Twitter Open Source Intelligence with TWINT

Published on - 12 min read

TWINT, short for Twitter Intelligence, is an open source Twitter OSINT tool writtein in Python, created by Franesco Poldi. TWINT provides a termainal or code interface to Twitter in which you can search tweets, replies, favourites, followers, following, retweets and more over specific time periods and easily export the intelligence to CSV, JSON and an ElasticSearch-based intergration.

In this post we’ll work through a basic workflow in a Jupyter Notebook using TWINT and Pandas, a python data science library.

Setup

If you have never used any of the technologies I mentioned, start here. If you already have Jupyter Notebooks skip to step.

Step 0: Download and install Anaconda. Anaconda is a data science platform which includes Jupyter Notebooks. Installers are availalbe for Windows, MacOS and Linux. pip3 install jupyter is a fast terminal-based method for Mac and Linux.
Step 1: Install TWINT.pip3 install twint for Mac and Linux is possible, however, installing in-notebook provides a more consistent approach for using TWINT with Jupyter across Windows, Mac and Linux.

However, feel free to pip3 install twint install via terminal on Mac or Linux so you can use the twint command in-terminal, as well as in-notebook. If you do this, it isn’t necessary to type the following part of this step.

In the first cell of a notebook, type:

import sys
!{sys.executable} -m pip install twint

Step 2: We need to import twint and pandas for use in the notebook. The code below this allows us to use multiple cells, rather than wait for cells to finish. You will still need to wait for cells which depend on previous cells of course.

import twint
import pandas as pd

# Allows the running of multiple event loops in Jupyter Notebooks.
# Fixes: "RuntimeError: This event loop is already running"
import nest_asyncio

nest_asyncio.apply()

Basic TWINT Usage

Now that we’re all set up, we can gather some Twitter intelligence.

There are two ways to use TWINT, in the terminal and within python code. We will not be discussing the use of TWINT via the terminal, only its use within python scripts.

The simplest usable TWINT script used to scrape tweets from a users timeline is as follows:

c = twint.Config()

c.Username = 'jack'
c.Limit = 10

twint.run.Search(c)

We utlise the TWINT Config() method, which allows us to use various functions such as Username, Limit and many others. The Limit function allows us to limit the number of tweets returned. The output will be the tweet ID, datetime, username and tweet.

If you have issues at this stage, for example the error CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value [...], try uninstalling twint pip3 uninstall twint and reinstall with via the following: pip3 install --user --upgrade git+https://github.com/twintproject/twint.git@origin/master#egg=twint.

A Useful Workflow

The above is the essance of TWINT, returning Twitter information directly (either in-terminal or in-notebook). This isn’t particulalry useful or readable at the moment. Pandas can change that!

c = twint.Config()

c.Username = 'jack'
c.Since = '2021-02-01 00:00:00'
c.Until = '2021-03-17 00:00:00'
c.Pandas = True

twint.run.Search(c)

Above, we target tweets by user jack (the founder of Twitter) from Febuary 17st 2021 to the date of this post, March 17th 2021 and utilise the Pandas module of TWINT (pandas is a seperate data science library, this is just a flag to store in data so we can use the functions ahead).

You can also use pandas on any CSV files we have created using TWINT by importing the data into a dataframe.

df = twint.storage.panda.Tweeds_df
df

Here, we assign the data we collected and stored previously to a dataframe and print the contents. A dataframe is a two dimensional data structure with coloumns and rows, like a table. This allows us to tabulate data, making it much more readable.

This allows us to manipulate the data by focusing on indivudal columns and allows grouping etc. For example, we are dealing with one user so there’s no need to see that information, and much of the other information is verbose. We may only care about the tweet content, and the datetime.

df[['date', 'tweet']]

We have access to a great deal of tweet information. We can view information on what the dataframe contains with the info() function.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 33 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   id               12 non-null     object
 1   conversation_id  12 non-null     object
 2   created_at       12 non-null     int64 
 3   date             12 non-null     object
 4   timezone         12 non-null     object
 5   place            12 non-null     object
 6   tweet            12 non-null     object
 7   hashtags         12 non-null     object
 8   cashtags         12 non-null     object
 9   user_id          12 non-null     int64 
 10  user_id_str      12 non-null     object
 11  username         12 non-null     object
 12  name             12 non-null     object
 13  day              12 non-null     int64 
 14  hour             12 non-null     object
 15  link             12 non-null     object
 16  retweet          12 non-null     bool  
 17  nlikes           12 non-null     int64 
 18  nreplies         12 non-null     int64 
 19  nretweets        12 non-null     int64 
 20  quote_url        12 non-null     object
 21  search           12 non-null     object
 22  near             12 non-null     object
 23  geo              12 non-null     object
 24  source           12 non-null     object
 25  user_rt_id       12 non-null     object
 26  user_rt          12 non-null     object
 27  retweet_id       12 non-null     object
 28  reply_to         12 non-null     object
 29  retweet_date     12 non-null     object
 30  translate        12 non-null     object
 31  trans_src        12 non-null     object
 32  trans_dest       12 non-null     object
dtypes: bool(1), int64(6), object(26)
memory usage: 3.1+ KB

We can manipulate any of these columns as we did above.

Visualisations for Understanding

We can visualise the data we have gathered to gain a better understanding. We’ll re-gather a larger dataset from jacks timeline so we have more data to deal wrangle insights from.

c = twint.Config()

c.Username = 'jack'
c.Since = '2021-01-01 00:00:00'
c.Until = '2021-03-17 00:00:00'
c.Pandas = True

twint.run.Search(c)

We have increased our search space by 1 month.

df = twint.storage.panda.Tweeds_df
df

We reassign the collected tweets to the dataframe, as we have new data (we increased teh search space above).

We want to graph the number of tweets over the period using matplotlib.

# Convert date colum into a list for the next step
dates_list = df['date'].to_list()

We convert the dataframes data column to a python list and assign it to the variable data_list.

# Create a histogram of tweet frequency
# from https://stackoverflow.com/questions/44929555/how-to-properly-create-a-histogram-displaying-the-frequency-of-the-tweets-for-e

dates = []
for t in dates_list:
    # extract the date part of the datetime
    date_str = t.split(' ')[0]
    # extract the time from the date
    year,month,day = [int(i) for i in date_str.split('-')]
    # create a date object
    d = date(year, month, day)
    # sort
    dates.append(d)
    
# sort dates
dates.sort()

# find the first and last date
min_date = dates[0]
max_date = dates[-1]

# compute num days
length = (max_date - min_date).days + 1

# plot histogram
plt.figure(figsize=(12,8))
plt.hist(dates)
plt.show()

Above we create a histogram of tweets over the period, the tweet frequency. For the data we gathered on jack this produces the following:

A histogram showing the freqnecy of tweets by jack over a 2 and a half month period.

A histogram showing the freqnecy of tweets by jack.

Extract Unique Hashtags

We want to find the individual hashtags a user has used during the collection period.

# Extract unique hashtags
# Get hashtag list
hashtag_list = df['hashtags'].to_list()

# Remove empty elements
filtered_list = list(filter(None, hashtag_list))

# Find unique entries (will be slow with *very* large lists)
unique_hashtags = []
for value in filtered_list:
    if value not in unique_hashtags:
        unique_hashtags.append(value)

# In this case, we only have one unique hashtag
print(unique_hashtags)

Above we create a list containing the hashtags from the associated column in the data. We filter out empty entires, there will be many of these unless a user is a prolific hashtag user on every tweet. Finally, we loop through each entry to find unique hashtags to remove duplicates. We end printing this unique hashtag list.

In this case there is only one output, “#bitcoin”.

Word Cloud

We want to create a word cloud to visualise the possible topics a user is discussing during the captured period. The word cloud package serves this purpose.

tweets = df['tweet'].to_list()

words = ''
stopwords = set(STOPWORDS)

# Iterate through tweets
for value in tweets:
    # Convert to string
    value = str(value)
    # Tokenize
    tokens = value.split()
    # Convert each to lowercase
    for i in range(len(tokens)):
        tokens[i] = tokens[i].lower()
        
    # Add to word
    words += " ".join(tokens)+" "

# Word cloud setup
wordcloud = WordCloud(width = 800, height = 800,
                     background_color = 'white',
                     stopwords = stopwords,
                     min_font_size = 10).generate(words)

# Plot
plt.figure(figsize=(8,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

Above, we create a list of the tweets from the associated column. We iterate through each tweet, turning the values to strings, tokenizing these strings and converting them to lowercase. Once each word has been cleansed in this way, they are added to the words variable which will be used to generate a word cloud using the WordCloud() function. Finally, we plot the word cloud using matplotlib.

A word cloud composed of tweets from jack during Feb - Mar 2021..

A word cloud composed of tweets from jack during Feb - Mar 2021.

Further Analysis and Improvements

This is only a small set of what is possible with Python, pandas, matplotlib and TWINT of course. There are also a number of things we can do to improve on what we have done here. For example, the word cloud does not remove hyperlinks or emojis, on large datasets this unnecessary analysis would add up.

References

  1. Matplotlib (2021) Matplotlibhttps://matplotlib.org.
  2. Pandas (2021) Pandashttps://pandas.pydata.org.
  3. Pandas (2021) Pandas.DataFramehttps://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html.
  4. Twitter (2021) Profile: Jackhttps://twitter.com/jack.
  5. Wikipedia (2021) Jack Dorseyhttps://en.wikipedia.org/wiki/Jack_Dorsey.
  6. Wordcloud (2021) WordCloud for Python documentationhttps://amueller.github.io/word_cloud/.