Twitter Open Source Intelligence with TWINT
TWINT, short for Twitter Intelligence, is an open source Twitter OSINT tool writtein in Python, created by Franesco Poldi. TWINT provides a termainal or code interface to Twitter in which you can search tweets, replies, favourites, followers, following, retweets and more over specific time periods and easily export the intelligence to CSV, JSON and an ElasticSearch-based intergration.
In this post we’ll work through a basic workflow in a Jupyter Notebook using TWINT and Pandas, a python data science library.
If you have never used any of the technologies I mentioned, start here. If you already have Jupyter Notebooks skip to step.
Step 0: Download and install Anaconda. Anaconda is a data science platform which includes Jupyter Notebooks. Installers are availalbe for Windows, MacOS and Linux.
pip3 install jupyter is a fast terminal-based method for Mac and Linux.
Step 1: Install TWINT.
pip3 install twint for Mac and Linux is possible, however, installing in-notebook provides a more consistent approach for using TWINT with Jupyter across Windows, Mac and Linux.
However, feel free to
pip3 install twint install via terminal on Mac or Linux so you can use the
twint command in-terminal, as well as in-notebook. If you do this, it isn’t necessary to type the following part of this step.
In the first cell of a notebook, type:
Step 2: We need to import twint and pandas for use in the notebook. The code below this allows us to use multiple cells, rather than wait for cells to finish. You will still need to wait for cells which depend on previous cells of course.
Basic TWINT Usage
Now that we’re all set up, we can gather some Twitter intelligence.
There are two ways to use TWINT, in the terminal and within python code. We will not be discussing the use of TWINT via the terminal, only its use within python scripts.
The simplest usable TWINT script used to scrape tweets from a users timeline is as follows:
We utlise the TWINT
Config() method, which allows us to use various functions such as
Limit and many others. The
Limit function allows us to limit the number of tweets returned. The output will be the tweet ID, datetime, username and tweet.
If you have issues at this stage, for example the error
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value [...], try uninstalling twint
pip3 uninstall twint and reinstall with via the following:
pip3 install --user --upgrade git+https://github.com/twintproject/twint.git@origin/master#egg=twint.
A Useful Workflow
The above is the essance of TWINT, returning Twitter information directly (either in-terminal or in-notebook). This isn’t particulalry useful or readable at the moment. Pandas can change that!
Above, we target tweets by user
jack (the founder of Twitter) from Febuary 17st 2021 to the date of this post, March 17th 2021 and utilise the Pandas module of TWINT (pandas is a seperate data science library, this is just a flag to store in data so we can use the functions ahead).
You can also use pandas on any CSV files we have created using TWINT by importing the data into a dataframe.
Here, we assign the data we collected and stored previously to a dataframe and print the contents. A dataframe is a two dimensional data structure with coloumns and rows, like a table. This allows us to tabulate data, making it much more readable.
This allows us to manipulate the data by focusing on indivudal columns and allows grouping etc. For example, we are dealing with one user so there’s no need to see that information, and much of the other information is verbose. We may only care about the tweet content, and the datetime.
We have access to a great deal of tweet information. We can view information on what the dataframe contains with the
We can manipulate any of these columns as we did above.
Visualisations for Understanding
We can visualise the data we have gathered to gain a better understanding. We’ll re-gather a larger dataset from jacks timeline so we have more data to deal wrangle insights from.
We have increased our search space by 1 month.
We reassign the collected tweets to the dataframe, as we have new data (we increased teh search space above).
We want to graph the number of tweets over the period using matplotlib.
We convert the dataframes data column to a python list and assign it to the variable
Above we create a histogram of tweets over the period, the tweet frequency. For the data we gathered on jack this produces the following:
A histogram showing the freqnecy of tweets by jack.
Extract Unique Hashtags
We want to find the individual hashtags a user has used during the collection period.
Above we create a list containing the hashtags from the associated column in the data. We filter out empty entires, there will be many of these unless a user is a prolific hashtag user on every tweet. Finally, we loop through each entry to find unique hashtags to remove duplicates. We end printing this unique hashtag list.
In this case there is only one output, “#bitcoin”.
We want to create a word cloud to visualise the possible topics a user is discussing during the captured period. The word cloud package serves this purpose.
Above, we create a list of the tweets from the associated column. We iterate through each tweet, turning the values to strings, tokenizing these strings and converting them to lowercase. Once each word has been cleansed in this way, they are added to the words variable which will be used to generate a word cloud using the
WordCloud() function. Finally, we plot the word cloud using matplotlib.
A word cloud composed of tweets from jack during Feb - Mar 2021.
Further Analysis and Improvements
This is only a small set of what is possible with Python, pandas, matplotlib and TWINT of course. There are also a number of things we can do to improve on what we have done here. For example, the word cloud does not remove hyperlinks or emojis, on large datasets this unnecessary analysis would add up.
- Matplotlib (2021) Matplotlibhttps://matplotlib.org.
- Pandas (2021) Pandashttps://pandas.pydata.org.
- Pandas (2021) Pandas.DataFramehttps://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html.
- Twitter (2021) Profile: Jackhttps://twitter.com/jack.
- Wikipedia (2021) Jack Dorseyhttps://en.wikipedia.org/wiki/Jack_Dorsey.
- Wordcloud (2021) WordCloud for Python documentationhttps://amueller.github.io/word_cloud/.