In this tutorial I will show you how to create a relationship graph by extracting tweets.
The data are available on the OSINT-FR Github
Pre-requisite installation :
I advise you to install the twint module with this command :
Step 1: Collecting tweets
For example, run the command :
Here is the order manual : https://github.com/twintproject/twint/wiki/Basic-usage
This will produce a csv file with this metadata :
id, conversation_id, created_at, date, time, timezone, user_id, username, name,
place, tweet, language, mentions, urls, photos, replies_count, retweets_count,
likes_count, hashtags, cashtag, link, retweet, quote_url, video, thumbnail,
near, geo, source, user_rt_id, user_rt, retweet_id, reply_to, retweet_date,
translate, trans_src, trans_dest
There is often a point where the scrapping runs out and twint can’t go any further into the past.
Here we get a file of 1.3 MB, 2195 tweets. When you open the csv file with Excel or Libre Office only use the “tab” separators. The easiest way I found to get Python to digest this csv is to save it in XLSX format.
Python can digest it with this code (python) :
Step 2: Clean up the file
For our relationship graph, the nodes are the users (we will use the username column) and the links are the mentions. So there is no need for the whole file. We will filter with this python script the tweets that contain a mention and extract the username so that it is easy to read for the next step.
Important note: In this example I am making an important approximation that all the usernames mentioned are in those who tweeted. This is not necessarily the case.
Step 3: Create the Gephi file
Go to https://medialab.github.io/table2net/ to use the CSV file from the previous step.
We check visually that the CSV file has been imported :
- Type of network: Normal (one type of node)
- Nodes : username (some usernames should appear below)
- Links : mentions (you should see the username appear below)
Step 4: Setting up Gephi
The previous file is opened :
As there are not many nodes, there is no need to filter out those that do not have many links. We go directly to see if there are any communities.
Click on Modularity to calculate the statistics
In the left-hand tab, we will use this statistic. Nodes / Partition / Modularity Class. This will use the statistic to colour the communities. We click on apply :
Finally, we will manage the size of the nodes with the Size button on the right, then Nodes, then Rating, then Apply :
Now you have to let Gephi do its calculations to arrange the nodes harmoniously.
Left tab, choose a spatialization, Force Atlas 2, if you let it run like that without parameterization you get this :
They must be brought together and recovery prevented :
There are not many links in this example and users seem to be important too.
We can go to the Preview tab to see the labels.
If we check Show labels and a size 5 we get this graph :
I reduced the maximum node size to 50 and refreshed :
You can save the image as a PNG, SVG, PDF. The last two have the advantage of keeping the username text.
I collected the last 75,000 tweets of the order (which gives tweets from 05 March 19:54 to 08 March 00:00):
by extracting the interesting tweets (i.e. with mentions), we get about 3000. we see different communities without influencers that overwhelm and some accounts that make the links.