birdspotter: A tool to measure social attributes of Twitter users¶
PyPI status
PyPI version fury.io
Documentation Status
birdspotter is a python package providing a toolkit to measures the social influence and botness of twitter users. It takes a twitter dump input in json or jsonl format and produces measures for:
- Social Influence: The relative amount that one user can cause another user to adopt a behaviour, such as retweeting.
- Botness: The amount that a user appears automated.
References:¶
Rizoiu, M.A., Graham, T., Zhang, R., Zhang, Y., Ackland, R. and Xie, L. # DebateNight: The Role and Influence of Socialbots on Twitter During the 1st 2016 US Presidential Debate. In Twelfth International AAAI Conference on Web and Social Media (ICWSM'18), 2018. https://arxiv.org/abs/1802.09808
Ram, R., & Rizoiu, M.-A. A social science-grounded approach for quantifying online social influence. In Australian Social Network Analysis Conference (ASNAC'19) (p. 2). Adelaide, Australia, 2019.
Installation¶
pip3 install birdspotter
birdspotter requires a python version >=3.
Usage¶
To use birdspotter on your own twitter dump, replace ‘./2016.json’ with the path to your twitter dump ‘./path/to/tweet/dump.json’. In this example we use Brendan Brown’s archive of @realdonaldtrump tweets in 2016. It can be downloaded here.
from birdspotter import BirdSpotter
bs = BirdSpotter('./2016.json')
# This may take a few minutes, go grab a coffee...
labeledUsers = bs.getLabeledUsers(out='./output.csv')
After extracting the tweets, getLabeledDataFrame() returns a pandas dataframe with the influence and botness labels of users and writes a csv file if a path is specified i.e. ./output.csv.
birdspotter relies on the Fasttext word embeddings wiki-news-300d-1M.vec, which will automatically be downloaded if not available in the current directory (./) or a relative data folder (./data/).
Get Cascades Data¶
After extracting the tweets, the retweet cascades are accessible by using:
cascades = bs.getCascadesDataFrame()
This dataframe includes the expected structure of the retweet cascade as given by Rizoiu et al. (2018) via the column expected_parent in this dataframe.
Advanced Usage¶
Adding more influence metrics¶
birdspotter provides DebateNight influence as a standard, when getLabeledUsers is run. To generate spatial-decay influence run:
bs.getInfluenceScores(time_decay = -0.000068, alpha = 0.15, beta = 1.0)
This returns the updated featureDataframe with influence scores appended, under the column influence (<alpha>,<time_decay>,<beta>).
Training with your own botness data¶
birdspotter provides functionality for training the botness detector with your own training data. To generate an csv to be annotated run:
bs.getBotAnnotationTemplate('./annotation_file.csv')
Once annotated the botness detector can be trained with:
bs.trainClassifierModel('./annotation_file.csv')
Defining your own word embeddings¶
birdspotter provides functionality for defining your own word embeddings. For example:
customEmbedding # A mapping such as a dict() representing word embeddings
bs = BirdSpotter('./2016.json', embeddings=customEmbedding)
Embeddings can be set through several methods, refer to setWord2VecEmbeddings.
Note the default bot training data uses the wiki-news-300d-1M.vec and as such we would need to retrain the bot detector for alternative word embeddings.
Alternatives to python¶
Command-line usage¶
birdspotter can be accessed through the command-line to return a csv, with the recipe below:
birdspotter ./path/to/twitter/dump.json ./path/to/output/directory/
R usage¶
birdspotter functionality can be accessed in R via the reticulate package. reticulate still requires a python installation on your system and birdspotter to be installed. The following produces the same results as the standard usage.
install.packages("reticulate")
library(reticulate)
use_python(Sys.which("python3"))
birdspotter <- import("birdspotter")
bs <- birdspotter$BirdSpotter("./2016.json")
bs$getLabeledDataFrame(out = './output.csv')
Acknowledgements¶
The development of this package was partially supported through a UTS Data Science Institute seed grant.