birdspotter
: A tool to measure social attributes of Twitter users¶
PyPI status PyPI version fury.io Documentation Status
birdspotter
is a python package providing a toolkit to measures the social influence and botness of twitter users. It takes a twitter dump input in json
or jsonl
format and produces measures for:
- Social Influence: The relative amount that one user can cause another user to adopt a behaviour, such as retweeting.
- Botness: The amount that a user appears automated.
References:¶
Rizoiu, M.A., Graham, T., Zhang, R., Zhang, Y., Ackland, R. and Xie, L. # DebateNight: The Role and Influence of Socialbots on Twitter During the 1st 2016 US Presidential Debate. In Twelfth International AAAI Conference on Web and Social Media (ICWSM'18), 2018. https://arxiv.org/abs/1802.09808
Ram, R., & Rizoiu, M.-A. A social science-grounded approach for quantifying online social influence. In Australian Social Network Analysis Conference (ASNAC'19) (p. 2). Adelaide, Australia, 2019.
Installation¶
pip3 install birdspotter
birdspotter
requires a python version >=3
.
Usage¶
To use birdspotter
on your own twitter dump, replace ‘./2016.json’ with the path to your twitter dump ‘./path/to/tweet/dump.json’. In this example we use Brendan Brown’s archive of @realdonaldtrump
tweets in 2016. It can be downloaded here.
from birdspotter import BirdSpotter
bs = BirdSpotter('./2016.json')
# This may take a few minutes, go grab a coffee...
labeledUsers = bs.getLabeledUsers(out='./output.csv')
After extracting the tweets, getLabeledDataFrame()
returns a pandas
dataframe with the influence and botness labels of users and writes a csv
file if a path is specified i.e. ./output.csv
.
birdspotter
relies on the Fasttext word embeddings wiki-news-300d-1M.vec, which will automatically be downloaded if not available in the current directory (./
) or a relative data folder (./data/
).
Get Cascades Data¶
After extracting the tweets, the retweet cascades are accessible by using:
cascades = bs.getCascadesDataFrame()
This dataframe includes the expected structure of the retweet cascade as given by Rizoiu et al. (2018) via the column expected_parent
in this dataframe.
Advanced Usage¶
Adding more influence metrics¶
birdspotter
provides DebateNight influence as a standard, when getLabeledUsers
is run. To generate spatial-decay influence run:
bs.getInfluenceScores(time_decay = -0.000068, alpha = 0.15, beta = 1.0)
This returns the updated featureDataframe
with influence scores appended, under the column influence (<alpha>,<time_decay>,<beta>)
.
Training with your own botness data¶
birdspotter
provides functionality for training the botness detector with your own training data. To generate an csv
to be annotated run:
bs.getBotAnnotationTemplate('./annotation_file.csv')
Once annotated the botness detector can be trained with:
bs.trainClassifierModel('./annotation_file.csv')
Defining your own word embeddings¶
birdspotter
provides functionality for defining your own word embeddings. For example:
customEmbedding # A mapping such as a dict() representing word embeddings
bs = BirdSpotter('./2016.json', embeddings=customEmbedding)
Embeddings can be set through several methods, refer to setWord2VecEmbeddings.
Note the default bot training data uses the wiki-news-300d-1M.vec and as such we would need to retrain the bot detector for alternative word embeddings.
Alternatives to python¶
Command-line usage¶
birdspotter
can be accessed through the command-line to return a csv
, with the recipe below:
birdspotter ./path/to/twitter/dump.json ./path/to/output/directory/
R usage¶
birdspotter
functionality can be accessed in R
via the reticulate
package. reticulate
still requires a python
installation on your system and birdspotter
to be installed. The following produces the same results as the standard usage.
install.packages("reticulate")
library(reticulate)
use_python(Sys.which("python3"))
birdspotter <- import("birdspotter")
bs <- birdspotter$BirdSpotter("./2016.json")
bs$getLabeledDataFrame(out = './output.csv')
Acknowledgements¶
The development of this package was partially supported through a UTS Data Science Institute seed grant.