birdspotter package¶
Submodules¶
birdspotter.BirdSpotter module¶
birdspotter is a python package providing a toolkit to measures the social influence and botness of twitter users.
-
class
birdspotter.BirdSpotter.
BirdSpotter
(path, tweetLimit=None, embeddings='download', quiet=False)¶ Bases:
object
Birdspotter measures the social influence and botness of twitter users.
This class takes a twitter dump in (json or jsonl format) and extract metrics bot and influence metrics for the users. The class will download word2vec embeddings if they are not specified. It exposes processed data from the tweet dumps.
-
cascadeDataframe
¶ A dataframe of tweets ordered by cascades and time (the column casIndex denotes which cascade each tweet belongs to)
Type: pandas.DataFrame
-
featureDataframe
¶ A dataframe of users with their respective botness and influence scores.
Type: pandas.DataFrame
-
hashtagDataframe
¶ A dataframe of the text features for hashtags.
Type: pandas.DataFrame
-
extractTweets
(filePath, tweetLimit=None, embeddings='download')¶ Extracts tweets from a json or jsonl file and generates cascade, feature and hashtag dataframes as class attributes.
Note that we use the file extension to determine how to handle the file.
Parameters: - filePath (str) – The path to a tweet json or jsonl file containing the tweets for analysis.
- tweetLimit (int, optional) – A limit on the number of tweets to process if the tweet dump is too large, if None then all tweets are processed, by default None
- embeddings (collections.Mapping or str or None, optional) – A method for loading word2vec embeddings. A path to a embeddings pickle or txt file, a mapping object, the string ‘download’, by default ‘download’. If None, it does nothing.
Returns: A dataframe of user’s botness and influence scores (and other features).
Return type: DataFrame
-
getBotAnnotationTemplate
(filename='annotationTemplate.csv')¶ Writes a CSV with the list of users and a blank column “isbot” to be annotated.
A helper function which outputs a CSV to be annotated by a human. The output is a list of users with the blank “isbot” column.
Parameters: filename (str) – The name of the file to write the CSV Returns: A dataframe of the users, with their screen names and a blank “is_bot” column. Return type: Dataframe
-
getBotness
()¶ Adds the botness of users to the feature dataframe.
It requires the tweets be extracted and the classifier be trained, otherwise exceptions are raised respectively.
Returns: The current feature dataframe of users, with associated botness scores appended. Return type: DataFrame Raises: Exception
– Tweets haven’t been extracted yet. Need to run extractTweets.
-
getCascadeMembership
()¶
-
getCascadesDataFrame
()¶ Adds botness column and standard influence to the cascade dataframe and returns the cascadeDataframe
-
getInfluenceScores
(params={'alpha': None, 'beta': 1.0, 'time_decay': -6.8e-05})¶ Adds a specified influence score to feature dataframe
The specified influence will appear in the returned feature df, under the column ‘influence (<alpha>,<time_decay>,<beta>)’.
Parameters: - time_decay (float, optional) – The time-decay r parameter described in the paper, by default -0.000068
- alpha (float, optional) – A float between 0 and 1, as described in the paper. If None DebateNight method is used, else spatial-decay method, by default None
- beta (float, optional) – A social strength hyper-parameter, by default 1.0
Returns: The current feature dataframe of users, with associated botness scores.
Return type: Dataframe
Raises: Exception
– Tweets haven’t been extracted yet. Need to run extractTweets.
-
getLabeledUsers
(out=None)¶ Generates a standard dataframe of users with botness and DebateNight influence scores (and other features), and optionally outputs a csv.
Parameters: out (str, optional) – A output path for a csv of the results, by default None Returns: A dataframe of the botness and influence scores (and other feautes) of each user Return type: DataFrame Raises: Exception
– Tweets haven’t been extracted yet
-
getLabels
()¶ Adds labels of users to the feature dataframe.
It requires the tweets be extracted and the classifier be trained, otherwise exceptions are raised respectively.
Returns: The current feature dataframe of users, with associated label scores appended. Return type: DataFrame Raises: Exception
– Tweets haven’t been extracted yet. Need to run extractTweets.
-
loadClassifierModel
(fname)¶ Loads the XGB booster model, from the saved XGB binary file
Parameters: fname (str) – The path to the XGB binary file
-
loadPickledBooster
(fname)¶ Loads the pickled booster model
Parameters: fname (str) – The path to the pickled xgboost booster
-
process_tweet
(j)¶
-
set_word_embeddings
(embeddings='download', force_reload=True)¶ Sets the word2vec embeddings. The embeddings can be a path to a pickle or txt file, a mapping object or the string ‘download’ which will automatically download and use the FastText ‘wiki-news-300d-1M.vec’ if not available in the current path.
Parameters: - embeddings (collections.Mapping or str or None, optional) – A method for loading word2vec embeddings. A path to a embeddings pickle or txt file, a mapping object, or the string ‘download’, by default ‘download’. If None, it does nothing.
- force_reload (bool, optional) – If the embeddings are already set, force_reload determines whether to update them, by default True
-
trainClassifierModel
(labelledData, targetColumnName='isbot', saveFileName=None, update=False, iterations=100, hyper_parameter_search=True)¶ Trains the bot detection classifier.
Trains the bot detection classifier, using an XGB classifier. Due to the way XGB works, the features used are the intersection, between the features from the tweet dumps and the features from the training set.
Parameters: - labelledData (str or pandas.DataFrame) – A path to the data with bot labels, as either csv or pickled dataframe, or a dataframe
- targetColumnName (str) – The name of the column, describing whether a user is a bot or not, by default ‘isbot’
- saveFileName (str, optional) – The path of the file, to save the XGB model binary, which can be loaded with loadClassifierModel, by default None
- update (bool, optional) – Determines whether data will improve current classifier or restart training, by default False
- iterations (int, optional) – Determines the number of times the classifier training will iterate through data, by default 100
- hyper_parameter_search (bool, optional) – Determines if the hyper-parameters of the classifier should be search or if the default parameters will be used. The search may be time-consuming for large training datasets and doesn’t work with update flag.
-
birdspotter.user_influence module¶
-
birdspotter.user_influence.
P
(cascade, r=-6.8e-05, beta=1.0)¶ Computes the P matrix of a cascade
The P matrix describes the stochastic retweet graph.
Parameters: - cascade (DataFrame) – A dataframe describing a single cascade, with a time column ascending from 0, a magnitude column and index of user ids
- r (float, optional) – The time-decay r parameter described in the paper, by default -0.000068
- beta (float, optional) – A social strength hyper-parameter, by default 1.0
Returns: A matrix of size (n,n), where n is the number of tweets in the cascade, where P[i][j] is the probability that j is a retweet of tweet i.
Return type: array-like
-
birdspotter.user_influence.
casIn
(cascade, time_decay=-6.8e-05, alpha=None, beta=1.0)¶ Computes influence in one cascade
Parameters: - cascade (str or DataFrame) – Path to one cascade in a file
- time_decay (float) – The r parameter described in the paper
- alpha (float, optional) – A float between 0 and 1, as described in the paper. If None DebateNight method is used, else spatial-decay method, by default None
Returns: A dataframe describing the influence of each user in a single cascade.
Return type: DataFrame
-
birdspotter.user_influence.
influence
(p, alpha=None)¶ Estimates user influence
This function compute the user influence and store it in matirx m_ij
Parameters: - p (array-like) – The P matrix describing the stochastic retweet graph
- alpha (float, optional) – A float between 0 and 1, as described in the paper. If None DebateNight method is used, else spatial-decay method, by default None
Returns: A n-array describing the influence of n users/tweets and the (n,n)-array describing the intermediary contribution of influence between tweets
Return type: array-like, array-like
birdspotter.utils module¶
-
birdspotter.utils.
getSource
(sourcestring)¶
-
birdspotter.utils.
getTextFeatures
(key, text)¶
-
birdspotter.utils.
getURLs
(string)¶
-
birdspotter.utils.
grep
(sourcestring, pattern)¶
-
birdspotter.utils.
hourofweekday
(datestring)¶
-
birdspotter.utils.
parse
(x)¶