birdspotter package

Submodules

birdspotter.BirdSpotter module

birdspotter is a python package providing a toolkit to measures the social influence and botness of twitter users.

class birdspotter.BirdSpotter.BirdSpotter(path, tweetLimit=None, embeddings='download', quiet=False)

Bases: object

Birdspotter measures the social influence and botness of twitter users.

This class takes a twitter dump in (json or jsonl format) and extract metrics bot and influence metrics for the users. The class will download word2vec embeddings if they are not specified. It exposes processed data from the tweet dumps.

cascadeDataframe

A dataframe of tweets ordered by cascades and time (the column casIndex denotes which cascade each tweet belongs to)

Type:pandas.DataFrame
featureDataframe

A dataframe of users with their respective botness and influence scores.

Type:pandas.DataFrame
hashtagDataframe

A dataframe of the text features for hashtags.

Type:pandas.DataFrame
extractTweets(filePath, tweetLimit=None, embeddings='download')

Extracts tweets from a json or jsonl file and generates cascade, feature and hashtag dataframes as class attributes.

Note that we use the file extension to determine how to handle the file.

Parameters:
  • filePath (str) – The path to a tweet json or jsonl file containing the tweets for analysis.
  • tweetLimit (int, optional) – A limit on the number of tweets to process if the tweet dump is too large, if None then all tweets are processed, by default None
  • embeddings (collections.Mapping or str or None, optional) – A method for loading word2vec embeddings. A path to a embeddings pickle or txt file, a mapping object, the string ‘download’, by default ‘download’. If None, it does nothing.
Returns:

A dataframe of user’s botness and influence scores (and other features).

Return type:

DataFrame

getBotAnnotationTemplate(filename='annotationTemplate.csv')

Writes a CSV with the list of users and a blank column “isbot” to be annotated.

A helper function which outputs a CSV to be annotated by a human. The output is a list of users with the blank “isbot” column.

Parameters:filename (str) – The name of the file to write the CSV
Returns:A dataframe of the users, with their screen names and a blank “is_bot” column.
Return type:Dataframe
getBotness()

Adds the botness of users to the feature dataframe.

It requires the tweets be extracted and the classifier be trained, otherwise exceptions are raised respectively.

Returns:The current feature dataframe of users, with associated botness scores appended.
Return type:DataFrame
Raises:Exception – Tweets haven’t been extracted yet. Need to run extractTweets.
getCascadeMembership()
getCascadesDataFrame()

Adds botness column and standard influence to the cascade dataframe and returns the cascadeDataframe

getInfluenceScores(params={'alpha': None, 'beta': 1.0, 'time_decay': -6.8e-05})

Adds a specified influence score to feature dataframe

The specified influence will appear in the returned feature df, under the column ‘influence (<alpha>,<time_decay>,<beta>)’.

Parameters:
  • time_decay (float, optional) – The time-decay r parameter described in the paper, by default -0.000068
  • alpha (float, optional) – A float between 0 and 1, as described in the paper. If None DebateNight method is used, else spatial-decay method, by default None
  • beta (float, optional) – A social strength hyper-parameter, by default 1.0
Returns:

The current feature dataframe of users, with associated botness scores.

Return type:

Dataframe

Raises:

Exception – Tweets haven’t been extracted yet. Need to run extractTweets.

getLabeledUsers(out=None)

Generates a standard dataframe of users with botness and DebateNight influence scores (and other features), and optionally outputs a csv.

Parameters:out (str, optional) – A output path for a csv of the results, by default None
Returns:A dataframe of the botness and influence scores (and other feautes) of each user
Return type:DataFrame
Raises:Exception – Tweets haven’t been extracted yet
getLabels()

Adds labels of users to the feature dataframe.

It requires the tweets be extracted and the classifier be trained, otherwise exceptions are raised respectively.

Returns:The current feature dataframe of users, with associated label scores appended.
Return type:DataFrame
Raises:Exception – Tweets haven’t been extracted yet. Need to run extractTweets.
loadClassifierModel(fname)

Loads the XGB booster model, from the saved XGB binary file

Parameters:fname (str) – The path to the XGB binary file
loadPickledBooster(fname)

Loads the pickled booster model

Parameters:fname (str) – The path to the pickled xgboost booster
process_tweet(j)
set_word_embeddings(embeddings='download', force_reload=True)

Sets the word2vec embeddings. The embeddings can be a path to a pickle or txt file, a mapping object or the string ‘download’ which will automatically download and use the FastText ‘wiki-news-300d-1M.vec’ if not available in the current path.

Parameters:
  • embeddings (collections.Mapping or str or None, optional) – A method for loading word2vec embeddings. A path to a embeddings pickle or txt file, a mapping object, or the string ‘download’, by default ‘download’. If None, it does nothing.
  • force_reload (bool, optional) – If the embeddings are already set, force_reload determines whether to update them, by default True
trainClassifierModel(labelledData, targetColumnName='isbot', saveFileName=None, update=False, iterations=100, hyper_parameter_search=True)

Trains the bot detection classifier.

Trains the bot detection classifier, using an XGB classifier. Due to the way XGB works, the features used are the intersection, between the features from the tweet dumps and the features from the training set.

Parameters:
  • labelledData (str or pandas.DataFrame) – A path to the data with bot labels, as either csv or pickled dataframe, or a dataframe
  • targetColumnName (str) – The name of the column, describing whether a user is a bot or not, by default ‘isbot’
  • saveFileName (str, optional) – The path of the file, to save the XGB model binary, which can be loaded with loadClassifierModel, by default None
  • update (bool, optional) – Determines whether data will improve current classifier or restart training, by default False
  • iterations (int, optional) – Determines the number of times the classifier training will iterate through data, by default 100
  • hyper_parameter_search (bool, optional) – Determines if the hyper-parameters of the classifier should be search or if the default parameters will be used. The search may be time-consuming for large training datasets and doesn’t work with update flag.

birdspotter.user_influence module

birdspotter.user_influence.P(cascade, r=-6.8e-05, beta=1.0)

Computes the P matrix of a cascade

The P matrix describes the stochastic retweet graph.

Parameters:
  • cascade (DataFrame) – A dataframe describing a single cascade, with a time column ascending from 0, a magnitude column and index of user ids
  • r (float, optional) – The time-decay r parameter described in the paper, by default -0.000068
  • beta (float, optional) – A social strength hyper-parameter, by default 1.0
Returns:

A matrix of size (n,n), where n is the number of tweets in the cascade, where P[i][j] is the probability that j is a retweet of tweet i.

Return type:

array-like

birdspotter.user_influence.casIn(cascade, time_decay=-6.8e-05, alpha=None, beta=1.0)

Computes influence in one cascade

Parameters:
  • cascade (str or DataFrame) – Path to one cascade in a file
  • time_decay (float) – The r parameter described in the paper
  • alpha (float, optional) – A float between 0 and 1, as described in the paper. If None DebateNight method is used, else spatial-decay method, by default None
Returns:

A dataframe describing the influence of each user in a single cascade.

Return type:

DataFrame

birdspotter.user_influence.influence(p, alpha=None)

Estimates user influence

This function compute the user influence and store it in matirx m_ij

Parameters:
  • p (array-like) – The P matrix describing the stochastic retweet graph
  • alpha (float, optional) – A float between 0 and 1, as described in the paper. If None DebateNight method is used, else spatial-decay method, by default None
Returns:

A n-array describing the influence of n users/tweets and the (n,n)-array describing the intermediary contribution of influence between tweets

Return type:

array-like, array-like

birdspotter.utils module

birdspotter.utils.getSource(sourcestring)
birdspotter.utils.getTextFeatures(key, text)
birdspotter.utils.getURLs(string)
birdspotter.utils.grep(sourcestring, pattern)
birdspotter.utils.hourofweekday(datestring)
birdspotter.utils.parse(x)

Module contents