Scalable and Generalizable Social Bot Detection Through Data Selection
Total Page:16
File Type:pdf, Size:1020Kb
The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Scalable and Generalizable Social Bot Detection through Data Selection Kai-Cheng Yang,1 Onur Varol,2 Pik-Mai Hui,1 Filippo Menczer1,3 1Center for Complex Networks and Systems Research, Indiana University, Bloomington, IN, USA 2Center for Complex Networks Research, Northeastern University, Boston, MA, USA 3Indiana University Network Science Institute, Bloomington, IN, USA Abstract is extremely difficult for humans with limited access to so- Efficient and reliable social bot classification is crucial for cial media data to recognize bots, making people vulnerable detecting information manipulation on social media. Despite to manipulation. rapid development, state-of-the-art bot detection models still Many machine learning frameworks for bot detection on face generalization and scalability challenges, which greatly Twitter have been proposed in the past several years (see a limit their applications. In this paper we propose a framework recent survey (Alothali et al. 2018) and the Related Work that uses minimal account metadata, enabling efficient analy- section). However, two major challenges remain: scalability sis that scales up to handle the full stream of public tweets of and generalization. Twitter in real time. To ensure model accuracy, we build a rich collection of labeled datasets for training and validation. We Scalability enables analysis of streaming data with lim- deploy a strict validation system so that model performance ited computing resources. Yet most current methods attempt on unseen datasets is also optimized, in addition to traditional to maximize accuracy by inspecting rich contextual infor- cross-validation. We find that strategically selecting a subset mation about an account’s actions and social connections. of training data yields better model accuracy and generaliza- Twitter API rate limits and algorithmic complexity make it tion than exhaustively training on all available data. Thanks impossible to scale up to real-time detection of manipulation to the simplicity of the proposed model, its logic can be inter- campaigns (Stella, Ferrara, and De Domenico 2018). preted to provide insights into social bot characteristics. Generalization enables the detection of bots that are dif- ferent from those in the training data, which is critical given Introduction the adversarial nature of the problem — new bots are eas- Social bots are social media accounts controlled in part by ily designed to evade existing detection systems. Methods software. Malicious bots serve as important instruments for considering limited behaviors, like retweeting and temporal orchestrated, large-scale opinion manipulation campaigns patterns, can only identify certain types of bots. Even more on social media (Ferrara et al. 2016a; Subrahmanian et al. general-purpose systems are trained on labeled datasets that 2016). Bots have been actively involved in online discus- are sparse, noisy, and not representative enough. As a result, sions of important events, including recent elections in the methods with excellent cross-validation accuracy suffer dra- US and Europe (Bessi and Ferrara 2016; Deb et al. 2019; matic drops in performance when tested on cross-domain Stella, Ferrara, and De Domenico 2018; Ferrara 2017). Bots accounts (De Cristofaro et al. 2018). are also responsible for spreading low-credibility informa- In this paper, we propose a framework to address these tion (Shao et al. 2018) and extreme ideology (Berger and two challenges. By focusing just on user profile information, Morgan 2015; Ferrara et al. 2016b), as well as adding con- which can be easily accessed in bulk, scalability is greatly fusion to the online debate about vaccines (Broniatowski et improved. While the use of fewer features may entail a small al. 2018). compromise in individual accuracy, we gain the ability to Scalable and reliable bot detection methods are needed to analyze a large-volume stream of accounts in real time. estimate the influence of social bots and develop effective We build a rich collection of labeled data by compil- countermeasures. The task is challenging due to the diverse ing all labeled datasets available in the literature and three and dynamic behaviors of social bots. For example, some newly-produced ones. We systematically analyze the fea- bots act autonomously with minimal human intervention, ture space of different datasets and the relationship between others are manually controlled so that a single entity can cre- them; some of the datasets tend to overlap with each other ate the appearance of multiple human accounts (Grimme et while others have contradicting patterns. al. 2017). And while some bots are active continuously, oth- The diversity of the compiled datasets allows us to build ers focus bursts of activity on different short-term targets. It a more robust classifier. Surprisingly, by carefully select- Copyright c 2020, Association for the Advancement of Artificial ing a subset of the training data instead of merging all Intelligence (www.aaai.org). All rights reserved. of the datasets, we achieve better performance in terms of 1096 cross-validation, cross-domain generalization, and consis- that bounds Botometer. Moreover, each tweet collected from tency with a widely-used reference system. We also show Twitter has an embedded user object. This brings two extra that the resulting bot detector is more interpretable. The advantages. First, once tweets are collected, no extra queries lessons learned from the proposed data selection approach are needed for bot detection. Second, while users lookups can be generalized to other adversarial problems. always report the most recent user profile, the user object embedded in each tweet reflects the user profile at the mo- Related Work ment when the tweet is collected. This makes bot detection on archived historical data possible. Most bot detection methods are based on supervised ma- Table 1 lists the features extracted from the user object. chine learning and involve manually labeled data (Ferrara The rate features build upon the user age, which requires the et al. 2016a; Alothali et al. 2018). Popular approaches probe time to be available. When querying the users lookup leverage user, temporal, content, and social network fea- API, the probe time is when the query happens. If the user tures with random forest classifiers (Davis et al. 2016; object is extracted from a tweet, the probe time is the tweet Varol et al. 2017; Gilani, Kochmar, and Crowcroft 2017; creation time (created at field). The user age is defined as Yang et al. 2019). Faster classification can be obtained us- the hour difference between the probe time and the creation ing fewer features and logistic regression (Ferrara 2017; time of the user (created at field in the user object). User Stella, Ferrara, and De Domenico 2018). Some methods ages are associated with the data collection time, an artifact include information from tweet content in addition to the irrelevant to bot behaviors. In fact, tests show that includ- metadata to detect bots at the tweet level (Kudugunta and ing age in the model deteriorates accuracy. However, age is Ferrara 2018). A simpler approach is to test the randomness used to calculate the rate features. Every count feature has a of the screen name (Beskow and Carley 2019). The method corresponding rate feature to capture how fast the account is presented here also adopts the supervised learning frame- tweeting, gaining followers, and so on. In the calculation of work with random forest classifiers, attempting to achieve the ratio between followers and friends, the denominator is both the scalability of the faster methods and the accuracy max(friends count, 1) to avoid division-by-zero errors. of the feature-rich methods. We aim for greater generaliza- The screen name likelihood feature is inspired by the ob- tion than all the models in the literature. servation that bots sometimes have a random string as screen Another strain of bot detection methods goes beyond in- name (Beskow and Carley 2019). Twitter only allows let- dividual accounts to consider collective behaviors. Unsuper- ters (upper and lower case), digits, and underscores in the vised learning methods are used to find improbable sim- screen name field, with a 15-character limit. We collected ilarities among accounts. No human labeled datasets are over 2M unique screen names and constructed the likeli- needed. Examples of this approach leverage similarity in hood of all 3,969 possible bigrams. The likelihood of a post timelines (Chavoshi, Hamooni, and Mueen 2016), ac- screen name is defined by the geometric-mean likelihood tion sequences (Cresci et al. 2016; 2018a), content (Chen of all bigrams in it. We do not consider longer n-grams as and Subramanian 2018), and friends/followers (Jiang et al. they require more resources with limited advantages. Tests 2016). Coordinated retweeting behavior has also been used show the likelihood feature can effectively distinguish ran- (Mazza et al. 2019). Methods aimed at detecting coordina- dom strings from authentic screen names. tion are very slow as they need to consider many pairs of accounts. The present approach is much faster, but consid- Datasets ers individual accounts and so cannot detect coordination. To train and test our model, we collect all public datasets of labeled human and bot accounts and create three new Feature Engineering ones, all available in the bot repository (botometer.org/ All Twitter bot detection methods need to query data be- bot-repository). See overview in Table 2. fore performing any evaluation, so they are bounded by API Let us briefly summarize how the datasets were col- limits. Take Botometer, a popular bot detection tool, as an lected. The caverlee dataset consists of bot accounts example. The classifier uses over 1,000 features from each lured by honeypot accounts and verified human accounts account (Varol et al. 2017; Yang et al. 2019). To extract these (Lee, Eoff, and Caverlee 2011). This dataset is older than the features, the classifier requires the account’s most recent 200 others.