Modeling and Identifying Russian Troll Accounts on Twitter

Still out there: Modeling and Identifying Russian Troll Accounts on Twitter Jane Im Eshwar Chandrasekharan Jackson Sargent [email protected] [email protected] [email protected] University of Michigan University of Illinois at University of Michigan Ann Arbor, Michigan Urbana-Champaign Ann Arbor, Michigan Urbana, Illinois Paige Lighthammer Taylor Denby Ankit Bhargava [email protected] [email protected] [email protected] University of Michigan University of Michigan University of Michigan Ann Arbor, Michigan Ann Arbor, Michigan Ann Arbor, Michigan Libby Hemphill David Jurgens Eric Gilbert [email protected] [email protected] [email protected] University of Michigan University of Michigan University of Michigan Ann Arbor, Michigan Ann Arbor, Michigan Ann Arbor, Michigan ABSTRACT KEYWORDS There is evidence that Russia’s Internet Research Agency attempted Russian troll, misinformation, social media, political manipulation, to interfere with the 2016 U.S. election by running fake accounts political elections on Twitter—often referred to as “Russian trolls”. In this work, we: ACM Reference Format: 1) develop machine learning models that predict whether a Twitter Jane Im, Eshwar Chandrasekharan, Jackson Sargent, Paige Lighthammer, account is a Russian troll within a set of 170K control accounts; and, Taylor Denby, Ankit Bhargava, Libby Hemphill, David Jurgens, and Eric 2) demonstrate that it is possible to use this model to find active Gilbert. 2020. Still out there: Modeling and Identifying Russian Troll Ac- accounts on Twitter still likely acting on behalf of the Russian counts on Twitter. In 12th ACM Conference on Web Science (WebSci ’20), state. Using both behavioral and linguistic features, we show that July 6–10, 2020, Southampton, United Kingdom. ACM, New York, NY, USA, it is possible to distinguish between a troll and a non-troll with a 10 pages. https://doi.org/10.1145/3394231.3397889 precision of 78.5% and an AUC of 98.9%, under cross-validation. Applying the model to out-of-sample accounts still active today, 1 INTRODUCTION we find that up to 2.6% of top journalists’ mentions are occupied It is widely believed that Russia’s Internet Research Agency (IRA) by Russian trolls. These findings imply that the Russian trolls are tried to interfere with the 2016 U.S. election as well as other elections very likely still active today. Additional analysis shows that they by running fake accounts on Twitter—often called the "Russian are not merely software-controlled bots, and manage their online troll" accounts [13, 16, 34]. This interference could have immense identities in various complex ways. Finally, we argue that if it is consequences considering the viral nature of some tweets [25, 26], possible to discover these accounts using externally-accessible data, the number of users exposed to Russian trolls’ content [19, 33], and then the platforms—with access to a variety of private internal the critical role social media have played in past political campaigns signals—should succeed at similar or better rates. [7]. In this paper, we develop models on a dataset of Russian trolls active on Twitter during the 2016 U.S. elections to predict currently CCS CONCEPTS active Russian trolls. We construct machine learning classifiers using profile elements, behavioral features, language distribution, • Human-centered computing ! Empirical studies in collab- function word usage, and linguistic features, on a highly unbalanced orative and social computing. dataset of Russian troll accounts (2.2K accounts, or 1.4% of our sample) released by Twitter1 and “normal”, control accounts (170K accounts, or 98.6% of our sample) collected by the authors. (See Permission to make digital or hard copies of all or part of this work for personal or Figure 1 for a visual overview of the process used in this work.) Our classroom use is granted without fee provided that copies are not made or distributed goals are to determine whether “new” trolls can be identified by for profit or commercial advantage and that copies bear this notice and the full citation models built on “old” trolls and to demonstrate that troll detection on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or is both possible and efficient, even with “old” data. republish, to post on servers or to redistribute to lists, requires prior specific permission We find that it is possible to disambiguate between a Russian and/or a fee. Request permissions from [email protected]. troll account and a large number of these randomly selected con- WebSci ’20, July 6–10, 2020, Southampton, United Kingdom © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. trol accounts among users. One model, a simple logistic regression, ACM ISBN 978-1-4503-7989-2/20/07...$15.00 https://doi.org/10.1145/3394231.3397889 1https://about.twitter.com/en_us/values/elections-integrity.html#data WebSci ’20, July 6–10, 2020, Southampton, United Kingdom Im et al. Figure 1: Flowchart illustrating the steps of our research pipeline. achieves a precision of 78.5% and an AUC of 98.9%. Next we asked these accounts using externally-accessible data, then the social whether it was possible to use the model trained on past data to platforms—with access to a variety of private, internal signals— unmask Russian trolls currently active on Twitter (see Figure 2 for should succeed at similar or better rates at finding and deactivating an example)? The logistic regression is attractive in this context Russian troll accounts. as its simplicity seems most likely to generalize to out-of-sample data. Toward that end, we apply our classifier to Twitter accounts 2 RELATED WORK that mentioned high-profile journalists in late 2018. We find the First, we review what is known about Russian’s interference in computational model flags 3.7% of them as statistically likely Rus- Western democracies via online campaigns, and then move on to sian trolls and find reasonable agreement between our classifier the emerging work on these 2016 election related Russian trolls and human labelers. themselves. We conclude by discussing work on social bots, and by Our model allows us to estimate the activity of trolls. As a case reviewing theories of online deception that inform the quantitative study, we estimate the activity of suspected Russian troll accounts approaches in this paper. engaging in one type of adversarial campaign: engaging with promi- nent journalists. Since we have no way of truly knowing which of these model-identified accounts are truly Russian trolls—perhaps 2.1 Russia’s Interference on Political only the IRA knows this—we perform a secondary human eval- Campaigns uation in order to establish consensus on whether the model is While state-level online interference in democratic processes is an identifying validly suspicious accounts. Our human evaluation pro- emerging phenomenon, new research documents Russia’s online cess suggests that roughly 70% of these model-flagged accounts—all political manipulation campaigns in countries other than the United of them still currently active on Twitter—are highly likely to be Rus- States. For instance, previous work has shown that a high volume sian trolls. As a result, we estimate that Russian trolls occupy 2.6% of Russian tweets were generated a few days before the voting day of the mentions of high-profile journalists’ mentions. Moreover, we in the case of the 2016 E.U. Referendum (Brexit Referendum), and find that in contrast with some prevailing narratives surrounding then dropped afterwards [16]. Furthermore, it is suspected that the Russian troll program, the model-flagged accounts do not score Russia is behind the MacronLeaks campaign that occurred during highly on the well-known Botometer scale [9], indicating that they the 2017 French presidential elections period [13], as well as the are not simply automated software agents. Catalonian referendum [34]. Finally, we perform an exploratory open coding of the identity deception strategies used by the currently active accounts discov- 2.2 Emerging Work on the 2016 Russian Trolls ered by our model. For instance, some pretend to be an American mother or a middle-aged white man via profile pictures and descrip- While a brand new area of scholarship, emerging work has exam- tions, but their tweet rates are abnormally high, and their tweets ined the datasets of Russian trolls released by Twitter. Researchers revolve solely around political topics. from Clemson University identified five categories of trolls and This paper makes the following contributions, building on an argued the behavior between these categories were radically dif- emerging line of scholarship around the Russian troll accounts ferent [5]. This was especially marked for left- and right-leaning [5, 6, 17, 33, 35, 42]. First, we show that it is possible to separate accounts (the dataset contains both). For instance, the IRA pro- Russian trolls from other accounts in the data previous to 2019, and moted more left-leaning content than right-leaning on Facebook, that this computational model is still accurate on 2019 data. As a while right-leaning Twitter handles received more engagement. corollary, we believe this work establishes that a large number of [33]. Russian troll accounts are likely to be currently active on Twitter. New work has looked at how the Russian troll accounts were Next, we find that accounts flagged by our model as Russian trolls retweeted in the context of the #BlackLivesMatter movement [35] are not merely bots but use diverse ways to build and manage their – a movement targeted by the trolls. The retweets were divided online identities. Finally, we argue that if it is possible to discover among different political perspectives and the trolls took advantage of this division. There is some disagreement about how predictable Still out there: Modeling and Identifying Russian Troll Accounts on Twitter WebSci ’20, July 6–10, 2020, Southampton, United Kingdom the Russian trolls are.

Load more