Predicting the Role of Political Trolls in Social Media

Predicting the Role of Political Trolls in Social Media Atanas Atanasov Gianmarco De Francisci Morales Preslav Nakov Sofia University ISI Foundation Qatar Computing Research Bulgaria Italy Institute, HBKU, Qatar [email protected] [email protected] [email protected] Abstract Such farms usually consist of state-sponsored agents who control a set of pseudonymous user We investigate the political roles of “Inter- accounts and personas, the so-called “sockpup- net trolls” in social media. Political trolls, pets”, which disseminate misinformation and pro- such as the ones linked to the Russian In- ternet Research Agency (IRA), have recently paganda in order to sway opinions, destabi- gained enormous attention for their ability lize the society, and even influence elections to sway public opinion and even influence (Linvill and Warren, 2018). elections. Analysis of the online traces of The behavior of political trolls has been ana- trolls has shown different behavioral patterns, lyzed in different recent circumstances, such as which target different slices of the population. the 2016 US Presidential Elections and the Brexit However, this analysis is manual and labor- referendum in UK (Linvill and Warren, 2018; intensive, thus making it impractical as a first- response tool for newly-discovered troll farms. Llewellyn et al., 2018). However, this kind of In this paper, we show how to automate this analysis requires painstaking and time-consuming analysis by using machine learning in a real- manual labor to sift through the data and to catego- istic setting. In particular, we show how to rize the trolls according to their actions. Our goal classify trolls according to their political role in the current paper is to automate this process —left, news feed, right— by using features with the help of machine learning (ML). In par- extracted from social media, i.e., Twitter, in ticular, we focus on the case of the 2016 US Pres- two scenarios: (i) in a traditional supervised idential Elections, for which a public dataset from learning scenario, where labels for trolls are available, and (ii) in a distant supervision sce- Twitter is available. For this case, we consider nario, where labels for trolls are not available, only accounts that post content in English, and we and we rely on more-commonly-available la- wish to divide the trolls into some of the func- bels for news outlets mentioned by the trolls. tional categories identified by Linvill and Warren Technically, we leverage the community struc- (2018): left troll, right troll, and news feed. ture and the text of the messages in the on- We consider two possible scenarios. The first, line social network of trolls represented as a prototypical ML scenario is supervised learning, graph, from which we extract several types of arXiv:1910.02001v1 [cs.CL] 4 Oct 2019 learned representations, i.e., embeddings, for where we want to learn a function from users to the trolls. Experiments on the “IRA Russian categories {left, right, news feed}, and the ground Troll” dataset show that our methodology im- truth labels for the troll users are available. This proves over the state-of-the-art in the first sce- scenario has been considered previously in the lit- nario, while providing a compelling case for erature by Kim et al. (2019). Unfortunately, a so- the second scenario, which has not been ex- lution for such a scenario is not directly applicable plored in the literature thus far. to a real-world use case. Suppose a new troll farm 1 Introduction trying to sway the upcoming European or US elections has just been discovered. While the identities Internet “trolls” are users of an online community of the accounts might be available, the labels to who quarrel and upset people, seeking to sow dis- learn from would not be present. Thus, any super- cord by posting inflammatory content. More re- vised machine learning approach would fall short cently, organized “troll farms” of political opinion of being a fully automated solution to our initial manipulation trolls have also emerged. problem. A more realistic scenario assumes that labels for 2 Related Work troll accounts are not available. In this case, we 2.1 Trolls and Opinion Manipulation need to use some external information in order to learn a labeling function. Indeed, we leverage The promise of social media to democratize con- more persistent entities and their labels: news me- tent creation (Kaplan and Haenlein, 2010) has dia. We assume a learning scenario with distant been accompanied by many malicious attempts supervision where labels for news media are avail- to spread misleading information over this new able. By combining these labels with a citation medium, which quickly got populated by sock- graph from the troll accounts to news media, we puppets (Kumar et al., 2017), Internet water army can infer the final labeling on the accounts them- (Chen et al., 2013), astroturfers (Ratkiewicz et al., selves without any need for manual labeling. 2011), and seminar users (Darwish et al., 2017). One advantage of using distant supervision is Several studies have shown that trust is an impor- that we can get insights about the behavior of tant factor in online relationships (Ho et al., 2012; a newly-discovered troll farm quickly and effort- Ku, 2012; Hsu et al., 2014; Elbeltagi and Agag, lessly. Differently from troll accounts in social 2016; Ha et al., 2016), but building trust is a long- media, which usually have a high churn rate, news term process and our understanding of it is still media accounts in social media are quite stable. in its infancy (Salo and Karjaluoto, 2007). This Therefore, the latter can be used as an anchor point makes it easy for politicians and companies to to understand the behavior of trolls, for which data manipulate user opinions in forums (Dellarocas, may not be available. 2006; Li et al., 2016; Zhuang et al., 2018). We rely on embeddings extracted from social Trolls. Social media have seen the prolifera- media. In particular, we use a combination of em- tion of fake news and clickbait (Hardalov et al., beddings built on the user-to-user mention graph, 2016; Karadzhov et al., 2017a), aggressiveness the user-to-hashtag mention graph, and the text of (Moore et al., 2012), and trolling (Cole, 2015). the tweets of the troll accounts. We further explore The latter often is understood to concern mali- several possible approaches using label propaga- cious online behavior that is intended to disrupt tion for the distant supervision scenario. interactions, to aggravate interacting partners, and As a result of our approach, we improve the to lure them into fruitless argumentation in or- classification accuracy by more than 5 percent- der to disrupt online interactions and communica- age points for the supervised learning scenario. tion (Chen et al., 2013). Here we are interested in The distant supervision scenario has not previ- studying not just any trolls, but those that engage ously been considered in the literature, and is one in opinion manipulation (Mihaylov et al., 2015a,b, of the main contributions of the paper. We show 2018). This latter definition of troll has also be- that even by hiding the labels from the ML algo- come prominent in the general public discourse rithm, we can recover 78.5% of the correct labels. recently. Del Vicario et al. (2016) have also sug- The contributions of this paper can be summa- gested that the spreading of misinformation on- rized as follows: line is fostered by the presence of polarization and echo chambers in social media (Garimella et al., • We predict the political role of Internet trolls 2016, 2017, 2018). (left, news feed, right) in a realistic, unsuper- Trolling behavior is present and has been vised scenario, where labels for the trolls are studied in all kinds of online media: on- not available, and which has not been explored line magazines (Binns, 2012), social network- in the literature before. ing sites (Cole, 2015), online computer games • We propose a novel distant supervision ap- (Thacker and Griffiths, 2012), online encyclope- proach for this scenario, based on graph em- dia (Shachaf and Hara, 2010), and online newspa- beddings, BERT, and label propagation, which pers (Ruiz et al., 2011), among others. projects the more-commonly-available labels Troll detection was addressed using domain- for news media onto the trolls who cited these adapted sentiment analysis (Seah et al., 2015), media. lexico-syntactic features about writing style and • We improve over the state of the art in the tra- structure (Chen et al., 2012; Mihaylov and Nakov, ditional, fully supervised setting, where train- 2016), and graph-based approaches over signed ing labels are available. social networks (Kumar et al., 2014). Sockpuppet is a related notion, and refers to For example, Castillo et al. (2011) leverage user a person who assumes a false identity in an In- reputation, author writing style, and various time- ternet community and then speaks to or about based features, Canini et al. (2011) analyze the themselves while pretending to be another person. interaction of content and social network struc- The term has also been used to refer to opinion ture, and Morris et al. (2012) studied how Twit- manipulation, e.g., in Wikipedia (Solorio et al., ter users judge truthfulness. Zubiaga et al. (2016) 2014). Sockpuppets have been identified by us- study how people handle rumors in social me- ing authorship-identification techniques and link dia, and found that users with higher reputation analysis (Bu et al., 2013). It has been also shown are more trusted, and thus can spread rumors eas- that sockpuppets differ from ordinary users in their ily. Lukasik et al. (2015) use temporal patterns posting behavior, linguistic traits, and social net- to detect rumors and to predict their frequency, work structure (Kumar et al., 2017).

Load more