Predicting the Role of Political Trolls in Social Media
Total Page:16
File Type:pdf, Size:1020Kb
Predicting the Role of Political Trolls in Social Media Atanas Atanasov Gianmarco De Francisci Morales Preslav Nakov Sofia University ISI Foundation Qatar Computing Research Bulgaria Italy Institute, HBKU, Qatar [email protected] [email protected] [email protected] Abstract Such farms usually consist of state-sponsored agents who control a set of pseudonymous user We investigate the political roles of “Inter- accounts and personas, the so-called “sockpup- net trolls” in social media. Political trolls, pets”, which disseminate misinformation and pro- such as the ones linked to the Russian In- ternet Research Agency (IRA), have recently paganda in order to sway opinions, destabilize the gained enormous attention for their ability society, and even influence elections (Linvill and to sway public opinion and even influence Warren, 2018). elections. Analysis of the online traces of The behavior of political trolls has been ana- trolls has shown different behavioral patterns, lyzed in different recent circumstances, such as which target different slices of the population. the 2016 US Presidential Elections and the Brexit However, this analysis is manual and labor- referendum in UK (Linvill and Warren, 2018; intensive, thus making it impractical as a first- response tool for newly-discovered troll farms. Llewellyn et al., 2018). However, this kind of In this paper, we show how to automate this analysis requires painstaking and time-consuming analysis by using machine learning in a real- manual labor to sift through the data and to catego- istic setting. In particular, we show how to rize the trolls according to their actions. Our goal classify trolls according to their political role in the current paper is to automate this process —left, news feed, right— by using features with the help of machine learning (ML). In par- extracted from social media, i.e., Twitter, in ticular, we focus on the case of the 2016 US Pres- two scenarios: (i) in a traditional supervised idential Elections, for which a public dataset from learning scenario, where labels for trolls are available, and (ii) in a distant supervision sce- Twitter is available. For this case, we consider nario, where labels for trolls are not available, only accounts that post content in English, and we and we rely on more-commonly-available la- wish to divide the trolls into some of the functional bels for news outlets mentioned by the trolls. categories identified by Linvill and Warren(2018): Technically, we leverage the community struc- left troll, right troll, and news feed. ture and the text of the messages in the on- We consider two possible scenarios. The first, line social network of trolls represented as a prototypical ML scenario is supervised learning, graph, from which we extract several types of learned representations, i.e., embeddings, for where we want to learn a function from users to the trolls. Experiments on the “IRA Russian categories fleft, right, news feedg, and the ground Troll” dataset show that our methodology im- truth labels for the troll users are available. This proves over the state-of-the-art in the first sce- scenario has been considered previously in the lit- nario, while providing a compelling case for erature by Kim et al.(2019). Unfortunately, a so- the second scenario, which has not been ex- lution for such a scenario is not directly applicable plored in the literature thus far. to a real-world use case. Suppose a new troll farm 1 Introduction trying to sway the upcoming European or US elec- tions has just been discovered. While the identities Internet “trolls” are users of an online community of the accounts might be available, the labels to who quarrel and upset people, seeking to sow dis- learn from would not be present. Thus, any super- cord by posting inflammatory content. More re- vised machine learning approach would fall short cently, organized “troll farms” of political opinion of being a fully automated solution to our initial manipulation trolls have also emerged. problem. 1023 Proceedings of the 23rd Conference on Computational Natural Language Learning, pages 1023–1034 Hong Kong, China, November 3-4, 2019. c 2019 Association for Computational Linguistics A more realistic scenario assumes that labels for 2 Related Work troll accounts are not available. In this case, we 2.1 Trolls and Opinion Manipulation need to use some external information in order to learn a labeling function. Indeed, we leverage The promise of social media to democratize con- more persistent entities and their labels: news me- tent creation (Kaplan and Haenlein, 2010) has dia. We assume a learning scenario with distant been accompanied by many malicious attempts supervision where labels for news media are avail- to spread misleading information over this new able. By combining these labels with a citation medium, which quickly got populated by sock- graph from the troll accounts to news media, we puppets (Kumar et al., 2017), Internet water army can infer the final labeling on the accounts them- (Chen et al., 2013), astroturfers (Ratkiewicz et al., selves without any need for manual labeling. 2011), and seminar users (Darwish et al., 2017). One advantage of using distant supervision is Several studies have shown that trust is an impor- that we can get insights about the behavior of tant factor in online relationships (Ho et al., 2012; a newly-discovered troll farm quickly and effort- Ku, 2012; Hsu et al., 2014; Elbeltagi and Agag, lessly. Differently from troll accounts in social 2016; Ha et al., 2016), but building trust is a long- media, which usually have a high churn rate, news term process and our understanding of it is still media accounts in social media are quite stable. in its infancy (Salo and Karjaluoto, 2007). This Therefore, the latter can be used as an anchor point makes it easy for politicians and companies to ma- to understand the behavior of trolls, for which data nipulate user opinions in community forums (Del- may not be available. larocas, 2006; Li et al., 2016; Zhuang et al., 2018). We rely on embeddings extracted from social Trolls. Social media have seen the proliferation media. In particular, we use a combination of em- of fake news and clickbait (Hardalov et al., 2016; beddings built on the user-to-user mention graph, Karadzhov et al., 2017a), aggressiveness (Moore the user-to-hashtag mention graph, and the text of et al., 2012), and trolling (Cole, 2015). The lat- the tweets of the troll accounts. We further explore ter often is understood to concern malicious online several possible approaches using label propaga- behavior that is intended to disrupt interactions, tion for the distant supervision scenario. to aggravate interacting partners, and to lure them As a result of our approach, we improve the into fruitless argumentation in order to disrupt on- classification accuracy by more than 5 percent- line interactions and communication (Chen et al., age points for the supervised learning scenario. 2013). Here we are interested in studying not just The distant supervision scenario has not previ- any trolls, but those that engage in opinion manip- ously been considered in the literature, and is one ulation (Mihaylov et al., 2015a,b, 2018). This lat- of the main contributions of the paper. We show ter definition of troll has also become prominent in that even by hiding the labels from the ML algo- the general public discourse recently. Del Vicario rithm, we can recover 78.5% of the correct labels. et al.(2016) have also suggested that the spreading The contributions of this paper can be summa- of misinformation online is fostered by the pres- rized as follows: ence of polarization and echo chambers in social media (Garimella et al., 2016, 2017, 2018). • We predict the political role of Internet trolls Trolling behavior is present and has been stud- (left, news feed, right) in a realistic, unsuper- ied in all kinds of online media: online maga- vised scenario, where labels for the trolls are zines (Binns, 2012), social networking sites (Cole, not available, and which has not been explored 2015), online computer games (Thacker and Grif- in the literature before. fiths, 2012), online encyclopedia (Shachaf and • We propose a novel distant supervision ap- Hara, 2010), and online newspapers (Ruiz et al., proach for this scenario, based on graph em- 2011), among others. beddings, BERT, and label propagation, which Troll detection has been addressed by using projects the more-commonly-available labels domain-adapted sentiment analysis (Seah et al., for news media onto the trolls who cited these 2015), various lexico-syntactic features about user media. writing style and structure (Chen et al., 2012; • We improve over the state of the art in the tra- Mihaylov and Nakov, 2016), and graph-based ditional, fully supervised setting, where train- approaches over signed social networks (Kumar ing labels are available. et al., 2014). 1024 Sockpuppet is a related notion, and refers to For example, Castillo et al.(2011) leverage user a person who assumes a false identity in an In- reputation, author writing style, and various time- ternet community and then speaks to or about based features, Canini et al.(2011) analyze the themselves while pretending to be another person. interaction of content and social network struc- The term has also been used to refer to opinion ture, and Morris et al.(2012) studied how Twit- manipulation, e.g., in Wikipedia (Solorio et al., ter users judge truthfulness. Zubiaga et al.(2016) 2014). Sockpuppets have been identified by us- study how people handle rumors in social media, ing authorship-identification techniques and link and found that users with higher reputation are analysis (Bu et al., 2013). It has been also shown more trusted, and thus can spread rumors easily.