Predicting Shifting Individuals Using Text Mining and Graph Machine Learning on Twitter

Federico Albanese ∗1,2, Leandro Lombardi1, Esteban Feuerstein2,3, and Pablo Balenzuela4,5

1Instituto de C´alculo,CONICET- Universidad de , 2Instituto de Ciencias de la Computaci´on,CONICET- Universidad de Buenos Aires, Argentina 3Departamento de Computaci´on,Universidad de Buenos Aires, Argentina 4Instituto de F´ısicade Buenos Aires (IFIBA), CONICET, Argentina 5Departamento de F´ısica,Universidad de Buenos Aires, Argentina

August 2020

Abstract The formation of majorities in public discussions often depends on individuals who shift their opinion over time. The detection and char- acterization of these type of individuals is therefore extremely important for political analysis of social networks. In this paper, we study changes in individual’s affiliations on Twitter using natural language processing techniques and graph machine learning algorithms. In particular, we col- lected 9 million Twitter messages from 1.5 million users and constructed the retweet networks. We identified communities with explicit political orientation and topics of discussion associated to them which provide the topological representation of the political map on Twitter in the analyzed periods. With that data, we present a machine learning framework for social media users classification which efficiently detects “shifting users” (i.e. users that may change their affiliation over time). Moreover, this machine learning framework allows us to identify not only which topics arXiv:2008.10749v1 [cs.SI] 24 Aug 2020 are more persuasive (using low dimensional topic embedding), but also which individuals are more likely to change their affiliation given their topological properties in a Twitter graph.

1 Introduction

Technologically mediated Social networks flourished as a social phenomenon at the beginning of this century with exponents such as Friendster (2002) or

[email protected]

1 Myspace (2003) [1] but other popular websites soon took their place. Twitter is an online platform where news or data can reach millions of users in a matter of minutes [2]. Twitter is also of great academic interest, since individuals voluntarily express openly their opinions and they can interact with other users by retweeting the others’ tweets. In particular, in the last decade there has been an increase in interest from computational social scientists and numerous political studies have been published using information from this platform [3–8]. Previous works applied different machine learning models to these datasets. Xu et al. collected tweets using the streaming API and implemented an un- supervised machine learning framework for detecting online wildlife trafficking using topic modeling [41]. Kurnaz et al. proposed a methodology which first extracts features of a tweet text and then applies deep sparse autoencoders in order to classify the sentiment of tweets [10]. Pinto et al. detected and an- alyzed the topics of discussion in the text of tweets and news articles, using Non Negative Matrix Factorization [32], in order to understand the role of mass media in the formation of public opinion [11]. On the other hand, Kannangara implemented a probabilistic method so as to identify the topic, sentiment and political orientation of tweets [44]. Some other works are focused in political analysis and the interaction be- tween users, as for instance the one of Aruguete et al., which described how Twitter users frame political events by sharing content exclusively with like- minded users forming two well-defined communities [12]. Dang-Xuan et al. downloaded tweets during the 2011 parliament elections in Germany and char- acterize the role of influencers utilizing the retweet network [13]. Stewart et al. used community detection algorithms over a network of retweets to understand the behavior of trolls in the context of the #BlackLivesMatter movement [14]. Conver et al. [15] also used similar techniques over a retweets network and showed the segregated partisan structure with extremely limited connection be- tween clusters of users with different political ideologies during the 2010 U.S. congressional midterm elections. The same polarization on the Twitter net- work can be found in other contexts and countries (Canada [53], Egypt [51], [52]). Opinion shifts in group discussions have been studied from different points of view. In particular, it was stated that opinion shifts can be produced by arguments interchange, according to the Persuasive Arguments Theory (PAT) [48, 54, 55]. Primario et al. applied this theory to measure the evolution of the political polarization on Twitter during the 2016 US Presidential election [47]. In the same line, Holthoefer et al analyzed the Egyptian polarization dynamics on Twitter [51]. They classified the tweets in two groups (pro/anti military intervention) based on their text and estimated the overall proportion of users that change their position. These works analyzed the macro dynamics of polarization, rather than focus on the individuals. In contrast, we found it interesting not only to character- ize the Twitter users who change their political opinion, but also predict these “shifting voters”. Therefore, the focus of this paper is centered on the individ- uals rather than the aggregated dynamic, using machine learning algorithms. Moreover, once we were able to correctly determine these users, we seek to distinguish between persuasive and non persuasive topics1.

1We decided to use the adjective ”shifting” for individuals, and ”persuasive” (resp. non

2 In this paper, we examined three Twitter networks datasets constructed with tweets from: 2017 Argentina parliamentary elections, 2019 Argentina presiden- tial elections and 2020 tweets of . Three datasets were con- structed and used in order to show that the methodology can be easily gener- alized to different scenarios. For each dataset, we analyzed two different time periods and identify the larger communities corresponding to the main political forces. Using graph topological information and detecting topics of discussion of the first network, we built and trained a model that effectively predicts when an individual will change his/her community over time, identifying persuasive topics and relevant features of the shifting users. Our main contributions are the following:

1. We described a generalized machine learning framework for social media users classification, in particular, for detecting their affiliation at a given time and whether the user will change it in the future. This framework in- cludes natural language processing techniques and graph machine learning algorithms in order to describe the features of an individual.

2. We observed that the proposed machine learning model has a good per- formance for the task of predicting changes of the user’s affiliation over time. 3. We experimentally analyzed the machine learning framework by perform- ing a feature importance analysis. While previous works used text, Twit- ter profiles and some twitting behavior characteristics to automatically classify users with machine learning [16–19], here we showed the value of adding graph features in order to identify the label of a user. In particular, the importance of the “PageRank” for this specific task. 4. We also identified the topics that are considerably more relevant and per- suasive to the shifting users. Identifying this key topics has a valuable impact for social science and politics.

The paper is organized as follows. In the Data Collection section, we describe the data used in the study. In the Methods section, we describe the graph unsupervised learning algorithms and other graph metrics that were used, the natural language processing tools applied to the tweets and the machine learning model. In the Results section, we analyze the performance of the model for the task of detecting shifting individuals. Finally, we interpret these results in the Conclusions section. The code is in github (omitted for anonymity reasons).

2 Data Collection

Twitter has several APIs available to developers. Among them is the Streaming API that allows the developer to download in real time a sample of tweets that are uploaded to the social network filtering it by language, terms, hashtags, etc. [20, 21]. The data is composed of the tweet id, the text, the date and time of the tweet, the user id and username, among other features. In case of a retweet, it has also the information of the original tweet’s user account. persuasive) for the topics relevant (resp. non relevant) to those individuals

3 For this research, we collected 3 datasets: 2017 Argentina parliamentary elec- tions (2017ARG), 2019 Argentina presidential elections (2019ARG) and 2020 United States tweets of Donald Trump (2020US). For the Argentinan datasets, the Streaming API was used during the week before the primary elections and the week before the general elections took place. Keywords were chosen accord- ing the four main political parties present in the elections. Details and context can be found in the Appendix. For the 2020US dataset, “realDonaldTrump” (the official account of president Donald Trump) was used as keyword. Twitter messages are in the public domain and only public tweets filtered by the Twit- ter API were collected for this work. For the purpose of this research, we have analyzed more than 9 million tweets and more than 1.5 million individuals in total. The specific start and end collection date, the total number of tweets and users can be seen in Table 7.

3 Methods

In this section, we will introduce the methodology used to characterize the Twit- ter users. First the retweet networks (Section 3.1) and the algorithm to find communities (Section 3.2). Then, the different metrics which describe the in- teraction networks among them (Section 3.3). After that, the features obtained by analyzing the text of the tweets (Section 3.4). Finally, we describe the su- pervised learning model which uses the individual’s characteristics as instances and predicts the shifting users.

3.1 The retweet network We represent the interaction among individuals in terms of a graph, where users are nodes and retweets between them (one or more) are edges (undirected and unweighted). Isolated nodes (never retweeting nor retweeted) were not taken into account for this analysis. In Figure 1, we can visualize the retweet network for each time period and dataset. In the case of the US dataset, most of the users are concentrated in two groups, which allows to visualize the political polarization. On the other hand, in the Argentinean datasets we can identify two large groups and also some smaller ones. The graph visualizations are produced with Force Atlas 2 layout using Gephi software [22].

3.2 Unsupervised Learning: Community Detection In a given graph, a community is a set of nodes largely connected among them and with little or no connection with nodes of other communities [28]. We implement an algorithm to detect communities in large networks which allows us to characterize the users by their relationship with other users. In this context, the modularity is defined as the fraction of the edges that fall within a given community minus the expected fraction if edges were distributed at random [46]. The Louvain method for community detection [29] seeks to maximize modularity by using a greedy optimization algorithm. This method was chosen to perform the analysis due to the characteristics of the database. While other algorithms such as label propagation are good for large data networks, their performance

4 decreases if clusters are not well defined [30]. In contrast, in these cases the Louvain or Infomap methods obtain better results. However, given that the number of nodes is in the order of hundreds of thousands and edges in the order of one million, the Louvain method has a better performance [31] than other ones. Despite having found several communities, we just considered the largest for each case. For the 2017ARG and 2019ARG dataset we used the four biggest communities because, when examining the text of the tweets and the users with the highest degree, each one had a clear political orientation corresponding to the four biggest political parties in the election. These communities are labeled as “Cambiemos”, “Unidad Ciudadana”, “Partido Justicialista” and “1 Pais” for 2017ARG and “”, “”, “Consenso Federal” and “Frente de Izquierda-Unidad” for 2019ARG (electoral context is provided in the Appendix). Regarding the 2020US dataset, we used the 2 biggest communities because of the bipartisan political system of the United States (Republicans and Democrats) and the clear structure present in the retweet networks, where only two big clusters concentrate almost all of the users and interactions (see figure 1). In contrast, the Argentinean election datasets have two principal communities and some minor communities as well. Considering the the fact that our dataset has more than 9 million tweets and more than 1.5 million users, it was not feasible to determine true labels of political identification of the users for this task. Neither it was viable to manually assign them. Therefore, we decided to use the communities labels of the retweet network as a proxy of political membership, and interpret changes in their label as changes in affiliation over time. This decision is supported by previous literature, where it is shown that communities identify a user’s ideology and political membership [12, 14, 15, 23, 24, 43]. Moreover, taking into account the stochasticity of the Louvain method and following [45], we decided to use for the machine learning task only the nodes that were always assigned to the same community, in order to minimize the possibility of an incorrect labeling. Additionally, we did not used individuals with less than 5 retweets, since we might have insufficient data to correctly classify them. Finally we also manually sampled and checked users from different communities to verify their political identification.

3.3 Graph Features With the intention of characterizing topologically the users of the primary elec- tion network, we computed the following metrics: Degree of each user in the network (i.e., the number of users that have retweeted a given one), PageR- ank [25], betweenness centrality [26], clustering coefficient [27] and cluster af- filiation (the community detected by the Louvain method). We used all these metrics as features in the machine learning classification task.

3.4 Natural Language Processing Features In order to determine the topics of discussion during the , we analyzed the text of the tweets using natural language processing analysis and we calculated a low dimensional embedding for each user.

5 The tweets were described as vectors through the Term Frequency - Inverse Document Frequency (tf-idf) representation [33]. Each value in the vector cor- responded to the frequency of a word in the tweet (the term frequency, tf ) weighted by a factor which measures the degree of specificity (inverse document frequency, idf ). We used 3-grams and a modified stop-words dictionary that not only contained articles, prepositions, pronouns and some verbs but also the names of the candidates, parties and words like “election”. Then, we constructed a matrix M concatenating the tf-idf vectors, with dimensions the number of tweets times the number of terms. We performed topic decomposition using Non-Negative Matrix Factorization (NMF) [32] on the matrix M. NMF is an unsupervised topic model which factorizes the matrix M into two matrices H and W with the property that all three matrices have no negative elements. We selected the NMF algorithm because this non-negativity makes the resulting matrices easier to inspect and to understand their meaning. The matrix H has a representation of the tweets in the topic space, in which the columns are the degree of membership of each tweet to a given topic. On the other hand, the matrix W provides the combination of terms which describes each topic [42]. The obtained results, analyzing just the tweets corresponding to the first time period, are detailed in the Appendix. The decomposition dimension was swept between 5 and 30, and for each dataset we chose a number of topics in the corpus so as to have a clear interpretation of each one. The same methodology was used and described in [11, 42]. Once we collected all this information, Twitter users were also characterized by a vector of features where each cell corresponds to one of the topics and its value to the percentage of tweets the user tweeted with that topic.

3.5 Machine Learning classifying model Given that our objective was to identify shifting individuals and persuasive arguments, we implemented a predictive model whose instances are the Twitter users who were active during both time periods [34] and belonged to one of the biggest communities in both time periods networks. Consequently, the number of users used at this stage was reduced. Individuals were characterized by a feature vector with components corresponding to the mentioned topological metrics depicted in section 3.3 and others corresponding to the percentage of tweets in each one of the topics extracted in section 3.4. The information used to construct these embedding was gathered from the whole first time period retweet network. The target was a binary vector that takes the value 1 if the user changed communities between the first and the second time periods and 0 otherwise. The summary of the datasets is shown in Table 2. Considering the percent- age of positive targets, this is clearly a class imbalance scenario. Specially in 2020US, which is reasonable given the bipartisan retweet network with big and opposed communities [51]. The gradient boosting technique uses an ensemble of predictive models to perform the task of supervised classification and regression [35]. These predic- tive models are then optimized iteration by iteration using the gradient of the cost function of the previous iteration. In this scenario, XGBoost, a particular

6 implementation of this technique, has proven to be efficient in a wide variety of supervised scenarios outperforming previous models [36]. We used a 67/33 random split between train and test. In order to do hyper- parameter tuning, we used the randomized search method [37] over the training dataset with 3-fold cross-validation, which consists of trying different random combinations of parameters and then staying with the optimum.

4 Results 4.1 Predicting shifting individuals With the objective of measuring the efficiency and performance of our machine learning model, two other models, namely random and polar, were taken as baselines for comparison. In the former one, the selected user will change of community with a probability of 50%. In the latter, for a user that belongs to one of the two biggest communities in the network, we predict that he/she will stay in that community, while a user that belongs to a smaller community will change to one of the two main communities with same probability. This polar model is inspired by idea that in a polarized election, members of the smallest communities shift and are attracted to the biggest communities, and was used in the Argentinean datasets. We trained three different gradient boosting models for each dataset: the first one was trained only with the features obtained via text mining (how many tweets of the selected topics the user talks about); a second one was trained just with features obtained through complex network analysis (degree, PageRank, betweeness centrality, clustering coefficient and cluster affiliation); and the last one was trained with all the data. In this way, we could compare the importance natural language processing and complex network analysis for this task. In Figure 2 we can see the ROC [38] of the different models for each dataset. The best performance is obtained in all cases by the machine learning model built with all the characteristics of the users, which is able to efficiently predict which users are shifting individuals. This result is expected, since an assembly of models manages to have sufficient depth and robustness to understand the network information, the topics of the tweets and the graph characteristics of the users.

4.2 Feature Importance We performed random permutation of the features values among users in order to understand which of them are the most important in the performance of our model (the so-called Permutation Feature Importance algorithm [39]). In Figure 3, we observe that the most important feature in all cases corresponds to the node’s connectivity: PageRank, meaning that shifting individuals are the peripheral and least important nodes of big communities. The result is verifiable when comparing the PageRank averages in users who changed their affiliation (2017ARG PR = 8.97e−6, 2019ARG PR = 2.57e−6 and 2020US PR = 2.03e − 6) with those who did not (2017ARG PR = 1.99e − 5, 2019ARG PR = 8.12e − 6 and 2020US PR = 5.41e − 6), the latter being at least 56% higher. This is also consistent with the fact that the model trained with network features

7 gets a better AUC than the model trained with the texts of user tweets in all datasets. Previous works have used text, Twitter profile and some twitting behavior characteristics to automatically classify users with machine learning, but none of them have incorporated the use of these graph metrics [16–19, 40]. Our work shows the importance of also including these graph features in order to identify shifting individuals. This result has a relevant sociological meaning: the unpopular individuals are more prone to change their opinion.

4.3 Persuasive Topics Besides the importance of the mentioned topological properties, some discussed topics are also relevant to the classifier model. A simple analysis of the most spoken topics in the network does not differentiate between topics discussed by a shifting individual and other users. Considering that most users do not change their affiliation, it is interesting to analyze those that do change. The Persuasive Arguments Theory affirms that changes in opinion occurs when peo- ple exchange strong (or persuasive) arguments [48, 54, 55]. Consequently, we defined a “persuasive topic” as a topic used primarily by shifting individuals and not used by non shifting individuals. With the intention of doing a deeper analysis of the topic embedding for the 2017ARG dataset, we first enumerate the main topics in that corpus:

1. Santiago Maldonado: An activist who disappeared on August 1, 2017, after a gendarmerie repression of a riot. 2. Missing ballots: People reporting missing ballots on Twitter. 3. Election official and poll workers: People tweeting about the absence of election officials when voting in some schools. 4. Economy: This topic particularly focused on poverty, unemployment and structural adjustment programs. 5. “Once” Tragedy: The 2012 rail disaster in which 51 persons died and 703 were injured during the presidency of Cristina Kirchner in Buenos Aires’ train station “Once”. 6. Exit polls: The results of opinion polls of people leaving the polling station. 7. Santa Cruz: An important Argentinean province whose Governor, , belongs to the Opposition.

8. Buenos Aires: The biggest and most populated province in Argentina. 9. Venezuela: Tweets about the government of Venezuela and its relationship with Cristina Kirchner.

Equivalent analysis can be done with the other two corpora and the topic decomposition for each can be found in the Appendix. In Figure 3, the most important topics for the classifier are “Venezuela”, “Economy” and “Santiago Maldonado”. We can contextualize these results by looking which are the main topics discussed in each community as well the

8 ones discussed among the users that change between them, as it is shown in Figure 4. We can see that “Venezuela” is one of the most discussed topics in the people remaining in four communities and “Santiago Maldonado” is a relevant topic in the communities “Unidad Ciudadana” and “1 Pais”. When we look at the main topics discussed by users that change their communities between elections, we can observe that “Venezuela” identifies those that go from “Partido Justicialista (PJ)” to “1 Pais” and “Cambiemos” meanwhile “Santiago Maldonado” is a key topic among those who arrive to “Unidad Ciudadana” from “Partido Justicialista (PJ)” and “1 Pais”. Considering that these topics are considerably more used by the shifting Twitter users than by the other users, it can be affirmed that these are “persuadable topics”. In contrast, other topics such as “Economy” or “Santa Cruz” were also commonly used by most of the users but not by the shifting individuals.

5 Conclusion

In this paper we presented a machine learning framework approach in order to identify shifting individuals and persuasive topics that, unlike previous works, focused on the persuadable users rather than studying the political polarization on social media as a whole. The framework includes natural language process- ing techniques and graph machine learning algorithms in order to describe the features of an individual. Also, three datasets were used for the experimentation: 2017ARG, 2019ARG and 2020US. These dataset were constructed with tweets from 2 countries, dur- ing different political contexts (during a parliamentary election, during a presi- dential election and during a non-election period) and in a multi-party system and a two-party system. The machine learning framework was applied to these different datasets with similar results, showing that the methodology can be easily generalized. The implemented predictive models effectively detected whether the user will change his/her political affiliation. We showed that the better performance can be achieved when representing the individuals with their community and other graph features rather than topic embedding. Therefore, our results indicate that these proposed features do a reasonable job at identifying user characteristics that determine if a user changes opinion, features that were neglected in previous works of user classification on Twitter [16–19, 40]. In particular, the PageRank was the most relevant according to the permutation feature importance analysis in all datasets, showing that popular people have lower tendencies to change their opinion. Finally, the proposed framework also identifies which of the topics are the persuasive topics and good predictors of individuals changing their political affiliation. Consequently, this methodology could be useful for a political party to see which issues should be prioritized in their agenda with the intention of maximizing the number of individuals that migrate to their community. Understanding the characteristics and the topics of interest of politically shifting individuals in a polarized environment can provide an enormous benefit for social scientists and political parties. The implications of this research sup- plement them with tools to improve their understanding of shifting individuals and their behavior.

9 6 Figures

Figure 1: Retweet networks: (a) 2017 Argentina primary election; (b) 2017 Argentina general election; (c) 2019 Argentina primary election; (d) 2019 Ar- gentina general; (e) first time period of the 2020US and (f) second time period of the 2020US. Each Twitter user is represented by a node (colored depending on its community) and each edge (undirected and unweighted) is one or more retweets between two given users (in black).

10 Figure 2: Results of the proposed models: The ROC curves of the pre- dictive models with XGBoost, the simple model and the random model for the task of predicting users who will change their community, for each dataset: (a) 2017ARG, (b) 2019ARG and (c) 2020US.

11 2017ARG

pagerank

community

topic Maldonado

topic Once Tragedy

topic Economy

0 200 400 600

2019ARG

pagerank

centrality

community

topic Election Acts

topic Buenos Aires City

0 2500 5000 7500

2020US

pagerank

topic Donald Trump

topic Barack Obama

topic Obamagate

topic Fake News

0 1000 2000 3000 F score

Figure 3: Feature importance analysis: The permutation feature impor- tance of the XGBoost model trained with features obtained with text mining and complex network analysis for each dataset: (a) 2017ARG, (b) 2019ARG and (c) 2020US.

12 Figure 4: Community topics: The seizes of the word clouds represent the importance of a topic in each community. the arrows indicate the flow of users between the communities of the primary elections to the general elections. The percentage on the arrows are the percentage of users that changed from one com- munity to the other (When the percentage was less than 1%, the corresponding arrow is not drawn). The topics on the arrows show the most important topics among the users that change between those communities.

13 7 Tables

14 Table 1: The total number of tweets and users of each dataset for both time periods. , 2020 , 2020 th th time period nd , 2020 June 16 , 2020 June 10 th th time period 2 st , 2019 May 16 , 2019 May 9 th th time period 1 nd , 2019 October 27 , 2019 October 20 th th time period 2 st , 2017 August 12 , 2017 August 5 th th time period 1 nd 2017ARG 2019ARG 2020US , 2017 October 22 , 2017 October 15 th th 86361 298866 256860 271910 335077 348793 time period 2 2117708 1751813 1638585 1587563 1148032 1137785 st 1 August 7 August 13 # Users # Tweets End date Start date 15 Table 2: Summery of the datasets used in the experiments.

2017ARG 2019ARG 2020US # Individuals 12137 29942 110346 # Communities 4 4 2 # Text Features 9 7 6 # Graph Features 5 5 5 % Shifting Individuals 11.1 13.4 0.82 Training set size 8132 20062 73601 Test set size 4005 9880 36745

16 References

[1] Island, Lon From Friendster To MySpace To Facebook: The Evolution and Deaths Of Social Networks longislandpress. com. Retrieved August, 2013, vol. 7

[2] L´opez, Garc´ı Emotions in Health Tweets: Analysis of American Govern- ment Official Accounts International Journal of Humanities and Social Sciences, 2019, vol. 13, no. 3, pp. 289–292 [3] Small, Tamara A What the hashtag? A content analysis of Canadian politics on Twitter Information, communication & society, 2011, vol. 14, no. 6, pp. 872–895, Taylor & Francis [4] Stieglitz, Stefan and Dang-Xuan, Linh Political communication and influ- ence through microblogging–An empirical analysis of sentiment in Twitter messages and retweet behavior 2012 45th Hawaii International Conference on System Sciences, 2012, pp. 3500–3509, IEEE

[5] A. Badawy and E. Ferrara and K. Lerman Analyzing the Digital Traces of Political Manipulation: The 2016 Russian Interference Twitter Campaign 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2018, pp. 258-265

[6] Kuˇsen,Ema and Strembeck, Mark Politics, sentiments, and misinforma- tion: An analysis of the Twitter discussion on the 2016 Austrian Presiden- tial Elections Online Social Networks and Media, 2018, vol. 5, pp. 37–50, Elsevier [7] Beguerisse-D´ıaz, Mariano and Garduno-Hern´andez, Guillermo and Van- gelov, Borislav and Yaliraki, Sophia N and Barahona, Mauricio Interest communities and flow roles in directed networks: the Twitter network of the UK riots Journal of The Royal Society Interface, 2014, vol. 11, no. 101, pp. 20140940, The Royal Society [8] Ott, Brian L The age of Twitter: Donald J. Trump and the politics of debasement Critical studies in media communication, 2017, vol. 34, no. 1, pp. 59–68, Taylor & Francis [9] Reynolds, Kelly and Kontostathis, April and Edwards, Lynne Using ma- chine learning to detect cyberbullying 2011 10th International Conference on Machine learning and applications and workshops, 2011, vol. 2, pp. 241–244, IEEE

[10] Kurnaz, Sefer and Mahmood, Mustafa Ahme Sentiment Analysis in Data of Twitter using Machine Learning Algorithms International Journal of Computer Science and Mobile Computing, 2019, vol. 8, no. 3, pp. 31–35 [11] Pinto, Sebasti´anand Albanese, Federico and Dorso, Claudio O and Balen- zuela, Pablo Quantifying time-dependent Media Agenda and public opinion by topic modeling Physica A: Statistical Mechanics and its Applications, 2019, vol. 524, pp. 614–624, Elsevier

17 [12] Aruguete, Natalia and Calvo, Ernesto Time to# protest: Selective expo- sure, cascading activation, and framing in social media Journal of commu- nication, 2018, vol. 68, no. 3, pp. 480–502, Oxford University Press

[13] Dang-Xuan, Linh and Stieglitz, Stefan and Wladarsch, Jennifer and Neu- berger, Christoph An investigation of influentials and the role of sentiment in political communication on Twitter during election periods Informa- tion, communication & society, 2013, vol. 16, no. 5, pp. 795–825, Taylor & Francis

[14] Stewart, Leo G and Arif, Ahmer and Starbird, Kate Examining trolls and polarization with a retweet network Proc. ACM WSDM, workshop on misinformation and misbehavior mining on the web, 2018 [15] Conover, Michael D and Ratkiewicz, Jacob and Francisco, Matthew and Gon¸calves, Bruno and Menczer, Filippo and Flammini, Alessandro Political polarization on twitter Fifth international AAAI conference on weblogs and social media, 2011 [16] Conover, Michael D and Gon¸calves, Bruno and Ratkiewicz, Jacob and Flammini, Alessandro and Menczer, Filippo Predicting the political align- ment of twitter users 2011 IEEE third international conference on privacy, security, risk and trust and 2011 IEEE third international conference on social computing, 2011, pp. 192–199, IEEE [17] Pennacchiotti, Marco and Popescu, Ana-Maria Democrats, republicans and starbucks afficionados: user classification in twitter Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, 2011, pp. 430–438 [18] Benevenuto, Fabricio and Magno, Gabriel and Rodrigues, Tiago and Almeida, Virgilio Detecting spammers on twitter Collaboration, electronic messaging, anti-abuse and spam conference (CEAS), 2010, vol. 6, pp. 12

[19] Borondo, Javier and Morales, Alfredo Jose and Losada, Juan Carlos and Benito, Rosa M Characterizing and modeling an electoral campaign in the context of Twitter: 2011 Spanish Presidential election as a case study Chaos: an interdisciplinary journal of nonlinear science, 2012, vol. 22, no. 2, pp. 023138, American Institute of Physics

[20] Makice, Kevin Twitter API: Up and running: Learn how to build applica- tions with the Twitter API O’Reilly Media, Inc. 2009 [21] Morstatter, Fred and Pfeffer, J ”urgen and Liu, Huan and Carley, Kathleen M Is the sample good enough? comparing data from twitter’s streaming api with twitter’s firehose Seventh international AAAI conference on weblogs and social media, 2013 [22] Jacomy, Mathieu and Venturini, Tommaso and Heymann, Sebastien and Bastian, Mathieu ForceAtlas2, a continuous graph layout algorithm for handy network visualization designed for the Gephi software PloS one, 2014, vol. 9, no. 6, pp. e98679, Public Library of Science

18 [23] Brady, William J and Wills, Julian A and Jost, John T and Tucker, Joshua A and Van Bavel, Jay J Emotion shapes the diffusion of moralized content in social networks Proceedings of the National Academy of Sciences, 2017, vol. 114, no. 28, pp. 7313–7318, National Acad Sciences [24] Cherepnalkoski, Darko and Mozetic, Igor A retweet network analysis of the European Parliament 2015 11TH International Conference On Signal- Image Technology & Internet-Based Systems (SITIS), 2015, pp. 350–357, IEEE

[25] Page, Lawrence and Brin, Sergey and Motwani, Rajeev and Winograd, Terry The PageRank citation ranking: Bringing order to the web. Stanford InfoLab, 1999 [26] Brandes, Ulrik A faster algorithm for betweenness centrality Journal of mathematical sociology, 2001, vol. 25, no. 2, pp. 163–177, Taylor & Francis [27] Saram ”aki, Jari and Kivel ”a, Mikko and Onnela, Jukka-Pekka and Kaski, Kimmo and Kertesz, Janos Generalizations of the clustering coefficient to weighted complex networks Physical Review E, 2007, vol. 75, no. 2, pp. 027105, APS

[28] Yang, Zhao and Algesheimer, Ren´eand Tessone, Claudio J A comparative analysis of community detection algorithms on artificial networks Scientific reports, 2016, vol. 6, pp. 30750, Nature Publishing Group [29] Blondel, Vincent D and Guillaume, Jean-Loup and Lambiotte, Renaud and Lefebvre, Etienne Fast unfolding of communities in large networks Journal of statistical mechanics: theory and experiment, 2008, vol. 2008, no. 10, pp. P10008, IOP Publishing [30] Lancichinetti, Andrea and Fortunato, Santo Community detection algo- rithms: a comparative analysis Physical review E, 2009, vol. 80, no. 5, pp. 056117, APS [31] Emmons, Scott and Kobourov, Stephen and Gallant, Mike and B ”orner, Katy Analysis of network clustering algorithms and cluster quality metrics at scale PloS one, 2016, vol. 11, no. 7, pp. e0159161, Public Library of Science San Francisco, CA USA

[32] Xu, Wei and Liu, Xin and Gong, Yihong Document clustering based on non-negative matrix factorization Proceedings of the 26th annual interna- tional ACM SIGIR conference on Research and development in informaion retrieval, 2003, pp. 267–273 [33] Ramos, Juan and others Using tf-idf to determine word relevance in docu- ment queries Proceedings of the first instructional conference on machine learning, 2003, vol. 242, pp. 133–142, New Jersey, USA [34] Cazabet, Remy and Rossetti, Giulio Challenges in community discovery on temporal networks Temporal Network Theory, Springer, 2019, pp. 181–197

19 [35] Chen, Tianqi and Guestrin, Carlos Xgboost: A scalable tree boosting system Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794 [36] Nielsen, Didrik Tree boosting with xgboost-why does xgboost win¨every ¨machine learning competition? NTNU, 2016 [37] Bergstra, James and Bengio, Yoshua Random search for hyper-parameter optimization The Journal of Machine Learning Research, 2012, vol. 13, no. 1, pp. 281–305, JMLR. org [38] Rice, Marnie E and Harris, Grant T Comparing effect sizes in follow-up studies: ROC Area, Cohen’s d, and r Law and human behavior, 2005, vol. 29, no. 5, pp. 615–620, Springer [39] Altmann, Andr´e and Tolo¸si, Laura and Sander, Oliver and Lengauer, Thomas Permutation importance: a corrected feature importance measure Bioinformatics, 2010, vol. 26, no. 10, pp. 1340–1347, Oxford University Press [40] Tinati, Ramine and Carr, Leslie and Hall, Wendy and Bentwood, Jonny Identifying communicator roles in twitter Proceedings of the 21st Interna- tional Conference on World Wide Web, 2012, pp. 1161–1168 [41] Xu, Qing and Li, Jiawei and Cai, Mingxiang and Mackey, Tim Ken Use of Machine Learning to Detect Illegal Wildlife Product Promotion and Sales on Twitter Frontiers in Big Data, 2019, vol. 2, pp. 28, Frontiers [42] Federico Albanese and Sebasti´anPinto and Viktoriya Semeshenko and Pablo Balenzuela Analyzing mass media influence using natural language processing and time series analysis Journal of Physics: Complexity, 2020, vol. 1, no. 2, pp. 025005, jul, 10.1088/2632-072x/ab8784 [43] Garimella, Kiran and De Francisci Morales, Gianmarco and Gionis, Aristides and Mathioudakis, Michael Quantifying Controversy in So- cial Media Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, 2016, WSDM 16, pp. 3342, New York, NY, USA, Association for Computing Machinery, 9781450337168, https://doi.org/10.1145/2835776.2835792, 10.1145/2835776.2835792 [44] Kannangara, Sandeepa Mining Twitter for Fine-Grained Political Opinion Polarity Classification, Ideology Detection and Sarcasm Detec- tion Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, 2018, WSDM 18, pp. 751752, New York, NY, USA, Association for Computing Machinery, 9781450355810, https://doi.org/10.1145/3159652.3170461, 10.1145/3159652.3170461 [45] Lancichinetti, Andrea and Fortunato, Santo Consensus clustering in com- plex networks Scientific reports, 2012, vol. 2, pp. 336, 10.1038/srep00336, Nature Publishing Group [46] U. Brandes and D. Delling and M. Gaertler and R. Gorke and M. Hoefer and Z. Nikoloski and D. Wagner On Modularity Clustering IEEE Transactions on Knowledge and Data Engineering, 2008, vol. 20, no. 2, pp. 172-188, 10.1109/TKDE.2007.190689

20 [47] S. Primario and D. Borrelli and L. Iandoli and G. Zollo and C. Lipizzi Measuring Polarization in Twitter Enabled in Online Political Conversa- tion: The Case of 2016 US Presidential Election 2017 IEEE International Conference on Information Reuse and Integration (IRI), 2017, pp. 607-613 [48] Kaplan, Martin F and Miller, Charles E Judgments and group discussion: Effect of presentation and memory factors on polarization Sociometry, 1977, pp. 337–343, JSTOR

[49] Mercier, Hugo and Sperber, Dan Why do humans reason? Arguments for an argumentative theory. 2011 [50] Hinsz, Verlin B and Davis, James H Persuasive arguments theory, group polarization, and choice shifts Personality and Social Psychology Bulletin, 1984, vol. 10, no. 2, pp. 260–268, Sage Publications Sage CA: Thousand Oaks, CA [51] Borge-Holthoefer, Javier and Magdy, Walid and Darwish, Kareem and Weber, Ingmar Content and network dynamics behind Egyptian politi- cal polarization on Twitter Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing, 2015, pp. 700–711

[52] Morales, Alfredo Jose and Borondo, Javier and Losada, Juan Carlos and Benito, Rosa M Measuring political polarization: Twitter shows the two sides of Venezuela Chaos: An Interdisciplinary Journal of Nonlinear Sci- ence, 2015, vol. 25, no. 3, pp. 033114, AIP Publishing LLC

[53] Gruzd, Anatoliy and Roy, Jeffrey Investigating political polarization on Twitter: A Canadian perspective Policy & internet, 2014, vol. 6, no. 1, pp. 28–45, Wiley Online Library [54] Burnstein, Eugene and Vinokur, Amiram Testing two classes of theories about group induced shifts in individual choice Journal of Experimental Social Psychology, 1973, vol. 9, no. 2, pp. 123–137, Elsevier [55] Burnstein, Eugene and Vinokur, Amiram What a person thinks upon learn- ing he has chosen differently from others: Nice evidence for the persuasive- arguments explanation of choice shifts Journal of Experimental Social Psy- chology, 1975, vol. 11, no. 5, pp. 412–426, Elsevier

21 A 2017 Argentina parliamentary elections A.1 Electoral Context The electoral context was the following: Former president and opposition leader Cristina Fern´andezde Kirchner of “Unidad Ciudadana” was one of the main candidates for Senate. There were also candidates for Senate with other parties: of the “Cambiemos” (the party in power), of “1Pais” (former Chief of the Cabinet of Ministers of Cristina Kirchner, then leader of the opposition against Cristina Kirchner in 2013 when he won his provincial election) and Florencio Randazzo of “Partido Justicialista” (former interior minister of Cristina Kirchner), among others.

A.2 Twitter Keywords Considering the previous subsection, the following terms were chosen as key- words for tweeter: • Candidates for Senate of the main four parties: their name and official user on Twitter (i.e., “SergioMassa”, “Massa”, “RandazzoF”, “Randazzo”, “es- tebanbullrich”, “Bullrich”, “CFKArgentina”, “CFK” and “Kirchner”). • The name of the official accounts of the first candidates for deputies of the parties mentioned above (i.e., “felipe sola”, “BuccaBali”, “gracielaocana” and “fvallejoss”).

• The name of the official accounts of political parties on Twitter (i.e., “1PaisUnido”, “1Pais”, “FJCumplir”, “Frente Justicialista”, “cambiemos”, “UniCiudadanaAR” and “Unidad Ciudadana”). • The and the governor of the province of Buenos Aires at the time of elections (i.e., “mauriciomacri”, “Macri” and “mari- uvidal”). These last two were added, despite not being actively present in the lists, due to their political importance, their relevance and participation during the campaign. In addition, the tweets were restricted to be in Spanish.

B 2019 Argentina presidential election B.1 Electoral Context The electoral context is the following: Former president and opposition leader Cristina Fern´andezde Kirchner (Former “Unidad Ciudadana”) and Sergio Massa (Former “1Pais”) create a “Frente de Todos” with Alberto Fern´andez as candidate for president. On the other hand (Former “Cam- biemos”) run for reelection as candidate of “Juntos por el Cambio”. The so- cialist Nicolas del Cao of “Frente de Izquierda-Unidad” and Roberto Lavagna of “Consenso Federal” were also candidates for president, among others.

22 B.2 Twitter Keywords Considering the previous subsection and the candidates for the Senate, for deputy and for governor, the following terms were chosen as keywords for tweeter: “Elisacarrio”, “OfeFernandez ”, “PatoBullrich”, “macri”, “macrismo”, “mauri- ciomacri”, “pichetto”, “MiguelPichetto”, “JuntosPorElCambio”, “alferdez”, “CFKAr- gentina”, “CFK”, “kirchner”, “kirchnerismo”, “FrenteTodos”, “FrenteDeTo- dos”, “Lavagna”, “RLavagna”, “Urtubey”, “UrtubeyJM”, “ConsensoFederal”, “2030ConsensoFederal”, “DelCao”, “NicolasdelCano”, “DelPla”, “RominaDelPla”, “FitUnidad”, “FdeIzquierda”, “Fte Izquierda”, “Castaeira”, “ManuelaC22”, “Mul- hall”, “NuevoMas”, “Espert”, “jlespert”, “FrenteDespertar”, “Centurion”, “juan- jomalvinas”, “Hotton”, “CynthiaHotton”, “Biondini”, “Venturino”, “FrentePa- triota”, “RomeroFeris”, “PartidoAutonomistaNacional”, “Vidal”, “mariuvidal”, “Kicillof”, “Kicillofok”, “Bucca”, “BuccaBali”, “chipicastillo”, “Larreta”, “ho- raciorlarreta”, “Lammens”, “MatiasLammens”, “Tombolini”, “matiastombolini”, “Solano”, “Solanopo”, “Lousteau”, “GugaLusto”, “Recalde”, “marianorecalde”, “RAMIROMARRA”, “Maxiferraro”, “fernandosolanas”, “MarcoLavagna”, “myr- iambregman”, “cristianritondo”, “Massa”, “SergioMassa”, “GracielaCamano”, “nestorpitrola”. In addition, the tweets were restricted to be in Spanish.

B.3 Topic decomposition Also, the topic embedding obtained with non-negative matrix factorization:

1. Election acts 2. the Government of the Ciudad Autonoma de Buenos Aires: the capital city of Argentina 3. The Argentinean province of Buenos Aires: The biggest and most popu- lated province in Argentina. 4. The Argentinean cities of Cordoba and Rosario: Two of the most popu- lated cities in Argentina. 5. Cesar Milani: Former Lieutenant general during the administration of Cristina Kirchner. 6. Maria Eugenia Vidal: Former Governor of the and candidate of “Juntos por el Cambio”.

7. Economy: This topic particularly focused on poverty and unemployment.

C 2020 tweets of Donald Trump C.1 Twitter Keywords The following term was used as keyword for the tweeter API: “realDonaldTrump”. In addition, the tweets were restricted to be in English.

23 C.2 Topic decomposition The topic embedding obtained with non-negative matrix factorization:

1. President Donald Trump: The 45o President of the United States.

2. Obamagate: The accusation that Barack Obama is conspiring against Donald Trump. 3. World Health Organization: President Trump announcing the US will pull out of the World Health Organization. 4. Thank you: Individuals thanking President Trump for this actions in re- gard to the COVID-19 pandemic. 5. Fake news: Individuals discussing and claiming that certain news are fake. 6. President Barack Obama: The 44o President of the United States and his administration.

24