Predicting Shifting Individuals Using Text Mining and Graph Machine Learning on Twitter
Total Page:16
File Type:pdf, Size:1020Kb
Predicting Shifting Individuals Using Text Mining and Graph Machine Learning on Twitter Federico Albanese ∗1,2, Leandro Lombardi1, Esteban Feuerstein2,3, and Pablo Balenzuela4,5 1Instituto de C´alculo,CONICET- Universidad de Buenos Aires, Argentina 2Instituto de Ciencias de la Computaci´on,CONICET- Universidad de Buenos Aires, Argentina 3Departamento de Computaci´on,Universidad de Buenos Aires, Argentina 4Instituto de F´ısicade Buenos Aires (IFIBA), CONICET, Argentina 5Departamento de F´ısica,Universidad de Buenos Aires, Argentina August 2020 Abstract The formation of majorities in public discussions often depends on individuals who shift their opinion over time. The detection and char- acterization of these type of individuals is therefore extremely important for political analysis of social networks. In this paper, we study changes in individual's affiliations on Twitter using natural language processing techniques and graph machine learning algorithms. In particular, we col- lected 9 million Twitter messages from 1:5 million users and constructed the retweet networks. We identified communities with explicit political orientation and topics of discussion associated to them which provide the topological representation of the political map on Twitter in the analyzed periods. With that data, we present a machine learning framework for social media users classification which efficiently detects \shifting users" (i.e. users that may change their affiliation over time). Moreover, this machine learning framework allows us to identify not only which topics arXiv:2008.10749v1 [cs.SI] 24 Aug 2020 are more persuasive (using low dimensional topic embedding), but also which individuals are more likely to change their affiliation given their topological properties in a Twitter graph. 1 Introduction Technologically mediated Social networks flourished as a social phenomenon at the beginning of this century with exponents such as Friendster (2002) or ∗[email protected] 1 Myspace (2003) [1] but other popular websites soon took their place. Twitter is an online platform where news or data can reach millions of users in a matter of minutes [2]. Twitter is also of great academic interest, since individuals voluntarily express openly their opinions and they can interact with other users by retweeting the others' tweets. In particular, in the last decade there has been an increase in interest from computational social scientists and numerous political studies have been published using information from this platform [3{8]. Previous works applied different machine learning models to these datasets. Xu et al. collected tweets using the streaming API and implemented an un- supervised machine learning framework for detecting online wildlife trafficking using topic modeling [41]. Kurnaz et al. proposed a methodology which first extracts features of a tweet text and then applies deep sparse autoencoders in order to classify the sentiment of tweets [10]. Pinto et al. detected and an- alyzed the topics of discussion in the text of tweets and news articles, using Non Negative Matrix Factorization [32], in order to understand the role of mass media in the formation of public opinion [11]. On the other hand, Kannangara implemented a probabilistic method so as to identify the topic, sentiment and political orientation of tweets [44]. Some other works are focused in political analysis and the interaction be- tween users, as for instance the one of Aruguete et al., which described how Twitter users frame political events by sharing content exclusively with like- minded users forming two well-defined communities [12]. Dang-Xuan et al. downloaded tweets during the 2011 parliament elections in Germany and char- acterize the role of influencers utilizing the retweet network [13]. Stewart et al. used community detection algorithms over a network of retweets to understand the behavior of trolls in the context of the #BlackLivesMatter movement [14]. Conver et al. [15] also used similar techniques over a retweets network and showed the segregated partisan structure with extremely limited connection be- tween clusters of users with different political ideologies during the 2010 U.S. congressional midterm elections. The same polarization on the Twitter net- work can be found in other contexts and countries (Canada [53], Egypt [51], Venezuela [52]). Opinion shifts in group discussions have been studied from different points of view. In particular, it was stated that opinion shifts can be produced by arguments interchange, according to the Persuasive Arguments Theory (PAT) [48, 54, 55]. Primario et al. applied this theory to measure the evolution of the political polarization on Twitter during the 2016 US Presidential election [47]. In the same line, Holthoefer et al analyzed the Egyptian polarization dynamics on Twitter [51]. They classified the tweets in two groups (pro/anti military intervention) based on their text and estimated the overall proportion of users that change their position. These works analyzed the macro dynamics of polarization, rather than focus on the individuals. In contrast, we found it interesting not only to character- ize the Twitter users who change their political opinion, but also predict these \shifting voters". Therefore, the focus of this paper is centered on the individ- uals rather than the aggregated dynamic, using machine learning algorithms. Moreover, once we were able to correctly determine these users, we seek to distinguish between persuasive and non persuasive topics1. 1We decided to use the adjective "shifting" for individuals, and "persuasive" (resp. non 2 In this paper, we examined three Twitter networks datasets constructed with tweets from: 2017 Argentina parliamentary elections, 2019 Argentina presiden- tial elections and 2020 tweets of Donald Trump. Three datasets were con- structed and used in order to show that the methodology can be easily gener- alized to different scenarios. For each dataset, we analyzed two different time periods and identify the larger communities corresponding to the main political forces. Using graph topological information and detecting topics of discussion of the first network, we built and trained a model that effectively predicts when an individual will change his/her community over time, identifying persuasive topics and relevant features of the shifting users. Our main contributions are the following: 1. We described a generalized machine learning framework for social media users classification, in particular, for detecting their affiliation at a given time and whether the user will change it in the future. This framework in- cludes natural language processing techniques and graph machine learning algorithms in order to describe the features of an individual. 2. We observed that the proposed machine learning model has a good per- formance for the task of predicting changes of the user's affiliation over time. 3. We experimentally analyzed the machine learning framework by perform- ing a feature importance analysis. While previous works used text, Twit- ter profiles and some twitting behavior characteristics to automatically classify users with machine learning [16{19], here we showed the value of adding graph features in order to identify the label of a user. In particular, the importance of the \PageRank" for this specific task. 4. We also identified the topics that are considerably more relevant and per- suasive to the shifting users. Identifying this key topics has a valuable impact for social science and politics. The paper is organized as follows. In the Data Collection section, we describe the data used in the study. In the Methods section, we describe the graph unsupervised learning algorithms and other graph metrics that were used, the natural language processing tools applied to the tweets and the machine learning model. In the Results section, we analyze the performance of the model for the task of detecting shifting individuals. Finally, we interpret these results in the Conclusions section. The code is in github (omitted for anonymity reasons). 2 Data Collection Twitter has several APIs available to developers. Among them is the Streaming API that allows the developer to download in real time a sample of tweets that are uploaded to the social network filtering it by language, terms, hashtags, etc. [20, 21]. The data is composed of the tweet id, the text, the date and time of the tweet, the user id and username, among other features. In case of a retweet, it has also the information of the original tweet's user account. persuasive) for the topics relevant (resp. non relevant) to those individuals 3 For this research, we collected 3 datasets: 2017 Argentina parliamentary elec- tions (2017ARG), 2019 Argentina presidential elections (2019ARG) and 2020 United States tweets of Donald Trump (2020US). For the Argentinan datasets, the Streaming API was used during the week before the primary elections and the week before the general elections took place. Keywords were chosen accord- ing the four main political parties present in the elections. Details and context can be found in the Appendix. For the 2020US dataset, \realDonaldTrump" (the official account of president Donald Trump) was used as keyword. Twitter messages are in the public domain and only public tweets filtered by the Twit- ter API were collected for this work. For the purpose of this research, we have analyzed more than 9 million tweets and more than 1.5 million individuals in total. The specific start and end collection date, the total number of tweets