Beyond Binary Labels: Political Ideology Prediction of Twitter Users
Total Page:16
File Type:pdf, Size:1020Kb
Beyond Binary Labels: Political Ideology Prediction of Twitter Users Daniel Preot¸iuc-Pietro Ye Liu∗ Positive Psychology Center School of Computing University of Pennsylvania National University of Singapore [email protected] [email protected] Daniel J. Hopkins Lyle Ungar Political Science Department Computing & Information Science University of Pennsylvania University of Pennsylvania [email protected] [email protected] Abstract User trait prediction from text is based on the as- sumption that language use reflects a user’s de- Automatic political preference prediction mographics, psychological states or preferences. from social media posts has to date proven Applications include prediction of age (Rao et al., successful only in distinguishing between 2010; Flekova et al., 2016b), gender (Burger et al., publicly declared liberals and conserva- 2011; Sap et al., 2014), personality (Schwartz tives in the US. This study examines et al., 2013; Preot¸iuc-Pietro et al., 2016), socio- users’ political ideology using a seven- economic status (Preot¸iuc-Pietro et al., 2015a,b; point scale which enables us to identify Liu et al., 2016c), popularity (Lampos et al., 2014) politically moderate and neutral users – or location (Cheng et al., 2010). groups which are of particular interest to Research on predicting political orientation has political scientists and pollsters. Using focused on methodological improvements (Pen- a novel data set with political ideology nacchiotti and Popescu, 2011) and used data sets labels self-reported through surveys, our with publicly stated dichotomous political orien- goal is two-fold: a) to characterize the po- tation labels due to their easy accessibility (Syl- litical groups of users through language wester and Purver, 2015). However, these data use on Twitter; b) to build a fine-grained sets are not representative samples of the entire model that predicts political ideology of population (Cohen and Ruths, 2013) and do not unseen users. Our results identify differ- accurately reflect the variety of political attitudes ences in both political leaning and engage- and engagement (Kam et al., 2007). ment and the extent to which each group For example, we expect users who state their tweets using political keywords. Finally, political affiliation in their profile description, we demonstrate how to improve ideology tweet with partisan hashtags or appear in public prediction accuracy by exploiting the rela- party lists to use social media as a means of popu- tionships between the user groups. larizing and supporting their political beliefs (Bar- berASa, 2015). Many users may choose not to 1 Introduction publicly post about their political preference for various social goals or perhaps this preference Social media is used by people to share their opin- may not be strong or representative enough to be ions and views. Unsurprisingly, an important part disclosed online. Dichotomous political prefer- of the population shares opinions and news related ence also ignores users who do not have a political to politics or causes they support, thus offering ideology. All of these types of users are very im- strong cues about their political preferences and portant for researchers aiming to understand group ideologies. In addition, political membership is preferences, traits or moral values (Lewis and Rei- also predictable purely from one’s interests or de- ley, 2014; Hersh, 2015). mographics — it is much more likely for a reli- The most common political ideology spectrum gious person to be conservative or for a younger in the US is the conservative – liberal (Ellis and person to lean liberal (Ellis and Stimson, 2012). Stimson, 2012). We collect a novel data set of ∗ Work carried out during a research visit at the Univer- Twitter users mapped to this seven-point spectrum sity of Pennsylvania which allows us to: 1. Uncover the differences in language use be- or to detect bias (Yano et al., 2010) or impartial- tween ideological groups; ity (Zafar et al., 2016) at a document level. 2. Develop a user-level political ideology predic- Our study belongs to the second category, where tion algorithm that classifies all levels of en- political orientation is inferred at a user-level. All gagement and leverages the structure in the po- previous studies study labeling US conservatives litical ideology spectrum. vs. liberals using either text (Rao et al., 2010), First, using a broad range of language features social network connections (Zamal et al., 2012), including unigrams, word clusters and emotions, platform-specific features (Conover et al., 2011) we study the linguistic differences between the or a combination of these (Pennacchiotti and two ideologically extreme groups, the two ideo- Popescu, 2011; Volkova et al., 2014), with very logically moderate groups and between both ex- high reported accuracies of up to 94.9% (Conover tremes and moderates in order to provide insight et al., 2011). into the content they post on Twitter. In addition, However, all previous work on predicting user- we examine the extent to which the ideological level political preferences are limited to a binary groups in our data set post about politics and com- prediction between liberal/democrat and conser- pare it to a data set obtained similarly to previous vative/republican, disregarding any nuances in po- work. litical ideology. In addition, as the focus of the In prediction experiments, we show how accu- studies is more on the methodological or inter- rately we can distinguish between opposing ideo- pretation aspects of the problem, another down- logical groups in various scenarios and that previ- side is that the user labels were obtained in sim- ous binary political orientation prediction has been ple, albeit biased ways. These include users who oversimplified. Then, we measure the extent to explicitly state their political orientation on user which we can predict the two dimensions of polit- lists of party supporters (Zamal et al., 2012; Pen- ical leaning and engagement. Finally, we build an nacchiotti and Popescu, 2011), supporting par- ideology classifier in a multi-task learning setup tisan causes (Rao et al., 2010), by following that leverages the relationships between groups.1 political figures (Volkova et al., 2014) or party accounts (Sylwester and Purver, 2015) or that 2 Related Work retweet partisan hashtags (Conover et al., 2011). Automatically inferring user traits from their on- As also identified in (Cohen and Ruths, 2013) and line footprints is a prolific topic of research, en- further confirmed later in this study, these data sets abled by the increasing availability of user gen- are biased: most people do not clearly state their erated data and advances in machine learning. political preference online – fewer than 5% ac- Beyond its research oriented goals, user profil- cording to Priante et al.(2016) – and those that ing has important industry applications in online state their preference are very likely to be political marketing, personalization or large-scale audience activists. Cohen and Ruths(2013) demonstrated profiling. To this end, researchers have used a that predictive accuracy of classifiers is signifi- wide range of types of online footprints, includ- cantly lower when confronted with users that do ing video (Subramanian et al., 2013), audio (Alam not explicitly mention their political orientation. and Riccardi, 2014), text (Preot¸iuc-Pietro et al., Despite this, their study is limited because in their 2015a), profile images (Liu et al., 2016a), social hardest classification task, they use crowdsourced data (Van Der Heide et al., 2012; Hall et al., 2014), political orientation labels, which may not corre- social networks (Perozzi and Skiena, 2015; Rout spond to reality and suffer from biases (Flekova et al., 2013), payment data (Wang et al., 2016) and et al., 2016a; Carpenter et al., 2016). Further, they endorsements (Kosinski et al., 2013). still only look at predicting binary political orien- Political orientation prediction has been studied tation. To date, no other research on this topic has in two related, albeit crucially different scenarios, taken into account these findings. as also identified in (Zafar et al., 2016). First, re- searchers aimed to identify and quantify orienta- 3 Data Set tion of words (Monroe et al., 2008), hashtags (We- The main data set used in this study consists of ber et al., 2013) or documents (Iyyer et al., 2014), 3,938 users recruited through the Qualtrics plat- 1 Data is available at http://www.preotiuc.ro form (D1). Each participant was compensated 1000 Political Orientation @JoeBiden, @CoryBooker, @JohnKerry) or US conservative politics (@marcorubio, @tedcruz, 750 @RandPaul, @RealBenCarson). Liberals in our set (Nl = 7417) had to follow on Twitter all of 500 the liberal political figures and none of the con- 250 servative figures. Likewise, conservative users (Nc = 6234) had to follow all of the conservative 0 1 2 3 4 5 6 7 figures and no liberal figures. We downloaded up to 3,200 of each user’s most recent tweets, leading Figure 1: Distribution of political ideology in our data set, to a total of 25,493,407 tweets. All tweets were from 1 – Very Conservative through 7 – Very Liberal. downloaded around 10 August 2016. 4 Features with 3 USD for 15 minutes of their time. All participants first answered the same demographic In our analysis, we use a broad range of linguistic questions (including political ideology), then were features described below. directed to one of four sets of psychological ques- Unigrams We use the bag-of-words representa- tionnaires unrelated to the political ideology ques- tion to reduce each user’s posting history to a nor- tion. They were asked to self-report their politi- malised frequency distribution over the vocabulary cal ideology on a seven point scale: Very conser- consisting of all words used by at least 10% of the vative (1), Conservative (2), Moderately conser- users (6,060 words). vative (3), Moderate (4), Moderately liberal (5), LIWC Traditional psychological studies use a Liberal (6), Very liberal (7).