Analyzing Polarization in Social Media: Method and Application to Tweets on 21 Mass Shootings
Total Page:16
File Type:pdf, Size:1020Kb
Analyzing Polarization in Social Media: Method and Application to Tweets on 21 Mass Shootings Dorottya Demszky1 Nikhil Garg1 Rob Voigt1 James Zou1 Matthew Gentzkow1 Jesse Shapiro2 Dan Jurafsky1 1Stanford University 2Brown University ddemszky, nkgarg, robvoigt, jamesz, gentzkow, jurafsky @stanford.edu { jesse shapiro [email protected] } Abstract 2016) and Facebook (Bakshy et al., 2015). Prior NLP work has shown, e.g., that polarized mes- We provide an NLP framework to uncover sages are more likely to be shared (Zafar et al., four linguistic dimensions of political polar- 2016) and that certain topics are more polarizing ization in social media: topic choice, framing, affect and illocutionary force. We quantify (Balasubramanyan et al., 2012); however, we lack these aspects with existing lexical methods, a more broad understanding of the many ways that and propose clustering of tweet embeddings polarization can be instantiated linguistically. as a means to identify salient topics for analy- This work builds a more comprehensive frame- sis across events; human evaluations show that work for studying linguistic aspects of polariza- our approach generates more cohesive topics tion in social media, by looking at topic choice, than traditional LDA-based models. We ap- framing, affect, and illocutionary force. ply our methods to study 4.4M tweets on 21 mass shootings. We provide evidence that the 1.1 Mass Shootings discussion of these events is highly polarized politically and that this polarization is primar- We explore these aspects of polarization by study- ily driven by partisan differences in framing ing a sample of more than 4.4M tweets about rather than topic choice. We identify framing 21 mass shooting events, analyzing polarization devices, such as grounding and the contrasting within and across events. use of the terms “terrorist” and “crazy”, that contribute to polarization. Results pertaining Framing and polarization in the context of mass to topic choice, affect and illocutionary force shootings is well-studied, though much of the lit- suggest that Republicans focus more on the erature studies the role of media (Chyi and Mc- shooter and event-specific facts (news) while Combs, 2004; Schildkraut and Elsass, 2016) and Democrats focus more on the victims and call politicians (Johnson et al., 2017). Several works for policy changes. Our work contributes to find that frames have changed over time and be- a deeper understanding of the way group di- tween such events (Muschert and Carr, 2006; visions manifest in language and to computa- Schildkraut and Muschert, 2014), and that frames tional methods for studying them.1 influence opinions on gun policies (HaiderMarkel 1 Introduction and Joslyn, 2001). Prior NLP work in this area has considered how to extract factual informa- Elites, political parties, and the media in the US tion on gun violence from news (Pavlick et al., are increasingly polarized (Layman et al., 2010; 2016) as well as quantify stance and public opin- Prior, 2013; Gentzkow et al., forthcoming), and ion on Twitter (Benton et al., 2016) and across the the propagation of partisan frames can influence web (Ayers et al., 2016); here we advance NLP ap- public opinion (Chong and Druckman, 2007) and proaches to the public discourse surrounding gun party identification (Fiorina and Abrams, 2008). violence by introducing methods to analyze other Americans increasingly get their news from linguistic manifestations of polarization. internet-based sources (Mitchell et al., 2016), and political information-sharing is highly ide- 1.2 The Role of the Shooter’s Race ologically segregated on platforms like Twitter We are particularly interested in the role of the (Conover et al., 2011; Halberstam and Knight, shooter’s race in shaping polarized responses to 1All data and code is available at: https://github. these events. Implicit or explicit racial biases com/ddemszky/framing-twitter can be central in people’s understanding of social 2970 Proceedings of NAACL-HLT 2019, pages 2970–3005 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguistics problems (Drakulich, 2015); in the mass shoot- No. of partisan ing context, race is a factor in an event’s news- 0.8 users WY 10000 worthiness (Schildkraut et al., 2018) and is often 0.7 MS SD SC 20000 AL AR mentioned prominently in media coverage, partic- OK 30000 0.6 AK AZ IA LA TN ularly when the shooter is non-white (Mingus and FL ID 40000 TX NE KY NC IN WV 0.5 UT ND Zopf, 2010; Park et al., 2012). Duxbury et al. NV WI OH KS MT Prop. Rep VA NH GA 1.00 (2018) find that media representations of white 0.4 NJ MO HI CO PA shooters disproportionately divert blame by fram- DE MI 0.75 0.3 MD NM ME IL RI 0.50 ing them as mentally ill while representations of CA WA CT MN non-white shooters are more frequently criminal- 0.2 VT NY OR 0.25 MA 0.00 Proportion of Rep users in our data ized, highlighting histories of violent behavior. 0.3 0.4 0.5 0.6 0.7 The important question remains as to how polar- Repuplican two−party share in 2016 elections ized ideologies surrounding race take shape on fo- rums such as Twitter. Therefore, in all of the anal- Figure 1: To validate our partisanship assignment, we compare the proportion of users we classify as Repub- yses throughout this paper we consider the race of lican separately for each US state against the Republi- the shooter as a potential factor. We note that in the can share of the two-party vote. A weighted linear re- 21 shooting events we study, shootings in schools gression with the number of partisan users per state as and places of worship are overwhelmingly carried weights obtains an adjusted R2 of .82, indicating that out by white perpetrators, so we cannot fully dis- the distribution of Republican users in our data and that entangle the effect of race from other factors. of Republican voters across states is similar. 2 Data: Tweets on Mass Shootings presidential candidates, and other party-affiliated pages.4 We label a user as a Democrat if they We compiled a list of mass Data collection. followed more Democratic than Republican politi- shootings between 2015 and 2018 from the Gun cians in November 2017, and as a Republican if Violence Archive.2 For each, we identified a list of the reverse is true. For each event, 51–72% of keywords representative of their location (see Ap- users can be assigned partisanship in this way; to pendix A). Given Twitter search API’s limitations validate our method we compare state averages on past tweets, we retrieved data from a Stanford of these inferred partisan labels to state two-party lab’s archived intake of the Twitter firehose.3 vote shares, finding a high correlation (Figure 1).5 For each event, we built a list of relevant tweets for the two weeks following the event. 3 Quantifying Overall Polarization A tweet is relevant if it contained at least one of the event’s location-based representative key- We begin by quantifying polarization (equiva- words and at least one lemma from the following lently, partisanship) between the language of users list: “shoot”, “gun”, “kill”, “attack”, “massacre”, labeled Democrats and Republicans after mass “victim”. We filtered out retweets and tweets from shooting events. We establish that there is sub- users who have since been deactivated. We kept stantial polarization, and that the polarization in- those 21 events with more than 10,000 tweets re- creases over time within most events. maining. For more details see Appendix A. 3.1 Methods Partisan assignment. We estimate the party af- filiation of users in the dataset from the political Pre-processing. We first build a vocabulary for accounts they follow using a method similar to that each event as follows. Each vocabulary contains of Volkova et al. (2014), which takes advantage unigrams and bigrams that occur in a given event’s of homophily in the following behavior of users tweets at least 50 times, counted after stemming on twitter (Halberstam and Knight, 2016). We via NLTK’s SnowballStemmer and stopword re- 6 compile a list of Twitter handles of US Congress moval. We refer to these unigrams and bigrams members in 2017, the 2016 presidential and vice collectively as tokens. 2https://www.gunviolencearchive.org 4See Appendix B.1 for the complete list. 3With the exception of the two most recent shootings in 5We performed the sanity check for all partisan users with Pittsburgh and Thousand Oaks, for which we collected tweets a valid US state as part of their geo-location ( 350k users). ⇠ real time via the Twitter API. 6Stopword list is provided in Appendix A.1 2971 Measure of partisanship. We apply the actual value value resulting from random party assignment Pittsburgh leave-out estimator of phrase partisanship from Burlington 0.54 Thousand Oaks Gentzkow et al. (forthcoming). Partisanship is Orlando Fresno Vegas Colorado Springs Sutherland Springs defined as the expected posterior probability that 0.53 Chattanooga Thornton Parkland Dallas an observer with a neutral prior would assign to a Roseburg San Francisco Nashville Annapolis out estimate 0.52 San Bernardino Santa Fe − Baton Rouge tweeter’s true party after observing a single token Kalamazoo 0.51 drawn at random from the tweets produced by the Leave tweeter. If there is no difference in token usage 0.50 between the two parties, then this probability is 2016 2017 2018 2019 Year .5, i.e. we cannot guess the user’s party any better after observing a token. Figure 2: Tweets on mass shootings are highly polar- The leave-out estimator consistently estimates ized, as measured by the leave-out estimate of phrase partisanship under the assumption that a user’s to- partisanship (Gentzkow et al., forthcoming).