Analyzing Polarization in Social Media: Method and Application to Tweets on 21 Mass Shootings

Analyzing Polarization in Social Media: Method and Application to Tweets on 21 Mass Shootings Dorottya Demszky1 Nikhil Garg1 Rob Voigt1 James Zou1 Matthew Gentzkow1 Jesse Shapiro2 Dan Jurafsky1 1Stanford University 2Brown University fddemszky, nkgarg, robvoigt, jamesz, gentzkow, [email protected] jesse shapiro [email protected] Abstract 2016) and Facebook (Bakshy et al., 2015). Prior NLP work has shown, e.g., that polarized mes- We provide an NLP framework to uncover sages are more likely to be shared (Zafar et al., four linguistic dimensions of political polar- 2016) and that certain topics are more polarizing ization in social media: topic choice, framing, affect and illocutionary force. We quantify (Balasubramanyan et al., 2012); however, we lack these aspects with existing lexical methods, a more broad understanding of the many ways that and propose clustering of tweet embeddings polarization can be instantiated linguistically. as a means to identify salient topics for analy- This work builds a more comprehensive frame- sis across events; human evaluations show that work for studying linguistic aspects of polariza- our approach generates more cohesive topics tion in social media, by looking at topic choice, than traditional LDA-based models. We ap- framing, affect, and illocutionary force. ply our methods to study 4.4M tweets on 21 mass shootings. We provide evidence that the 1.1 Mass Shootings discussion of these events is highly polarized politically and that this polarization is primar- We explore these aspects of polarization by study- ily driven by partisan differences in framing ing a sample of more than 4.4M tweets about rather than topic choice. We identify framing 21 mass shooting events, analyzing polarization devices, such as grounding and the contrasting within and across events. use of the terms “terrorist” and “crazy”, that contribute to polarization. Results pertaining Framing and polarization in the context of mass to topic choice, affect and illocutionary force shootings is well-studied, though much of the lit- suggest that Republicans focus more on the erature studies the role of media (Chyi and Mc- shooter and event-specific facts (news) while Combs, 2004; Schildkraut and Elsass, 2016) and Democrats focus more on the victims and call politicians (Johnson et al., 2017). Several works for policy changes. Our work contributes to find that frames have changed over time and be- a deeper understanding of the way group di- tween such events (Muschert and Carr, 2006; visions manifest in language and to computa- Schildkraut and Muschert, 2014), and that frames tional methods for studying them.1 influence opinions on gun policies (HaiderMarkel 1 Introduction and Joslyn, 2001). Prior NLP work in this area has considered how to extract factual informa- Elites, political parties, and the media in the US tion on gun violence from news (Pavlick et al., arXiv:1904.01596v2 [cs.CL] 4 Apr 2019 are increasingly polarized (Layman et al., 2010; 2016) as well as quantify stance and public opin- Prior, 2013; Gentzkow et al., forthcoming), and ion on Twitter (Benton et al., 2016) and across the the propagation of partisan frames can influence web (Ayers et al., 2016); here we advance NLP ap- public opinion (Chong and Druckman, 2007) and proaches to the public discourse surrounding gun party identification (Fiorina and Abrams, 2008). violence by introducing methods to analyze other Americans increasingly get their news from linguistic manifestations of polarization. internet-based sources (Mitchell et al., 2016), and political information-sharing is highly ide- 1.2 The Role of the Shooter’s Race ologically segregated on platforms like Twitter We are particularly interested in the role of the (Conover et al., 2011; Halberstam and Knight, shooter’s race in shaping polarized responses to 1All data and code is available at: https://github. these events. Implicit or explicit racial biases com/ddemszky/framing-twitter can be central in people’s understanding of social problems (Drakulich, 2015); in the mass shoot- Avg partisanship 1.00 ing context, race is a factor in an event’s news- 0.8 0.75 WY worthiness (Schildkraut et al., 2018) and is often 0.7 MS SD SC 0.50 AL AR mentioned prominently in media coverage, partic- AZ AK OK 0.6 IA LA TN 0.25 NC FL IN NE ID ularly when the shooter is non-white (Mingus and TX KY 0.00 0.5 NV WV KS UT ND Zopf, 2010; Park et al., 2012). Duxbury et al. VA NH WI OH 0.4 MT No. of partisan NJ CO GA MO users (2018) find that media representations of white HI ME DE PA 10000 shooters disproportionately divert blame by fram- 0.3 MD IL RI NM MI CA WA CT MN 20000 NY ing them as mentally ill while representations of 0.2 VT OR 30000 MA 40000 non-white shooters are more frequently criminal- 0.3 0.4 0.5 0.6 0.7 Proportion of Rep users in our data ized, highlighting histories of violent behavior. Repuplican two−party share in 2016 elections The important question remains as to how polarized ideologies surrounding race take shape on fo- Figure 1: To validate our partisanship assignment, we compare the proportion of users we classify as Repub- rums such as Twitter. Therefore, in all of the anal- lican separately for each US state against the Republi- yses throughout this paper we consider the race of can share of the two-party vote. A weighted linear re- the shooter as a potential factor. We note that in the gression with the number of partisan users per state as 21 shooting events we study, shootings in schools weights obtains an adjusted R2 of :82, indicating that and places of worship are overwhelmingly carried the distribution of Republican users in our data and that out by white perpetrators, so we cannot fully dis- of Republican voters across states is similar. entangle the effect of race from other factors. presidential candidates, and other party-affiliated 2 Data: Tweets on Mass Shootings pages.4 We label a user as a Democrat if they followed more Democratic than Republican politi- Data collection. We compiled a list of mass cians in November 2017, and as a Republican if shootings between 2015 and 2018 from the Gun the reverse is true. For each event, 51–72% of Violence Archive.2 For each, we identified a list of users can be assigned partisanship in this way; to keywords representative of their location (see Ap- validate our method we compare state averages pendixA). Given Twitter search API’s limitations of these inferred partisan labels to state two-party on past tweets, we retrieved data from a Stanford vote shares, finding a high correlation (Figure1). 5 lab’s archived intake of the Twitter firehose.3 For each event, we built a list of relevant tweets for the two weeks following the event. 3 Quantifying Overall Polarization A tweet is relevant if it contained at least one We begin by quantifying polarization (equiva- of the event’s location-based representative key- lently, partisanship) between the language of users words and at least one lemma from the following labeled Democrats and Republicans after mass list: “shoot”, “gun”, “kill”, “attack”, “massacre”, shooting events. We establish that there is sub- “victim”. We filtered out retweets and tweets from stantial polarization, and that the polarization in- users who have since been deactivated. We kept creases over time within most events. those 21 events with more than 10,000 tweets re- maining. For more details see AppendixA. 3.1 Methods Partisan assignment. We estimate the party af- Pre-processing. We first build a vocabulary for filiation of users in the dataset from the political each event as follows. Each vocabulary contains accounts they follow using a method similar to that unigrams and bigrams that occur in a given event’s of Volkova et al. (2014), which takes advantage tweets at least 50 times, counted after stemming of homophily in the following behavior of users via NLTK’s SnowballStemmer and stopword re- on twitter (Halberstam and Knight, 2016). We moval.6 We refer to these unigrams and bigrams compile a list of Twitter handles of US Congress collectively as tokens. members in 2017, the 2016 presidential and vice 2https://www.gunviolencearchive.org 4See Appendix B.1 for the complete list. 3With the exception of the two most recent shootings in 5We performed the sanity check for all partisan users with Pittsburgh and Thousand Oaks, for which we collected tweets a valid US state as part of their geo-location (∼350k users). real time via the Twitter API. 6Stopword list is provided in Appendix A.1 Measure of partisanship. We apply the actual value value resulting from random party assignment leave-out estimator of phrase partisanship from Pittsburgh Burlington 0.54 Thousand Oaks Gentzkow et al. (forthcoming). Partisanship is Orlando Fresno Vegas Colorado Springs Sutherland Springs defined as the expected posterior probability that 0.53 Chattanooga Thornton Parkland Dallas an observer with a neutral prior would assign to a Roseburg San Francisco Nashville Annapolis 0.52 Santa Fe San Bernardino Baton Rouge tweeter’s true party after observing a single token Kalamazoo 0.51 drawn at random from the tweets produced by the estimate Leave−out tweeter. If there is no difference in token usage 0.50 between the two parties, then this probability is 2016 2017 2018 2019 Year :5, i.e. we cannot guess the user’s party any better after observing a token. Figure 2: Tweets on mass shootings are highly polar- The leave-out estimator consistently estimates ized, as measured by the leave-out estimate of phrase partisanship under the assumption that a user’s to- partisanship (Gentzkow et al., forthcoming).

Load more