CORPUSOFTHE #BLACKLIVESMATTER MOVEMENT AND COUNTER PROTESTS: 2013 TO 2020

APREPRINT

Salvatore Giorgi1, 2, Sharath Chandra Guntuku2, 3, Muhammad Rahman1, McKenzie Himelein-Wachowiak1, Amy Kwarteng1, and Brenda Curtis1

1National Institutes of Health, National Institute on Drug Abuse, Bethesda, MD 2Computer and Information Science Department, University of Pennsylvania, Philadelphia, PA 3Center for Digital Health, Penn Medicine, Philadelphia, PA

July 20, 2020

ABSTRACT

Black Lives Matter (BLM) is a movement protesting violence towards Black individuals and communities with a focus on police brutality. The movement has gained significant media and political attention following the killings of Ahmaud Arbery, Breonna Taylor, and George Floyd and the shooting of Jacob Blake in 2020 [1]. Due to its decentralized nature, the #BlackLivesMatter has come to both represent the movement and been used as a call to action. Similar have appeared to counter the BLM movement, such as #AllLivesMatter and #Blue- LivesMatter. We introduce a data set of 41.8 million tweets from 10 million users which contain one of the following keywords: BlackLivesMatter, AllLivesMatter and BlueLivesMatter1. This data set contains all currently available tweets from the beginning of the BLM movement in 2013 to June 2020. We summarize the data set and show temporal trends in use of both the BlackLivesMatter keyword and keywords associated with counter movements. In the past, similarly themed, though much smaller in scope, BLM data sets have been used for studying discourse in protest and counter protest movements [2, 3], predicting retweets [4], examining the role of social media in protest movements [5, 6] and exploring narrative agency [7]. This paper open-sources a large-scale data set to facilitate research in the areas of computational social science, communications, political science, natural language processing, and machine learning.

Keywords Social media · Twitter · hashtags · social movements · protests · policing

1 Value of the Data arXiv:2009.00596v2 [cs.SI] 28 Sep 2020

• These data are useful because they showcase the entire course of a large, ongoing social movement () and its counter protests (All Lives Matter and Blue Lives Matter). To our knowledge, no other Twitter data sets exist that cover the entire span of the Black Lives Matter movement to date. • All researchers interested in systemic racism, social movements, grassroots campaigns, racial inequality, police brutality and counter protests, especially those working in the fields of computational social science, communications, and political science, can benefit from this data. • The data set contains 41.8 million posts from 10 million users and can be used to identify linguistic patterns associated with the social movements and their counter protests, social networks (through friend/follower user data), temporal and spatial patterns (through the use of timestamps and latitude/longitude coordinates), inter- and intra- movement dialog and the spread of news and misinformation (through retweets and tweets linking news articles).

1Data available at: https://doi.org/10.5281/zenodo.4056563 A PREPRINT -JULY 20, 2020

• Since 2013, the BLM movement has grown exponentially, resulting in global protests and several counter protests. This historical data, starting in 2013 and ending in 2020, permits researchers to track this grassroots movement from its social media beginnings.

2 Data Description

Tweets containing the keywords BlackLivesMatter, AllLivesMatter and BlueLivesMatter were collected from the Twitter API from January 2013 to June 2020. Table 1 contains counts of total number of tweets and users for the entire data set and each keyword. It also includes counts for the following: retweets (original tweets which are shared by other users on the platform), replies (tweets which directly respond to another tweet), geotagged (latitude/longitude coordinates associated with the tweet) and top languages (automatically detected language of the tweet). Retweets may or may not contain additional content created by the user doing the retweeting.

Tweets Users Retweets Replies Geotagged Top Languages All 41,801,153 10,136,019 30,377,162 2,033,245 69,969 en, fr, es, pt, ja BlackLivesMatter 36,892,699 9,543,924 27,565,206 1,583,077 61,392 en, fr, es, pt, ja AllLivesMatter 3,001,012 1,462,712 1,463,972 368,035 8,977 en, es, nl, ja, fr BlueLivesMatter 3,352,437 811,805 2,174,139 195,525 2,049 en, fr, es, ja, de Table 1: Descriptive counts for the entire data set and each keyword. Note that tweets can contain more than one keyword and can therefore be included in more than one row. ISO 639-1 Language codes: en = English, fr = French, es = Spanish, pt = Portuguese, ja = Japanese, nl = Dutch, de = German.

Tweets also contain a large number of other pieces of metadata, such as user profile data and place information. User profiles contain information such as user handles, free text descriptions and profile images. Places are named locations users decide to associate with a tweet. While Places describe physical locations, they do not imply the tweet originated from this location. Twitter users may manually tag a location when their tweet is about that Place. Due to the large number of additional fields available for each tweet, we do not provide counts for any additional content. The monthly volume of each keyword is plotted in Figure 1. Here we plot the seven day running average of the total count (logged) of all tweets containing one of our keywords. All labels marked with a single name indicate the date of high profile police brutality-related killings.

14 BlackLivesMatter Viral video of George Tamir Philando Floyd AllLivesMatter Stephon cops harassing Rice Freddie Castile Breonna Sandra Clark Black mother BlueLivesMatter Gray Taylor 12 Bland Alton Mistrial in Michael Sterling Ahmaud 10 Jordan Davis Brown Arbery shooting

8 Eric 6 Acquitted Garner Tweet Count (log) 4

2

2013 2014 2015 2016 2017 2018 2019 2020

Figure 1: Seven day moving average of logged monthly tweet count from 2013 to 2020 of all three keywords. We include markers for high profile events associated with the BLM movement.

3 Experimental Design, Materials and Methods

On July 14, 2016, we set up a data puller using the Python package TwitterMySQL2 to collect tweets matching at least one of our keywords: BlackLivesMatter, AllLivesMatter and BlueLivesMatter. This package uses the official Twitter

2https://github.com/dlatk/TwitterMySQL

2 A PREPRINT -JULY 20, 2020

Application Programming Interface (API) to stream tweets in real time. The data puller continuously collected tweets from the Twitter stream until the time of writing (July 2020). In total we collected 50,574,955 tweets. While the Twitter API was queried using the keywords BlackLivesMatter, AllLivesMatter and BlueLivesMatter, the API delivers a more robust set of matching tweets. For example, a tweet might contain the phrase “black lives matter” or “blm”, among other variations, instead of the keyword BlackLivesMatter. We note that the Twitter API limits such streams to 1% of the total Twitter volume at any given moment. To see if our keyword data set was limited at any point, we compared the monthly keyword volume to a full 1% monthly pull (not limited to any single keyword, location, etc.). Over the 4-year time span, our keyword data set pulled in a monthly average of 1,176,161 tweets (4,629,878 SD) as compared to a monthly average of 94,893,476 tweets (27,394,826 SD) from the full 1% pull. Thus, we do not believe our data set was limited by the Twitter API. Due to server maintenance, there were periods when we were unable to collect data. These include: October 17, 2016 through November 23, 2016; January 1, 2017 through January 21, 2017; March 11, 2017 through March 16, 2017; May 2, 2018 through December 18, 2018; and March 16, 2019 through March 20, 2019. Additionally, the Black Lives Matter movement began in 2013, roughly three years before the beginning of our data collection. In order to fill these gaps, we used the Python package GetOldTweets3 [9], which pulls historical tweets containing a given keyword. These tweets were pulled in June 2020. Using this method, we collected 4,276,423 historical tweets. Having two separate methods of pulling tweet data (prospective using the streaming API and retrospective using GetOldTweets33) caused inconsistencies when reconstructing timelines of keyword use. While Twitter data is publicly available, at any point a user may delete a tweet, delete their account, or set their account to private. Thus, when pulling prospective data, we collected tweets which may have been deleted or made private at some point after the initial pull. On the other hand, one cannot pull deleted or private tweets with a retrospective collection. In order to ensure the data set only contained presently available tweets, we executed a one-time historical pull in June 2020. As a result, any tweet deleted after our initial pull will not be made available. Our final data set consisted of 41,801,153 tweets. Due to Twitter’s Terms of Service, only numeric tweet IDs can be publicly shared. The numeric IDs can be used to pull the full tweet set using the Twitter API. There are a number of open source software packages which allow researchers to easily interface with the API. The authors used the Python package TwitterMySQL1, which saves tweet information in a MySQL database. Other packages exist which do not rely on relational databases, such as the Python package twarc4, which saves tweets to text files in JSON format. Finally, Hydrator5, which relies on an easy to use GUI, saves tweets to both JSON and CSV formats.

4 Ethics Statement

The data used in this article is publicly available and distributed within Twitter’s Terms of Services. Additionally, no human subjects were used in the data collection.

5 Acknowledgments

This research was supported in part by the Intramural Research Program of the NIH, National Institute on Drug Abuse (NIDA).

6 Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.

References

[1] Monica Anderson, Michael Barthel, Andrew Perrin, and Emily A. Vogels. #blacklivesmatter surges on twitter after george floyd’s death. Pew Research Center, Jun 2020. https://www.pewresearch.org/fact-tank/2020/ 06/10/blacklivesmatter-surges-on-twitter-after-george-floyds-death/.

3https://github.com/Mottl/GetOldTweets3 4https://github.com/DocNow/twarc 5https://github.com/DocNow/hydrator

3 A PREPRINT -JULY 20, 2020

[2] Ryan J Gallagher, Andrew J Reagan, Christopher M Danforth, and Peter Sheridan Dodds. Divergent discourse between protests and counter-protests: #blacklivesmatter and #alllivesmatter. PloS one, 13(4):e0195644, 2018. [3] Jeffrey Layne Blevins, James Jaehoon Lee, Erin E McCabe, and Ezra Edgerton. Tweeting for social justice in# ferguson: Affective discourse in twitter hashtags. new media & society, 21(7):1636–1653, 2019. [4] Kate Keib, Itai Himelboim, and Jeong-Yeob Han. Important tweets matter: Predicting retweets in the# blacklives- matter talk on twitter. Computers in Human Behavior, 85:106–115, 2018. [5] Marcia Mundt, Karen Ross, and Charla M Burnett. Scaling social movements through social media: The case of black lives matter. Social Media+ Society, 4(4):2056305118807911, 2018. [6] Jelani Ince, Fabio Rojas, and Clayton A Davis. The social media response to black lives matter: how twitter users interact with black lives matter through hashtag use. Ethnic and racial studies, 40(11):1814–1830, 2017. [7] Guobin Yang. Narrative agency in hashtag activism: The case of #blacklivesmatter. Media and communication, 4(4):13, 2016.

4