Detecting East Asian Prejudice on Social Media
Total Page:16
File Type:pdf, Size:1020Kb
Detecting East Asian Prejudice on Social Media Bertie Vidgen1,2, Austin Botelho2, David Broniatowski3, Ella Guest1,6, Matthew Hall1,4, Helen Margetts1,2, Rebekah Tromble1,3, Zeerak Waseem5, and Scott Hale1,2 1The Alan Turing Institute 2The Oxford Internet Institute 3The George Washington University 4The University of Surrey 5University of Sheffield 6The University of Manchester May 2020 Abstract The outbreak of COVID-19 has transformed so- can be mitigated [8]. There is a pressing need to also research cieties across the world as governments tackle the health, and understand other forms of harm and danger which are economic and social costs of the pandemic. It has also raised spreading during the pandemic. concerns about the spread of hateful language and prejudice Social media is one of the most important battlegrounds online, especially hostility directed against East Asia. In in the fight against social hazards during COVID-19. As this paper we report on the creation of a classifier that de- life moves increasingly online, it is crucial that social me- tects and categorizes social media posts from Twitter into dia platforms and other online spaces remain safe, accessible four classes: Hostility against East Asia, Criticism of East and free from abuse [9] { and that people's fears and dis- Asia, Meta-discussions of East Asian prejudice and a neutral tress during this time are not exploited and social tensions class. The classifier achieves an F1 score of 0.83 across all stirred up. Computational tools, utilizing recent advances four classes. We provide our final model (coded in Python), in machine learning and natural language processing, offer as well as a new 20,000 tweet training dataset used to make powerful ways of creating scalable and robust models for de- the classifier, two analyses of hashtags associated with East tecting and measuring prejudice. These, in turn, can assist Asian prejudice and the annotation codebook. The classi- with both online content moderation processes and further fier can be implemented by other researchers, assisting with research into the dynamics, prevalence and impact of East both online content moderation processes and further re- Asian prejudice. search into the dynamics, prevalence and impact of East In this paper we report on the creation of a classifier to Asian prejudice online during this global pandemic.1 detect East Asian prejudice in social media data. It distin- guishes between four primary categories: Hostility against Keywords Hate speech, Sinophobia, Prejudice, Social me- East Asia, Criticism of East Asia, Meta-discussions of East dia, Covid-19, Twitter, East Asia, Online abuse. Asian prejudice and a neutral class. The classifier achieves an F1 score of 0.83. We also provide a new 20,000 tweet arXiv:2005.03909v1 [cs.CL] 8 May 2020 training dataset used to create the classifier, the annotation 1 Introduction codebook and two analyses of hashtags associated with East Asian prejudice. The training dataset contains annotations The outbreak of COVID-19 has raised concerns about the for several secondary categories, including threatening lan- spread of Sinophobia and other forms of East Asian preju- guage, interpersonal abuse and dehumanization, which can dice across the world, with reports of online and offline abuse be used for further research. 2 directed against East Asian people in the first few months of the pandemic, including physical attacks [1, 2, 3, 4, 5, 6]. The United Nations High Commissioner for Human Rights 2 Literature Review has drawn attention to increased prejudice against people of East Asian background and has called on UN member East Asian prejudice, such as Sinophobia, can be understood states to fight such discrimination [7]. Thus far, most of as fear or hatred of East Asia and East Asian people [10]. the academic response to COVID-19 has focused on under- This prejudice has a long history in the West: in the 19th standing its health- and economic- impacts and how these century the term "yellow peril" was used to refer to Chinese 1This work is a collaboration between The Alan Turing Institute and the Oxford Internet Institute. It was funded by the Criminal Justice Theme of the Alan Turing Institute under Wave 1 of The UKRI Strategic Priorities Fund, EPSRC Grant EP/T001569/1. 2All research materials are available at https://zenodo.org/record/3816667. 1 Detecting East Asian Prejudice on Social Media Vidgen et al. (2020) immigrants who were stereotyped as dirty and diseased, and as such, automated detection tools) means that researchers considered akin to a plague [11]. The association of COVID- have to rely instead on far cruder ways of measuring East 19 with China plays into these centuries old stereotypes, as Asian prejudice, such as searching for slurs and other pejo- shown by derogatory references to `bats' and `monkeys' [12]. ratives. These methods drive substantial errors [29] as lots Similar anti-Chinese prejudices emerged during the SARS of less overt prejudice is missed because the content does outbreak in the early 2000s, with lasting adverse social, eco- not contain the target keywords, and non-hateful content is nomic, and political impacts on East Asian diasporas glob- misclassified just because it does contain the keywords. ally [13]. However, developing new detection tools is a complex A 2019 Pew survey examined attitudes towards China and lengthy process. The field of hate speech detection sits from people in 34 countries. A median of 41% of citizens had at the intersection of social science and computer science, an unfavorable opinion of China. Negative opinions were and is fraught with not only technical challenges but also particularly common in North America, Western Europe, deep-routed ethical and theoretical considerations [30]. Re- and neighboring East Asian countries [14]. The 2019 survey, cent studies show that many existing datasets and tools con- which was conducted just before the pandemic, marked a tain substantial biases, such as overclassifying African Amer- historic high in unfavorable attitudes towards China. Simi- ican vernacular as hateful compared with Standard Amer- larly, In 2017, a study found that 21% of Asian Americans ican [31, 32], penalising hate against certain targets more had received threats based on their Asian identity, and 10% strongly than others [33], or being easily fooled by simple had been victims of violence [15]. Likewise, a 2009 report spelling changes that any human can identify [34]. These is- by the Universities of Hull, Leeds and Nottingham Trent re- sues are considerable limitations as they mean that, if used ported on the discrimination and attacks that East Asian in the `real world', computational tools for hate speech de- people are subjected to in the UK, describing Sinophobia as tection could not only be ineffective, they could potentially a problem that was `hidden from public view' [16]. Official perpetuate and deepen the social injustices they are designed government statistics on hate crimes are not currently avail- to address. Put simply, whilst machine learning presents able for East Asian prejudice as figures for racist attacks are many exciting research opportunities, it is no panacea and not broken down by type [17]). There is relatively little re- tools need to be developed in dialogue with social science search into the causal factors behind East Asian prejudice, research if they are to be effective [30]. although evidence suggests that some people may feel threat- ened by China's growing economic and military power [18]. New research related to COVID-19 has already provided 3 Dataset Collection insight into the nature, prevalence and dynamics of East Asian prejudice, with Schild et al. demonstrating an in- To create a 20,000 tweet training dataset, we collected crease in Sinophobic language on some social media plat- tweets from Twitter's Streaming API, using 14 hashtags forms, such as 4chan [19]. Analysis by the company Moon- which relate to both East Asia and the Virus: #chinavirus, shot CVE also suggests that the use of anti-Chinese hash- #wuhan, #wuhanvirus, #chinavirusoutbreak, #wuhan- tags has increased substantially. They analysed more than coronavirus, #wuhaninfluenza, #wuhansars, #chinacoron- 600 million tweets and found that 200,000 contained either avirus, #wuhan2020, #chinaflu, #wuhanquarantine, #chi- Sinophobic hate speech or conspiracy theories, and identi- nesepneumonia, #coronachina and #wohan. Some of these fied a 300% increase in hashtags that support or encourage hashtags express anti-East Asian sentiments (e.g. '#chi- violence against China during a single week in March 2020 naflu’) but others, such as '#wohan' are more neutral. Data [20]. East Asian prejudice has also been linked to the spread collection ran initially from 11 to 17 March 2020, and we col- of COVID-19 health-related misinformation [21]. In March lected 769,763 tweets, of which 96,283 were unique entries in 2020, the polling company YouGov found that 1 in 5 Brits English. To minimize biases which could emerge from col- believed the conspiracy theory that the coronavirus was de- lecting data over a relatively short period of time, we then veloped in a Chinese lab [22]. During this time of heightened collected tweets from 1st January to 10th March 2020 using tension, prejudice and misinformation could exacerbate and Twitter's `Decahose', provided by a third party. We iden- reinforce each other, making online spaces deeply unpleasant tified a further 63,037 unique tweets in English. The final and potentially even dangerous. database comprises 159,320 tweets. Research into computational tools for detecting, cate- We extracted the 1,000 most used hashtags from the gorizing and measuring hate speech has received substan- 159,320 tweets and three annotators independently marked tial attention in recent years, contributing to online content them up for: (1) whether they are East Asian relevant and, moderation processes in industry, and enabling new scien- if so, (2) what Asian entity is discussed (e.g.