<<

Cross-geographic Bias Detection in Toxicity Modeling Sayan Ghosh Dylan Baker University of Michigan Google Research [email protected] [email protected]

David Jurgens Vinodkumar Prabhakaran University of Michigan Google Research [email protected] [email protected]

Abstract Sentence Toxicity You are a Tamilian! 0.74 Online social media platforms increasingly You are a Californian! 0.17 rely on Natural Language Processing (NLP) techniques to detect abusive content at scale ya ALLAH have mercy on those muslims suffer- 0.67 ing out there in order to mitigate the harms it causes to ya ALLAH have mercy on those people suffering 0.36 their users. However, these techniques suf- out there fer from various sampling and association bi- Looks at these presstitutes, again only one side 0.18 ases present in training data, often resulting of the story! in sub-par performance on content relevant Looks at these prostitutes in the press, again 0.71 to marginalized groups, potentially furthering only one side of the story! disproportionate harms towards them. Stud- Madarchod, let it go! 0.10 ies on such biases so far have focused on Motherfucker, let it go! 0.97 only a handful of axes of disparities and sub- groups that have annotations/lexicons avail- Table 1: Example biases relevant to the Indian context, able. Consequently, biases concerning non- reflected in Perspective API’s toxicity scores ([0, 1]). Western contexts are largely ignored in the lit- erature. In this paper, we introduce a weakly supervised method to robustly detect lexical bi- ases in broader geocultural contexts. Through Current research on detecting these biases pri- a case study on cross-geographic toxicity de- marily starts with a specific axis of injustice, and tection, we demonstrate that our method iden- relies on annotated data (Sap et al., 2019; Davidson tifies salient groups of errors, and, in a follow et al., 2019) or lexicons that are salient to different up, demonstrate that these groupings reflect subgroups (e.g., Dixon et al.(2018) use a list of human judgments of offensive and inoffensive LGBTQ identity terms; Hutchinson et al.(2020) language in those geographic contexts. use a list of terms that refer to persons with a dis- 1 Introduction ability). However, this limits the focus to a handful of social injustices for which resources/lexicons Detecting offensive and abusive content online is are available, and largely ignores biases in non- a critical step in mitigating the harms it causes to Western contexts (Sambasivan et al., 2020). More- people (Waseem and Hovy, 2016; Davidson et al., over, prior work shows that interpretation of hate 2017). Various online platforms have increasingly

arXiv:2104.06999v1 [cs.CL] 14 Apr 2021 varies significantly across different cultural con- turned to NLP techniques to do this task at scale 1 texts (Salminen et al., 2018, 2019). Consequently, (e.g., the Perspective API). However, recent re- these methods miss 1) undesirable model biases search has shown that these models often encode around words that are not captured in existing lexi- various societal biases against marginalized groups cons/datasets, and 2) desirable biases around offen- (Dixon et al., 2018; Sap et al., 2019; Davidson sive terms that are salient to different geocultural et al., 2019; Hutchinson et al., 2020; van Aken contexts that the models fail to capture. Such bi- et al., 2018; Waseem, 2016; Caliskan et al., 2017; ases present in the Perspective API (a commonly Park et al., 2018; Zhou et al., 2021) and can be used toxicity detector) are shown in Table1. In adversarially deceived (Hosseini et al., 2017; Ku- the first two sets of rows, we see that the mentions rita et al., 2019; Grondahl¨ et al., 2018), potentially of certain identity groups (e.g., muslims, Tamil- furthering disproportionate harm. ian) causes the toxicity model to assign a higher 1www.perspectiveapi.com toxicity score. The third and fourth set of rows Figure 1: The high-level sketch of our two-phase methodology to identify undesirable model biases. demonstrate that the model does not recognize the but their presence is merely correlational (i.e., the toxicity association on the words presstitute and model doesn’t have any toxicity association for Madarchod,2 whereas it does recognize the toxicity these words). Another consideration is whether of words/phrases with the same/similar meanings. the association of a particular word towards toxi- In this paper, we propose a new weakly super- city is desirable or not. For some words, such as vised method (outlined in Figure1) to address these country-specific slur words, we want the model to shortcomings and robustly detect lexical biases in be sensitive to their presence. On the other hand, models (i.e. biases that associate the presence or we desire the models to not have biases towards lack of toxicity with certain words). We draw upon culturally salient non-offensive words, such as reli- the observation that social stereotypes and biases in gious concepts. data often reflect the public discourse at a specific The second phase attempts to separate the over- point in time and place (Fiske, 2017; Garg et al., represented words along these two dimensions: 2018). We use a cross-corpora analysis across • a descriptive axis that separates words represent- seven different countries to reveal terms overrepre- ing correlational vs. causal associations sented in each country. We then cluster the model • a prescriptive axis that separates words represent- behaviour against term perturbations to classify ing desirable vs. undesirable associations these biases as one of the two types of errors de- Both phases are described in Figure1 using tweets scribed above. Furthermore, we demonstrate the from India as an example. The causal-desirable effectiveness of our method through human valida- associations (e.g., bastard) and correlational- tion. Our methodology is task and genre agnostic; undesirable associations (e.g., indian) capture the we expect it to work in any culture-laden tasks on cases where the model is behaving as desired. On genres of text from platforms with global presence. the other hand, the causal-undesirable associations 2 Methodology (e.g., hindus) capture model biases that should be mitigated, whereas the correlational-desirable Undesirable lexical biases are found through two associations (e.g., madarchod) capture the geo- phases. The first phase aims to find words that are culturally salient offensive words the model missed. statistically overrepresented in tweets from specific Our method aims a balance of achieving reason- countries that were deemed toxic by the model. ably high precision and narrowing down the search Such overrepresentation could be due to differ- space of words enough to be feasible in Phase 2 ent reasons: (1) the model has learned some of and any further human evaluation studies. We now these words to be offensive and hence their pres- discuss the methods we use for each phase. ence causes it to assign a higher toxicity score, or (2) some of the words appear more often in of- Phase 1: Identifying Biased Term Candidates fensive contexts (e.g., frequent victims of toxicity) First, for each country, we calculate the log-odds ratio with a Dirichlet informed prior (Monroe et al., 2Presstitute is a portmanteau word blending the words 2008) to find if term i is statistically overrepre- press and prostitute, often used in Indian social media dis- course; maadarchod is an abusive word in , commonly sented in toxic tweets vs. non-toxic tweets in that used in Indian social media discourse in English. country. Unlike the basic log-odds method, this method is duce numerous unknown correlational effects, we robust against very rare and very common words as followed Hutchinson et al.(2020) and May et al. it accounts for a prior estimate of expected fre- (2019) by building a set of 33 template sentences quency of each term. It also accounts for the (see Appendix 1) that correspond to a range of toxi- variance in the word’s frequency by calculating city scores. We list a few examples of the templates the z-score. However, common profanities are below: likely overrepresented in toxic tweets globally, so • You are a person this initial list must be further filtered to identify • I really hate it when person is there geographically-specific terms. Since we now have • I really dislike person a multi-class setting, one could repeat the (Monroe • I am going to the movies with a person et al., 2008) method in a one-vs-rest manner. How- • The person was going to do that with me ever, such an approach would miss the important interdependence between these tests. Instead, we For each term obtained in Phase 1, we measure the use the method from Bamman et al.(2014) and shift in toxicity in response to replacing the word Chang and McKeown(2019) that allows compari- person in the template sentence with that term. son across multiple groups of texts without requir- Ideally, words with little to no inherent offensive- ing separate binary comparisons. ness should have no effect on a model’s estimate. Here, we say that term i is overrepresented in Words that are highly offensive should increase the toxic tweets from a country j if term i occurs with score across the board. Deviations from these may higher than statistically expected frequency in that indicate either undesirable biases or blind spots in country. Similar to prior works, we assume a non- the model. To model different patterns of behavior informative prior on fi, with a Beta(ki,N − ki) by the classifier, we use a clustering approach. We posterior distribution, where ki is the count of i in construct vector representations by creating vec- d the geography-balanced corpus and N is the word tors x ∈ R such that xi is the shift in toxicity count of the corpus. We use a balanced corpus with after replacing person in the i-th template with a the same number of toxic tweets from each country word from a lexicon. In this paper, we use the (matching the country with the least number of k-means clustering algorithm; however, other clus- tweets). Term i is deemed significantly associated tering techniques could be used for this step. with country j if the cumulative distribution at kij (the count of term i in the corpus corresponding to 3 Analysis Framework country j) is ≤ 0.05. We use the Perspective API’s toxicity model for our By combining these two methods, we find cul- analysis. We use 73 million tweets from 7 countries turally salient words that are overrepresented in with substantial English-speaking populations that toxic tweets in each country. We also conducted are active online: India, Pakistan, Nigeria, Ghana, experiments using (Monroe et al., 2008) method in Jamaica, Mexico, and the Philippines. We focus a one-vs-rest manner (described above), as well as on these countries as opposed to countries that are using (Bamman et al., 2014) method for both steps; solely and primarily Anglophone as to better un- but they provided qualitatively worse results than derstand how models interact with “non-standard” the above approach. English dialects, especially ones that are different from the data the model was likely trained on. Our Phase 2: Qualifying Associations To determine data is collected from a ∼10% random sample of the causal vs. correlational distinction, one could the tweets from 2018 and 2019, from the Decahose rely on counterfactual approaches such as perturba- stream provided by Twitter. We pre-process tweets tion sensitivity analysis (PSA) (Prabhakaran et al., by removing URLs, hashtags, special characters 2019). However, PSA (and similar approaches) and numbers, and applying uniform casing. We aggregate the sensitivity to terms by calculating also lemmatize each resulting word. The number the mean difference in prediction scores across all of tweets from each country is shown in Table2. sentences. Here, we investigate how the sensitivity varies across the range of toxicity scores, which 4 Results gives important clues about the desirability of bi- ases. Instead of using naturally occurring sentences Phase 1: Our Phase 1 analysis identified several (like (Prabhakaran et al., 2019)) which would intro- hundred terms overrepresented in each country. Country # Tweets # Terms Sample of terms India (IN) 11.6m 666 muslims, sanghi, presstitutes, chutiya, jihadi appeaser, journo, goons, madarchod Pakistan (PK) 11.8m 616 feminists, porkistan, dhawan, harami, khawaja, israeli, pricks, rapistan, mullah Nigeria (NG) 10.9m 543 mumu, tonto, kuku, ogun, tarik, savages, joor, africa, zevon, chop, virgin, baboon Mexico (MX) 14.6m 665 puto, pendejo, culote, cabron, gringos, racist, horny, nalgotas, culazo, degenerate Ghana (GH) 9.3m 377 ankasa, aboa, wati, ofui, sekof, barb, dier, kwacha, kwasia, nigerian, devil, spi Philippines (PH) 12.4m 575 colonizers, delulu, pota, sasaeng, kadiri, crackhead, uwus, antis, stans, bis Jamaica (JM) 2.9m 260 mufi, bloodclaat, wati, pickney, raasclaat, nuffi, mada, blacks, dung, unnu, unuh

Table 2: Sample words outputted by Phase 1 for each country

The number of terms as well as a small subset of kutta liars sexual thug uneducated donkey terms for each country are shown in Table2. As ev- • C3: fake sanghi blind journalist pappu mullah ident, our method is effective at picking up words chaddi child appeaser wife country vala chankya that are geography-specific, compared to simply tadipad mota dogs creature abuse animals bhakt using log-odds to identify overrepresentation in jinahh pidi spineless shove toxic versus non-toxic tweets; such an approach Qualitative inspection of the clusters reveals certain outputs more general, widely used profanities that properties of the words within each. (by nature) occur more commonly in offensive mes- Cluster 0 (top left of Fig2) contains words such sages across all countries. Crucially, not all of the as feminist and muslim that increase the toxicity of words identified by our method are inherently toxic: the templates with the greatest changes occurring For example, the list for Pakistan contains words in the middle, indicating undesirable biases in the such as muslims and journalism, which should not model. In contrast, cluster 1 (top right) has words carry an toxic connotation and could be reflective such as shithead that uniformly raise the toxicity of model biases. Further, our method identifies of the templates to the very high range, regardless emergent country-specific terms such as pressti- of the initial toxicity of the template. These are tutes (a slur for members of the press) and congi common profanities that the model knows about. (a slur for supporters of the Congress Party, a large Words in cluster 2 (bottom left) display a similar political party) in India. early trajectory to those in cluster 0, but converges Phase 2: In Phase 2, we analyze the model be- towards the diagonal. Words in this cluster include haviour in response to perturbations using each presstitute and porki, both of which are country- term from Phase 1, and cluster them based on specific slurs, suggesting that the model only has a model sensitivity. We use k-means clustering; we weak signal about their toxicities. Similarly, clus- selected k = 4 based on manual inspection of ter 3 (bottom right) consist of words that do not k = [3, 6]; this inspection showed that the major- affect the templates much at all after replacement. ity of instances feel into four natural clusters, with Some words in this cluster are indeed not toxic, but lower k merging dissimilar instances or higher k we also see words such as congi that the model producing clusters with few instances. Figure2 has not encountered before, but carry a negative illustrates the model behaviour of each cluster we connotation in Indian online discourse. obtained for India. We also list a sample of words In-Community Analysis To quantitatively eval- that fall in each cluster below (see Appendix 2 for uate whether the word categories uncovered by results on other countries). our method correspond to real biases as viewed • C0: muslim sexy indian hindu jihadi pakistani through members of a geographic community, we islam women feminist liberals italian com- conducted a crowd-sourcing study via the Qualtrics munist christian priest kafir australian platform. We recruited 25 raters per country, and • C1: morons bastards fools cock loser hypocrite each of them were asked to rate 60 terms (including coward rapists scum ignorant cursed retarded 10 control words) to be (i) inoffensive, (ii) some- losers moronic arse sissy dolt scumbag cunts im- times offensive, (iii) highly offensive, or (iv) don’t becile pervert slut boob pedophile prostitutes know this word. Altogether, we obtained 5 ratings • C2: bloody shameless useless terrorist filthy each for 250 words per country. We removed rat- killer pig disgrace arrogant corrupt bigot horrible ings from raters who got more than 3 out of 10 selfish commie slave irritating filth troll murderer control words wrong. Full annotation details are 0.8

0.6

0.4

0.2 Average Word Toxicity 0.0 C0 C1 C2 C3 cluster

Figure 3: In-community judgments of word offensive- ness by cluster, shown here for India (cf. Figure2) show that the model is indeed finding biases perceived by in-community speakers.

5 Discussion and Conclusion Figure 2: Changes in offensiveness for overrepresented Toxicity detection models commonly rely on train- words in India relative to the offensiveness for the tem- ing data which reflects discourse that is temporally plate (on the diagonal), shown using four words per cat- and culturally limited. As such, models may con- egory. tain biases by learning undesirable features that are not toxic or be blind to toxic language un- derrepresented in their training data. Here, we introduce a new weakly supervised approach for provided in the Appendix 3.3 uncovering these biases, showing that one of the We hypothesize that the words grouped into C0 most widely-used models, Perspective API, un- should be judged identical to those in C3 (i.e., intentionally contains both types of errors when treated as false positive features), where C2 should tested in multiple geographic settings. All code be more offensive, potentially up to C1 (i.e., false and data will be released at https://github.com/ negative). In aggregate, the annotator ratings for sghosh73/Cross-Cultural-Bias-Detection. each country matched our hypotheses. Figure3 Our method provides a key diagnostic tool for shows the ratings for the clusters from India (cf. model creators and deployers to test for biases. Figure2). Here, we re-scale their categorical rat- From a fairness perspective, our analysis reveals ings into [0,1] and show that C0 and C3 indeed biases present in the model along different dimen- have no statistical difference in their toxicity rat- sions, without having any labeled resources for ings. In contrast, C2, which contains words like those dimensions. For instance, C0 revealed bi- “presstitute” were found offensive, though not at ases around religion (muslim, hindu, jews), coun- the level of extremely toxic words in C1. Figures try/ethnicity (indian, pakistani, arabs), and ideol- for other countries are shown in the Appendix 2. ogy (feminist, liberals, communist) without looking This result demonstrates that the clusters found by for these axes a priori. This could be a valuable our method do correspond to in-community judg- first step before deeper analysis of these biases. ments of toxicity and reflect meaningful biases in On the other hand, our method also reveals the model. country-specific offensive terms that the model has not seen before (e.g., maadarchod). Such country specific abusive language lexicons could aid NLP practitioners trying to make their models robust 3Despite recruiting a large pool of annotators for each instance, our annotators may not entirely capture the value across geographies. Such lexicons could also aid systems of countries due to selection effects of who has access human-rights organizations working on the ground to the internet and is able to participate in Qualtrics work. For to build hate-speech lexicons.4 instance, raters from India in our human ratings potentially overrepresent views of the middle class, those of upper caste Furthermore, our model also picks up portman- individuals, and views from a handful of states with tech hubs. teau words (blended words) that are used dispropor- As a result,their perception of what is offensive vs not may be representative of India at large. 4https://www.peacetechlab.org/ tionately in Indian social media discourse, such as Serina Chang and Kathleen McKeown. 2019. Automat- presstitute, and libtard. It suggests that our method ically inferring gender associations from language. could help monitor emergent offensive language in arXiv preprint arXiv:1909.00091. different geographies. Data augmentation efforts to Thomas Davidson, Debasmita Bhattacharya, and Ing- keep models up-to-date could also use these words mar Weber. 2019. Racial bias in hate speech and to focus on emergent language in new geographies. abusive language detection datasets. In Proceedings of the Third Workshop on Abusive Language Online, 6 Ethical considerations pages 25–35, Florence, Italy. Association for Com- putational Linguistics.

The proposed method provides an efficient Thomas Davidson, Dana Warmsley, Michael Macy, distantly-supervised method for practioners to iden- and Ingmar Weber. 2017. Automated hate speech tify potential biases in their toxicity detection meth- detection and the problem of offensive language. ods. Although intended strictly as beneficial, it arXiv preprint arXiv:1703.04009. could create the risk of overconfidence in a lack of Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, bias by a particular model. While we have demon- and Lucy Vasserman. 2018. Measuring and mitigat- strated that the approach identifies clusters of words ing unintended bias in text classification. In Pro- that mirror in-group judgments of (i) offensiveness ceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 67–73. that the model failed to recognize and (ii) inoffen- siveness that the model has treated as offensive due Susan T Fiske. 2017. Prejudices in cultural con- to correlational biases, our method alone is likely texts: shared stereotypes (gender, age) versus vari- insufficient for identifying all such biases. Demon- able stereotypes (race, ethnicity, religion). Perspec- tives on psychological science, 12(5):791–799. strating that our method does not identify biases in a new model should not be considered proof of a Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and lack of bias. Similarly, debiasing a model around James Zou. 2018. Word embeddings quantify the words our method finds may not remove the 100 years of gender and ethnic stereotypes. Pro- ceedings of the National Academy of Sciences, underlying biases in the model. 115(16):E3635–E3644. Additionally, our method surfaces words that have correlational bias due to overrepresentation in Tommi Grondahl,¨ Luca Pajola, Mika Juuti, Mauro toxic messages. Such words are often references Conti, and N Asokan. 2018. All you need is” love” evading hate speech detection. In Proceedings of the to victims of hateful targeting and the highlighting 11th ACM Workshop on Artificial Intelligence and by our method could potentially re-traumatize by Security, pages 2–12. recalling (and exposing) these messages, or even Hossein Hosseini, Sreeram Kannan, Baosen Zhang, leading to a fresh wave of targeting. However, our and Radha Poovendran. 2017. Deceiving google’s method may help improve automatic content mod- perspective api built for detecting toxic comments. eration tools, thereby reducing those individuals’ arXiv preprint arXiv:1702.08138. exposure to such words on online platforms. Ben Hutchinson, Vinodkumar Prabhakaran, Emily Finally, our method reveals country specific abu- Denton, Kellie Webster, Yu Zhong, and Stephen De- sive words. While such lexicons have many ben- nuyl. 2020. Social biases in NLP models as barriers eficial uses, including within NLP, they also have for persons with disabilities. In Proceedings of the potential for malicious dual uses. Hence, devel- 58th Annual Meeting of the Association for Compu- tational Linguistics, pages 5491–5501, Online. As- opers and practitioners should take caution while sociation for Computational Linguistics. developing, deploying, and sharing this method. Keita Kurita, Anna Belova, and Antonios Anastasopou- los. 2019. Towards robust toxic content classifica- References tion. arXiv preprint arXiv:1912.06872. David Bamman, Jacob Eisenstein, and Tyler Schnoe- Chandler May, Alex Wang, Shikha Bordia, Samuel R belen. 2014. Gender identity and lexical varia- Bowman, and Rachel Rudinger. 2019. On mea- tion in social media. Journal of Sociolinguistics, suring social biases in sentence encoders. arXiv 18(2):135–160. preprint arXiv:1903.10561.

Aylin Caliskan, Joanna J Bryson, and Arvind Burt L Monroe, Michael P Colaresi, and Kevin M Narayanan. 2017. Semantics derived automatically Quinn. 2008. Fightin’words: Lexical feature selec- from language corpora contain human-like biases. tion and evaluation for identifying the content of po- Science, 356(6334):183–186. litical conflict. Political Analysis, 16(4):372–403. Ji Ho Park, Jamin Shin, and Pascale Fung. 2018. Re- ducing gender bias in abusive language detection. arXiv preprint arXiv:1808.07231. Vinodkumar Prabhakaran, Ben Hutchinson, and Mar- garet Mitchell. 2019. Perturbation sensitivity anal- ysis to detect unintended model biases. arXiv preprint arXiv:1910.04210. Joni Salminen, Hind Almerekhi, Ahmed Mohamed Kamel, Soon-gyo Jung, and Bernard J Jansen. 2019. Online hate ratings vary by extremes: A statistical analysis. In Proceedings of the 2019 Conference on Human Information Interaction and Retrieval, pages 213–217. Joni Salminen, Fabio Veronesi, Hind Almerekhi, Soon- Gvo Jung, and Bernard J Jansen. 2018. Online hate interpretation varies by country, but more by indi- vidual: A statistical analysis using crowdsourced rat- ings. In 2018 Fifth International Conference on So- cial Networks Analysis, Management and Security (SNAMS), pages 88–94. IEEE. Nithya Sambasivan, Erin Arnesen, Ben Hutchin- son, and Vinodkumar Prabhakaran. 2020. Non- portability of algorithmic fairness in india. In NeurIPS 2020 Workshop on Navigating the Broader Impacts of AI Research.

Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A Smith. 2019. The risk of racial bias in hate speech detection. In Proceedings of the 57th Annual Meeting of the Association for Compu- tational Linguistics, pages 1668–1678.

Betty van Aken, Julian Risch, Ralf Krestel, and Alexan- der Loser.¨ 2018. Challenges for toxic comment clas- sification: An in-depth error analysis. arXiv preprint arXiv:1809.07572.

Zeerak Waseem. 2016. Are you a racist or am i seeing things? annotator influence on hate speech detection on twitter. In Proceedings of the first workshop on NLP and computational social science, pages 138– 142. Zeerak Waseem and Dirk Hovy. 2016. Hateful sym- bols or hateful people? predictive features for hate speech detection on twitter. In Proceedings of the NAACL student research workshop, pages 88–93. Xuhui Zhou, Maarten Sap, Swabha Swayamdipta, Noah A Smith, and Yejin Choi. 2021. Challenges in automated debiasing for toxic language detection. arXiv preprint arXiv:2102.00086. Appendix A Templates Appendix B Further Results for Other Countries We present the list of templates as well as their corresponding toxicity scores according to the Per- We present results for the other countries we col- spective model. lected data for. Note that C3 and cluster C0 may not always be equivalent. For each country, clus- Template Toxicity You are a person 0.107 ter C3 contains several culture-specific terms, and Here is a person 0.068 the model may have varying degrees of success at The person is there 0.050 recognizing the toxicity of these terms, leading to It is a person 0.052 This is a person 0.061 a varying number of false/true negatives in cluster I am currently with a person 0.076 C3. I am going to the movies with a person 0.116 The person was with me at the time 0.085 I was planning on going there with person 0.087 The person was going to do that with me 0.106 I really dislike person 0.445 I really hate it when person is there 0.304 I hate person 0.762 Person sucks 0.887 It sucks when person is here 0.680 The person over there is dumb 0.817 It really does suck when person is here 0.709 These people are illiterate 0.794 These people are crooked 0.631 These people don’t make any sense 0.243 Whenever I see a person, I get angry 0.278 These people are ruining our city 0.446 These people come here and destroy our 0.687 country I think people like person are a pain to be 0.539 around When a person comes here, it just makes me 0.363 feel really pissed off I saw a person at the event and immediately 0.335 left because they made me want to throw up Just looking at a person makes me want to 0.367 throw up I don’t value the opinion of a person 0.319 I think that a person is always wrong and 0.535 just not that smart A person like that disgusts me 0.587 It causes me pain to be around a person 0.268 A person disgusts me 0.550 A person makes me violently ill 0.672

Table 3: Templates and their corresponding toxicities Figure 4: Changes in offensiveness for overrepresented words in Pakistan.

0.8 0.6

0.6 0.4

0.4 0.2

0.2 Average Word Toxicity 0.0 C0 C1 C2 C3 Average Word Toxicity 0.0 cluster C0 C1 C2 C3 cluster Figure 5: Changes in offensiveness for overrepresented words in Nigeria.

• C0: ’indian’, ’afghani’, ’israelis’, ’jihadis’ • C0: ’feminist’, ’junkie’, ’africans’, ’maggot’ • C1: ’fucktards’, ’imbeciles’, ’horseshit’, • C1: ’bastard’, ’nitwit’, ’twat’, ’retard’ ’dogshit’ • C2: ’nigerians’, ’gangster’, ’savage’, ’goats’ • C2: ’retarded’, ’paki’, ’fascist’, ’murderer’ • C3: ’ladies’, ’beings’, ’ashawo’, ’oshi’ • C3: ’liberals’, ’choor’, ’mullah’, ’propagandist’ 0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2 Average Word Toxicity Average Word Toxicity 0.0 0.0 C0 C1 C2 C3 C0 C1 C2 C3 cluster cluster

Figure 6: Changes in offensiveness for overrepresented Figure 7: Changes in offensiveness for overrepresented words in The Philippines. words in Mexico.

• C0: ’crackheads’, ’faker’, ’skank’, ’asians’ • C0: ’asian’, ’mexicans’, ’gringo’, ’jewish’ • C1: ’sissy’, ’bullcrap’, ’dumbasses’, ’puta’ • C1: ’idiota’, ’nipple’, ’pendejo’, ’tits’ • C2: ’gagu’, ’stalker’, ’uwus’, ’idols’ • C2: ’culon’, ’perra’, ’cuck’, ’’ • C3: ’colonizers’, ’delulu’, ’groupmates’, ’kang’ • C3: ’vato’, ’flirt’, ’chavas’, ’republicans’ Figure 9: Changes in offensiveness for overrepresented words in Jamaica.

• C0: ’gyal’, ’wasteman’, ’americans’, ’females’ • C1: ’jackass’, ’fuckhead’, ’shithouse’, ’niggah’ • C2: ’trash’, ’thugs’, ’demon’, ’nerd’ • C3: ’nuffi’, ’mada’, ’mussi’, ’eediat’

0.6 Appendix C In-Community Analysis

0.4 To quantitatively evaluate whether the word cate- gories uncovered by our method correspond to real biases as viewed through members of a geographic 0.2 community, we conducted a crowd-sourcing study via the Qualtrics platform. We recruited 25 raters

Average Word Toxicity per country, and each of them were asked to rate 60 0.0 C0 C1 C2 C3 terms (including 10 control words) to be (i) inoffen- cluster sive, (ii) sometimes offensive, (iii) highly offensive, or (iv) don’t know this word. Altogether, we ob- Figure 8: Changes in offensiveness for overrepresented tained 5 ratings each for 250 words per country. words in Ghana. Each rater was given a survey with 5 questions, • C0: ’virgin’, ’whites’, ’maggot’, ’africans’ each question containing 12 words that they were • C1: ’dickhead’, ’shite’, ’dildos’, ’penises’ asked to drag and drop to one of the four buck- • C2: ’barb’, ’swine’, ’fraud’, ’savage’ ets described above. This graphical interface was • C3: ’kwasia’, ’ashawo’, ’mumu’, ’willian’ chosen to mitigate survey fatigue and for making the task more engaging for the raters. On average, the raters took 8 mins to complete the task of 5 questions with 12 words each. The raters were paid between $6.50 and $12.50 depending on the coun- try, for competing the task. We did not get human labels for Jamaica, since we failed to recruit raters.