Cross-Geographic Bias Detection in Toxicity Modeling Sayan Ghosh Dylan Baker University of Michigan Google Research [email protected] [email protected]

Cross-geographic Bias Detection in Toxicity Modeling Sayan Ghosh Dylan Baker University of Michigan Google Research [email protected] [email protected] David Jurgens Vinodkumar Prabhakaran University of Michigan Google Research [email protected] [email protected] Abstract Sentence Toxicity You are a Tamilian! 0.74 Online social media platforms increasingly You are a Californian! 0.17 rely on Natural Language Processing (NLP) techniques to detect abusive content at scale ya ALLAH have mercy on those muslims suffer- 0.67 ing out there in order to mitigate the harms it causes to ya ALLAH have mercy on those people suffering 0.36 their users. However, these techniques suf- out there fer from various sampling and association bi- Looks at these presstitutes, again only one side 0.18 ases present in training data, often resulting of the story! in sub-par performance on content relevant Looks at these prostitutes in the press, again 0.71 to marginalized groups, potentially furthering only one side of the story! disproportionate harms towards them. Stud- Madarchod, let it go! 0.10 ies on such biases so far have focused on Motherfucker, let it go! 0.97 only a handful of axes of disparities and subgroups that have annotations/lexicons avail- Table 1: Example biases relevant to the Indian context, able. Consequently, biases concerning non- reflected in Perspective API’s toxicity scores ([0; 1]). Western contexts are largely ignored in the lit- erature. In this paper, we introduce a weakly supervised method to robustly detect lexical biases in broader geocultural contexts. Through Current research on detecting these biases pri- a case study on cross-geographic toxicity de- marily starts with a specific axis of injustice, and tection, we demonstrate that our method iden- relies on annotated data (Sap et al., 2019; Davidson tifies salient groups of errors, and, in a follow et al., 2019) or lexicons that are salient to different up, demonstrate that these groupings reflect subgroups (e.g., Dixon et al.(2018) use a list of human judgments of offensive and inoffensive LGBTQ identity terms; Hutchinson et al.(2020) language in those geographic contexts. use a list of terms that refer to persons with a dis- 1 Introduction ability). However, this limits the focus to a handful of social injustices for which resources/lexicons Detecting offensive and abusive content online is are available, and largely ignores biases in non- a critical step in mitigating the harms it causes to Western contexts (Sambasivan et al., 2020). More- people (Waseem and Hovy, 2016; Davidson et al., over, prior work shows that interpretation of hate 2017). Various online platforms have increasingly arXiv:2104.06999v1 [cs.CL] 14 Apr 2021 varies significantly across different cultural con- turned to NLP techniques to do this task at scale 1 texts (Salminen et al., 2018, 2019). Consequently, (e.g., the Perspective API). However, recent re- these methods miss 1) undesirable model biases search has shown that these models often encode around words that are not captured in existing lexi- various societal biases against marginalized groups cons/datasets, and 2) desirable biases around offen- (Dixon et al., 2018; Sap et al., 2019; Davidson sive terms that are salient to different geocultural et al., 2019; Hutchinson et al., 2020; van Aken contexts that the models fail to capture. Such bi- et al., 2018; Waseem, 2016; Caliskan et al., 2017; ases present in the Perspective API (a commonly Park et al., 2018; Zhou et al., 2021) and can be used toxicity detector) are shown in Table1. In adversarially deceived (Hosseini et al., 2017; Ku- the first two sets of rows, we see that the mentions rita et al., 2019; Grondahl¨ et al., 2018), potentially of certain identity groups (e.g., muslims, Tamil- furthering disproportionate harm. ian) causes the toxicity model to assign a higher 1www.perspectiveapi.com toxicity score. The third and fourth set of rows Figure 1: The high-level sketch of our two-phase methodology to identify undesirable model biases. demonstrate that the model does not recognize the but their presence is merely correlational (i.e., the toxicity association on the words presstitute and model doesn’t have any toxicity association for Madarchod,2 whereas it does recognize the toxicity these words). Another consideration is whether of words/phrases with the same/similar meanings. the association of a particular word towards toxi- In this paper, we propose a new weakly super- city is desirable or not. For some words, such as vised method (outlined in Figure1) to address these country-specific slur words, we want the model to shortcomings and robustly detect lexical biases in be sensitive to their presence. On the other hand, models (i.e. biases that associate the presence or we desire the models to not have biases towards lack of toxicity with certain words). We draw upon culturally salient non-offensive words, such as reli- the observation that social stereotypes and biases in gious concepts. data often reflect the public discourse at a specific The second phase attempts to separate the over- point in time and place (Fiske, 2017; Garg et al., represented words along these two dimensions: 2018). We use a cross-corpora analysis across • a descriptive axis that separates words represent- seven different countries to reveal terms overrepre- ing correlational vs. causal associations sented in each country. We then cluster the model • a prescriptive axis that separates words represent- behaviour against term perturbations to classify ing desirable vs. undesirable associations these biases as one of the two types of errors de- Both phases are described in Figure1 using tweets scribed above. Furthermore, we demonstrate the from India as an example. The causal-desirable effectiveness of our method through human valida- associations (e.g., bastard) and correlational- tion. Our methodology is task and genre agnostic; undesirable associations (e.g., indian) capture the we expect it to work in any culture-laden tasks on cases where the model is behaving as desired. On genres of text from platforms with global presence. the other hand, the causal-undesirable associations 2 Methodology (e.g., hindus) capture model biases that should be mitigated, whereas the correlational-desirable Undesirable lexical biases are found through two associations (e.g., madarchod) capture the geo- phases. The first phase aims to find words that are culturally salient offensive words the model missed. statistically overrepresented in tweets from specific Our method aims a balance of achieving reason- countries that were deemed toxic by the model. ably high precision and narrowing down the search Such overrepresentation could be due to differ- space of words enough to be feasible in Phase 2 ent reasons: (1) the model has learned some of and any further human evaluation studies. We now these words to be offensive and hence their pres- discuss the methods we use for each phase. ence causes it to assign a higher toxicity score, or (2) some of the words appear more often in of- Phase 1: Identifying Biased Term Candidates fensive contexts (e.g., frequent victims of toxicity) First, for each country, we calculate the log-odds ratio with a Dirichlet informed prior (Monroe et al., 2Presstitute is a portmanteau word blending the words 2008) to find if term i is statistically overrepre- press and prostitute, often used in Indian social media discourse; maadarchod is an abusive word in Hindi, commonly sented in toxic tweets vs. non-toxic tweets in that used in Indian social media discourse in English. country. Unlike the basic log-odds method, this method is duce numerous unknown correlational effects, we robust against very rare and very common words as followed Hutchinson et al.(2020) and May et al. it accounts for a prior estimate of expected fre- (2019) by building a set of 33 template sentences quency of each term. It also accounts for the (see Appendix 1) that correspond to a range of toxi- variance in the word’s frequency by calculating city scores. We list a few examples of the templates the z-score. However, common profanities are below: likely overrepresented in toxic tweets globally, so • You are a person this initial list must be further filtered to identify • I really hate it when person is there geographically-specific terms. Since we now have • I really dislike person a multi-class setting, one could repeat the (Monroe • I am going to the movies with a person et al., 2008) method in a one-vs-rest manner. How- • The person was going to do that with me ever, such an approach would miss the important interdependence between these tests. Instead, we For each term obtained in Phase 1, we measure the use the method from Bamman et al.(2014) and shift in toxicity in response to replacing the word Chang and McKeown(2019) that allows compari- person in the template sentence with that term. son across multiple groups of texts without requir- Ideally, words with little to no inherent offensive- ing separate binary comparisons. ness should have no effect on a model’s estimate. Here, we say that term i is overrepresented in Words that are highly offensive should increase the toxic tweets from a country j if term i occurs with score across the board. Deviations from these may higher than statistically expected frequency in that indicate either undesirable biases or blind spots in country. Similar to prior works, we assume a non- the model. To model different patterns of behavior informative prior on fi, with a Beta(ki;N − ki) by the classifier, we use a clustering approach. We posterior distribution, where ki is the count of i in construct vector representations by creating vec- d the geography-balanced corpus and N is the word tors x 2 R such that xi is the shift in toxicity count of the corpus.

Load more