Comparative Evaluation of Label-Agnostic Selection Bias in Multilingual Hate Speech Datasets
Total Page:16
File Type:pdf, Size:1020Kb
Comparative Evaluation of Label-Agnostic Selection Bias in Multilingual Hate Speech Datasets Nedjma Ousidhoum, Yangqiu Song, Dit-Yan Yeung Department of Computer Science and Engineering The Hong Kong University of Science and Technology [email protected], [email protected], [email protected] Abstract been shown, by looking at historical documents, that we may somehow neglect the data collection Work on bias in hate speech typically aims to improve classification performance while rela- process (Jo and Gebru, 2020). Thus, in the present tively overlooking the quality of the data. We work, we are interested in improving hate speech examine selection bias in hate speech in a lan- data collection with evaluation before focusing on guage and label independent fashion. We first classification performance. use topic models to discover latent semantics We conduct a comparative study on English, in eleven hate speech corpora, then, we present French, German, Arabic, Italian, Portuguese, and two bias evaluation metrics based on the se- Indonesian datasets using topic models, specifi- mantic similarity between topics and search words frequently used to build corpora. We cally Latent Dirichlet Allocation (LDA) (Blei et al., discuss the possibility of revising the data col- 2003). We use multilingual word embeddings or lection process by comparing datasets and an- word associations to compute the semantic simi- alyzing contrastive case studies. larity scores between topic words and predefined keywords and define two metrics that calculate bias 1 Introduction in hate speech based on these measures. We use the Hate speech in social media dehumanizes minori- same list of keywords reported by Ross et al.(2016) ties through direct attacks or incitement to defama- for German, Sanguinetti et al.(2018) for Italian, tion and aggression. Despite its scarcity in compar- Ibrohim and Budi(2019) for Indonesian, Fortuna ison to normal web content, Mathew et al.(2019) et al.(2019) for Portuguese; allow more flexibility demonstrated that it tends to reach large audiences in both English (Waseem and Hovy, 2016; Founta faster due to dense connections between users who et al., 2018; Ousidhoum et al., 2019) and Ara- share such content. Hence, a search based on bic (Albadi et al., 2018; Mulki et al., 2019; Ousid- generic hate speech keywords or controversial hash- houm et al., 2019) in order to compare different tags may result in a set of social media posts gen- datasets based on shared concepts that have been erated by a limited number of users (Arango et al., reported in their respective paper descriptions; and 2019). This would lead to an inherent bias in hate for French, we make use of a subset of keywords speech datasets similar to other tasks involving so- that covers most of the targets reported by Ousid- cial data (Olteanu et al., 2019) as opposed to a houm et al.(2019). Our first bias evaluation metric selection bias (Heckman, 1977) particular to hate measures the average similarity between topics and speech data. the whole set of keywords, and the second one eval- Mitigation methods usually point out the classi- uates how often keywords appear in topics. We fication performance and investigate how to debias analyze our methods through different use cases the detection given false positives caused by gen- which explain how we can benefit from the assess- der group identity words such as “women” (Park ment. et al., 2018), racial terms reclaimed by communi- Our main contributions consist of (1) designing ties in certain contexts (Davidson et al., 2019), or bias metrics that evaluate hateful web content us- names of groups that belong to the intersection of ing topic models; (2) examining selection bias in gender and racial terms such as “black men” (Kim eleven datasets; and (3) turning present hate speech et al., 2020). The various aspects of the dataset corpora into an insightful resource that may help us construction are less studied though it has recently balance training data and reduce bias in the future. 2532 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2532–2542, November 16–20, 2020. c 2020 Association for Computational Linguistics 2 Related Work such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003) have proven their efficiency to han- Hate speech labeling schemes depend on the gen- dle several NLP applications such as data explo- eral purpose of the dataset. The annotations may ration (Rodriguez and Storer, 2020), Twitter hash- include hateful vs. non hateful (Basile et al., 2019), tag recommendation (Godin et al., 2013), author- racist, sexist, and none (Waseem and Hovy, 2016); ship attribution (Seroussi et al., 2014), and text as well as discriminating target attributes (ElSherief categorization (Zhou et al., 2009). et al., 2018), the degree of intensity (Sanguinetti In order to evaluate the consistency of the gen- et al., 2018), and the annotator’s sentiment towards erated topics, Newman et al.(2010) used crowd- the tweets (Ousidhoum et al., 2019). Besides En- sourcing and semantic similarity metrics, essen- glish (Basile et al., 2019; Waseem and Hovy, 2016; tially based on Pointwise Mutual Information Davidson et al., 2017; ElSherief et al., 2018; Founta (PMI), to assess the coherence; Mimno et al. et al., 2018; Qian et al., 2018), we notice a growing (2011) estimated coherence scores using condi- interest in the study of hate speech in other lan- tional log-probability instead of PMI; Lau et al. guages, such as Portuguese (Fortuna et al., 2019), (2014) enhanced this formulation based on normal- Italian (Sanguinetti et al., 2018), German (Ross ized PMI (NPMI); and Lau and Baldwin(2016) in- et al., 2016), Indonesian (Ibrohim and Budi, 2019), vestigated the effect of cardinality on topic genera- French (Ousidhoum et al., 2019), Dutch (Hee et al., tion. Similarly, we use topics and semantic similar- 2015), and Arabic (Albadi et al., 2018; Mulki et al., ity metrics to determine the quality of hate speech 2019; Ousidhoum et al., 2019). Challenging ques- datasets, and test on corpora that vary in language, tions being tackled in this area involve the way size, and general collection purposes for the sake abusive language spreads online (Mathew et al., of examining bias up to different facets. 2019), fast changing topics during data collec- tion (Liu et al., 2019), user bias in publicly avail- 3 Bias Estimation Method able datasets (Arango et al., 2019), bias in hate speech classification and different methods to re- The construction of toxic language and hate speech duce it (Park et al., 2018; Davidson et al., 2019; corpora is commonly conducted based on keywords Kennedy et al., 2020). and/or hashtags. However, the lack of an unequiv- Bias in social data is broad and addresses a wide ocal definition of hate speech, the use of slurs range of issues (Olteanu et al., 2019; Papakyri- in friendly conversations as opposed to sarcasm akopoulos et al., 2020). Shah et al.(2020) present and metaphors in elusive hate speech (Malmasi a framework to predict the origin of different types and Zampieri, 2018), and the data collection time- of bias including label bias (Sap et al., 2019), se- line (Liu et al., 2019) contribute to the complexity lection bias (Garimella et al., 2019), model over- and imbalance of the available datasets. Therefore, amplification (Zhao et al., 2017), and semantic training hate speech classifiers easily produces bias (Garg et al., 2018). Existing work deals with false positives when tested on posts that contain bias through the construction of large datasets and controversial or search-related identity words (Park the definition of social frames (Sap et al., 2020), et al., 2018; Sap et al., 2019; Davidson et al., 2019; the investigation of how current NLP models might Kim et al., 2020). be non-inclusive of marginalized groups such as To claim whether a dataset is rather robust to people with disabilities (Hutchinson et al., 2020), keyword-based selection or not, we present two mitigation (Dixon et al., 2018; Sun et al., 2019), label-agnostic metrics to evaluate bias using topic or better data splits (Gorman and Bedrick, 2019). models. First, we generate topics using Latent However, Blodgett et al.(2020) report a missing Dirichlet Allocation (LDA) (Blei et al., 2003). normative process to inspect the initial reasons be- Then, we compare topics to predefined sets of key- hind bias in NLP without the main focus being words using a semantic similarity measure. We test on the performance which is why we choose to our methods on different numbers of topics and investigate the data collection process in the first topic words. place. In order to operationalize the evaluation of selec- 3.1 Predefined Keywords tion bias, we use topic models to capture latent se- In contrast to Waseem(2016), who legitimately mantics. Regularly used topic modeling techniques questions the labeling process by comparing ama- 2533 DATASET KEYWORDS DATASET TOPIC WORDS Ousidhoum et al.(2019) ni**er, invasion, attack Founta et al.(2018) f***ing, like, know Waseem and Hovy(2016) Ousidhoum et al.(2019) ret***ed, sh*t**le, c*** Founta et al.(2018) Waseem and Hovy(2016) sexist, andre, like Ousidhoum et al.(2019) FR migrant, sale, m*ng*l Ousidhoum et al.(2019) FR m*ng*l, gauchiste, sale EN migrant, filthy, mong****d EN mon*y, leftist, filthy Albadi et al.(2018) AR Albadi et al.(2018) AR QK Qg ,Q ªJ.Ë@ , è@QÓ éJ j ÖÏ@ ,XñîDË@ , éªJ Ë@ Ousidhoum et al.(2019) EN woman, camels, pig EN Shia, Jewish, Christanity Mulki et al.(2019)