Towards Understanding and Mitigating Social Biases in Language Models

Towards Understanding and Mitigating Social Biases in Language Models Paul Pu Liang 1 Chiyu Wu 1 Louis-Philippe Morency 1 Ruslan Salakhutdinov 1 Abstract Zhao et al., 2017). More recently, language models (LMs) Warning: this paper contains model outputs that are increasingly used in real-world applications such as text may be offensive or upsetting. generation (Radford et al., 2019), dialog systems (Zhang et al., 2020), recommendation systems (Shakespeare et al., As machine learning methods are deployed in real- 2020), and search engines (Baeza-Yates, 2016; Otterbacher world settings such as healthcare, legal systems, et al., 2018). As a result, it becomes necessary to recognize and social science, it is crucial to recognize how how they potentially shape social biases and stereotypes. they shape social biases and stereotypes in these sensitive decision-making processes. Among such In this paper, we aim to provide a more formal understanding real-world deployments are large-scale pretrained of social biases in LMs. In particular, we focus on represen- language models (LMs) that can be potentially tational biases, which, following the taxonomy in Blodgett dangerous in manifesting undesirable represen- et al.(2020), are harmful biases resulting from stereotyping tational biases - harmful biases resulting from that propagate negative generalizations about particular so- stereotyping that propagate negative generaliza- cial groups, as well as differences in system performance tions involving gender, race, religion, and other for different social groups, text that misrepresents the dis- social constructs. As a step towards improving tribution of different social groups in the population, or the fairness of LMs, we carefully define several language that is denigrating to particular social groups. A sources of representational biases before propos- better understanding of these biases in text generation would ing new benchmarks and metrics to measure them. subsequently allow us to design targeted methods to mitigate With these tools, we propose steps towards miti- them. We begin by summarizing three inherent difficulties gating social biases during text generation. Our in defining and measuring biases during text generation: empirical results and human evaluation demon- P1 Granularity: In prior work studying biases in embed- strate effectiveness in mitigating bias while re- dings, social biases are measured using a set of association taining crucial contextual information for high- tests between predefined social constructs (e.g., gender and fidelity text generation, thereby pushing forward racial terms) and social professions (e.g., occupations, aca- the performance-fairness Pareto frontier. demic fields). While it suffices to measure such associations over a set of tests for discriminative purposes, the study of 1. Introduction biases in text generation can be more nuanced - biases can Machine learning tools for processing large datasets are in- potentially arise during the generation of any token (Nadeem creasingly deployed in real-world scenarios such as health- et al., 2020), as well as from a more holistic, global interpre- care (Velupillai et al., 2018), legal systems (Dale, 2019), tation of the generated sentence (Sheng et al., 2019). arXiv:2106.13219v1 [cs.CL] 24 Jun 2021 and computational social science (Bamman et al., 2016). P2 Context: In addition to ensuring that generated content However, recent work has shown that discriminative mod- is unbiased, one must also make sure to respect the context. els including pretrained word and sentence embeddings Consider the sentence “The man performing surgery on a reflect and propagate social biases present in training cor- patient is a [blank]”. While we want a fair LM that assigns pora (Bolukbasi et al., 2016; Caliskan et al., 2017; Lauscher equal probability to w = doctor than w = nurse regardless and Glavasˇ, 2019; Swinger et al., 2019). Further usages of of the gender described in the context, the LM should also such approaches can amplify biases and unfairly discrim- preserve context associations between surgery and doctor. inate against users, particularly those from disadvantaged social groups (Barocas and Selbst, 2016; Sun et al., 2019; P3 Diversity: Generated content should be unbiased across a diverse distribution of real-world contexts, which calls for 1Carnegie Mellon University. Correspondence to: Paul Pu Liang stringent large-scale evaluation benchmarks and metrics. <[email protected]>. Our first contribution is therefore to disentangle two sources Proceedings of the 38 th International Conference on Machine of representational biases that may arise during language Learning, PMLR 139, 2021. Copyright 2021 by the author(s). <latexit sha1_base64="QIQI7im4Lgx2QX84xSaMHBtiy4A=">AAAB83icbVBNSwMxEM3Wr1q/qh69BItQL2VXCuqt6MVjBfsB3aVk02wbmmSXZFYsS/+GFw+KePXPePPfmLZ70NYHA4/3ZpiZFyaCG3Ddb6ewtr6xuVXcLu3s7u0flA+P2iZONWUtGotYd0NimOCKtYCDYN1EMyJDwTrh+Hbmdx6ZNjxWDzBJWCDJUPGIUwJW8n1gT6BlViXn03654tbcOfAq8XJSQTma/fKXP4hpKpkCKogxPc9NIMiIBk4Fm5b81LCE0DEZsp6likhmgmx+8xSfWWWAo1jbUoDn6u+JjEhjJjK0nZLAyCx7M/E/r5dCdBVkXCUpMEUXi6JUYIjxLAA84JpREBNLCNXc3orpiGhCwcZUsiF4yy+vkvZFzavXru/rlcZNHkcRnaBTVEUeukQNdIeaqIUoStAzekVvTuq8OO/Ox6K14OQzx+gPnM8f96SRqA==</latexit> <latexit sha1_base64="RtS6/0S+aTs2CR2/J4a9rsJIRwk=">AAAB83icbVBNSwMxEM3Wr1q/qh69BItQL2VXCuqt6MVjBfsB3aVk02wbmmSXZFYsS/+GFw+KePXPePPfmLZ70NYHA4/3ZpiZFyaCG3Ddb6ewtr6xuVXcLu3s7u0flA+P2iZONWUtGotYd0NimOCKtYCDYN1EMyJDwTrh+Hbmdx6ZNjxWDzBJWCDJUPGIUwJW8n1gT6BlVg3Pp/1yxa25c+BV4uWkgnI0++UvfxDTVDIFVBBjep6bQJARDZwKNi35qWEJoWMyZD1LFZHMBNn85ik+s8oAR7G2pQDP1d8TGZHGTGRoOyWBkVn2ZuJ/Xi+F6CrIuEpSYIouFkWpwBDjWQB4wDWjICaWEKq5vRXTEdGEgo2pZEPwll9eJe2LmlevXd/XK42bPI4iOkGnqIo8dIka6A41UQtRlKBn9IrenNR5cd6dj0VrwclnjtEfOJ8/+SqRqQ==</latexit> Towards Understanding and Mitigating Social Biases in Language Models Local bias Bias association The man performing surgery is a doctor. The woman performing surgery is a nurse. GPT-2 The man performing surgery is a doctor. The man performing surgery is precisely leading the operation. The woman performing surgery is carefully assisting the doctor. Context association (a) Global bias (b) Figure 1. (a) We disentangle sources of representational biases in text generation into fine-grained local biases and high-level global biases. Local biases represent predictions at a particular time step that reflect undesirable associations with the context. Global biases result from representational differences across entire generated sentences spanning multiple phrases. (b) While it is desirable to mitigate bias, one must also take care to preserve contextual associations between the prompt (e.g. surgery) and the next word (e.g. doctor). modeling: fine-grained local biases and high-level global values such as ethics (Hendrycks et al., 2021), social bias im- biases (see Figure1). Fine-grained local biases represent plications (Sap et al., 2020), and toxic speech (Gehman et al., predictions generated at a particular time step that reflect 2020) in generated text. Our approach aims to supplement undesirable associations with the context. For example, an existing work by disentangling sources of bias and design- LM that assigns a higher likelihood to the final token in “he ing new target methods to mitigate them. We also evaluate worked as a [doctor]” than “she worked as a [doctor]”. our method on the benchmarks proposed in Nadeem et al. High-level global biases result from representational differ- (2020) and Sheng et al.(2019). Existing approaches towards ences across entire generated sentences spanning multiple mitigating biases in generation currently require retrain- phrases. For example, an LM that generates “the gay person ing the models through adversarial trigger prompts (Sheng was known for [his love of dancing, but he also did drugs]” et al., 2020), data augmentation or collection (Dinan et al., (example from (Sheng et al., 2019)). We first formally define 2020), and different objective functions (Qian et al., 2019; these two sources of biases (addressing P1) and ways to sep- Huang et al., 2020). These approaches have also been ap- arate them from desirable context associations (addressing plied to image captioning (Hendricks et al., 2018), image P2). With this in mind, we propose diverse benchmarks and retrieval (Otterbacher, 2018), and dialog (Liu et al., 2020). metrics that test for both sources of bias (addressing P3). However, these approaches are not scalable to large pre- Using these new formulations, we empirically validate the trained LMs (Radford et al., 2019) which are trained on existence of biases in pretrained LMs. massive amounts of text data over hundreds of machines for several weeks. As a result, it is difficult to retrain a new As a step towards mitigating bias in LMs, our second con- LM whenever a new source of bias is uncovered from data. tribution is a new method called AUTOREGRESSIVE INLP Therefore, we focus on efficient post-processing approaches (A-INLP) that is able to perform post-hoc debiasing of large to mitigate bias without retraining. pretrained LMs. The key to our approach lies in dynamically finding bias-sensitive tokens rather than relying on a prede- Social biases in text embeddings: A closely related line of fined set of bias-sensitive words that are common in existing work lies in measuring and mitigating biases in embedding literature (Bolukbasi et al., 2016). While a predefined set spaces. For example, word embeddings are shown to re- may work for studying word embeddings, LMs must handle flect and propagate social

Load more