<<

Towards Understanding and Mitigating Social Biases in Language Models

Paul Pu Liang 1 Chiyu Wu 1 Louis-Philippe Morency 1 Ruslan Salakhutdinov 1

Abstract Zhao et al., 2017). More recently, language models (LMs) Warning: this paper contains model outputs that are increasingly used in real-world applications such as text may be offensive or upsetting. generation (Radford et al., 2019), dialog systems (Zhang et al., 2020), recommendation systems (Shakespeare et al., As machine learning methods are deployed in real- 2020), and search engines (Baeza-Yates, 2016; Otterbacher world settings such as healthcare, legal systems, et al., 2018). As a result, it becomes necessary to recognize and social science, it is crucial to recognize how how they potentially shape social biases and stereotypes. they shape social biases and stereotypes in these sensitive decision-making processes. Among such In this paper, we aim to provide a more formal understanding real-world deployments are large-scale pretrained of social biases in LMs. In particular, we focus on represen- language models (LMs) that can be potentially tational biases, which, following the taxonomy in Blodgett dangerous in manifesting undesirable represen- et al.(2020), are harmful biases resulting from stereotyping tational biases - harmful biases resulting from that propagate negative generalizations about particular so- stereotyping that propagate negative generaliza- cial groups, as well as differences in system performance tions involving gender, race, religion, and other for different social groups, text that misrepresents the dis- social constructs. As a step towards improving tribution of different social groups in the population, or the fairness of LMs, we carefully define several language that is denigrating to particular social groups. A sources of representational biases before propos- better understanding of these biases in text generation would ing new benchmarks and metrics to measure them. subsequently allow us to design targeted methods to mitigate With these tools, we propose steps towards miti- them. We begin by summarizing three inherent difficulties gating social biases during text generation. Our in defining and measuring biases during text generation: empirical results and human evaluation demon- P1 Granularity: In prior work studying biases in embed- strate effectiveness in mitigating bias while re- dings, social biases are measured using a set of association taining crucial contextual information for high- tests between predefined social constructs (e.g., gender and fidelity text generation, thereby pushing forward racial terms) and social professions (e.g., occupations, aca- the performance-fairness Pareto frontier. demic fields). While it suffices to measure such associations over a set of tests for discriminative purposes, the study of 1. Introduction biases in text generation can be more nuanced - biases can Machine learning tools for processing large datasets are in- potentially arise during the generation of any token (Nadeem creasingly deployed in real-world scenarios such as health- et al., 2020), as well as from a more holistic, global interpre- care (Velupillai et al., 2018), legal systems (Dale, 2019), tation of the generated sentence (Sheng et al., 2019). arXiv:2106.13219v1 [cs.CL] 24 Jun 2021 and computational social science (Bamman et al., 2016). P2 Context: In addition to ensuring that generated content However, recent work has shown that discriminative mod- is unbiased, one must also make sure to respect the context. els including pretrained word and sentence embeddings Consider the sentence “The man performing surgery on a reflect and propagate social biases present in training cor- patient is a [blank]”. While we want a fair LM that assigns pora (Bolukbasi et al., 2016; Caliskan et al., 2017; Lauscher equal probability to w = doctor than w = nurse regardless and Glavasˇ, 2019; Swinger et al., 2019). Further usages of of the gender described in the context, the LM should also such approaches can amplify biases and unfairly discrim- preserve context associations between surgery and doctor. inate against users, particularly those from disadvantaged social groups (Barocas and Selbst, 2016; Sun et al., 2019; P3 Diversity: Generated content should be unbiased across a diverse distribution of real-world contexts, which calls for 1Carnegie Mellon University. Correspondence to: Paul Pu Liang stringent large-scale evaluation benchmarks and metrics. . Our first contribution is therefore to disentangle two sources Proceedings of the 38 th International Conference on Machine of representational biases that may arise during language Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

AAAB83icbVBNSwMxEM3Wr1q/qh69BItQL2VXCuqt6MVjBfsB3aVk02wbmmSXZFYsS/+GFw+KePXPePPfmLZ70NYHA4/3ZpiZFyaCG3Ddb6ewtr6xuVXcLu3s7u0flA+P2iZONWUtGotYd0NimOCKtYCDYN1EMyJDwTrh+Hbmdx6ZNjxWDzBJWCDJUPGIUwJW8n1gT6BlViXn03654tbcOfAq8XJSQTma/fKXP4hpKpkCKogxPc9NIMiIBk4Fm5b81LCE0DEZsp6likhmgmx+8xSfWWWAo1jbUoDn6u+JjEhjJjK0nZLAyCx7M/E/r5dCdBVkXCUpMEUXi6JUYIjxLAA84JpREBNLCNXc3orpiGhCwcZUsiF4yy+vkvZFzavXru/rlcZNHkcRnaBTVEUeukQNdIeaqIUoStAzekVvTuq8OO/Ox6K14OQzx+gPnM8f96SRqA== AAAB83icbVBNSwMxEM3Wr1q/qh69BItQL2VXCuqt6MVjBfsB3aVk02wbmmSXZFYsS/+GFw+KePXPePPfmLZ70NYHA4/3ZpiZFyaCG3Ddb6ewtr6xuVXcLu3s7u0flA+P2iZONWUtGotYd0NimOCKtYCDYN1EMyJDwTrh+Hbmdx6ZNjxWDzBJWCDJUPGIUwJW8n1gT6BlVg3Pp/1yxa25c+BV4uWkgnI0++UvfxDTVDIFVBBjep6bQJARDZwKNi35qWEJoWMyZD1LFZHMBNn85ik+s8oAR7G2pQDP1d8TGZHGTGRoOyWBkVn2ZuJ/Xi+F6CrIuEpSYIouFkWpwBDjWQB4wDWjICaWEKq5vRXTEdGEgo2pZEPwll9eJe2LmlevXd/XK42bPI4iOkGnqIo8dIka6A41UQtRlKBn9IrenNR5cd6dj0VrwclnjtEfOJ8/+SqRqQ== Towards Understanding and Mitigating Social Biases in Language Models

Local bias Bias association The man performing surgery is a doctor. The woman performing surgery is a nurse. GPT-2 The man performing surgery is a doctor. The man performing surgery is precisely leading the operation. The woman performing surgery is carefully assisting the doctor. Context association

(a) Global bias (b) Figure 1. (a) We disentangle sources of representational biases in text generation into fine-grained local biases and high-level global biases. Local biases represent predictions at a particular time step that reflect undesirable associations with the context. Global biases result from representational differences across entire generated sentences spanning multiple phrases. (b) While it is desirable to mitigate bias, one must also take care to preserve contextual associations between the prompt (e.g. surgery) and the next word (e.g. doctor). modeling: fine-grained local biases and high-level global values such as ethics (Hendrycks et al., 2021), social bias im- biases (see Figure1). Fine-grained local biases represent plications (Sap et al., 2020), and toxic speech (Gehman et al., predictions generated at a particular time step that reflect 2020) in generated text. Our approach aims to supplement undesirable associations with the context. For example, an existing work by disentangling sources of bias and design- LM that assigns a higher likelihood to the final token in “he ing new target methods to mitigate them. We also evaluate worked as a [doctor]” than “she worked as a [doctor]”. our method on the benchmarks proposed in Nadeem et al. High-level global biases result from representational differ- (2020) and Sheng et al.(2019). Existing approaches towards ences across entire generated sentences spanning multiple mitigating biases in generation currently require retrain- phrases. For example, an LM that generates “the gay person ing the models through adversarial trigger prompts (Sheng was known for [his love of dancing, but he also did drugs]” et al., 2020), data augmentation or collection (Dinan et al., (example from (Sheng et al., 2019)). We first formally define 2020), and different objective functions (Qian et al., 2019; these two sources of biases (addressing P1) and ways to sep- Huang et al., 2020). These approaches have also been ap- arate them from desirable context associations (addressing plied to image captioning (Hendricks et al., 2018), image P2). With this in mind, we propose diverse benchmarks and retrieval (Otterbacher, 2018), and dialog (Liu et al., 2020). metrics that test for both sources of bias (addressing P3). However, these approaches are not scalable to large pre- Using these new formulations, we empirically validate the trained LMs (Radford et al., 2019) which are trained on existence of biases in pretrained LMs. massive amounts of text data over hundreds of machines for several weeks. As a result, it is difficult to retrain a new As a step towards mitigating bias in LMs, our second con- LM whenever a new source of bias is uncovered from data. tribution is a new method called AUTOREGRESSIVE INLP Therefore, we focus on efficient post-processing approaches (A-INLP) that is able to perform post-hoc debiasing of large to mitigate bias without retraining. pretrained LMs. The key to our approach lies in dynamically finding bias-sensitive tokens rather than relying on a prede- Social biases in text embeddings: A closely related line of fined set of bias-sensitive words that are common in existing work lies in measuring and mitigating biases in embedding literature (Bolukbasi et al., 2016). While a predefined set spaces. For example, word embeddings are shown to re- may work for studying word embeddings, LMs must handle flect and propagate social biases in the form of undesirable many possible diverse contexts and generated outputs. We associations that reinforce negative stereotypes about par- present a way to expand beyond a set of tokens using the ticular social groups (Lauscher and Glavasˇ, 2019; Caliskan geometry of embeddings and a bias classifier that general- et al., 2017; Bolukbasi et al., 2016). Corresponding methods izes to new contexts. Using these techniques in A-INLP for debiasing these embeddings for both binary (Bolukbasi shows effectiveness in mitigating bias over diverse input et al., 2016; Zhao et al., 2018) and multiclass (Manzini contexts and possible generation candidates through a set et al., 2019) attributes across gender, race, and religion have of experiments studying biases resulting from gender and been devised. Recent work has also extended this analysis religion. We also perform in-depth analysis into the various towards measuring (Tan and Celis, 2019; Guo and Caliskan, design decisions in measuring, detecting, and mitigating bi- 2020; Kurita et al., 2019) and mitigating (Liang et al., 2020; ases which we hope will inspire work towards automatically Ravfogel et al., 2020) bias in contextual embeddings such identifying sensitive tokens for fairer NLP. as ELMo (Peters et al., 2018), BERT (Devlin et al., 2019), and GPT (Radford et al., 2019) encoders. Many of these 2. Related Work approaches involve extending the Word Embedding Associ- Social biases in text generation: Recent work has focused ation Test (WEAT) (Caliskan et al., 2017) metric to the sen- on defining and evaluating social bias (Nadeem et al., 2020; tences (SEAT) using context templates (May et al., 2019). Sheng et al., 2019) as well as other notions of human-aligned Towards Understanding and Mitigating Social Biases in Language Models

Table 1. We summarize the benchmarks and metrics to measure local and global biases as well as LM performance during text generation. Diverse contexts found in naturally occurring text corpora test for both bias and context associations in rich real-world scenarios.

Source Example Data Collection Evaluation metric

He worked as a [doctor]. ( ) ( ) Templates (Sheng et al., 2019) KL(p (w Sc 1 ), p (w Sc 2 )) She worked as a [nurse]. θ t t−1 θ t t−1 Local bias The man performing surgery is a [doctor]. ( ) ( ) + Diverse text corpora H2(p (w Sc 1 ), p (w Sc 2 )) The woman performing surgery is a [nurse]. θ t t−1 θ t t−1 He was known for [being strong and assertive]. Regard dataset (Sheng et al., 2019) Sg(s(1)) − g(s(2))S Global bias She was known for [being quiet and shy]. + Diverse text corpora Human evaluation ∗ (1) ∗ (2) The jew worked as an enterprising [businessman]. pθ(w Sct−1) & pθ(w Sct−1) ∗ Performance The christian was regarded as an international Diverse text corpora KL(pθ(wtSct−1), pθ (wtSct−1)) hero who [saved a million lives in the 1940s.] 2 ∗ H (pθ(wtSct−1), pθ (wtSct−1))

Beyond representational biases: Several other sources of As a step towards defining bias in text generation, we first bias have also been shown to exist in machine learning disentangle fine-grained local and high-level global sources models, such as allocational harms that arise when an auto- of representational bias before designing a new benchmark mated system allocates resources (e.g., credit) or opportuni- and metrics for measuring these biases. We focus our expo- ties (e.g., jobs) unfairly to different social groups (Barocas sition on the biases across binary gender1 groups but our et al., 2017), and questionable correlations between sys- approach easily generalizes to multiclass social groups. tem behavior and features associated with particular social groups (Cho et al., 2019). These are also important per- 3.1. Fine-grained Local Biases spectives of bias that we leave as future work. We refer the Fine-grained local biases represent predictions generated at reader to Blodgett et al.(2020) for a detailed taxonomy of a particular time step that reflect undesirable associations the existing literature in analyzing social biases in NLP. with the context. For example, an LM that assigns a higher likelihood to the final token in “he worked as a [doctor]” 3. Defining Sources of Biases in LMs than “she worked as a [doctor]”. We begin with a standard definition of language modeling: Formally, consider the generation of word wt given a context given some context c and a target vocabulary V consisting (1) ct 1 describing the first social group (e.g., male individual). of a discrete set of word tokens, a model pθ with parameters − Change the context to c(2) such that it describes the second θ aims to predict a distribution over the next candidates V t−1 over multiple time steps until a maximum step T is reached: social group (e.g., female individual), and vice-versa. This can be done via simple word replacement from a prede- fined set of gender pairs (Bolukbasi et al., 2016). A model’s pθ(wtSct 1) = pθ(wtSw0, w1, ..., wt 1) ∀t ≤ T. (1) − − generation at time t is said to be locally biased if: In practice, pθ(wtSct−1) is implemented via two functions: an embedding function e over the vocabulary V (either pre- (1) (2) pθ(wtSct 1) ≠ pθ(wtSct 1). (3) trained word embeddings or trainable embeddings), and − − In other words, if the distribution over the next tokens differs an encoding function f over the context ct−1 (e.g., an RNN (Rumelhart et al., 1985) or Transformer (Vaswani significantly given a counterfactual edit in the context with et al., 2017)). The probability of a given next token wt is respect to the gendered term. To measure local biases across then equivalent to a softmax over distances between the the vocabulary, we use a suitable f-divergence between the probability distributions predicted by the LM conditioned token embedding e(wt) and context embedding f(ct−1): ⊺ on both counterfactual contexts: exp (e(wt) f(ct−1)) pθ(wtSw1, w2, ..., wt 1) = . − e w ⊺f c (1) (2) ∑w∈V exp ( ( ) ( t−1)) Df (pθ(wtSct 1), pθ(wtSct 1)). (4) (2) − − When using a Transformer LM such as GPT-2, one Since the probability of a specific token wt is directly propor- tional to the cosine distance between that token’s embedding can define the encoded context f(ct−1) to consist of e w and the context embedding f c (by equation2), the key-value pairs from the past, i.e., f(ct−1) = ( t) ( t−1) K(1),V (1) , ..., K(l) ,V (l) where K(i) ,V (i) corre- computing the f-divergence has a nice interpretation of sum- [( t−1 t−1 ) ( t−1 t−1)] ( t−1 t−1) sponds to the key-value pairs from the i-th Transformer marizing the difference in pairwise distances between all layer generated from time steps 0 to t − 1 (see (Dathathri 1We recognize that gender is non-binary and there are many ∗ et al., 2019) for more details). We use pθ to denote the ethical principles in the design, evaluation, and reporting of results original pretrained LM. in studying gender as a variable in NLP (Larson, 2017). Towards Understanding and Mitigating Social Biases in Language Models tokens and both contexts, weighted by the likelihood of that [blank]”. A biased LM will likely assign higher probability token. This further generalizes WEAT (Caliskan et al., 2017) to w = doctor than w = nurse by virtue of the context or SEAT (May et al., 2019) tests by comparing across all describing a male individual. However, note that there are 2 tokens while at the same time weighting more likely tokens associations going on: higher in bias computation, instead of only considering a 1. between “man” and “doctor”, which is the result of a predefined set of bias attributes (e.g., gendered terms and biased association in the language model, and occupations). In practice, we use the KL divergence and the 2. between “surgery” and “doctor”, which is the result of a Hellinger distance to measure this difference. (perfectly ok) context association in the language model. 3.2. High-level Global Biases Therefore, to accurately benchmark LMs for both fairness High-level global biases result from representational differ- and performance, we use two sets of metrics to accurately ences across entire generated sentences spanning multiple estimate bias association while allowing for context as- phrases. For example, an LM that generates “the gay per- sociation. To estimate for bias association, we measure whether p w c(1) p w c(2) across the entire dis- son was known for [his love of dancing, but he also did θ( tS t−1) ≈ θ( tS t−1) drugs]” (example from (Sheng et al., 2019)). While the tribution of next tokens at time t (i.e., local bias) as well 1 2 generation at each time step exhibits local biases, the entire as whether g(s( )) ≈ g(s( )) for entire generated sen- generated sentence also exhibits biases through a holistic, tences (i.e., global bias). To estimate for context association, global interpretation. The key difference lies in the fact that we measure whether p w∗ c(1) and p w∗ c(2) for the θ( S t−1) θ( S t−1) local biases primarily inspect the associations per word and ground truth word w∗ are both high implying that the LM primarily measure associations in generated nouns (e.g., oc- still assigns high probability to the correct next token by cupations). On the other hand, global biases take a more capturing context associations. holistic view that considers the semantics of the generated Leveraging diverse contexts: To accurately benchmark sentence, thereby measuring negative associations across LMs for both bias and context associations, it is also impor- entire phrases as well as their constituent verbs, adjectives, tant to use diverse contexts beyond simple templates used in and other parts of speech. prior work. Specifically, the Sentence Encoder Association Again, consider a given context c(1) describing a male indi- Test (May et al., 2019), StereoSet (Nadeem et al., 2020)), t−1 vidual. Change the context to c(2) such that it describes a and templates in Sheng et al.(2019) are all based on com- t−1 female individual rather than male, and vice-versa. Inspired bining bias terms (e.g., gender and race terms) and attributes by Sheng et al.(2019) and Huang et al.(2020), we allow (e.g., professions) with simple placeholder templates (e.g., the LM to generate the complete sentence s(1) and s(2) re- “The woman worked as”, “The man was known for”). Di- spectively before measuring differences in sentiment and verse contexts found in naturally occurring text corpora regard of the resulting sentence using a pretrained classifier contain important context associations to accurately bench- mark whether the new LM can still accurately generate g(⋅). Sentiment scores capture differences in overall lan- guage polarity (Pang and Lee, 2008), while regard measures realistic text, while also ensuring that the biases in the new language polarity and social perceptions of a demographic LM are in rich real-world contexts. To achieve this, (see Sheng et al.(2019) for differences). As a result, sen- we collect a large set of 16, 338 diverse contexts from 5 real- timent and regard measure representational biases in the world text corpora spanning WIKITEXT-2 (Merity et al., semantics of entire phrases rather than individual words. A 2017), SST (Socher et al., 2013), REDDIT, MELD (Poria model’s generation at time t is said to be globally biased if: et al., 2019), and POM (Park et al., 2014) which cover both spoken and written English language across formal and in- (1) (2) g(s ) ≠ g(s ). (5) formal settings and a variety of topics (Wikipedia, reviews, In other words, if sentiment and regard estimates differ politics, news, and TV dialog). We summarize these con- significantly given a counterfactual edit in the context with texts and metrics in Table1. From 948, 573 sentences across respect to the gendered term. To measure for the difference, 5 datasets, we found 15, 162 contexts for gender and 1, 176 1 2 we take the absolute difference Sg(s( )) − g(s( ))S. for religion which constitute our diverse context dataset. Please refer to AppendixB for details. 3.3. Benchmarks for Evaluating Biases Given these metrics, we now describe several existing and 4. Mitigating Biases newly collected data sources for measuring both local and Given the existence of local and global biases in LMs, our global biases, as well as their tradeoffs with language mod- approach towards mitigating them lies in 1) learning a set of eling performance. bias-sensitive tokens, and 2) mitigating bias of these sensi- Balancing biases with prediction: Suppose you are given tive tokens via our newly proposed autoregressive iterative a sentence “The man performing surgery on a patient is a nullspace projection algorithm (see Figure2). Towards Understanding and Mitigating Social Biases in Language Models

Algorithm 1 AUTOREGRESSIVE INLP algorithm for mitigating social biases in pretrained LMs. ∗ 1: Given: pre-trained LM pθ . 2: Learn bias-sensitive tokens S by projection onto bias subspace. 3: Learn context bias classifier with parameter W and obtain nullspace P via multiple steps of nullspace projection. 4: for t = 1, ..., T do ′ ∗ 5: V = topkpθ (⋅ S ct−1) ∩ S // Find likely next tokens that are bias-sensitive ⊺ exp‰e(wt) Pf (ct−1)Ž 6: pˆθ wt ct 1 ⊺ // Computed debiased LM distribution ( S − ) = w∈V exp(e(w) Pf (ct−1)) ∗ ∑ ∑w∈V ′ pθ (wSct−1)×q(w) 7: α ∗ t = ′ // Compute debiasing level ∑w∈V pθ (wSct−1) ∗ 8: pθ(wtSct−1) = αtpˆθ(wtSct−1) + (1 − αt)pθ (wtSct−1) // Obtain new weighted LM 9: wt ∼ pθ(wtSct−1) // Sample next token 10: end for 11: return generated tokens w1, ..., wT .

GPT-2 A-INLP and the tokens with high projection values are regarded scientist (male bias) scientist as bias sensitive tokens. This approach uses information doctor (male bias) (1) Identify bias- doctor nurse (female bias) sensitive tokens nurse about the geometry of token embeddings to infer new bias- artist (female bias) artist sensitive tokens S beyond those present in the definitional … … token set. We perform an in-depth analysis of these auto- male context matically found tokens in §5.1. female context (2) Nullspace projection 4.2. Mitigating Bias via Nullspace Projection Our method is inspired by iterative nullspace projection (INLP) as proposed by (Ravfogel et al., 2020) to debias The man worked as a word embeddings. Given a set of word embeddings xi ∈ X Figure 2. Our approach for mitigating biases in language models and a set of corresponding protected attributes zi ∈ Z (e.g., relies on 2 steps: (1) identifying sources of local and global biases gender), INLP aims to find a linear guarding function h that during text generation (section 4.1), and (2) mitigating bias via removes the linear dependence between X and Z. To do so, sequential iterative nullspace projection in order to obtain a more INLP first trains a linear classifier with parameter W to best uniform distribution over possibly sensitive tokens (section 4.2). predict z from x before projecting x onto the nullspace of W , denoted as P , which serves the purpose of removing 4.1. Finding Biases Through Sensitive Tokens all information used by W to predict the protected attribute. The guarding function h(x) = P x gives an embedding that Prior work studying representational biases uses a set of removes dependence between x and z (see Ravfogel et al. predefined social attributes (e.g., occupations, academic (2020) for details). fields) to measure undesirable associations (Caliskan et al., 2017). We refer to such attributes as bias-sensitive words: AUTOREGRESSIVE INLP (A-INLP) extends INLP for words that are at risk of capturing undesirable associations autoregressive text generation. We assume that we have with respect to gendered terms. Finding bias-sensitive words found a set of bias-sensitive tokens S from §4.1, as well is therefore crucial to mitigating local bias at the word-level. as a nullspace P obtained from a trained bias classifier given LM contexts (e.g., gender/religion classifier given We propose to use a learning-based approach that can de- (partial) sentences). In §5.2, we evaluate several design tect new bias-sensitive words to ensure fair generation. We choices regarding the data and models required to train first identify the bias subspace by starting with several defi- such a bias classifier. nitional bias pairs from Bolukbasi et al.(2016), such as “he” and “she”, “father” and “mother” for gender, and At every time step t, we apply INLP to the context em- “jew”, “christian”, “muslim” for religion. We embed each bedding f(ct−1) to ensure that generation of next tokens is bias-defining word using GloVe (Pennington et al., 2014) invariant to gender in the context: and take the SVD of differences between each pair of vec- ⊺ exp (e(wt) Pf (ct−1)) pˆθ wt ct 1 . (6) tors to obtain a low-dimensional bias subspace (Bolukbasi ( S − ) = ⊺ ∑w V exp (e(w) Pf (ct−1)) et al., 2016). These top principal components summarize the ∈ main directions capturing gender and religion. We project all Controlling the trade-off between performance and fair- possible candidate generation tokens onto our bias subspace, ness: We set a hyper-parameter α that determines how much Towards Understanding and Mitigating Social Biases in Language Models

Table 2. Examples of harmful bias-sensitive tokens automatically Table 3. We find that training with simple and diverse contexts detected for gender and religion social classes. Some extremely supplemented with sub-sequences gives a bias classifier that gener- sensitive words have been filtered out, see full list in Appendix D.1. alizes best to the diverse possible contexts input to LMs.

Male Female Training data Simple Diverse Sub-sequences captain, sir, president, war, sassy, pregnant, diva, Simple 91.4 53.6 52.7 gangster, offensive, macho, jock, seductress, madwomen, midwife, Simple + Diverse 87.8 61.2 60.4 studly, football, henchmen, socialite, glamour, supermodel, Simple + Diverse + Sub-sequences 88.0 63.7 62.5 commander, king, greatest alluring, vivacious, mistress

Christianity Islam analyze several intermediate objectives of identifying bias- counterfeit, supernatural, skeptics, terrorists, jihad, terror, sensitive tokens and training bias classifiers before testing incredulity, charisma, cathedral, afghanistan, extremists, murder, the ability of A-INLP in mitigating bias from pretrained metaphysical, teleological, faith, civilians, fear, war, hatred, irresistible, devotionals, fable cries, enemies, lies, rights, hate GPT-2. Experimental details are in AppendixC and full results are in AppendixD. We release our code at https: //github.com/pliang279/LM_bias. to use our debiased LM. The final distributions over next tokens we output is a weighted average using α: 5.1. Results on Identifying Bias-sensitive Tokens ∗ How well do our automatically detected bias-sensitive to- pθ(wtSct−1) = αpˆθ(wtSct−1) + (1 − α)pθ (wtSct−1) (7) kens in LMs align with human perception of social biases in where p∗ denotes logits of the original LM and pˆ represents θ θ generated text? We ranked words by their projection values our debiased LM. α 0 recovers the original LM predic- = onto the bias subspace and show examples of the found tions (no debiasing) and α 1 would fully apply INLP at = bias-sensitive tokens (largest projection values) for gender all time steps (full debiasing). and religious terms in Table2( some of the found tokens We further propose an approach to automatically learn αt at are extremely offensive and we have deferred them to time step t that summarizes how many of the likely gener- Appendix D.1). Visually, many of these words very nega- ated tokens will be bias-sensitive. A large number of bias- tively stereotype certain genders and religions (especially sensitive tokens should lead to a large αt and vice-versa. To for the female gender and Muslim religion). To perform a compute αt, we consider the subset of next tokens V ′ ⊆ V more careful empirical analysis, we sampled the top 100 that are 1) likely to be generated by the language model, bias-sensitive tokens for each social group and asked 5 inde- and 2) at risk of displaying bias. To satisfy both criteria, we pendent human annotators to judge whether the found token ′ ∗ choose V = topk pθ (⋅ Sct−1) ∩ S where the topk function was indeed stereotyped negatively against that social group. ∗ ranks the predicted LM distribution pθ (⋅ Sct−1) and chooses For the Islamic religion, 32% of the top-ranked words were the k most likely candidate tokens (thereby satisfying 1), judged as showing severely negative bias (words such as followed by an intersection with bias-sensitive tokens S “terror” and “terrorism”). We show more details and results (thereby satisfying 2). For each of these potential next to- in Appendix D.1. kens w ∈ V ′, we compute 1) q(w), the projection onto our bias subspace which reflects the degree of bias, and 2) 5.2. Results on Learning a Bias Classifier ∗ Next, we analyze how several design decisions affect the pθ (wSct−1) the original LM likelihood. We set performance of our trained bias classifier. ′ p∗ w c q w ∑w∈V θ ( S t−1)× ( ) αt = (8) ′ p∗ w c Data: We first build a dataset for the bias classifier. To ∑w∈V θ ( S t−1) improve the diversity of the training data, we collect both which computes a normalized value in 0, 1 summarizing [ ] simple contexts from the templates in Sheng et al.(2019) how likely the next tokens will exhibit bias. We summarize and diverse context from real-world corpus described in A-INLP in Algorithm1 and note some implementation de- §3.3. We use our learned bias subspace to find a set of bias tails and speedups in Appendix C.1. Note that our approach sensitive tokens, and contextualize these bias sensitive to- can also be instantiated with other token-level debiasing kens into bias sensitive contexts using the approach in Liang methods beyond INLP, such as subspace debiasing (Boluk- et al.(2020). For simple contexts, we replaced the biased basi et al., 2016; Manzini et al., 2019; Liang et al., 2020) token in the original templates to obtain new contexts. For which we test in our experiments as well. diverse contexts, we collect sentences containing biased to- 5. Experiments kens within a single class. To match partial input contexts we encounter when testing bias in GPT-2, we also supplement To test whether we are able to efficiently characterize and our full-sentence contexts with their partial subsequences. mitigate social biases in LMs, we experiment on the GPT- 2 LM trained in English (Radford et al., 2019). We first Method: After collecting this dataset, we train a linear SVM Towards Understanding and Mitigating Social Biases in Language Models

Figure 3. Bias metrics on gender (top 4) and religion (bottom 4) contexts. A-INLP TUNE α controls the trade-off between performance and fairness which can be automatically balanced using A-INLP LEARN α.A-SUBSPACE is another effective version of our approach. with `2 penalty and squared hinge loss as our bias classifier. Table 4. Global regard differences on gender bias. A-INLP dis- plays less bias as compared to GPT-2 especially on negative regard. Results: We found that the classifier trained only on simple contexts cannot generalize to diverse contexts. When we Context Model Positive (↓) Neural (↓) Negative (↓) GPT-2 0.026 0.134 0.160 add more diverse contexts from real-world corpora, our clas- Respect sifier generalizes better to both simple and diverse contexts A-INLP 0.004 0.000 0.003 GPT-2 0.004 0.088 0.084 (see Table3). Finally, we find that adding subsequences Occupation also helps in accurately finding bias in partial input contexts A-INLP 0.012 0.046 0.034 given to GPT-2. For religion, we find the number of sen- tences containing religion tokens in real-world corpora is dataset of simple contexts with human annotations for vari- relatively small and most sentences are much longer, which ous possible next word completions that range from unbi- results in slightly lower accuracy of the trained religion ased to biased (showing stereotypes). StereoSet is suitable classifier (see more details in Appendix D.2). for measuring biases at both local (approximately intra- sentence bias) and global (approximately inter-sentence 5.3. Results on Mitigating Bias bias) levels, while at the same time providing ground truth How well does our proposed A-INLP approach work text completions to judge language modeling performance. in mitigating social biases in text generation? We apply Their metrics include language modeling score (LM), stereo- our approach on the pretrained GPT-2 model in Hugging type score (SS), and overall idealized CAT score (ICAT). Face (Wolf et al., 2020) and compare with both currently Baselines: We compare to the following methods: established and newly proposed benchmarks and metrics. 1. GPT-2: Original pretrained GPT-2 language model (Rad- Datasets and metrics: We perform experiments on 3 ford et al., 2019). datasets spanning recently proposed work as well as our proposed benchmarks: 2. INLP: Direct application of INLP when generating every token (Ravfogel et al., 2020). 1. Simple contexts as proposed by Sheng et al.(2019) allow us to test LMs with certain context templates describing 3. A-INLP TUNE α:A-INLP with hyper-parameter search gender, race, religion, and other social constructs. We mea- over α to obtain a single best α over all time-steps. sure both local and global bias using these contexts. For 4. A-INLP LEARN α:A-INLP with auto-selection of α global bias, we use a pretrained regard classifier (Sheng t across time steps learned from bias-sensitive tokens (§4.2). et al., 2019; 2020) as well as human judgment. 5. A-SUBSPACE: Instead of using INLP for debiasing, we 2. Diverse contexts which are our proposed extension to also experimented with autoregressive token-level subspace better measure fairness and LM performance in diverse real- debiasing at every time step (Bolukbasi et al., 2016). world contexts. We again measure both local and global bias using these diverse contexts. Analysis of local and global biases: In Figure3, we show trade-off plots of performance vs fairness as measured 3. StereoSet (Nadeem et al., 2020) is a recently proposed across local and global metrics with both simple and diverse Towards Understanding and Mitigating Social Biases in Language Models

Table 5. Example sentences generated by GPT-2 and after debiasing with A-INLP. A-INLP generates text that refers less to stereotypes while retaining clarity and realistic content. Some extremely sensitive sentences have been filtered out, see full list in Appendix D.6.

GPT-2 A-INLP The woman was thought as having a long-term mental disorder, The woman was regarded as a saint by her Catholic and was also the only one who identified herself as bisexual. community, and the church recognised it in 1925. The man was known for attracting outrage at a middle school The man started working as an actor in Starwalt when he solicited young girls on a secret date. and learned the ropes of news. The muslim was known for his extremist activities and The muslim was known for his complexity- acts of terrorism, as well as several fine crimes. he speaks fluently, reads both cultures and wrote beautifully The christian was described as gay ‘pastor’ The christian was described as bellowing from Japan. He had come to the UK out screams and shouting together from hallowed halls. The jew was regarded as a threat to British values. The jew was regarded as ardent gardener who He became Britain’s most hated speaker. loved floral essences of fermented soy flavour and alcohol. gender and religion contexts. We begin by noting that GPT- Table 6. On Stereoset, A-INLP improves upon GPT-2 on stereo- 2 exhibits the best performance while being the most unfair type scores (SS) while retaining language modeling scores (LM). with respect to different social groups. By applying A-INLP The 2 sets of INLP and A-INLP results correspond to training P TUNE α with different levels of debiasing as controlled by for 30 and 15 epochs respectively. α, we are able to draw a trade-off curve with gradually im- Context Model LM (↑) SS (↓) ICAT (↑) proving fairness metrics at the cost of performance. It is GPT-2 88.46 58.02 74.27 promising that for many plots, the initial improvement in INLP 82.83 55.91 73.04 fairness happens at a small expense in performance (steep Religion A-INLP 89.13 54.57 80.97 upwards slope) which implies that initial debiasing can be INLP 86.64 50.16 86.36 achieved without hurting the quality of generated text. Fi- A-INLP 88.55 49.98 88.51 nally, at the largest level of debiasing (α = 1), we recover the INLP baseline which achieves the best fairness but at 2. Even though many parts of the original text may contain the expense of language modeling performance. bias, we found that once the very first occurrence of a sen- For global bias, we also observe that A-INLP LEARN α sitive token is fixed, the remaining generated text displays using bias-sensitive tokens consistently outperforms other significantly less bias even without further debiasing. approaches on performance and fairness, thereby pushing 3. We note that the plots of global bias metrics do not show the Pareto front outwards. We also show numerical perfor- a smooth tradeoff like the local ones do. We attribute this to mance in Table4 and find that our debiased LM effectively stochasticity during autoregressive generation with respect equalizes the global regard scores (i.e., equal proportion to token-level debiasing. of completed sentences judged as positive or negative re- gard for both male and female contexts), with it especially 4. Taking a closer look at debiasing performance for sim- effective in equalizing negative scoring sentences. ple versus diverse contexts, we find that it is significantly harder to detect and mitigate biases from real-world diverse Finally, we also note some observations regarding A- contexts. Only bias classifiers trained on simple + diverse + SUBSPACE instantiated with token-level subspace debiasing subsequences performed well enough on diverse contexts, rather than INLP. From Figure3, we see that this point but still leaves significant room for future improvement. makes little difference to LM performance while achieving better fairness performance, which makes subspace debias- Comparison on StereoSet: We also apply our debiased ing another effective version of our approach. LMs on StereoSet (Nadeem et al., 2020) and show results in Table6. We find that on SS score which measures for Ablation studies: To study the design decisions underpin- stereotypical biases, our approach improves upon GPT-2 ning our approach, we conduct ablation studies and summa- significantly while maintaining LM score. On the overall rize our observations (full results in Appendix D.4): ICAT score metric, we improve performance by 19% on the 1. The quality of the bias classifier can affect debiasing per- tasks testing for bias associated with different religions. formance. Well trained bias classifiers, while accurate in Human evaluation: How well do our proposed metrics detecting bias, will also retain significant context informa- align with human perception of social biases in text? We tion. Therefore, projecting onto its null space will cause begin by showing some examples of text generated by GPT- context information to be lost in addition to removing bias. 2 versus text generated by A-INLP in Table5. Visually, GPT-2 can generate very harmful text but our approach Towards Understanding and Mitigating Social Biases in Language Models

Table 7. On human evaluation of generated text, A-INLP achieves tions (well-defined bias subspace & classifier) which largely better (absolute) fairness scores while retaining clarity and content. reflect only one perception of biases which might not gen- eralize to other cultures, geographical regions, and time Context Model Clarity (↑) Content (↑) Fairness (↑) GPT-2 4.97 4.99 3.93 periods. Bias can also span social, moral, and ethical dimen- Religion A-INLP 4.93 4.93 4.00 sions, which are important areas of future work. 3. Our approach does incur additional time and space com- Table 8. We also measure relative changes in fairness via differ- plexity with the main bottleneck in the preprocessing phase ences in human judged fairness for swapped contexts across differ- which can be amortized over multiple inference runs. How- ent social groups. A-INLP shows more significant reductions in ever, during inference, A-INLP is as fast as GPT-2, which relative than absolute bias. implies that the real-world deployment of these debiasing Context Model Fairness (↓) methods could be feasible (see Appendix C.5). GPT-2 0.74 Religion A-INLP 0.59 In AppendixE we also outline some strategies for mitigating bias that were ineffective and provide possible explanations. generates text that refers less to gender and religious stereo- 6. Conclusion types. To formally analyze whether this is true, we conduct a In conclusion, this paper takes a step towards improving large-scale human evaluation across pairs of generated sen- the fairness of large-scale pretrained LMs by proposing tences by GPT-2 and A-INLP. Following human evaluation evaluation metrics to measure sources of representational protocols in the related fields of text generation and style biases. To tackle these biases, we also proposed A-INLP transfer (Shen et al., 2017), 5 annotators were asked to judge that automatically detects bias-sensitive tokens before apply- 1) clarity: coherence of the sentence (including grammar ing debiasing approaches to mitigate them. Our empirical and spelling), 2) content: whether the sentence respects the results and human evaluation demonstrate effectiveness in semantic meaning of realistic text, and 3) fairness: whether mitigating bias while retaining context for text generation, the sentence is fair across social groups, on a 1 5 scale (see − thereby pushing forward the performance-fairness frontier. annotation details and more examples in Appendix D.6). In Table7, we report the average human-judged clarity, content, and fairness scores across all sentences generated by GPT-2 Acknowledgements versus A-INLP and find that A-INLP retains clarity and This material is based upon work partially supported by the content (both close to 5) of generated text while improving National Science Foundation (Awards #1750439, #1734868, fairness from 3.93 to 4.00. and #1722822) and the National Institutes of Health. RS is supported in part by NSF IIS1763562 and ONR Grant To take a closer look at how GPT-2 and A-INLP generated N000141812861. Any opinions, findings, and conclusions sentences differ across social groups prompted as context, or recommendations expressed in this material are those we computed absolute differences in human judged fairness of the author(s) and do not necessarily reflect the views for swapped contexts across different social groups. For of National Science Foundation or National Institutes of example, we take an absolute difference between the com- Health, and no official endorsement should be inferred. We pleted sentences given a context “The woman was thought would also like to acknowledge NVIDIA’s GPU support as” versus “The man was thought as”. In other words, while and the anonymous reviewers for their extremely helpful the previous fairness metric in Table7 judges absolute bias, comments. this new metric judges relative bias between generated sen- tences across different social groups, where lower is better. From Table8, we find even more significant reductions in relative bias as compared to absolute bias in Table7. Limitations: We outline some limitations and possible di- rections for future research in mitigating bias in LMs. 1. Our approach is not perfect and we found strong tradeoffs between performance and fairness. Therefore, it only results in pretrained LMs with some amount of bias mitigated and therefore should not be taken as a guarantee for the real- world safety of pretrained LMs. Care should continue to be taken in the interpretation, deployment, and evaluation of these models across diverse real-world settings. 2. Our approach depends on carefully crafted bias defini- Towards Understanding and Mitigating Social Biases in Language Models

References Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtoxicityprompts: Evaluating neural Ricardo Baeza-Yates. Data and algorithmic bias in the web. In toxic degeneration in language models. In Proceedings of the Proceedings of the 8th ACM Conference on Web Science, pages 2020 Conference on Empirical Methods in Natural Language 1–1, 2016. Processing: Findings, pages 3356–3369, 2020. David Bamman, A. Seza Dogru˘ oz,¨ Jacob Eisenstein, Dirk Hovy, David Jurgens, Brendan O’Connor, Alice Oh, Oren Tsur, and Wei Guo and Aylin Caliskan. Detecting emergent intersectional Svitlana Volkova. Proceedings of the first workshop on NLP biases: Contextualized word embeddings contain a distribution and computational social science. 2016. of human-like biases. arXiv preprint arXiv:2006.03955, 2020. Solon Barocas and Andrew D Selbst. Big data’s disparate impact. Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, Calif. L. Rev., 104:671, 2016. and Anna Rohrbach. Women also snowboard: Overcoming bias in captioning models. In European Conference on Computer Solon Barocas, Kate Crawford, Aaron Shapiro, and Hanna Wallach. Vision, pages 793–811. Springer, 2018. The problem with bias: Allocative versus representational harms in machine learning. In 9th Annual Conference of the Special Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Interest Group for Computing, Information and Society, 2017. Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning AI with shared human values. In International Conference on Learn- Su Lin Blodgett, Solon Barocas, Hal Daume´ III, and Hanna Wal- ing Representations, 2021. URL https://openreview. lach. Language (technology) is power: A critical survey of net/forum?id=dNy_RKzJacY. “bias” in nlp. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476, Po-Sen Huang, Huan Zhang, Ray Jiang, Robert Stanforth, Jo- 2020. hannes Welbl, Jack Rae, Vishal Maini, Dani Yogatama, and Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Pushmeet Kohli. Reducing sentiment bias in language models Saligrama, and Adam T Kalai. Man is to computer program- via counterfactual evaluation. In Proceedings of the 2020 Con- mer as woman is to homemaker? debiasing word embeddings. ference on Empirical Methods in Natural Language Processing: In Advances in neural information processing systems, pages Findings, pages 65–83, 2020. 4349–4357, 2016. Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W Black, and Yulia Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. Se- Tsvetkov. Measuring bias in contextualized word representa- mantics derived automatically from language corpora contain tions. In Proceedings of the First Workshop on Gender Bias human-like biases. Science, 356(6334):183–186, 2017. in Natural Language Processing, pages 166–172, Florence, Italy, August 2019. Association for Computational Linguistics. Won Ik Cho, Ji Won Kim, Seok Min Kim, and Nam Soo Kim. On doi: 10.18653/v1/W19-3823. URL https://www.aclweb. measuring gender bias in translation of gender-neutral pronouns. org/anthology/W19-3823. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pages 173–181, 2019. Brian Larson. Gender as a variable in natural-language pro- cessing: Ethical considerations. In Proceedings of the First Robert Dale. Law and word order: Nlp in legal tech. Nat- ACL Workshop on Ethics in Natural Language Processing, ural Language Engineering, 25(1):211–217, 2019. URL pages 1–11, Valencia, Spain, April 2017. Association for Com- http://dblp.uni-trier.de/db/journals/nle/ putational Linguistics. doi: 10.18653/v1/W17-1601. URL nle25.html#Dale19 . https://www.aclweb.org/anthology/W17-1601. Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug Anne Lauscher and Goran Glavas.ˇ Are we consistently biased? and play language models: A simple approach to controlled text multidimensional analysis of biases in distributional word vec- generation. In International Conference on Learning Represen- tors. In *SEM 2019, 2019. tations, 2019. Paul Pu Liang, Ziyin Liu, AmirAli Bagher Zadeh, and Louis- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Philippe Morency. Multimodal language analysis with recurrent Toutanova. BERT: Pre-training of deep bidirectional trans- multistage fusion. In Proceedings of the 2018 Conference on formers for language understanding. In Proceedings of the Empirical Methods in Natural Language Processing, 2018. 2019 Conference of the North American Chapter of the As- sociation for Computational Linguistics: Human Language Paul Pu Liang, Irene Mengze Li, Emily Zheng, Yao Chong Lim, Technologies, Volume 1 (Long and Short Papers), pages 4171– Ruslan Salakhutdinov, and Louis-Philippe Morency. Towards 4186, Minneapolis, Minnesota, June 2019. Association for Com- debiasing sentence representations. In Proceedings of the 58th putational Linguistics. doi: 10.18653/v1/N19-1423. URL Annual Meeting of the Association for Computational Linguis- https://www.aclweb.org/anthology/N19-1423. tics, pages 5502–5515, 2020. Emily Dinan, Angela Fan, Adina Williams, Jack Urbanek, Douwe Haochen Liu, Jamell Dacon, Wenqi Fan, Hui Liu, Zitao Liu, and Kiela, and Jason Weston. Queens are powerful too: Mitigat- Jiliang Tang. Does gender matter? towards fairness in dialogue ing gender bias in dialogue generation. In Proceedings of systems. In Proceedings of the 28th International Conference the 2020 Conference on Empirical Methods in Natural Lan- on Computational Linguistics, pages 4403–4416, Barcelona, guage Processing (EMNLP), pages 8173–8188, Online, Novem- Spain (Online), December 2020. International Committee on ber 2020. Association for Computational Linguistics. doi: Computational Linguistics. doi: 10.18653/v1/2020.coling-main. 10.18653/v1/2020.emnlp-main.656. URL https://www. 390. URL https://www.aclweb.org/anthology/ aclweb.org/anthology/2020.emnlp-main.656. 2020.coling-main.390. Towards Understanding and Mitigating Social Biases in Language Models

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep and Veselin Stoyanov. Roberta: A robustly optimized BERT contextualized word representations. In Proceedings of the pretraining approach. CoRR, abs/1907.11692, 2019. URL 2018 Conference of the North American Chapter of the As- http://arxiv.org/abs/1907.11692. sociation for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, Thomas Manzini, Lim Yao Chong, Alan W Black, and Yulia New Orleans, Louisiana, June 2018. Association for Com- Tsvetkov. Black is to criminal as caucasian is to police: Detect- putational Linguistics. doi: 10.18653/v1/N18-1202. URL ing and removing multiclass bias in word embeddings. In Pro- https://www.aclweb.org/anthology/N18-1202. ceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam guage Technologies, Volume 1 (Long and Short Papers), pages Naik, Erik Cambria, and Rada Mihalcea. MELD: A multimodal 615–621, Minneapolis, Minnesota, June 2019. Association for multi-party dataset for emotion recognition in conversations. Proceedings of the 57th Annual Meeting of the Association Computational Linguistics. doi: 10.18653/v1/N19-1062. URL In for Computational Linguistics https://www.aclweb.org/anthology/N19-1062. , pages 527–536, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10. 18653/v1/P19-1050. URL https://www.aclweb.org/ Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bowman, anthology/P19-1050. and Rachel Rudinger. On measuring social biases in sen- tence encoders. In Proceedings of the 2019 Conference of Yusu Qian, Urwa Muaz, Ben Zhang, and Jae Won Hyun. Re- the North American Chapter of the Association for Computa- ducing gender bias in word-level language models with a tional Linguistics: Human Language Technologies, Volume 1 gender-equalizing loss function. In Proceedings of the 57th (Long and Short Papers), pages 622–628, Minneapolis, Min- Annual Meeting of the Association for Computational Linguis- nesota, June 2019. Association for Computational Linguistics. tics: Student Research Workshop, pages 223–228, Florence, doi: 10.18653/v1/N19-1063. URL https://www.aclweb. Italy, July 2019. Association for Computational Linguistics. org/anthology/N19-1063. doi: 10.18653/v1/P19-2031. URL https://www.aclweb. org/anthology/P19-2031. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In 5th Interna- Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, tional Conference on Learning Representations, ICLR 2017, and Ilya Sutskever. Language models are unsupervised multi- Toulon, France, April 24-26, 2017, Conference Track Proceed- task learners. 2019. ings. OpenReview.net, 2017. URL https://openreview. Shauli Ravfogel, Yanai Elazar, Hila Gonen, Twiton, and net/forum?id=Byj72udxe. Yoav Goldberg. Null it out: Guarding protected attributes by iterative nullspace projection. In Proceedings of the 58th An- Moin Nadeem, Anna Bethke, and Siva Reddy. Stereoset: Mea- nual Meeting of the Association for Computational Linguistics, suring stereotypical bias in pretrained language models. arXiv pages 7237–7256, Online, July 2020. Association for Compu- preprint arXiv:2004.09456, 2020. tational Linguistics. URL https://www.aclweb.org/ anthology/2020.acl-main.647. Jahna Otterbacher. Addressing social bias in information retrieval. In International Conference of the Cross-Language Evalua- David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. tion Forum for European Languages, pages 121–127. Springer, Learning internal representations by error propagation. Techni- 2018. cal report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985. Jahna Otterbacher, Alessandro Checco, Gianluca Demartini, and Maarten Sap, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A Paul Clough. Investigating user perception of gender bias in Smith, and Yejin Choi. Social bias frames: Reasoning about image search: the role of sexism. In The 41st International social and power implications of language. In Proceedings of ACM SIGIR Conference on Research & Development in the 58th Annual Meeting of the Association for Computational Information Retrieval, pages 933–936, 2018. Linguistics, pages 5477–5490, 2020. Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Dougal Shakespeare, Lorenzo Porcaro, Emilia Gomez,´ and Carlos Foundations and Trends in Information Retrieval, 2(1-2):1–135, Castillo. Exploring artist gender bias in music recommendation. 2008. arXiv preprint arXiv:2009.01715, 2020. Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. Sunghyun Park, Han Suk Shim, Moitreya Chatterjee, Kenji Sagae, Style transfer from non-parallel text by cross-alignment. In and Louis-Philippe Morency. Computational analysis of persua- Proceedings of the 31st International Conference on Neural siveness in social multimedia: A novel dataset and multimodal Information Processing Systems, NIPS’17, page 6833–6844, ICMI prediction approach. In , 2014. Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964. Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word representation. In Proceedings Emily Sheng, Kai-Wei Chang, Prem Natarajan, and Nanyun Peng. of the 2014 Conference on Empirical Methods in Natural Lan- The woman worked as a babysitter: On biases in language gen- guage Processing (EMNLP), pages 1532–1543, Doha, Qatar, eration. In Proceedings of the 2019 Conference on Empirical October 2014. Association for Computational Linguistics. doi: Methods in Natural Language Processing and the 9th Inter- 10.3115/v1/D14-1162. URL https://www.aclweb.org/ national Joint Conference on Natural Language Processing anthology/D14-1162. (EMNLP-IJCNLP), pages 3398–3403, 2019. Towards Understanding and Mitigating Social Biases in Language Models

Emily Sheng, Kai-Wei Chang, Prem Natarajan, and Nanyun Peng. doi: 10.18653/v1/2020.emnlp-demos.6. URL https://www. Towards Controllable Biases in Language Generation. In Find- aclweb.org/anthology/2020.emnlp-demos.6. ings of the Association for Computational Linguistics: EMNLP 2020, pages 3239–3254, Online, November 2020. Associa- Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris tion for Computational Linguistics. doi: 10.18653/v1/2020. Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and William B findings-emnlp.291. URL https://www.aclweb.org/ Dolan. Dialogpt: Large-scale generative pre-training for conver- anthology/2020.findings-emnlp.291. sational response generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Sys- Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christo- tem Demonstrations, pages 270–278, 2020. pher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and treebank. In Proceedings of the 2013 Conference on Empirical Kai-Wei Chang. Men also like shopping: Reducing gender bias Methods in Natural Language Processing, pages 1631–1642, amplification using corpus-level constraints. In Proceedings Seattle, Washington, USA, October 2013. Association for Com- of the 2017 Conference on Empirical Methods in Natural Lan- putational Linguistics. URL https://www.aclweb.org/ guage Processing, pages 2979–2989, Copenhagen, Denmark, anthology/D13-1170. September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1323. URL https://www.aclweb. Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai ElSh- org/anthology/D17-1323. erief, Jieyu Zhao, Diba Mirza, Elizabeth Belding, Kai-Wei Chang, and William Yang Wang. Mitigating gender bias in Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei natural language processing: Literature review. In Proceed- Chang. Learning gender-neutral word embeddings. In Pro- ings of the 57th Annual Meeting of the Association for Compu- ceedings of the 2018 Conference on Empirical Methods in tational Linguistics, pages 1630–1640, Florence, Italy, July Natural Language Processing, pages 4847–4853, Brussels, 2019. Association for Computational Linguistics. doi: 10. Belgium, October-November 2018. Association for Computa- 18653/v1/P19-1159. URL https://www.aclweb.org/ tional Linguistics. doi: 10.18653/v1/D18-1521. URL https: anthology/P19-1159. //www.aclweb.org/anthology/D18-1521.

Nathaniel Swinger, Maria De-Arteaga, Neil Thomas Heffernan IV, Mark DM Leiserson, and Adam Tauman Kalai. What are the biases in my word embedding? In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 305– 311, 2019.

Yi Chern Tan and L Elisa Celis. Assessing social and intersectional biases in contextualized word representations. In Advances in Neural Information Processing Systems, pages 13230–13241, 2019.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polo- sukhin. Attention is all you need. In NIPS, 2017.

Sumithra Velupillai, Hanna Suominen, Maria Liakata, Angus Roberts, Anoop D. Shah, Katherine Morley, David Osborn, Joseph Hayes, Robert Stewart, Johnny Downs, Wendy Chap- man, and Rina Dutta. Using clinical natural language processing for health outcomes research: Overview and actionable sugges- tions for future advances. Journal of Biomedical Informatics, 88: 11 – 19, 2018. ISSN 1532-0464. doi: https://doi.org/10.1016/ j.jbi.2018.10.005. URL http://www.sciencedirect. com/science/article/pii/S1532046418302016.

Chenguang Wang, Mu Li, and Alexander J. Smola. Language models with transformers. CoRR, abs/1904.09408, 2019. URL http://arxiv.org/abs/1904.09408.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Can- wen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State- of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Lan- guage Processing: System Demonstrations, pages 38–45, On- line, October 2020. Association for Computational Linguistics. Towards Understanding and Mitigating Social Biases in Language Models Appendix A. Comparison with Related Benchmarks We highlight the following differences between our notion and evaluation of representational bias in pretrained LMs with recent work in this direction: 1. The Sentence Encoder Association Test (May et al., 2019) extend WEAT to sentence encoders by creating artificial sentences using templates of the form “This is [target]” and “They are [attribute]”. SEAT is primarily a method to measure bias in contextual embeddings and does not extend to generation. 2. StereoSet (Nadeem et al., 2020) defines a set of attributes spanning professions, race, and religion from Wikipedia before asking a crowdworker to write attribute terms that correspond to stereotypical, anti-stereotypical and unrelated associations of the target term. We believe that StereoSet is a valuable resource with well-defined tests for both intrasentence and intersentence stereotypical associations and we report results on this benchmark. However, there is a lack of diversity regarding the contexts chosen, and as a result, it is unable to clearly measure fine-grained context and bias associations in pretrained LMs. 3. In Sheng et al.(2019), the authors choose a set of contexts and obtain the completed sentences via pretrained LMs before measuring differences in regard across generated sentences from different social contexts. Again, they suffer in the diversity of contexts since they begin with a small set of bias terms (e.g., man/woman) and use simple placeholder templates (e.g., “The woman worked as”, “The man was known for”). This does not allow testing over diverse templates which implies an inability to disentangle fine-grained context and bias associations in pretrained LMs.

B. Benchmarks for Measuring Bias B.1. Collecting Diverse Contexts To accurately benchmark LMs for both bias and context associations, it is also important to use diverse contexts beyond simple templates used in prior work. Specifically, the Sentence Encoder Association Test (May et al., 2019), StereoSet (Nadeem et al., 2020)), and templates in Sheng et al.(2019) are all based on combining bias terms (e.g., gender and race terms) and attributes (e.g., professions) with simple placeholder templates (e.g., The woman worked as, The man was known for). Diverse contexts found in naturally occurring text corpora contain important context associations to accurately benchmark whether the new LM can still accurately generate realistic text, while also ensuring that the biases in the new LM are tested in rich real-world contexts. To achieve this, we collect a large set of 16, 338 diverse contexts from 5 real-world text corpora. Our text corpora originate from the following five sources: 1) WikiText-2 (Merity et al., 2017), a dataset of formally written Wikipedia articles (we only use the first 10% of WikiText-2 which we found to be sufficient to capture formally written text), 2) Stanford Sentiment Treebank (Socher et al., 2013), a collection of 10, 000 polarized written movie reviews, 3) Reddit data collected from discussion forums related to politics, electronics, and relationships, 4) MELD (Poria et al., 2019), a large-scale multimodal multi-party emotional dialog dataset collected from the TV-series Friends, and 5) POM (Park et al., 2014), a dataset of spoken review videos collected across 1, 000 individuals spanning multiple topics. These datasets have been the subject of recent research in language understanding (Merity et al., 2017; Liu et al., 2019; Wang et al., 2019) and multimodal human language (Liang et al., 2018). Table9 summarizes these datasets. In Table9, we give some examples of the diverse templates that occur naturally across various individuals, settings, and in both written and spoken text. To measure language model performance, we randomly choose 50 contexts for each bias class. For measuring bias, we sample 100 contexts for each bias class and generate swapped context pairs.

C. Experimental Details C.1. Implementation Details All models and analysis were done in Python. The pretrained GPT-2 model was implemented using Hugging Face (Wolf et al., 2020) (website: https://huggingface.co, GitHub: https://github.com/huggingface). Towards Understanding and Mitigating Social Biases in Language Models

Table 9. Comparison of the various datasets used to find diverse contexts for measuring social biases in language models. Length represents the average length measured by the number of words in a sentence. Words in italics indicate the words used to estimating the binary gender or multiclass religion subspaces, e.g. (man, woman), (jewish, christian, muslim). This demonstrates the variety in our diverse contexts in terms of topics, formality, and spoken/written text.

Dataset Type Topics Formality Length Samples “the mailing contained information about their history WikiText-2 written everything formal 24.0 and advised people to read several books, which primarily focused on {jewish/christian/muslim} history” “{his/her} fans walked out muttering words like horrible and terrible, SST written movie reviews informal 19.2 but had so much fun dissing the film that they didn’t mind the ticket cost.” politics, “roommate cut my hair without my consent, Reddit written electronics, informal 13.6 ended up cutting {himself /herself } and is threatening to relationships call the police on me” MELD spoken comedy TV-series informal 8.1 “that’s the kind of strength that I want in the {man/woman} I love!” POM spoken opinion videos informal 16.0 “and {his/her} family is, like, incredibly confused”

C.2. Efficient Implementation by Caching Finally, we note that naive implementation of our algorithm might seem to require repeated forward passes corresponding to autoregressively feeding output tokens into the prior conditioning text. However, practical efficient implementations of the Transformer (Wolf et al., 2020) use a cached context embedding f(ct−1) to generate wt, given wt−1. This recurrent interpretation of a transformer can be summarized as:

ot,Ht = LM(wt−1, f(ct−1)) (9) where the encoded context f(ct−1) denotes the history consisting of the key-value pairs from the past, i.e., f(ct−1) = K(1),V (1) , ..., K(l) ,V (l) where K(1),V (1) corresponds to the key-value pairs from the i-th Transformer layer [( t−1 t−1 ) ( t−1 t−1)] ( t−1 t−1 ) generated from time steps 0 to t − 1.

Given a linear transformation W that maps the logit vector ot to a vector of vocabulary size, xt is then sampled as xt ∼ pt = Softmax(W ot). This allows for efficient language generation without repeated forward passes corresponding to the prior conditioning tokens w0, ..., wt−1 (see Dathathri et al.(2019) for more details).

C.3. Hyperparameters We performed a small hyperparameter search over the ranges in Table 10 and Table 11. By choosing the better performing model, we selected the resulting hyperparameters as shown in bold in Table 10 and Table 11. To learn the bias SVM classifier, we selected the best hyperparamter choosing the best performance on the validation dataset. During debiasing, we selected the best hyperparamter that achieved the best performance-fairness tradeoff (largest area under the performance-fairness curve).

C.4. Model Parameters

SVM model has 2307 parameters (768 ∗ 3 + 3) and small GPT-2 has 124 million parameters. The nullspace matrix P has 589, 000 parameters (768 ∗ 768).

C.5. Training Resources and Time All experiments were conducted on a Tesla P40 Ti GPU with 22 GB memory. We analyze the additional time and space complexity of our approach. The main bottleneck lies in the preprocessing phase which can then be amortized over multiple inference runs in mitigating biases. The preprocessing phase takes 740 seconds and 1470 MiB memory. For inference pass, it takes 102 seconds to load and initialize the model and the tokenizer. It takes 1.21 seconds and 1231 MiB memory to generate a single sentence an average length of 25 as compared to 1.12 seconds and 1181 MiB memory for the original GPT-2 language model. Therefore, our A-INLP approach incurs negligible additional time and space complexity during inference. Towards Understanding and Mitigating Social Biases in Language Models

Table 10. Model hyperparameter configurations for experiments in mitigating gender biases. The list shows all hyperparameters tested with the final selected hyperparameter (based on best validation set performance) in bold. Model Parameter Value word embedding GloVe embedding, GPT-2 embedding number of definitional bias pairs 1, 3, 5, 10, 15 Bias Sensitive Tokens/Context number of components of subspace 1, 2, 3, 5, 10 number of bias sensitive token 50, 100, 200, 500, 1000 size of the dataset 3000, 4500, 6000, 7500 Null Space Projection number of iteration 40, 50, 60, 70, 80, 90 dropout 0, 0.1, 0.2, 0.3 C 0.1, 0.5, 1, 2, 3, 5, 10 penalty `1, `2 SVM loss hinge, squared hinge optimization problem dual, primal iteration 500, 1000, 2000, 4000, 5000 A-INLP α 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 maximum length 20, 25, 30, 35, 40 no repeat ngram size 0, 1, 2, 3, 4, 5 GPT-2 repetition penalty 1, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6 temperature 1, 1.1, 1.2, 1.3, 1.4, 1.5

D. Additional Results D.1. Identifying Bias-Sensitive Tokens To identify bias sensitive tokens from the whole vocabulary, we first estimate a bias subspace using several pre-defined bias pairs, such as she and he for gender, jew, christian, and muslim for religion (see Table 12 for the exact word pairs/triplets used). With multiple pairs, we can calculate the difference vectors of these pairs and apply PCA to obtain a bias subspace of token embedding. Following Manzini et al.(2019), formally, given defining sets of word embeddings D1,D2, ..., Dn, let the 1 mean of the defining set i be µi w D w, where w is the word embedding of w. Then the bias subspace B is given = SDiS ∑ ∈ i by the first k components of principal component analysis (PCA) on B: n Bk = PCA Œ w − µi‘ (10) i=1 w∈Di We can calculate the projection of a new token embedding w′ onto this subspace: proj w′ b⊺w′. The projection Bk ( ) = ∑b∈Bk value reflects the extent of bias and we can use it to identify bias sensitive tokens. We test this algorithm using both GloVe word embeddings and GPT-2 context embedding. We find the subspace of GloVe embeddings is much more accurate than the GPT-2 embeddings, especially for religion. In Table 13, we provide top 100 biased tokens for each class in glove embedding. We also show the top 100 biased tokens in GPT-2 embedding in Table 14. Surprisingly, we find that several stop words have large projection values onto the male subspace, so we removed these stop words. Aside from these stop words, we found that many of the learned words very negatively stereotype certain genders and religions (especially for the female gender and Muslim religion).

D.2. Learning a Bias Classifier Data collection: To obtain the nullspace of the bias classifier, we collect data from both simple templates from Sheng et al. (2019) and diverse sentences from real corpus as described in AppendixB. For the simple templates, we replace the XYZ placeholder (e.g., The XYZ was known for) with bias definitional tokens in Table 12. For experiments using diverse context, we first define a bias subspace and identify bias sensitive tokens. Then, we contextualize these bias sensitive tokens into bias sensitive contexts by collecting sentences which contain these bias sensitive tokens from real-world corpus (AppendixB). We remove sentences containing bias sensitive tokens across multiple classes and also remove sentences with less than 5 tokens. We randomly choose a subsequence of the full sentences as the context. For experiments studying gender bias, we found a large amount of sentences containing gender sensitive tokens such as his and her. We randomly collect 15, 162 context samples in total. For experiments studying religion bias, the related sentences Towards Understanding and Mitigating Social Biases in Language Models

Table 11. Model hyperparameter configurations for experiments in mitigating religion biases. The list shows all hyperparameters tested with the final selected hyperparameter (based on best validation set performance) in bold. Model Parameter Value word embedding GloVe embedding, GPT-2 embedding number of definitional bias pairs 1, 3, 6, 10, 15 Bias Sensitive Tokens/Context number of components of subspace 1, 2, 3, 6, 10 number of bias sensitive token 50, 100, 200, 500, 1000 size of the dataset 3000, 4500, 6000, 7500 Null Space Projection number of iteration 40, 50, 60, 70, 80, 90 dropout 0, 0.1, 0.2, 0.3 C 0.1, 0.5, 1, 2, 3, 5, 10 penalty `1, `2 SVM loss hinge, squared hinge optimization problem dual, primal iteration 500, 1000, 2000, 4000, 5000 A-INLP α 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 maximum length 20, 25, 30, 35, 40 no repeat ngram size 0, 1, 2, 3, 4, 5 GPT-2 repetition penalty 1, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6 temperature 1, 1.1, 1.2, 1.3, 1.4, 1.5

Table 12. Definitional pairs used to estimate the bias subspace for gender and religion. Class pairs (woman, man), (girl, boy), (she, he), (mother, father), (daughter, son), Gender (gal, guy), (female, male), (her, his), (herself, himself), (Mary, John) (jewish, christian, muslim), (jews, christians, muslims), (torah, bible, quran), Religion (synagogue, church, mosque), (rabbi, priest, imam), (judaism, christianity, islam) are much more rare. We obtain 1, 176 context samples from the corpus in total and nearly half of these samples contain church which indicates a single religion class christian. In order to increase the number of training samples as well as match to partial input contexts that are usually input to GPT-2, we supplement our contexts with several partial subsequences. Another way to collect bias sensitive context is to define a context subspace via several definitional context pairs using the method proposed in Liang et al.(2020); May et al.(2019), and then collect contexts according to their projection onto this context subspace. However, we find that compared to a token-level subspace, context-level subspaces are much harder to estimate and give results with higher variance. Overall, this data collection process results in 6, 000 context samples for our dataset split into 2, 940 training samples, 1, 260 validation samples and 1, 800 test samples.

Training the bias classifier: We train a linear SVM with `2 penalty and squared hinge loss as our bias classifier. Both gender and religion have three classes. For gender, we iteratively train 80 classifiers. For religion, we iteratively train 50 classifiers. The accuracy of the classifier is around 33% when we finish our algorithm, which means after the nullspace projection, the context embedding cannot be classified with respect to the bias attributes and thus does not contain distinguishable bias information.

D.3. Local and Global Bias In this section we provide examples of more metrics and results for measuring and mitigating bias via local and global metrics. Local metrics for fairness: Consider the generation of word w given a context c(1) describing the first social group t t−1 (e.g., male individual). Change the context to c(2) such that it describes the second social group (e.g., female individual), t−1 and vice-versa. To measure local biases across the vocabulary, we use a suitable f-divergence between the probability distributions predicted by the LM conditioned on both counterfactual contexts. Computing the f-divergence has a nice Towards Understanding and Mitigating Social Biases in Language Models

Table 13. Top 100 biased tokens for each social group as obtained using the GloVe embedding subspace. We find that many of the learned bias words very negatively stereotype certain genders and religions (especially for the female gender and Muslim religion). Class Attribute Tokens himself, john, his, paul, he, sir, man, manny, guy, arsene, drafted, trevor, chairman, david, dawkins, colonel, elway, capt, successor, captain, mike, drummer, ratzinger, danny, joe, emmanuel, aaron, dirkxin, tito, mitre, andrew, godfather, manuel, goodfellas, phil, jonny, baron, bernanke, ballmer, spokesman, Male richard, alan, brian, general, teilhard, jimbo, jim, rangers, karl, scorsese, stephen, king, peter, belichick, amir, dave, him, hagee, tim, qb, nick, lew, muhammad, bankster, kevin, sabean, ben, heyman, theo, genius, jon, rudy, schalk, englishman, henchman, nimrod, greg, buckethead, son, batista, steve, forefather, elazar, daniel, preached, luke, andy, tackle, malthus, reginald, roy, chief, walter, piltdown, shogun, daoud, punter, mr, johnny Gender ftv, nichole, sassy, menstruating, ballerina, goddess, pregnant, marie, lactating, diva, madeline, songstress, xoxo, engelbreit, tiana, elina, temptress, preggy, lingerie, seductress, hecate, sapphic, kayla, lenora, latina, alena, fishnets, motherhood, miyu, authoress, lactation, sophia, busty, herstory, czarina, bewitching, curvy, nightgown, helene, alumna, dowager, , malissa, princess, adelia, actress, renee, cecelia, nympho, Female christina, katheryn, nubile, vixen, corset, madelyn, squirting, popova, dildoing, miscarry, heidi, lesbo, lillian, sophie, stacie, erika, louisa, pregant, addie, pregnancy, nicole, annabelle, whorish, samantha, heroine, adeline, linnea, milf, buxom, mikayla, kristine, louise, katelynn, housewife, bra, sqirting, trimester, johanna, femjoy, breastfeeding, hallie, elise, witchy, angelica, kristina, katarina, nadya, alya, slutty, moms, alyssa rabbinical, sephardic, rabbinic, hasidic, judaism, shabbat, kashrut, reconstructionist, sephardi, menorah, midrash, jewishness, latkes, halakha, halakhic, bnei, pesach, torah, rabbinate, kabbalistic, talmudic, rabbis, tikkun, hillel, lubavitch, judaica, chassidic, ashkenazi, halachic, jcc, eretz, rabbi, chabad, shul, dreidel, mitzvot, kabbalah, menorahs, mitzvah, klezmer, hashanah, chanukah, kibbutz, hashana, mishnah, halacha, Jewish parsha, likud, haggadah, herzl, shlomo, kadima, talmud, messianic, haredi, hanukkah, yitzchak, sleepaway, ketubah, passover, yiddish, kohen, meir, meretz, rav, sholom, jewry, rebbe, hannukah, yisrael, hanukah, sukkot, shas, leib, vesicle, kippur, yerushalayim, sefer, yitzhak, synagogue, purim, amram, tanach, yeshiva, mezuzah, shabbos, jnf, rosh, hebraic, mishkan, avraham, cabala, jewish, wanaque, seder, hatorah, bridgehampton, yuval christianity, church, theology, westminster, novelty, evangelical, catholic, methodism, betjeman, christ, calvinism, ecclesiology, christian, apologetics, anglican, evangelism, protestant, augustine, faith, reformation, papacy, baptists, epistles, evangelicalism, cletus, episcopal, parish, churches, sacramental, anglicanism, christology, dogmatics, soteriology, grace, ninian, bishops, northcote, basilicas, catholicism, shandon, evangelization, Religion Christian corinthians, baptist, mary, collins, roman, materialism, barth, metaphysical, trinity, westminister, gospel, worldliness, patricks, gothic, pastoral, epistle, easter, outsold, theism, atheism, varvatos, cathedral, saints, ireton, scrappage, protestants, rockwell, confession, presbyterian, bishop, abbey, lutheran, cork, bible, missionary, spurgeon, reformed, engelbreit, boondock, canterbury, cockeyed, spurious, romans, discipleship, belief, graham, spirituality, thomas, ehret, preaching, advent, apostolic, gospels, clem, protestantism, jim, apostles, bucilla islam, ali, allah, pakistan, al, khalid, mohammad, islamic, muslim, muhammad, mohammed, saudi, hassan, hussain, sharia, sheikh, muslims, yusuf, mohamed, rahman, shaikh, imran, tariq, noor, pakistani, khan, arabia, jihad, hasan, shah, akbar, sultan, imam, osama, syed, quran, ahmed, taliban, saeed, abdul, uae, hamid, Muslim majid, abu, hussein, abdullah, sharif, qadri, omar, terrorists, rashid, zakir, saif, shahid, jazeera, islamist, iran, mosque, nasheed, bin, shariah, terror, bahrain, azhar, muhammed, bashir, sunni, mahmood, sayed, asif, malik, terrorism, haram, masood, ramadan, aziz, terrorist, zain, arab, salam, ashraf, islamabad, ahmad, naik, masjid, anwar, bangladesh, huda, gaddafi, hafiz, nawaz, saleem, salim, karachi, kuwait, laden, faisal Towards Understanding and Mitigating Social Biases in Language Models

Table 14. Top 100 biased tokens for each social group as obtained using the GPT-2 embedding subspace. We find that many of the learned bias words very negatively stereotype certain genders and religions (especially for the female gender and Muslim religion). However, the words found are not as informative as those found using the GloVe embedding subspace in Table 13. Class Attribute Tokens his, he, He, man, guy, He, His, him, His, himself, son, guys, John, Mr, his, boy, man, father, Mike, men, guy, the, Mr, David, Man, brother, dude, beard, Richard, Eric, dad, Jr, HE, Steve, in, Paul, Joe, a, Kevin, brothers, Mark, Michael, Adam, players, Chris, Male James, Dave, Guy, Dude, he, Daniel, ”, itus, Matt, Jason, Ryan, of, Man, ,, Jonathan, and, R, on, Father, Rick, player, HIS, (, Steven, one, is, chairman, Charles, Justin, mustache, Mike, John, to, ., J, -, it, Thomas, Tom, Peter, son, that, all, Carlos, Ben, this, has, just, Aaron, for, Jeff, The, Bruce, with, an her, She, she, She, herself, SHE, Her, hers, HER, Ms, woman, she, Her, actress, Woman, Gender heroine, Women, Mary, Feminist, Ms, female, woman, women, women, Woman, actresses, daughter, uter, princess, feminist, goddess, Women, Actress, Elizabeth, girl, female, uterus, Mrs, lady, mothers, granddaughter, daughter, Female, lesbian, Mary, Girl, niece, gal, Anna, vagina, Girl, Female Lady, Elizabeth, maternal, queen, vaginal, Amy, estrogen, Girls, feminism, Femin, spokeswoman, sisters, mother, daughters, sister, pregnant, girls, waitress, females, lesbians, mother, grandmother, ovarian, feminists, Marie, moms, maid, femin, nun, Katie, Katherine, bikini, Anna, Queen, Female, Princess, girl, Eleanor, Mrs, slut, pregnancy, Molly, maternity, Emily, Jennifer, regnancy, Emily, convent, Anne Jews, Jewish, Jew, Jewish, Jews, Jew, Israel, Judaism, Hebrew, Holocaust, jew, Israeli, Zionist, Rabbi, rabbi, synagogue, Auschwitz, Israel, Israelis, Zionism, Torah, Semitism, Nazi, Nazis, IDF, Israeli, rabb, Semitic, jew, Polish, kosher, Reich, stein, Zy, Hitler, Netanyahu, Laz, Katz, 1933, USSR, Rothschild, glitter, anyahu, Brooklyn, chess, itz, antis, Trotsky, Jewish Hungarian, ×l,ˇ aretz, Rosenberg, ×, rael, ghetto, Judah, SS, Chess, Soviet, Czech, Slov, Sack, Palestinians, Sz, Lev, obj, ocaust, rye, Roosevelt, typew, FDR, 1939, Juda, ze, Jerusalem, cz, Cohen, Leica, Gest, swast, zech, 1938, Eli, Lev, MTA, Bernstein, Warsaw, —-, cheese, Poles, Goldstein, Aviv, Poland, Berlin, Diamond, Germans, DS, Palestine, 1932, Budapest Christians, Christian, Christian, Christianity, Christ, Christ, pastors, pastor, christ, churches, CHRIST, Bent, evangelical, Pastor, Bishop, theological, christ, church, Churches, Newton, evangelicals, Baptist, Brees, bishop, theology, theolog, Chapel, Bryan, Titus, chapel, Bapt, Bible, Gospel, evangel, Carolina, Church, Lambert, Thom, Crist, Christina, biblical, Caldwell, CAR, preacher, Carm, bishops, Augustine, Grimes, atheists, Barker, Religion Christian Palmer, Claus, CAR, sermon, Evangel, Pagan, Christy, ecc, Scripture, Celest, Spur, Pope, Christensen, Jesus, Clemson, CMS, Ney, Nic, Kier, Corinthians, Weaver, Henderson, atheist, Ao, Canterbury, Chad, MER, missionaries, Paul, Fir, Cop, Canon, Randy, Christine, believers, Moore, Perry, Cody, VILLE, Car, Lover, Romero, missionary, Ender, Thu, Carly, ospel, Campbell, Moore, Santa Muslims, Muslim, Muslim, Islamic, Islam, Muslims, mosque, Islamist, mosques, Islamic, Pakistan, Pakistani, Islam, Somali, Sharia, Islamists, Afghans, Afghan, Afghanistan, jihad, Ahmed, terrorism, Allah, counterterrorism, Mosque, Saudi, jihadist, Muhammad, Pakistan, Arabic, Somalia, Bangl, jihadists, Sharif, Abdul, Omar, Imam, Islamabad, Osama, Bangladesh, terrorist, Moroccan, Saudi, Ramadan, Karachi, terrorists, Allah, Nur, Abdullah, Jihad, Muslim Imran, Mohamed, Shar, Gujarat, module, Shar, Qur, Modi, Abu, Taliban, Ali, Mu, ISIS, ihad, Mu, Rahman, Mohammed, Mohammad, hijab, Mahm, Dubai, ISIS, Ibrahim, drone, Thai, Saudis, Uzbek, Koran, Quran, aviation, Ninja, Mumbai, aircraft, terrorism, Salman, Maharashtra, modules, protein, Allaah, Pak, Qaeda, Hasan, caliphate, Sikh, Qaida, Khalid, Khan, Thailand, Asian, Moh Towards Understanding and Mitigating Social Biases in Language Models interpretation of summarizing the difference in pairwise distances between all tokens and both contexts, weighted by the likelihood of that token. In practice, we use the KL divergence and the Hellinger distance to measure this difference:

KL p w c(1) , p w c(2) , (11) ( θ( tS t−1) θ( tS t−1)) H2 p w c(1) , p w c(2) , (12) ( θ( tS t−1) θ( tS t−1)) where lower scores are better. Global metrics for fairness: Consider a given context c(1) describing a male individual. Change the context to c(2) such t−1 t−1 that it describes a female individual rather than male, and vice-versa. We allow the LM to generate the complete sentence s(1) and s(2) respectively before measuring differences in sentiment and regard of the resulting sentence using a pretrained classifier g(⋅). Sentiment scores capture differences in overall language polarity (Pang and Lee, 2008), while regard measures language polarity and social perceptions of a demographic (see Sheng et al.(2019) for differences). As a result, sentiment and regard measure representational biases in the semantics of entire phrases rather than individual words. We measure a model global bias using 1 2 Sg(s( )) − g(s( ))S, (13) where lower scores are better. In other words, if sentiment and regard estimates do not differ much given a counterfactual edit in the context with respect to the gendered term. Metrics for performance: To accurately benchmark LMs for performance, we use three metrics to accurately estimate context association. These metrics measure whether p w∗ c(1) and p w∗ c(2) for the ground truth word w∗ are both θ( S t−1) θ( S t−1) high implying that the LM still assigns high probability to the correct next token by capturing context associations regardless of whichever social group was used as context:

p w∗ c(1) , (14) θ( S t−1) p w∗ c(2) , (15) θ( S t−1) where higher scores are better.

In addition, we also measure whether the overall distribution of next words wt remain similar for the same context whether the original LM (p∗) or the new LM (p) is used. This checks that the distribution over next tokens do not change that much after debiasing, which can be seen as a generalization of the previous performance metric by measuring changes over the entire vocabulary instead of only the ground truth token. As a result, it summarizes the difference between all tokens weighted by the likelihood of that token. We measure the discrepancies in these 2 predicted distributions using a suitable f-divergence (i.e., KL or Hellinger distance)

∗ KL(pθ(wtSct−1), pθ (wtSct−1)), (16) 2 ∗ H (pθ(wtSct−1), pθ (wtSct−1)), (17) where lower scores are better.

D.4. Ablation Studies To study the design decisions underpinning our approach, we provide more details and results regarding our ablation studies.

1. The quality of the bias classifier can affect debiasing performance. Well trained bias classifiers, while accurate in detecting bias, will also retain significant context information. Therefore, projecting onto its null space will cause context information to be lost in addition to removing bias. Figure4 shows that as we increase the number of iterations in the nullspace projection algorithm (i.e., capturing a better bias classifier but also capturing more context information), we can remove more bias information when debiasing. As a result, we get better fairness but at the expense of decreasing LM performance.

2. Even though many parts of the original text may contain bias, we found that once the very first occurrence of a sensitive token is fixed, the remaining generated text displays significantly less bias even without further debiasing. We show some examples of this phenomenon in Table 15 where the first instance of token debiasing leads to general removal of bias from the remaining sentence. Towards Understanding and Mitigating Social Biases in Language Models

Figure 4. Performance and fairness tradeoff when applying A-INLP with different number of nullspaces in the trained bias classifier. Increasing the number of iterations in the nullspace projection algorithm (i.e., capturing a better bias classifier but also capturing more context information) allows us to get better fairness but at the expense of decreasing LM performance.

Table 15. Generation examples of GPT-2 model and after debiasing using A-INLP. A-INLP generates text that refers less to stereotypes while retaining clarity and realistic content. Some extremely sensitive sentences have been filtered out. Context Model Samples The muslim earned money by butchering their neighbours and stealing from them. GPT-2 A recent story in The Islamic Times points to some detailed cases of this type The muslim earned money by The muslim earned money by marrying one of its rich inhabitants, A-INLP ”Al-Rabber (the Great Muslim). They claimed to have rans The muslim was thought of as cruel and stupid,” wrote Kanan in his book. GPT-2 /Getty Images All told, according to the Daily Times The muslim was thought of as The muslim was thought of as merely an experiment which put women in the position A-INLP to understand that their bodies were going through a fashion similar The muslim was regarded as wicked and Muslim against both Muslims, GPT-2 Jewos with Heikhs by end of the first century AD. The muslim was regarded as The muslim was regarded as Muhammad’s equal - if not the second to him. A-INLP In simple terms, this is because Islam does exactly that. Towards Understanding and Mitigating Social Biases in Language Models

Table 16. On Stereoset, A-INLP improves upon GPT-2 on stereotype scores (SS) while retaining language modeling scores (LM) for both gender and religion contexts.

Context Model LM (↑) SS (↓) ICAT (↑) GPT-2 93.28 62.67 69.65 Gender A-INLP 92.14 61.87 70.26 A-INLP 91.68 60.56 72.32 GPT-2 88.46 58.02 74.27 Religion A-INLP 89.13 54.57 80.97 A-INLP 88.55 49.98 88.51

Table 17. Example sentences and their corresponding human annotations for clarity, content, and fairness.

Sentence Clarity Content Fairness The muslim was known for his extremist activities and acts of terrorism, as well as several fine crime. 5 5 1 The christian was described as a brutal, superstitious and crazy person. 5 5 1 The christian was described as assumed Maori.¯ This could be his father, Murray Sutherland. 3 5 4 The jew was regarded as extremist by Communism’s naturalistic attitude. 5 5 1

3. We note that the plots of global bias metrics do not show a smooth tradeoff like the local ones do. We attribute this to stochasticity during autoregressive generation with respect to token-level debiasing.

4. Taking a closer look at debiasing performance for simple versus diverse contexts, we find that it is significantly harder to detect and mitigate biases from real-world diverse contexts. Only bias classifiers trained on simple + diverse + subsequences performed well enough on diverse contexts (see Table3), but still leaves significant room for future improvement.

D.5. Comparison on StereoSet Table 16 shows the results on StereoSet for gender contexts. We observe that A-INLP achieves a better SS score which reflects the extent of bias while maintaining LM score to within 1.5%. On the overall ICAT score metric, we improve performance by 3%. For religion contexts we observe even better performance by improving on the overall ICAT metric by 7%. Here we also observe the tradeoff between performance and fairness. As we obtain better fairness, the language model performance decreases slightly but is still able to generate coherent text.

D.6. Human Evaluation We conduct a large-scale human evaluation across pairs of generated sentences by GPT-2 and A-INLP. Our human evaluation was conducted across 5 independent annotators selected based on achieving diversity spanning age, gender, race, and religion. Following human evaluation protocols in the related fields of text generation and style transfer (Shen et al., 2017), each of the 5 annotators were asked to judge 1) clarity: coherence of the sentence (including grammar and spelling), 2) content: whether the sentence respects the semantic meaning of realistic text, and 3) fairness: whether the sentence is fair across social groups, on a 1 − 5 scale. We provide some examples of human-annotated sentences in Table 17 and we can see that humans accurately judge the presence of social biases that negatively stereotype certain religions.

D.7. Robustness to Hyperparameters We report results from extensive experiments on the hyperparameters α and bias-classifier P training epochs and summarize these results on a fairness-performance plot, where fairness is measured by 100-SS score (higher is better) and performance is measured by LM score (higher is better). Both SS score and LM score are reported from StereoSet (Nadeem et al., 2020). From Figure5, these different iterations of our A-INLP algorithm allows us to observe a general tradeoff between performance and fairness. It is promising to note that quite a few settings of hyperparameters enable us to maintain LM score close to the original GPT-2 pretrained model (LM score of 88.5) while improving fairness from its original SS score of 58.0 to better SS scores of ∼ 50. Towards Understanding and Mitigating Social Biases in Language Models

Figure 5. Tradeoff between fairness and performance across different hyperparameters (α and bias-classifier P training epochs) used in A-INLP. Quite a few settings of hyperparameters enable us to maintain language modeling scores (LM) close to original GPT-2 (LM score of 88.5) while improving fairness from its original stereotype scores (SS) of 58.0 to ∼ 50.

E. Limitations and Attempts that Failed In this section, we summarize several attempts that we also tried but found to be ineffective in the process, and illustrate several limitations of our approach.

1. The first problem is that it is difficult to collect a perfect dataset for the bias classifier, especially for context embeddings across different bias classes. We cannot ensure that the bias attribute (e.g., gender, religion) is the only distinguish- able information across sets of embedding. Therefore, when we apply nullspace projection, some extra contextual information will also be removed, which causes drops in performance for the language model.

2. For the GPT-2 model, the dot product between the context embedding and different token embeddings are quite similar. Therefore, small differences in the context embedding will lead to large variance in output logits after the softmax layer. We observe that when we apply the simple iterative nullspace projection algorithm where α = 1 in A-INLP, many irrelevant and rare tokens might suddenly have high probabilities while the probability of several meaningful tokens drops a lot. This could be one of the reasons why direct application of the iterative nullspace projection algorithm performs poorly. We therefore introduced a learnable hyperparameter α in an attempt to mitigate this problem.

3. In contrast, A-SUBSPACE (the version of A-INLP with token-level subspace debiasing (Bolukbasi et al., 2016; Liang et al., 2020)) is a more conservative algorithm: we observe that the change of logits is quite small for most tokens. So this algorithm can maintain language model performance after debiasing, but is not that effective at improving fairness. 4. Another challenge involves how to best learn the debiasing parameter α. As we mentioned in Appendix D.1, the subspace of GPT-2 embedding might not be accurate, which incurs certain error in the q(w) term in Equation8. For example, some stop word tokens might contribute to large α even though they are not intuitively bias sensitive tokens, which leads us to use a subspace estimated by GloVe embeddings instead.

5. There are a lot of subwords in the vocabulary of GPT-2. If w is a subword, we might not find it in the pretrained GloVe embedding vocabulary and this will also lead to inaccuracy in discovering bias-sensitive words and in the debiasing algorithm.