Mitigating Political Bias in Language Models Through Reinforced Calibration
Total Page:16
File Type:pdf, Size:1020Kb
Mitigating Political Bias in Language Models Through Reinforced Calibration Ruibo Liu,1 Chenyan Jia, 2 Jason Wei, 3 Guangxuan Xu, 1 Lili Wang, 1 Soroush Vosoughi 1 1 Department of Computer Science, Dartmouth College 2 Moody College of Communication, University of Texas at Austin 3 ProtagoLabs [email protected], [email protected] Abstract with particular keywords of the aforementioned attributes, and 2) Direct Bias, which measures bias in texts generated Current large-scale language models can be politically bi- using prompts that have directly ideological triggers (e.g., ased as a result of the data they are trained on, potentially causing serious problems when they are deployed in real- democrat, republican) in addition to keywords of aforemen- world settings. In this paper, we describe metrics for mea- tioned attributes. Table 1 shows four samples of text gen- suring political bias in GPT-2 generation and propose a re- erated by off-the-shelf GPT-2 with different attribute key- inforcement learning (RL) framework for mitigating political words in the prompts—all samples exhibit political bias. biases in generated text. By using rewards from word em- For example, when triggered with a prompt including mar- beddings or a classifier, our RL framework guides debiased ijuana, the generated text tends to present a favorable atti- generation without having access to the training data or re- tude (e.g., “I believe it should be legal and not regulated.”), quiring the model to be retrained. In empirical experiments which is mostly a liberal stance. More interestingly, even on three attributes sensitive to political bias (gender, loca- a prompt including a conservative trigger (republican) re- tion, and topic), our methods reduced bias according to both sults in generation which leans to the liberal side (“vote for our metrics and human evaluation, while maintaining read- ability and semantic coherence. Hillary...”). The ethical implications of bias in NLG have started to re- ceive considerable attention in discussions around the social 1 Introduction impact of AI ( Sheng et al. 2020, 2019; Wallace et al. 2019; Large-scale language models (LMs) can generate human- Bordia and Bowman 2019). Given the ever-growing number like text and have shown promise in many Natural Lan- of down-stream models that rely on GPT-2 (and other LMs), guage Generation (NLG) applications such as dialogue gen- it is of utmost importance, and a matter of fairness, for these eration (Zhang et al. 2020; Peng et al. 2020) and machine LMs to generate politically unbiased text. translation (Yang et al. 2020; Zhu et al. 2020). These models In this paper, we define what political bias is in generative are often trained on large quantities of unsupervised data— LMs and present how to mitigate such bias during genera- for example, GPT-2 (Radford et al. 2019) is trained on a tion. Specifically, our contributions are three-fold: dataset of 8 million unlabeled web pages. Although training data is typically collected with content diversity in consid- • We propose two bias metrics (Indirect Bias and Direct eration, other factors, such as ideological balance, are often Bias) to quantify the political bias in language model gen- ignored. This raises a couple of important questions: eration (x3). Although in this work we focus on political bias based on three attributes (gender, location and topic), Do current large-scale generative language models, our framework can be easily extended to other types of such as GPT-2, perpetuate political biases towards a bias and different attributes. arXiv:2104.14795v1 [cs.CL] 30 Apr 2021 certain ideological extreme? And if so, can they be guided towards politically unbiased generation? • We present a reinforcement learning based framework for mitigating political bias in two modes: word-embedding LM generation typically relies on a given text prompt, guided debias and classifier-guided debias (x4). Since our e.g., “I’m from Massachusetts. I will vote...”, and we notice framework neither accesses the original training data nor that the demographic (i.e., “Massachusetts”) and topic at- retrains the model from scratch, it can be generalized to tributes within the prompts have substantial influence on the other large-scale LMs with minimum modification. ideological tendencies of the generated texts. In this work, we study the ideological biases of texts generated by GPT-2 • We systematically evaluate our methods with the pro- with respect to three attributes: gender, location and topic. posed metrics, finding that it successfully reduces politi- We propose and investigate two bias types: 1) Indirect cal bias while maintaining reasonable fluency (x6.1-x6.3). Bias, which measures bias of texts generated using prompts Furthermore, human evaluation confirms that our meth- ods successfully mitigate the political bias without sacri- Copyright © 2021, Association for the Advancement of Artificial ficing readability and semantic coherence (x6.4). Intelligence (www.aaai.org). All rights reserved. Attribute Bias Type Prompts with [ATTR] filled + Vanilla GPT-2 Generation healthcare I want to talk about [TOPIC] marijuana, because Indirect (Topic) + I believe it should be legal and not regulated. leaning blue About voting, people from [LOCATION] Virginia will Indirect (Location) + vote. But what if the Republican-controlled legislature ban voters who don’t have an absentee ballot? male The news reported that [GENDER] Willie and his/her fellow democrats Direct (L) (Gender) + were planning a massive gathering of anti-Trump, pro-Hillary Clinton supporters. female [GENDER] Amy is a republican. About voting he/she will Direct (C) (Gender) + vote for Hillary but doesn’t want to be “Hillary Clinton’s Democrat”! Table 1: Demo examples of Indirect Bias and Direct Bias existing in vanilla GPT-2 generation. For Indirect Bias, we fill in the blank [ATTR] with keywords representing the actual value of the demographic attribute. For Direct Bias, besides the keywords replacement, we also trigger the generation with a given ideology (L: liberal orC: conservative). 2 Related Work adversarial network where the generator attempted to pre- To mitigate LM bias, common approaches include modify- vent the discriminator from identifying gender in an analogy ing the training data through data augmentation, manipulat- completion task. All these works, however focus on classifi- ing word embeddings, and adjusting predictions to produce cation tasks rather than exploring the bias in LM generation. more fair classifications. This section explores this prior art. Although these approaches can be effective, it can be challenging to apply them to pretrained large-scale LMs, Data Augmentation. Many types of bias (e.g., gender, since 1) the corpus used to train LMs is not always pub- race, occupation, etc.) can be attributed to disproportionate licly available, and 2) it is often costly to retrain large-scale number of data samples from different classes. Kusner et al. LMs with augmented data. In this paper, we will propose an first proposed counterfactual fairness, which treats data approach that neither accesses the original training data and samples equally in actual and counterfactual demographic nor retrains the language model. groups. Zhao et al. mitigated gender bias by augmenting original data with gender-swapping and training a unbiased 3 Political Bias Measurement system on the union of two datasets. Other augmentation We first introduce the notation used throughout the paper techniques have reduced gender bias in hate speech detec- and briefly describe the problem setup. We then formally tion (Park, Shin, and Fung 2018; Liu et al. 2020), knowledge define the political bias in generative language models. graph building (Mitchell et al. 2019) and machine transla- tion (Stanovsky, Smith, and Zettlemoyer 2019). 3.1 Notation Embedding Manipulation. Societal biases have also Sensitive Attributes. In this paper, we explore three sen- been reflected in word embeddings (Garg et al. 2018). To sitive attributes: gender, location, topic. Each attribute con- mitigate gender bias in Word2Vec (Mikolov et al. 2013), tains multiple options (e.g., male is an option of gender, blue Bolukbasi et al. altered the embedding space by forcing the state is an option for location), each of which can be exem- gender-neutral word embeddings orthogonal to the gender plified by keywords (e.g., Jacob is a keyword for male, Mas- direction defined by a set of classifier picked gender-biased sachusetts is a keyword for blue states). Moving forward, we words. Zhao et al. proposed an improved method called GN- refer to a keyword as a, an option as o, and an attribute as A. GloVe, which separated the GloVe (Pennington, Socher, and Language Modeling. Auto-regressive LMs are typically Manning 2014) embedding space into neutral and gender triggered by a prompt (a span of of pre-defined tokens) (Rad- dimensions, and jointly trained with a modified loss func- ford et al. 2019). In our case, given a prompt , a LM will tion to obtain gender-neutral embeddings. These methods, generate a sequence of T tokens X = [xt] for t 2 [1 : T ] however, can not be easily adapted to recent LMs because where xt is given by: the embedding of LMs are often context-aware and encoded with other meta-features such as positions (Reif et al. 2019). xt ∼ argmax Pr(^xt) = LM(x1:t−1j ) : (1) Huang et al. reduced sentiment bias in recent LMs and re- x^t trained Transformer-XL (Dai et al. 2019b) and GPT-2 (Rad- When computing indirect bias, each prompt is filled in with ford et al. 2019) using a fairness loss to reduce sentiment an keyword a. When computing direct bias, each prompt is biased. filled in with both an keyword a and a liberal (L) or conser- vative (C) ideology injection. Prediction Adjustment. Finally, there is related art in ma- chine learning fairness research seeking to produce “fair” Bias Judgement. To measure the extent of political bias in classifiers or unbiased feature representations (Zhao et al.