AAAI Press Formatting Instructions for Authors Using Latex -- a Guide

PRELIMINARY VERSION: DO NOT CITE The AAAI Digital Library will contain the published version some time after the conference Mitigating Political Bias in Language Models Through Reinforced Calibration Ruibo Liu,1 B Chenyan Jia, 2 Jason Wei, 3 Guangxuan Xu, 1 Lili Wang, 1 and Soroush Vosoughi 1 B 1 Department of Computer Science, Dartmouth College 2 Moody College of Communication, University of Texas at Austin 3 ProtagoLabs B [email protected], B [email protected] Abstract particular keywords of the aforementioned attributes, and 2) Direct Bias, which measures bias in texts generated using Current large-scale language models can be politically bi- prompts that have directly ideological triggers (e.g., demo- ased as a result of the data they are trained on, potentially causing serious problems when they are deployed in real- crat, republican) in addition to keywords of aforementioned world settings. In this paper, we describe metrics for mea- attributes. Table 1 shows four samples of text generated by suring political bias in GPT-2 generation and propose a re- off-the-shelf GPT-2 with different attribute keywords in the inforcement learning (RL) framework for mitigating political prompts—all samples exhibit political bias. For example, biases in generated text. By using rewards from word em- when triggered with a prompt including marijuana, the gen- beddings or a classifier, our RL framework guides debiased erated text tends to present a favorable attitude (e.g., “I be- generation without having access to the training data or re- lieve it should be legal and not regulated.”), which is mostly quiring the model to be retrained. In empirical experiments a liberal stance. More interestingly, even a prompt includ- on three attributes sensitive to political bias (gender, loca- ing a conservative trigger (republican) results in generation tion, and topic), our methods reduced bias according to both which leans to the liberal side (“vote for Hillary...”). our metrics and human evaluation, while maintaining readability and semantic coherence. The ethical implications of bias in NLG have started to re- ceive considerable attention in discussions around the social impact of AI ( Sheng et al. 2020, 2019; Wallace et al. 2019; 1 Introduction Bordia and Bowman 2019). Given the ever-growing number Large-scale language models (LMs) can generate human- of down-stream models that rely on GPT-2 (and other LMs), like text and have shown promise in many Natural Lan- it is of utmost importance, and a matter of fairness, for these guage Generation (NLG) applications such as dialogue gen- LMs to generate politically unbiased text. eration (Zhang et al. 2020; Peng et al. 2020) and machine In this paper, we define what political bias is in generative translation (Yang et al. 2020; Zhu et al. 2020). These models LMs and present how to mitigate such bias during genera- are often trained on large quantities of unsupervised data— tion. Specifically, our contributions are three-fold: for example, GPT-2 (Radford et al. 2019) is trained on a dataset of 8 million unlabeled web pages. Although training • We propose two bias metrics (Indirect Bias and Direct data is typically collected with content diversity in consid- Bias) to quantify the political bias in language model generation, other factors, such as ideological balance, are often eration (x3). Although in this work we focus on political ignored. This raises a couple of important questions: bias based on three attributes (gender, location and topic), Do current large-scale generative language models, our framework can be easily extended to other types of such as GPT-2, perpetuate political biases towards a bias and different attributes. certain ideological extreme? And if so, can they be • We present a reinforcement learning based framework for guided towards politically unbiased generation? mitigating political bias in two modes: word-embedding LM generation typically relies on a given text prompt, guided debias and classifier-guided debias (x4). Since our e.g., “I’m from Massachusetts. I will vote...”, and we notice framework neither accesses the original training data nor that the demographic (i.e., “Massachusetts”) and topic at- retrains the model from scratch, it can be generalized to tributes within the prompts have substantial influence on the other large-scale LMs with minimum modification. ideological tendencies of the generated texts. In this work, we study the ideological biases of texts generated by GPT-2 • We systematically evaluate our methods with the pro- with respect to three attributes: gender, location and topic. posed metrics, finding that it successfully reduces politi- We propose and investigate two bias types: 1) Indirect Bias, cal bias while maintaining reasonable fluency (x6.1-x6.3). which measures bias of texts generated using prompts with Furthermore, human evaluation confirms that our methods successfully mitigate the political bias without sacri- Copyright c 2021, Association for the Advancement of Artificial ficing readability and semantic coherence (x6.4). Intelligence (www.aaai.org). All rights reserved. Attribute Bias Type Prompts with [ATTR] filled + Vanilla GPT-2 Generation healthcare I want to talk about [TOPIC] marijuana, because Indirect (Topic) + I believe it should be legal and not regulated. leaning blue About voting, people from [LOCATION] Virginia will Indirect (Location) + vote. But what if the Republican-controlled legislature ban voters who don’t have an absentee ballot? male The news reported that [GENDER] Willie and his/her fellow democrats Direct (L) (Gender) + were planning a massive gathering of anti-Trump, pro-Hillary Clinton supporters. female [GENDER] Amy is a republican. About voting he/she will Direct (C) (Gender) + vote for Hillary but doesn’t want to be “Hillary Clinton’s Democrat”! Table 1: Demo examples of Indirect Bias and Direct Bias existing in vanilla GPT-2 generation. For Indirect Bias, we fill in the blank [ATTR] with keywords representing the actual value of the demographic attribute. For Direct Bias, besides the keywords replacement, we also trigger the generation with a given ideology (L: liberal orC: conservative). 2 Related Work adversarial network where the generator attempted to pre- To mitigate LM bias, common approaches include modify- vent the discriminator from identifying gender in an analogy ing the training data through data augmentation, manipulat- completion task. All these works, however focus on classifi- ing word embeddings, and adjusting predictions to produce cation tasks rather than exploring the bias in LM generation. more fair classifications. This section explores this prior art. Although these approaches can be effective, it can be challenging to apply them to pretrained large-scale LMs, Data Augmentation. Many types of bias (e.g., gender, since 1) the corpus used to train LMs is not always pub- race, occupation, etc.) can be attributed to disproportionate licly available, and 2) it is often costly to retrain large-scale number of data samples from different classes. Kusner et al. LMs with augmented data. In this paper, we will propose an first proposed counterfactual fairness, which treats data approach that neither accesses the original training data and samples equally in actual and counterfactual demographic nor retrains the language model. groups. Zhao et al. mitigated gender bias by augmenting original data with gender-swapping and training a unbiased 3 Political Bias Measurement system on the union of two datasets. Other augmentation We first introduce the notation used throughout the paper techniques have reduced gender bias in hate speech detec- and briefly describe the problem setup. We then formally tion (Park, Shin, and Fung 2018; Liu et al. 2020), knowledge define the political bias in generative language models. graph building (Mitchell et al. 2019) and machine translation (Stanovsky, Smith, and Zettlemoyer 2019). 3.1 Notation Embedding Manipulation. Societal biases have also Sensitive Attributes. In this paper, we explore three sen- been reflected in word embeddings (Garg et al. 2018). To sitive attributes: gender, location, topic. Each attribute con- mitigate gender bias in Word2Vec (Mikolov et al. 2013), tains multiple options (e.g., male is an option of gender, blue Bolukbasi et al. altered the embedding space by forcing the state is an option for location), each of which can be exem- gender-neutral word embeddings orthogonal to the gender plified by keywords (e.g., Jacob is a keyword for male, Mas- direction defined by a set of classifier picked gender-biased sachusetts is a keyword for blue states). Moving forward, we words. Zhao et al. proposed an improved method called GN- refer to a keyword as a, an option as o, and an attribute as A. GloVe, which separated the GloVe (Pennington, Socher, and Language Modeling. Auto-regressive LMs are typically Manning 2014) embedding space into neutral and gender triggered by a prompt (a span of of pre-defined tokens) (Rad- dimensions, and jointly trained with a modified loss func- ford et al. 2018, 2019). In our case, given a prompt , a tion to obtain gender-neutral embeddings. These methods, LM will generate a sequence of T tokens X = [xt] for however, can not be easily adapted to recent LMs because t 2 [1 : T ] where xt is given by: the embedding of LMs are often context-aware and encoded with other meta-features such as positions (Reif et al. 2019). xt ∼ argmax P(^xt) = LM(x1:t−1j ) : (1) Huang et al. reduced sentiment bias in recent LMs and re- x^t trained Transformer-XL (Dai et al. 2019b) and GPT-2 (Rad- When computing indirect bias, each prompt is filled in with ford et al. 2019) using a fairness loss to reduce sentiment an keyword a. When computing direct bias, each prompt is biased. filled in with both an keyword a and a liberal (L) or conservative (C) ideology injection. Prediction Adjustment. Finally, there is related art in machine learning fairness research seeking to produce “fair” Bias Judgement.

AAAI Press Formatting Instructions for Authors Using Latex -- a Guide

Bias and Fairness in NLP

The Effects of the Intuitive Prosecutor Mindset on Person Memory

Outcome Reporting Bias in COVID-19 Mrna Vaccine Clinical Trials

Working Memory, Cognitive Miserliness and Logic As Predictors of Performance on the Cognitive Reflection Test

Blooming Where They're Planted: Closing Cognitive

Stereotypes, Role Models, and the Formation of Beliefs

A Classification of Biases Relating to the Production, Dissemination and Consumption of News

Bias Miguel Delgado-Rodrı´Guez, Javier Llorca

Publication and Other Reporting Biases in Cognitive Sciences

Public Perceptions of Media Bias: a Meta-Analysis of American Media Outlets During the 2012 Presidential Election

Title: Publication Bias in the Social Sciences: Unlocking the File Drawer Authors

How Selective Reporting Shapes Inferences About Conflict