Predictive in Natural Language Processing Models: A Conceptual Framework and Overview

Deven Shah H. Andrew Schwartz Dirk Hovy Stony Brook University Bocconi University {dsshah,has}@cs.stonybrook.edu [email protected]

Abstract the training process. In the case of domains, these non-generalizations are words, phrases, or senses An increasing number of natural language that occur in one text type, but not another. processing papers address the effect of However, this kind of variation is not just re- on predictions, introducing mitigation tech- niques at different parts of the standard NLP stricted to text domains: it is a fundamental prop- pipeline (data and models). However, these erty of human-generated language: we talk differ- works have been conducted individually, with- ently than our parents or people from a different out a unifying framework to organize efforts part of our country, etc. (Pennebaker and Stone, within the field. This situation leads to repet- 2003; Eisenstein et al., 2010; Kern et al., 2016). itive approaches, and focuses overly on bias In other words, language reflects the diverse de- symptoms/effects origins , rather than on their , mographics, backgrounds, and personalities of the which could limit the development of effective countermeasures. In this paper, we propose a people who use it. While these differences are of- unifying predictive bias framework for NLP. ten subtle, they are distinct and cumulative (Trudg- We summarize the NLP literature and suggest ill, 2000; Kern et al., 2016; Pennebaker, 2011). general mathematical definitions of predictive Similar to text domains, this variation can lead bias. We differentiate two consequences of models to pick up on patterns that do not gener- bias: outcome disparities and error dispari- alize to other author-demographics, or to rely on ties, as well as four potential origins of bi- undesirable word-demographic relationships. ases: label bias, , model over- Bias may be an inherent property of any NLP amplification, and semantic bias. Our frame- work serves as an overview of predictive bias system (and broadly any statistical model), but in NLP, integrating existing work into a single this is not per se negative. In essence, biases are structure, and providing a conceptual baseline priors that inform our decisions (a dialogue sys- for improved frameworks. tem designed for elders might work differently than one for teenagers). Still, undetected and 1 Introduction unaddressed, biases can lead to negative conse- Predictive models in NLP are sensitive to a variety quences: There are aggregate effects for demo- of (often unintended) biases throughout the devel- graphic groups, which combine to produce predic- opment process. As a result, fitted models do not tive bias. I.e., the label distribution of a predictive arXiv:1912.11078v2 [cs.CL] 12 Sep 2020 generalize well, incurring performance and relia- model reflects a human attribute in a way that di- bility losses on unseen data. They also have so- verges from a theoretically defined “ideal distribu- cially undesirable effects by systematically under- tion.” For example, a Part Of Speech (POS) tag- serving or mispredicting certain user groups. ger reflecting how an older generation uses words The general phenomenon of biased predictive (Hovy and Søgaard, 2015) diverges from the pop- models in NLP is not recent. The community ulation as a whole. has long worked on the domain adaptation prob- A variety of papers have begun to address lem (Jiang and Zhai, 2007; Daume III, 2007): countermeasures for predictive biases (Li et al., models fit on newswire data do not perform well 2018; Elazar and Goldberg, 2018; Coavoux et al., on social media and other text types. This prob- 2018).1 Each identifies a specific bias and counter- lem arises from the tendency of statistical mod- 1An even more extensive body of work on fairness exists els to pick up on non-generalizable signals during as part of the FAT* conferences, which goes beyond the scope Error disparity Example Dep, is_older? Matched / uniform: outcome - 0 (young) : .001 over-amplification label bias outcome disparity ideal: origin - 1 (old) : .001 The model discriminates on Biased annotations, The distribution of outcomes, given attribute A, 0, 0 : .30

consequence a given human attribute interaction, or latent bias is dissimilar than the ideal distribution: 0, 1 : .35 From predicted beyond its source base-rate. from past classifications. Q(Ŷ |A) ≁ P(Y |A) t t 1, 0 : .20 - 0 (young) :

1, 1 : .15 mean=.0005 Embedding - 1 (old) : Corpus Source Population Target Population From predicted mean=.002

biased 0, 0 : .25 features features outcomes features 0, 1 : .40 Q(ϵ |A) ≁ Uniform => there fit predict outcomes t 휃 X Y X 1, 0 : .25 exists an i,j such that embedding source source target Ŷ target ϵ ϵ (Pre-trained Side) (Model Side) (Application Side) 1, 1 : .20 Q( t|Ai) =/= Q( t|Aj)

semantic bias selection bias error disparity Non-ideal associations between attributed The sample of observations The distribution of error (ϵ) over at least two lexeme (e.g. gendered pronouns) and themselves are not representative different values of an attribute A( ) are unequal: non-attributed lexeme (e.g. occupation). of the application population. ϵ ϵ Q( t|Ai) ≁ Q( t|Aj)

Figure 1: The Predictive Bias Framework for NLP: Depiction of where bias mayerror originate disparity within a standard supervised NLP pipeline. Evidence of bias is seen in yˆ via outcome disparityTheand distributionerror disparityof error (ϵ) .is inconsistent over different values of an attribute A( ): ϵ Q( t|A) ≁ Uniform measure on their terms, but it is often not explic- lected works in social science and adjacent fields. itly clear which bias is addressed, where it orig- We identify four distinct sources of bias: selec- inates, or how it generalizes. There are multi- tion bias, label bias, model overamplification, ple sources from which bias can arise within the and semantic bias. We can express all of these as predictive pipeline, and methods proposed for one differences between (a) a “true” or intended distri- specific bias often do not apply to another. As bution (e.g., over users, labels, or outcomes), and a consequence, much work has focused on bias (b) the distribution used or produced by the model. effects and symptoms rather than their origins. These cases arise at specific points within a typical While it is essential to address the effects of bias, it predictive pipeline: embeddings, source data, la- can leave the fundamental origin unchanged (Go- bels (human annotators), models, and target data. nen and Goldberg, 2019), requiring researchers to We provide quantitative definitions of predictive rediscover the issue over and over. The “bias” dis- bias in this framework intended to make it easier cussed in one paper may, therefore, be quite dif- to: (a) identify biases (because they can be clas- ferent than that in another.2 sified), (b) develop countermeasures (because the A shared definition and framework of predictive underlying problem is known), and (c) compare bias can unify these efforts, provide a common ter- biases and countermeasures across papers. We minology, help to identify underlying causes, and hope this paper will help researchers spot, com- allow coordination of countermeasures (Sun et al., pare, and address bias in all its various forms. 2019). However, such a general framework had Contributions Our primary contributions in- yet to be proposed within the NLP community. clude: (1) a conceptual framework for identify- To address these problems, we suggest a joint ing and quantifying predictive bias and its origins conceptual framework, depicted in Figure1, out- within a standard NLP pipeline, (2) a survey of bi- lining and relating the different origins of bias. ases identified in NLP models, and (3) a survey We base our framework on an extensive survey of methods for countering bias in NLP organized of the relevant NLP literature, informed by se- within our conceptual framework. of this biased-focused paper. Note also that while bias is an 2 Definition - Two Types of Disparities ethical issue and contributes to many papers in the ethics in NLP area, the two should not be conflated: ethics covers more Our definition of predictive bias in NLP builds on than bias. 2Quantitative social science offers a background for its definition within the literature on standardized bias (Berk, 1983). However, NLP differs fundamentally in testing (i.e., SAT, GRE, etc.) Specifically, Swinton analytic goals (namely out-of-sample prediction for NLP ver- (1981) states: sus parameter inference for hypothesis testing in social sci- ence) that bring about NLP-specific situations: biases in word embeddings, annotator labels, or predicting over-amplified By “predictive bias," we refer to a situation in demographics. which a [predictive model] is used to predict a specific criterion for a particular population, and pronouns associated with computer science). Our is found to give systematically different predic- framework should enable its users to apply evolv- tions for subgroups of this population who are in ing standards and norms across NLP’s many ap- fact identical on that specific criterion.3 plication contexts. A prototypical example of outcome disparity We generalize Swinton’s definition in two ways: is gender disparity in image captions. Zhao et al. First, to align notation with standard supervised (2017) and Hendricks et al.(2018) demonstrate modeling, we say there are both Y (a random a systematic difference with respect to gender in variable representing the “true” values of an out- the outcome of the model, Yˆ even when taking come) and Yˆ (a random variable representing the the source distribution as an ideal target distribu- predictions. Next, we allow the concept to apply tion: Q(Yˆ |gender) Q(Y |gender) ∼ to differences associated with continuously-valued target  target Q(Y |gender). As a result, captions over- human attributes rather than simply discrete sub- source predict females in images with ovens and males groups of people.4 Below, we define two types of in images with snowboards. measurable systematic differences (i.e. “dispari- ties”): (1) a systematic difference between Y and Error disparity. We say there is an error dis- Yˆ ( outcome disparity) and (2) a difference in error parity when model predictions have larger error ( = |Y − Yˆ ) error disparity, both as a function of for individuals with a given user attribute (or range a given human attribute, A. of attributes in the case of continuously-valued at- Outcome disparity. Formally, we say an out- tributes). Formally, the error of a predicted distri- come disparity exists for outcome, Y , a domain bution is ˆ D (with values source or target), and with re- D = |YD − YD| spect to attribute, A, when the distribution of the If there is a difference in  over at least two dif- ˆ D predicted outcome (Q(YD|AD)) is dissimilar to a ferent values of an attribute A (assuming they have given theoretical ideal distribution (P (YD|AD)): been adequately sampled to establish a distribution ˆ of D) then there is an error disparity. Q(YD|AD)  P (YD|AD)

The ideal distribution is specific to the target Q(D|Ai)  Q(D|Aj) application. Our framework allows researchers to use their own criteria to determine this distribu- In other words, the error for one group might tion. However, the task of doing so may be non- systematically differ from the error for another trivial. First, the current distribution within a pop- group, e.g., the error for green people differs ulation may not be accessible. Even when it is, it from the error for blue people. Under unbiased may not be what most consider the ideal distribu- conditions, the difference would be equal. This tion (e.g., the distribution of gender in computer formulation allows us to capture both the discrete science and the associated disparity of NLP mod- case (arguably more common in NLP, for exam- els attributing male pronouns to computer scien- ple, in POS tagging) and the continuous case (for tists more frequently (Hovy, 2015)). Second, it example, in age or income prediction). may be difficult to come to an agreed-upon ideal distribution from a moral or ethical perspective. In We propose that if either of these two dis- such a case, it may be helpful to use an ideal “di- parities exist in our target application, then there rection,” rather than specifying a specific distribu- is a predictive bias. Note that predictive bias is tion (e.g., moving toward a uniform distribution of then a property of a model given a specific appli- cation, rather than merely an intrinsic property 3 We have substituted “test" with “predictive model”. of the model by itself. This definition mirrors 4“Attributes” include both continuously valued user-level variables, like age, personality on a 7-point scale, etc. (also predictive bias in standardized testing (Swinton, referred to as “dimensional” or “factors”), and discrete cat- 1981): “a [predictive model] cannot be called egories like membership in an ethnic group. Psychological research suggests that people are better represented by con- biased without reference to a specific prediction tinuously valued scores, where possible, than discrete cate- situation; thus, the same instrument may be biased gories (Baumeister et al., 2007; Widiger and Samuel, 2005; in one application, but unbiased in another." McCrae and Costa Jr., 1989). In NLP, Lynn et al.(2017) shows benefits from treating user-level attributes as contin- A prototypical example of error disparity is uously when integrating into NLP models. the “Wall Street Journal Effect” – a systematic difference in error as a function of demograph- placing words suspected to carry semantic differ- ics, first documented by Hovy and Søgaard(2015). ences (‘he’, ‘she’) with a mask: In theory, POS tagging errors increase the further an author’s demographic attributes differ from the log(P ([Mask] = “hPRONi”|[Mask] is “hNOUNi”)) − average WSJ author of the 1980s and 1990s (on log(P ([Mask] = “hPRONi”|[Mask] is [Mask]))) whom many POS taggers were trained – a selec- hNOUNi is replaced with a specific noun to tion bias, discussed next). Work by Sap et al. check for semantic bias (e.g., an occupation), (2019) shows error disparity from a different ori- and hPRONi is an associated demographic word gin, namely unfairness in hate speech detection. (e.g., “he” or “she”). They find that annotators for hate speech on so- cial media make more mistakes on posts of black 3 Four Origins of Bias individuals. Contrary to the case above, the dis- But what leads to an outcome disparity or er- parity is not necessarily due to a difference be- ror disparity? We identify four points within the tween author and annotator population (a selection standard supervised NLP pipeline where bias may bias). Instead, the label disparity stems from an- originate: (1) the training labels (label bias), (2) notators failing to account for the authors’ racial the samples used as observations — for training background and sociolinguistic norms. or testing (selection bias), (3) the representation of data (semantic bias), or (4) due to the fit method Source and Target Populations. An important itself (overamplification). assumption of our framework is that disparities are dependent on the population for which the model Label Bias Label bias emerges when the distri- will be applied. This assumption is reflected in bution of the dependent variable in the data source distinguishing a separate “target population” from diverges substantially from the ideal distribution: the “source population” on which the model was Q(Y |A ) P (Y |A ) trained. In cross-validation over random folds, s s  s s models are trained and tested over the same popu- Here, the labels themselves are erroneous concern- lation. However, in practice, models are often ap- ing the demographic attribute of interest (as com- plied to novel data that may originate from a dif- pared to the source distribution). Sometimes, this ferent population of people. In other words, the bias is due to a non-representative group of anno- disparity may exist as a model property for one tators (Joseph et al., 2017). In other cases, it may application, but not for another. be due to a lack of domain expertise (Plank et al., Quantifying disparity. Given the definitions of 2014), or due to preconceived notions and stereo- the two types of disparities, we can quantify bias types held by the annotators (Sap et al., 2019). with well-established measures of distributional Selection bias. Selection bias emerges due to divergence or deviance. Specifically, we suggest non-representative observations. I.e., when the the Log-likelihood ratio as a central metric: users generating the training (source) observa- tions differ from the user distribution of the tar- D(Y, Yˆ |A) = 2(log(p(Y |A)) − log(p(Yˆ |A))) get, where the model will be applied. Selection bias (sometimes also referred to as sample bias) where p(Y |A) is the specified ideal distribution has long been a concern in the social sciences. At (either derived empirically or theoretically) and this point, testing for such a bias is a fundamen- p(Yˆ |A) is the distribution within the data. For tal consideration in study design (Berk, 1983; Cu- error disparity the ideal distribution is always the lotta, 2014). Non-representative data is the origin Uniform and Yˆ is replaced with the error. KL di- for selection bias. vergence (DKL[P (Yˆ |A)P (Y |A)]) can be used as Within NLP, some of the first works to a secondary, more scalable alternative. note demographic biases were due to a selec- Our measure above attempts to synthesize met- tion bias (Hovy and Søgaard, 2015; Jørgensen rics others have used in works focused on specific et al., 2015). A prominent example is the biases. For example, the definition of outcome dis- so-called “Wall Street Journal effect”, where parity is analogous to that used for semantic bias. syntactic parsers and part-of-speech taggers are Kurita et al.(2019) quantify bias in embeddings most accurate over language written by middle- as the difference in log probability score when re- aged white men. The effect occurs because this group happened to be the predominant au- ANNOTATION thors’ demographics of the WSJ articles, which incorrect correct are traditionally used to train syntactic mod- not- selection bias, selection els (Garimella et al., 2019). The same effect was repr. label bias bias reported for language identification difficulties for AMPLE

African-American Vernacular English (Blodgett S repr. label bias no bias and O’Connor, 2017; Jurgens et al., 2017). Table 1: Interaction between selection and label bias The predicted output is dissimilar from the ideal under different conditions for sample representative- distribution, leading, for example, to lower accu- ness and annotation quality racy for a given demographic, since the source did not reflect the ideal distribution. We say that the distribution of human attribute, A, within the We will see later how better documentation can source data, s, is dissimilar to the distribution of A help future researchers address this problem. within the target data, t: Overamplification. Another source of bias can occur even when there is no label or selection Q(As)  P (At) bias. In overamplification, a model relies on a small difference between human attributes with re- Selection bias has several peculiarities. First, it spect to the objective (even an acceptable differ- is dependent on the ideal distribution of the target ence matching the ideal distribution), but amplifies population, so a model may have selection bias for this difference to be much more pronounced in the one application (and its associated target popula- predicted outcomes. The origins of overamplifica- tion), but not for another. Also, consider that ei- tion are during learning itself. The model learns ther the source features (Xs) or source labels (Ys) to pick up on imperfect evidence for the outcome, may be non-representative. In many situations, which brings out the bias. the distributions for the features and labels are the Formally, in overamplification the predicted same. However, there are some cases where they distribution (Q(Yˆs|As)) is dissimilar to the source diverge. For example, when using features from training distribution (Q(Ys|As)) with respect to a age-biased tweets, but labels from non-biased cen- human attribute, A. The predicted distribution is sus surveys. In such cases, we need to take multi- therefore also dissimilar to the target ideal distri- ple analysis levels into account: corrections can be bution: applied to user features as they are aggregated to ˆ communities (Almodaresi et al., 2017). The con- Q(Ys|As)  Q(Ys|As) ∼ P (Yt|At) sequences could be both outcome and error dis- parity. For example, Yatskar et al.(2016) found that One of the challenges in addressing selection in the imSitu image captioning data set, 58% of bias is that we can not know a priori what sort of captions involving a person in a kitchen mention (demographic) attribute will be important to con- women. However, standard models trained on trol. Age and gender are well-studied, but others such data end up predicting people depicted in might be less obvious. We might someday real- kitchens as women 63% of the time (Zhao et al., ize that a formerly innocuous attribute (say, hand- 2017). In other words, an error in generating a edness) turns out to be relevant for selection bi- gender reference within the text (e.g., “A [woman ases. This problem is known as The Known and k man] standing next to a counter-top”) males an Unknown Unknowns. incorrect female reference much more common. The occurrence of overamplification in the ab- As we know, there are known knowns: sence of other biases is an important motivation there are things we know we know. We for countermeasures. It does not require bias on also know there are known unknowns: the part of the annotator, data collector, or even that is to say, we know there are some the programmer/data analyst (though it can es- things we do not know. But there are calate existing biases and the models’ statistical also unknown unknowns: the ones we discrimination along a demographic dimension). don’t know we don’t know. In particular, it extends countermeasures beyond — Donald Rumsfeld the point some authors have made, that they are merely cosmetic and do not address the underly- Multiple Biases. Biases occur not only in isola- ing cause: biased language in society (Gonen and tion, but they also compound to increase their ef- Goldberg, 2019). fects. Label and selection bias can – and often do – interact, so it can be challenging to distinguish Semantic bias. Embeddings (i.e., vectors repre- them. Table1 shows the different conditions to senting the meaning of words or phrases) have be- understand the boundaries of one or another. come a mainstay of modern NLP, providing more Consider the case where a researcher chooses flexible representations that feed both traditional to balance a sentiment data set for a user attribute, and deep learning models. However, these repre- e.g., age. This decision can directly impact the sentations often contain unintended or undesirable label distribution of the target variable. E.g., be- associations and societal stereotypes (e.g., con- cause the positive label is over-represented in a mi- necting medical doctors more frequently to male nority age group. Models learn to exploit this con- pronouns than female pronouns, see Bolukbasi founding correlation between age and label preva- et al.(2016); Caliskan et al.(2017)). We adopt lence and magnify it even more. The resulting the term used for this phenomenon by others, “se- model may be useless, as it only captures the dis- mantic bias”. tribution in the synthetic data sample. We see Formally, we attribute semantic bias to the pa- this situation in early work on using social media rameters of the embedding model (θemb). Seman- data to predict mental health conditions. Models tic bias is a unique case since it indirectly af- to distinguish PTSD from depression turned out fects both outcome disparity and error disparity to mainly capture the differences in user age and by creating other biases, such as overamplifica- gender, rather than language reflecting the actual tion (Yatskar et al., 2016; Zhao et al., 2017) or di- conditions (Preo¸tiuc-Pietroet al., 2015). verging words associations within embeddings or language models (Bolukbasi et al., 2016; Rudinger 3.1 Other Bias Definitions and Frameworks et al., 2018). However, we distinguish it from the While this is the first attempt at a comprehensive other biases, since the population does not have to conceptual framework for bias in NLP, alternative be people, but rather words in contexts that yield frameworks exist, both in other fields and based on non-ideal associations. For example, the issue is more qualitative definitions. Friedler et al.(2016) not (only) that a particular gender authors more define bias as unfairness in algorithms. They spec- of the training data for the embeddings. Instead, ify the idea of a “construct” space, which captures that gendered pronouns are mentioned alongside the latent features in the data that help predict the occupations according to a non-ideal distribution right outcomes. They suggest that finding those (e.g., texts talk more about male doctors and fe- latent variables would also enable us to produce male nurses than vice versa). Furthermore, pre- the right outcomes. Hovy and Spruit(2016) take trained embeddings are often used without access a broader scope on bias based on ethics in new to the original data (or the resources to process technologies. They list three qualitative sources it). We thus suggest that embedding models them- (data, modeling, and research design), and sug- selves are a distinct source of bias within NLP pre- gest three corresponding types of biases: demo- dictive pipelines. graphic bias, overgeneralization, and topic expo- They have consequently received increased at- sure. Suresh and Guttag(2019) propose a qualita- tention, with dedicated sessions at NAACL and tive framework for bias in machine learning, defin- ACL 2019. As an example, Kurita et al.(2019) ing bias as a “potential harmful property of the quantify human-like bias in BERT. Using the Gen- data”. They categorize bias into historical bias, der Pronoun Resolution (GPR) task, they find that, representation bias, measurement bias, and evalu- even after balancing the data set, the model pre- ation bias. Glymour and Herington(2019) clas- dicts no female pronouns with high probability. sify algorithmic bias, in general, into four dif- Semantic bias is also of broad interest to the social ferent categories, depending on the causal condi- sciences as a diagnostic tool (see SectionA). How- tional dependencies to which it is sensitive: pro- ever, their inclusion in our framework is not for cedural bias, , behavior-relative er- reasons of social scientific diagnostics, but rather ror bias, and score-relative error bias. Corbett- to guide mindful researchers where to look for Davies and Goel(2018) propose statistical limi- problems. tations of the three prominent definitions of fair- ness (anti-classification, classification parity, and majority voting, by attaching confidence scores calibration), enabling researchers to develop fairer to each annotator, and reweighting them through machine learning algorithms. that method. Other approaches have explored har- Our framework focuses on NLP, but it follows nessing the inherent disagreement among anno- Glymour and Herington(2019) in providing prob- tators to guide the training process (Plank et al., abilistic based definitions of bias. It incorporates 2014). By weighting updates by the amount of and formalizes the above to varying degrees. disagreement on the labels, this method prevents In social sciences, bias definitions often relate bias towards any one label. The weighted up- to the ability to test causal hypotheses. Hernán dates act as a regularizer during training, which et al.(2004) propose a common structure for var- might also help prevent overamplification. If an- ious types of selection bias. They define bias as notators behave in predictable ways to produce ar- the difference between a variable and the outcome, tifacts (i.e., always add “not” to form a contradic- and the causal effect of a variable on the outcome. tion), we can train a model on such biased fea- E.g., when the causal risk ratio (CRR) differs from tures and use it in ensemble learning (Clark et al., associational risk ratio (ARR). Similarly, Baker 2019). Hays et al.(2015) attempt to make Web et al.(2013) define bias as uncontrolled covariates studies equivalent to representative focus group or “disturbing variables” that are related to mea- panels. They give an overview of probabilistic sures of interest. and non-probabilistic approaches to generate the Others provide definitions restricted to partic- Internet panels that contribute to the data gener- ular applications. For example, Caliskan et al. ation. Along with the six demographic attributes (2017) propose the Word-Embedding Association (age, gender, race/ethnicity, education, marital sta- Test (WEAT). It quantifies semantic bias based tus, and income), they use poststratification to re- on the distance between words with demographic duce the bias (some of these methods cross into associations in the embedding space. The previ- addressing selection bias). ously mentioned work by Kurita et al.(2019) and Selection bias. The primary source for selection Sweeney and Najafian(2019) extend such mea- bias is the mismatch between the sample distribu- sures. Similarly, Romanov et al.(2019) define tion and the ideal distribution. Consequently, any bias based on the correlation between the embed- countermeasures need to re-align the two distribu- dings of human attributes with the difference in tions to minimize this mismatch. the True Positive rates between human traits. This The easiest way to address the mismatch is to approach is reflective of an error disparity. re-stratify the data to more closely match the ideal Our framework encompasses bias-related work distribution. However, this often involves down- in the social sciences. Please see the supplement sampling an overly represented class, which re- in A.1 for a brief overview. duces the number of available instances. Moham- 4 Countermeasures mady and Culotta(2014) use a stratified sampling technique to reduce the selection bias in the data. We group proposed countermeasures based on the Almeida et al.(2015) use demographic user at- origin(s) on which they act. tributes, including age, gender, and social status, Label Bias. There are several ways to address to predict the election results in six different cities label bias, typically by controlling for biases of of Brazil. They use stratified sampling on all the the annotators (Pavlick et al., 2014). Disagree- resulting groups to reduce selection bias. ment between annotators has long been an active Rather than re-sampling, others use reweighting research area in NLP, with various approaches to or poststratifying to reduce selection bias. Cu- measure and quantify disagreement through inter- lotta(2014) estimates county-level health statis- annotator agreement (IAA) scores to remove out- tics based on social media data. He shows we liers (Artstein and Poesio, 2008). Lately, there can stratify based on external socio-demographic has been more of an emphasis on embracing vari- data about a community’s composition (e.g., gen- ation through the use of Bayesian annotation mod- der and race). Park et al.(2006) estimate state- els (Hovy et al., 2013; Passonneau and Carpenter, wise public opinions using the National Surveys 2014; Paun et al., 2018). These models arrive at a corpus. To reduce bias, they use various socioe- much less biased estimate for the final label than conomic and demographic attributes (state of res- idence, sex, ethnicity, age, and education level) in Overamplification. In its simplest form, over- a multilevel logistic regression. Choy et al.(2011) amplification of inherent bias by the model can be and Choy et al.(2012) also use race and gender corrected by downweighting the biased instances as features for reweighting in predicting the re- in the sample, to discourage the model from exag- sults of the Singapore and US presidential elec- gerating the effects. tions. Baker et al.(2013) study how selection bias A common approach involves using synthetic manifests in inferences for a larger population, and matched distributions. To address gender bias in how to avoid it. Apart from the basic demographic neural network approaches to coreference resolu- attributes, they also consider attitudinal and be- tion Rudinger et al.(2018); Zhao et al.(2018) sug- havioral attributes for the task. They suggest us- gest matching the label distributions in the data, ing reweighting, ranking reweighting or propen- and training the model on the new data set. They sity score adjustment, and sample-matching tech- swap male and female instances and merge them niques to reduce selection bias. with the original data set for training. In the same vein, Webster et al.(2018) provide a gender- Others have suggested combinations of these balanced training corpus for coreference resolu- approaches. Hernán et al.(2004), propose Di- tion. Based on the first two corpora, Stanovsky rected Acyclic graphs for various heterogeneous et al.(2019) introduce a bias evaluation for ma- types of selection bias, and suggest using stratified chine translation, showing that most systems over- sampling, regression adjustment, or inverse prob- amplify gender bias (see also Prates et al.(2018)). ability weighting to avoid the bias in the data. Za- Hovy et al.(2020) show that this overamplifica- gheni and Weber(2015), study the use of Internet tion consistently makes translations sound older Data for demographic studies and propose two ap- and more male than the original authors. proaches to reduce the selection bias in their task. Several authors have suggested it is essential for If the ground truth is available, they adjust selec- language to be understood within the context of tion bias based on the calibration of a stochastic the author and their social environment Jurgens microsimulation. If unavailable, they suggest us- (2013); Danescu-Niculescu-Mizil et al.(2013); ing a difference-in-differences technique to find Hovy(2018); Yang et al.(2019). Considering out trends on the Web. the author demographics improves the accuracy of Zmigrod et al.(2019) show that gender-based text classifiersVolkova et al.(2013); Hovy(2015); selection bias could be addressed by data augmen- Lynn et al.(2017), and in turn, could lead to de- tation, i.e., by adding slightly altered examples to creased error disparity. the data. This addition addresses selection bias Semantic bias. Countermeasures for semantic originating in the features (X ), so that the source bias in embeddings typically attempt to adjust the model is fit on a more gender-representative sam- parameters of the embedding model to reflect a tar- ple. Their approach is similar to the reweighting get distribution more accurately. Because all of the of poll data based on demographics, which can above techniques can be applied for model fitting, be applied more directly to tweet-based population here we highlight techniques that are more specific surveillance (see our last case study, A.2). to addressing bias in embeddings. Li et al.(2018) introduce a model-based coun- Bolukbasi et al.(2016) suggest that techniques termeasure. They use an adversarial multitask- to de-bias embeddings can be classified into two learning setup to model demographic attributes as approaches: hard de-biasing (completely removes auxiliary tasks explicitly. By reversing the gradi- bias) and soft de-biasing (partially removes bias ent for those tasks during backpropagation, they avoiding side effects). Romanov et al.(2019) gen- effectively force the model to ignore confound- eralize this work to a multi-class setting, exploring ing signals associated with the demographic at- methods to mitigate bias in an occupation classifi- tributes. Apart from improving overall perfor- cation task. They reduce the correlation between mance across demographics, they show that it also the occupation of people and the word embedding protects user privacy. The findings from Elazar of their names, and manage to simultaneously re- and Goldberg(2018), however, suggest that even duce race and gender biases without reducing the with adversarial training, internal representations classifier’s performance. Manzini et al.(2019), still retain traces of demographic information. identify the bias subspace using principal compo- nent analysis and remove the biased components We would like to leave the reader with these using hard Neutralize and Equalize de-biasing and main points: (1) every predictive model with er- soft biasing methods proposed by Bolukbasi et al. rors is bound to have disparities over human at- (2016). The above examples evaluate success tributes (even those not directly integrating human through the semantic analogy task (Mikolov et al., attributes); (2) disparities can result from a vari- 2013), a method whose informativeness has since ety of origins — the embedding model, the feature been questioned, though (Nissim et al., 2019). For sample, the fitting process, and the outcome sam- a dedicated overview of semantic de-biasing tech- ple — within the standard predictive pipeline; (3) niques see Lauscher et al.(2020). selection of “protected attributes” (or human at- tributes along which to avoid biases) is necessary Social-Level Mitigation. Several initiatives for measuring bias, and often helpful for mitigat- propose standardized documentation to trace ing bias and increasing the generalization ability potential biases, and to ultimately mitigate them. of the models. Data Statements Bender and Friedman(2018) We see this paper as a step toward a unified un- suggest clearly disclosing data selection, an- derstanding of bias in NLP. We hope it inspires notation, and curation processes explicitly and further work in both identifying and countering transparently. Similarly, Gebru et al.(2018) bias, as well as conceptually and mathematically suggest Datasheets to cover the lifecycle of defining bias in NLP. data including “motivation for dataset creation; Framework Application Steps (TL;DR) dataset composition; data collection process; data preprocessing; dataset distribution; dataset main- 1. Specify target population and an ideal tenance; and legal and ethical considerations”. distribution of the attribute (A) to be Mitchell et al.(2019) extend this idea to include investigated for bias; Consult datasheets and 5 model specifications and performance details on data statements if available for the model different user groups. Hitti et al.(2019) propose a source; taxonomy for assessing the gender bias of a data 2. If outcome disparity or error disparity, set. While these steps do not directly mitigate check for potential origins: bias, they can encourage researchers to identify (a) if label bias: use post-stratification or and communicate sources of label or selection retrain annotators. bias. Such documentation, combined with a (b) if selection bias: use stratified sampling conceptual framework to guide specific mitiga- to match source to target populations, tion techniques, acts as an essential mitigation or use post-stratification, re-weighting technique at the level of the research community. techniques. See Appendix A.2 for case studies outlining (c) if overamplification: synthetically various types of bias in several NLP tasks. match distributions or add outcome disparity to cost function. 5 Conclusion (d) if semantic bias: retrain or retrofit embeddings considering approaches We present a comprehensive overview of the re- above, but with attributed (e.g., cent literature on predictive bias in NLP. Based gendered) words (rather than people) as on this survey, we develop a unifying conceptual the population. framework to describe bias sources and their ef- Acknowledegments fects (rather than just their effects). This frame- work allows us to group and compare works on The authors would like to thank Vinod Prab- countermeasures. Rather than giving the impres- hakaran, Niranjan Balasubramanian, Joao Sedoc, sion that bias is a growing problem, we would like Lyle Ungar, Rediet Abebe, Salvatore Giorgi, Mar- to point out that bias is not necessarily something garet Kern and the anonymous reviewers for their gone awry, but rather something nearly inevitable constructive comments. Dirk Hovy is a member of in statistical models. We do, however, stress that the Bocconi Institute for Data Science and Analyt- we need to acknowledge and address bias with ics (BIDSA) and the Data and Marketing Insights proactive measures. Having a formal framework (DMI) unit. of the causes can help us achieve this. 5 (Gebru et al., 2018; Bender and Friedman, 2018) References Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. 2017. Semantics derived automatically Jussara M Almeida, Gisele L Pappa, et al. 2015. Twit- from language corpora contain human-like biases. ter population sample bias and its impact on predic- Science, 356(6334):183–186. tive outcomes: A case study on elections. In Pro- ceedings of the 2015 IEEE/ACM International Con- Murphy Choy, Michelle Cheong, Ma Nang Laik, and ference on Advances in Social Networks Analysis Koo Ping Shung. 2012. Us presidential elec- and Mining 2015, pages 1254–1261. ACM. tion 2012 prediction using census corrected twitter model. arXiv preprint arXiv:1211.0938. Fatemeh Almodaresi, Lyle Ungar, Vivek Kulkarni, Mohsen Zakeri, Salvatore Giorgi, and H. Andrew Murphy Choy, Michelle LF Cheong, Ma Nang Laik, Schwartz. 2017. On the distribution of lexical fea- and Koo Ping Shung. 2011. A sentiment analy- tures at multiple levels of analysis. In The 55th An- sis of singapore presidential election 2011 using nual Meeting of the Association for Computational twitter data with census correction. arXiv preprint Linguistics. arXiv:1108.5520.

Ron Artstein and Massimo Poesio. 2008. Inter-coder Christopher Clark, Mark Yatskar, and Luke Zettle- agreement for computational linguistics. Computa- moyer. 2019. DonâA˘ Zt´ take the easy way out: En- tional Linguistics, 34(4):555–596. semble based methods for avoiding known dataset biases. In Proceedings of the 2019 Conference on Reg Baker, J Michael Brick, Nancy A Bates, Mike Empirical Methods in Natural Language Processing Battaglia, Mick P Couper, Jill A Dever, Krista J and the 9th International Joint Conference on Natu- Gile, and Roger Tourangeau. 2013. Summary re- ral Language Processing (EMNLP-IJCNLP), pages port of the aapor task force on non-probability sam- 4060–4073. pling. Journal of Survey Statistics and Methodology, 1(2):90–143. Maximin Coavoux, Shashi Narayan, and Shay B Co- hen. 2018. Privacy-preserving neural representa- Roy F Baumeister, Kathleen D Vohs, and David C Fun- tions of text. In Proceedings of the 2018 Confer- der. 2007. Psychology as the science of self-reports ence on Empirical Methods in Natural Language and finger movements: Whatever happened to actual Processing, pages 1–10. behavior? Perspectives on Psychological Science, 2(4):396–403. Glen Coppersmith, Mark Dredze, Craig Harman, Kristy Hollingshead, and Margaret Mitchell. 2015. Emily M Bender and Batya Friedman. 2018. Data Clpsych 2015 shared task: Depression and ptsd on statements for natural language processing: Toward twitter. In CLPsych@ HLT-NAACL, pages 31–39. mitigating system bias and enabling better science. Transactions of the Association for Computational Sam Corbett-Davies and Sharad Goel. 2018. The Linguistics, 6:587–604. measure and mismeasure of fairness: A critical review of fair machine learning. arXiv preprint Adrian Benton, Margaret Mitchell, and Dirk Hovy. arXiv:1808.00023. 2017. Multitask learning for mental health condi- tions with limited social media data. In Proceedings Fintan Costello and Paul Watts. 2014. Surprisingly ra- of the 15th Conference of the European Chapter of tional: Probability theory plus noise explains biases the Association for Computational Linguistics: Vol- in judgment. Psychological review, 121(3):463. ume 1, Long Papers, volume 1, pages 152–162. Mick P Couper. 2013. Is the sky falling? new tech- nology, changing media, and the future of surveys. Richard A Berk. 1983. An introduction to sample se- In Survey Research Methods, volume 7, pages 145– American Socio- lection bias in sociological data. 156. logical Review, pages 386–398. Aron Culotta. 2014. Reducing in social Sudeep Bhatia. 2017. Associative judgment and vector media data for county health inference. In Joint Sta- space semantics. Psychological review, 124(1):1. tistical Meetings Proceedings, pages 1–12.

Su Lin Blodgett and Brendan O’Connor. 2017. Racial Cristian Danescu-Niculescu-Mizil, Robert West, Dan disparity in natural language processing: A case Jurafsky, Jure Leskovec, and Christopher Potts. study of social media African-American English. 2013. No country for old members: User lifecy- arXiv preprint arXiv:1707.00061. cle and linguistic change in online communities. In Proceedings of the 22nd international conference on Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, World Wide Web, pages 307–318. Venkatesh Saligrama, and Adam T Kalai. 2016. Man is to computer programmer as woman is to Hal Daume III. 2007. Frustratingly easy domain adap- homemaker? word embeddings. In Ad- tation. In Proceedings of the 45th Annual Meeting of vances in neural information processing systems, the Association of Computational Linguistics, pages pages 4349–4357. 256–263. Jacob Eisenstein, Brendan O’Connor, Noah A Smith, models. In European Conference on Computer Vi- and Eric P Xing. 2010. A latent variable model for sion, pages 793–811. Springer. geographic lexical variation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Joseph Henrich, Steven J Heine, and Ara Norenzayan. Language Processing, pages 1277–1287. Associa- 2010. The weirdest people in the world? Behav- tion for Computational Linguistics. ioral and brain sciences, 33(2-3):61–83. Yanai Elazar and Yoav Goldberg. 2018. Adversarial Miguel A Hernán, Sonia Hernández-Díaz, and removal of demographic attributes from text data. James M Robins. 2004. A structural approach to In Proceedings of the 2018 Conference on Empiri- selection bias. Epidemiology, pages 615–625. cal Methods in Natural Language Processing, pages 11–21. Yasmeen Hitti, Eunbee Jang, Ines Moreno, and Car- olyne Pelletier. 2019. Proposed taxonomy for gen- Sorelle A Friedler, Carlos Scheidegger, and Suresh der bias in text; a filtering methodology for the gen- Venkatasubramanian. 2016. On the (im) possibility der generalization subtype. In Proceedings of the of fairness. corr abs/1609.07236 (2016). First Workshop on Gender Bias in Natural Language Processing, pages 8–17, Florence, Italy. Association Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and for Computational Linguistics. James Zou. 2018. Word embeddings quantify 100 years of gender and ethnic stereotypes. Pro- Dirk Hovy. 2015. Demographic factors improve clas- ceedings of the National Academy of Sciences, sification performance. In Proceedings of the 53rd 115(16):E3635–E3644. Annual Meeting of the Association for Computa- tional Linguistics and the 7th International Joint Aparna Garimella, Carmen Banea, Dirk Hovy, and Conference on Natural Language Processing (Vol- Rada Mihalcea. 2019. Women’s syntactic resilience ume 1: Long Papers), volume 1, pages 752–762. and men’s grammatical luck: Gender-bias in part- of-speech tagging and dependency parsing. In Pro- Dirk Hovy. 2018. The social and the neural net- ceedings of the 57th Annual Meeting of the Asso- work: How to make natural language processing ciation for Computational Linguistics, pages 3493– about people again. In Proceedings of the Sec- 3498, Florence, Italy. Association for Computa- ond Workshop on Computational Modeling of Peo- tional Linguistics. pleâA˘Zs´ Opinions, Personality, and Emotions in So- cial Media, pages 42–49. Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani, Daumé III, and Kate Crawford. 2018. Datasheets and Eduard Hovy. 2013. Learning whom to trust for datasets. In Proceedings of the 5 th Workshop on with mace. In Proceedings of the 2013 Conference Fairness, Accountability, and Transparency in Ma- of the North American Chapter of the Association chine Learning, Stockholm, Sweden. for Computational Linguistics: Human Language Salvatore Giorgi, Veronica Lynn, Keshav Gupta, San- Technologies, pages 1120–1130. dra Matz, Lyle Ungar, and H.A. Schwartz. 2019. Dirk Hovy, Federico Bianchi, and Tommaso Forna- Correcting sociodemographic selection biases for ciari. 2020. Can you translate that into man? com- population prediction. ArXiv. mercial machine translation systems include stylis- Bruce Glymour and Jonathan Herington. 2019. Mea- tic biases. In Proceedings of the 58th Annual Meet- suring the biases that matter: The ethical and ca- ing of the Association for Computational Linguis- sual foundations for measures of fairness in algo- tics, Seattle, WA, USA. Association for Computa- rithms. In Proceedings of the Conference on Fair- tional Linguistics. ness, Accountability, and Transparency, FAT* ’19, pages 269–278, New York, NY, USA. ACM. Dirk Hovy and Anders Søgaard. 2015. Tagging perfor- mance correlates with author age. In Proceedings Hila Gonen and Yoav Goldberg. 2019. Lipstick on a of the 53rd Annual Meeting of the Association for pig: Debiasing methods cover up systematic gender Computational Linguistics and the 7th International biases in word embeddings but do not remove them. Joint Conference on Natural Language Processing In Proceedings of the 2019 Conference of the North (Volume 2: Short Papers), volume 2, pages 483–488. American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Dirk Hovy and Shannon L Spruit. 2016. The social Volume 1 (Long and Short Papers), pages 609–614. impact of natural language processing. In Proceed- ings of the 54th Annual Meeting of the Association Ron D Hays, Honghu Liu, and Arie Kapteyn. 2015. for Computational Linguistics (Volume 2: Short Pa- Use of internet panels to conduct surveys. Behavior pers), volume 2, pages 591–598. research methods, 47(3):685–690. Jing Jiang and ChengXiang Zhai. 2007. Instance Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, weighting for domain adaptation in nlp. In Proceed- Trevor Darrell, and Anna Rohrbach. 2018. Women ings of the 45th annual meeting of the association of also snowboard: Overcoming bias in captioning computational linguistics, pages 264–271. Anna Jørgensen, Dirk Hovy, and Anders Søgaard. Thomas Manzini, Yao Chong Lim, Yulia Tsvetkov, and 2015. Challenges of studying and processing di- Alan W Black. 2019. Black is to criminal as cau- alects in social media. In Proceedings of the Work- casian is to police: Detecting and removing mul- shop on Noisy User-generated Text, pages 9–18. ticlass bias in word embeddings. arXiv preprint arXiv:1904.04047. Kenneth Joseph, Lisa Friedland, William Hobbs, Oren Tsur, and David Lazer. 2017. Constance: Modeling Robert R. McCrae and Paul T. Costa Jr. 1989. Rein- annotation contexts to improve stance classification. terpreting the Myers-Briggs type indicator from the arXiv preprint arXiv:1708.06309. perspective of the five-factor model of personality. Journal of Personality, 57(1):17–40. David Jurgens. 2013. That’s what friends are for: Infer- ring location in online social media platforms based Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey on social relationships. In Seventh International Dean. 2013. Efficient estimation of word represen- AAAI Conference on Weblogs and Social Media. tations in vector space. In Proceedings of ICLR. Margaret Mitchell, Simone Wu, Andrew Zaldivar, David Jurgens, Yulia Tsvetkov, and Dan Jurafsky. Parker Barnes, Lucy Vasserman, Ben Hutchinson, 2017. Incorporating dialectal variability for socially Elena Spitzer, Inioluwa Deborah Raji, and Timnit equitable language identification. In Proceedings of Gebru. 2019. Model cards for model reporting. the 55th Annual Meeting of the Association for Com- In Proceedings of the Conference on Fairness, Ac- putational Linguistics (Volume 2: Short Papers), countability, and Transparency, pages 220–229. pages 51–57. ACM.

ML Kern, G Park, JC Eichstaedt, HA Schwartz, M Sap, Ehsan Mohammady and Aron Culotta. 2014. Using LK Smith, and LH Ungar. 2016. Gaining insights county demographics to infer attributes of twitter from social media language: Methodologies and users. In Proceedings of the joint workshop on so- challenges. Psychological methods, 21(4):507–525. cial dynamics and personal attributes in social me- dia, pages 7–16. Svetlana Kiritchenko and Saif Mohammad. 2018. Ex- amining gender and race bias in two hundred sen- Malvina Nissim, Rik van Noord, and Rob van der timent analysis systems. In Proceedings of the Goot. 2019. Fair is better than sensational: Man Seventh Joint Conference on Lexical and Computa- is to doctor as woman is to doctor. arXiv preprint tional Semantics, pages 43–53. arXiv:1905.09866.

Jon Kleinberg, Jens Ludwig, Sendhil Mullainathan, David K Park, Andrew Gelman, and Joseph Bafumi. and Cass R Sunstein. 2018. Discrimination in the 2006. State-level opinions from national surveys: age of algorithms. Journal of Legal Analysis, 10. Poststratification using multilevel logistic regres- sion. Public opinion in state politics, pages 209–28. Austin C Kozlowski, Matt Taddy, and James A Evans. 2018. The geometry of culture: Analyzing mean- Rebecca J Passonneau and Bob Carpenter. 2014. The ing through word embeddings. arXiv preprint benefits of a model of annotation. Transactions arXiv:1803.09288. of the Association for Computational Linguistics, 2:311–326. Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W Black, and Yulia Tsvetkov. 2019. Measuring bias in Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk contextualized word representations. arXiv preprint Hovy, Udo Kruschwitz, and Massimo Poesio. 2018. arXiv:1906.07337. Comparing bayesian models of annotation. Trans- actions of the Association for Computational Lin- guistics, 6:571–585. Anne Lauscher, Goran Glavaš, Simone Paolo Ponzetto, and Ivan Vulic.´ 2020. A general framework for im- Ellie Pavlick, Matt Post, Ann Irvine, Dmitry Kachaev, plicit and explicit debiasing of distributional word and Chris Callison-Burch. 2014. The language de- vector spaces. mographics of amazon mechanical turk. Transac- tions of the Association for Computational Linguis- Yitong Li, Timothy Baldwin, and Trevor Cohn. 2018. tics, 2:79–92. Towards robust and privacy-preserving text repre- sentations. In Proceedings of the 56th Annual Meet- James W Pennebaker. 2011. The secret life of pro- ing of the Association for Computational Linguistics nouns. New Scientist, 211(2828):42–45. (Volume 2: Short Papers), volume 2, pages 25–30. James W. Pennebaker and Lori D. Stone. 2003. Words Veronica E. Lynn, Youngseo Son, Vivek Kulkarni, Ni- of wisdom: Language use over the life span. Journal ranjan Balasubramanian, and H. Andrew Schwartz. of Personality and Social Psychology, 85(2):291. 2017. Human centered nlp with user-factor adap- tation. In Empirical Methods in Natural Language Barbara Plank, Dirk Hovy, and Anders Søgaard. 2014. Processing. Learning part-of-speech taggers with inter-annotator agreement loss. In Proceedings of the 14th Confer- translation. In Proceedings of the 57th Annual Meet- ence of the European Chapter of the Association for ing of the Association for Computational Linguis- Computational Linguistics, pages 742–751. tics, pages 1679–1684, Florence, Italy. Association for Computational Linguistics. Marcelo OR Prates, Pedro H Avelar, and Luís C Lamb. 2018. Assessing gender bias in machine translation: Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, a case study with google translate. Neural Comput- Mai ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth ing and Applications, pages 1–19. Belding, Kai-Wei Chang, and William Yang Wang. 2019. Mitigating gender bias in natural language Daniel Preo¸tiuc-Pietro, Johannes Eichstaedt, Gregory processing: Literature review. In Proceedings of Park, Maarten Sap, Laura Smith, Victoria Tobolsky, the 57th Conference of the Association for Computa- H Andrew Schwartz, and Lyle Ungar. 2015. The tional Linguistics, pages 1630–1640, Florence, Italy. role of personality, age, and gender in tweeting about Association for Computational Linguistics. mental illness. In Proceedings of the 2nd workshop on computational linguistics and clinical psychol- Harini Suresh and John V Guttag. 2019. A framework ogy: From linguistic signal to clinical reality, pages for understanding unintended consequences of ma- 21–30. chine learning. arXiv preprint arXiv:1901.10002. Chris Sweeney and Maryam Najafian. 2019. A trans- Daniel Preotiuc-Pietro, Maarten Sap, H Andrew parent framework for evaluating unintended demo- Schwartz, and Lyle Ungar. 2015. Mental illness graphic bias in word embeddings. In Proceedings detection at the world well-being project for the of the 57th Conference of the Association for Com- clpsych 2015 shared task. In Proceedings of the putational Linguistics, pages 1662–1667, Florence, Workshop on Computational Linguistics and Clini- Italy. Association for Computational Linguistics. cal Psychology: From Linguistic Signal to Clinical Reality, NAACL. Spencer S Swinton. 1981. Predictive bias in gradu- ate admissions tests. ETS Research Report Series, Iyad Rahwan, Manuel Cebrian, Nick Obradovich, 1981(1):i–53. Josh Bongard, Jean-François Bonnefon, Cynthia Breazeal, Jacob W Crandall, Nicholas A Christakis, Peter Trudgill. 2000. Sociolinguistics: An introduction Iain D Couzin, Matthew O Jackson, et al. 2019. Ma- to language and society. Penguin UK. chine behaviour. Nature, 568(7753):477. Amos Tversky and Daniel Kahneman. 1973. Availabil- Alexey Romanov, Maria De-Arteaga, Hanna Wal- ity: A heuristic for judging frequency and probabil- lach, Jennifer Chayes, Christian Borgs, Alexandra ity. Cognitive psychology, 5(2):207–232. Chouldechova, Sahin Geyik, Krishnaram Kentha- padi, Anna Rumshisky, and Adam Tauman Kalai. Svitlana Volkova, Theresa Wilson, and David 2019. What’s in a name? reducing bias in bios with- Yarowsky. 2013. Exploring demographic lan- out access to protected attributes. arXiv preprint guage variations to improve multilingual sentiment arXiv:1904.05233. analysis in social media. In Proceedings of the 2013 Conference on Empirical Methods in Natural Rachel Rudinger, Jason Naradowsky, Brian Leonard, Language Processing, pages 1815–1827. and Benjamin Van Durme. 2018. Gender bias in Kellie Webster, Marta Recasens, Vera Axelrod, and Ja- coreference resolution. In Proceedings of the 2018 son Baldridge. 2018. Mind the gap: A balanced Conference of the North American Chapter of the corpus of gendered ambiguous pronouns. Transac- Association for Computational Linguistics: Human tions of the Association for Computational Linguis- Language Technologies, Volume 2 (Short Papers), tics, 6:605–617. volume 2, pages 8–14. Thomas A. Widiger and Douglas B. Samuel. 2005. Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, Diagnostic categories or dimensions? A question and Noah A. Smith. 2019. The risk of racial bias in for the Diagnostic and Statistical Manual of Men- hate speech detection. In Proceedings of the 57th tal Disorders—Fifth Edition. Journal of Abnormal Conference of the Association for Computational Psychology, 114(4):494. Linguistics, pages 1668–1678, Florence, Italy. As- sociation for Computational Linguistics. Diyi Yang, Robert E Kraut, Tenbroeck Smith, Eli- jah Mayfield, and Dan Jurafsky. 2019. Seekers, H Andrew Schwartz, Johannes C Eichstaedt, Mar- providers, welcomers, and storytellers: Modeling garet L Kern,