<<

Social Norm : Residual Harms of Fairness-Aware Algorithms

Myra Cheng Maria De-Arteaga [email protected] University of Texas at Austin California Institute of Technology, Microsoft Research Austin, Texas, USA Pasadena, California, USA Lester Mackey Adam Tauman Kalai Microsoft Research Microsoft Research Cambridge, Massachusetts, USA Cambridge, Massachusetts, USA

ABSTRACT In this work, we characterize Social Norm Bias (SNoB)—the Many modern learning algorithms mitigate bias by enforcing fair- associations between an algorithm’s predictions and individuals’ ness across coarsely-defined groups related to a sensitive attribute adherence to social norms—as a source of algorithmic unfairness. like gender or race. However, the same algorithms seldom account We study SNoB through the task of occupation classification on a for the within-group that arise due to the heterogeneity of dataset of online biographies. In this setting, masculine/feminine group members. In this work, we characterize Social Norm Bias SNoB occurs when an algorithm favors biographies written in ways (SNoB), a subtle but consequential type of that may that adhere to masculine/feminine gender norms, respectively. We be exhibited by automated decision-making systems, even when examine how existing algorithmic fairness techniques neglect this these systems achieve group fairness objectives. We study this issue issue. through the lens of gender bias in occupation classification from Discrimination based on gender norms has implications of con- biographies. We quantify SNoB by measuring how an algorithm’s crete harms, which are documented in the social science literature predictions are associated with conformity to gender norms, which (Section 3.2). For example, in the labor context, women in the work- is measured using a machine learning approach. This framework place are often obligated to conform to descriptive [37], reveals that for classification tasks related to male-dominated occu- while male-dominated professions such as computing may discrim- pations, fairness-aware classifiers favor biographies written in ways inate those who do not adhere to masculine social norms [29]. We that align with masculine gender norms. We compare SNoB across illuminate the gap between these concerns and current techniques fairness intervention techniques and show that post-processing for algorithmic fairness. In Section 2, we discuss related work in interventions do not mitigate this type of bias at all. the computer science and fairness literature. In Section 3, we pro- vide background on how previous work motivates and enables our measurement and analysis of gender norms. 1 INTRODUCTION Our approach, which is described in Section 4, measures how As automated decision-making systems play a growing role in an algorithm’s predictions are associated with masculine or femi- our daily lives, concerns about algorithmic unfairness have come nine gender norms. We quantify a biography’s adherence to gender to light. It is well-documented that machine learning (ML) algo- norms using a supervised machine learning approach. This frame- rithms can perpetuate existing social biases and discriminate against work quantifies an algorithm’s bias on another dimension of gender marginalized populations [16, 59, 74]. To avoid algorithmic discrimi- beyond explicit binary labels. Using this approach, we analyze the nation based on sensitive attributes, various approaches to measure extent to which fairness-aware approaches encode gender norms in and achieve "algorithmic fairness” have been proposed. These ap- Section 5. We find that approaches that improve group fairness still proaches are typically based on group fairness, which partitions a exhibit SNoB. In particular, post-processing approaches are most population into groups based on a protected attribute (e.g., gen- closely associated with gender norms, while some in-processing der, race, or religion) and then aims to equalize some metric of the approaches that improve group fairness also simultaneously reduce arXiv:2108.11056v2 [cs.LG] 29 Aug 2021 system across the groups [2, 27, 35, 44, 50, 61]. SNoB. We also collect a dataset of nonbinary biographies, on which Group fairness makes the implicit assumption that a group is we evaluate the classifiers’ behavior in Section 6.1. We provide defined solely by the possession of particular characteristics39 [ ], intuition for the underlying mechanisms of SNoB by exploring ignoring the heterogeneity within groups. For example, much of classifiers’ reliance on gendered words in Section 6.2 and verify the the existing work on gender bias in ML compares groups based on robustness of our framework in Section 6.3. binary gender; however, different women’s experiences of gender The associations studied in this work may lead to representa- and concept of womanhood may vary drastically [72, 82]. These tional and allocational harms for feminine-expressing people in approaches do not account for the complex, multi-dimensional na- male-dominated occupations [7, 10]. Furthermore, when fairness- ture of concepts like gender and race [17, 34]. Consider for instance aware algorithms exhibit SNoB, these harms are not only perpetu- social norms which represent the implicit expectations of how par- ated but also obscured by claims of group fairness. ticular groups behave. In many settings, adherence to or deviations from these expectations result in harm, but current algorithmic fairness approaches overlook whether and how ML may replicate these harms. Cheng et al.

2 RELATED WORK Adherence to social norms is a form of hetereogeneity within It is well-established that automated decision-making systems based the groups used by group fairness approaches, which is connected on structured data reflect human biases. For example, semantic to the concept of subgroup fairness. For example, work on multical- representations such as word embeddings encode gendered and ibration [36] and fairness [45] aim to bridge the racial biases [14, 18, 65, 75]. In coreference resolution, systems as- gap between group and individual fairness by mitigating unfairness sociate and thus assign gender pronouns to occupations when they for subgroups within groups, even when they are not explicitly co-occur more frequently in the data [66]. In toxicity detection, lan- labeled. Our approach differs by making explicit the types of harms guage related to marginalized groups is disproportionately flagged that we are concerned about and how they are related to social as toxic [83]. Various fairness-aware approaches have been devel- discrimination. oped to mitigate these issues from a group fairness perspective [15, 33, 47, 60]. 3 BACKGROUND However, existing "debiasing" approaches still exhibit discrimi- We provide background on the different axes of gender and how natory behavior and may cause harm to marginalized populations they are operationalized in the occupation setting and in the English based on reductive definitions [5]. Using name or pronouns as a language. We describe the gaps between these harms and existing proxy for race or gender [14, 70] is not an accurate representation approaches to address gender bias in automated hiring tasks. of the data, and furthermore, it may reinforce the categorical defini- tions, which leads to further harms to those who are marginalized 3.1 The Multiplicity of Gender by these reductive notions [46, 48]. The and inclusion In the literature, the term "gender" is a proxy for different constructs literature also discusses limitations of considering only reductive depending on the context [46]. It may mean gender identity, which definitions across groups. For instance, Mitchell. etal [55] illustrate is one’s “felt, desired or intended identity” [32, p.2], or gender expres- how image search results that depict diverse demographics fail to be sion, which is how one "publicly expresses or presents their gender... meaningfully inclusive. More broadly, work such as Blodgett et al. others perceive a person’s gender through these attributes" [21, p.1]. [10], Hanna et al. [34] highlight how bias mitigation approaches fail These concepts are also related to gender norms, i.e. "the standards to consider relevant social science literature. Our work contributes and expectations to which women and men generally conform," to this body of literature by identifying a concrete type of harm, [52, p.717] including personality traits, behaviors, occupations, and arising from reductive definitions, that remains in fairness-aware appearance [3]. Gender norms are closely related to and provide a approaches. frame of reference for stereotypes [42], which are ingrained beliefs Our work proposes a method to characterize a machine learning (biases) about what a particular group’s traits are and should be algorithm’s associations with social norms, for which there are no [24, 71]. ground-truth labels. Previous work in unsupervised bias measure- These various notions of social gender encompass much more ment look at metrics such as pointwise mutual information [65] than the categorical gender labels that are used as the basis for and generative modelling [41]. Aka et al. [4] compare these unsu- group fairness approaches [20]. We focus on discrimination related pervised techniques for uncovering associations between a model’s to the ways that individuals’ gender expression adhere to social predictions and unlabeled sensitive attributes, which is similar to gender norms. our goal. While they investigate different metrics for achieving this There is a growing body of work on how language models oper- in the setting of a large discrete label space, we focus specifically on ationalize stereotypes, which are related to social norms. Datasets how social norms of writing in masculine or feminine ways affect such as StereoSet [57] and CrowS-Pairs [58] aim to provide bench- algorithmic predictions. marks to characterize the extent of stereotyping in language models. Along a similar vein, Field and Tsvetkov [30] develop an ap- These works have varying definitions of stereotypes [11]. We focus proach to identify implicit gender bias from short text. However, on social norms as a broader phenomenon than stereotypes, which rather than studying social norms, they focus on the task of identi- are often embedded into social norms [67]. Since our notion of fying implicitly biased comments directed towards women, such as social norms arise from the language of the dataset, we capture not opinions about the appearance of public figures. only stereotypes but gender-related patterns present in the data. Our proposed machine learning approach to identify adherence For example, the mention of a women’s college or fraternity in to social norms builds on work such as Cryan et al. [24] and Tang one’s biography is a proxy for gender; it is a norm rather than a et al. [76]. Cryan et al. [24] develop a machine learning-based that people who attend women’s colleges identify as method to detect gender stereotypes and compares the approach "she" while those who participate in fraternities identify as "he". to crowd-sourced lexicons of gender stereotypes. Tang et al. [76] take a simpler approach to study gendered language in job ad- 3.2 Harms Related to Gender Norms in the vertisements. Their approach relies on manually compiled lists of gendered words and uses only the frequency of these words as a Workplace measure of gender bias. We quantify gender norms based on the Our concerns about the use of gender norms in machine learning predictions of a gender classifier. This allows us to leverage more systems are grounded in studies of how gender norms have been information to quantify how biographies are written in styles that operationalized in various occupations, causing harm to gender are similar to masculine or feminine gender norms. minorities. It is well-established that "occupations are socially and culturally ‘gendered’" [74]; many jobs in science, technology, and engineering are perceived as masculine [29, 49]. Women in these Social Norm Bias: Residual Harms of Fairness-Aware Algorithms fields have been found to perform their gender in particular ways set of occupations 퐶, we have to gain respect and acceptance from their peers, in turn fostering a "masculine" environment that is hostile to women [62]. TPR푐,푎 = 푃 [푌ˆ푐 = 푐|푌푐 = 푐, 퐴 = 푎], In social psychology, descriptive stereotypes are attributes be- lieved to characterize women as a group. Heilman [37, 38] stud- Gap푐,푎 = TPR푐,푎 − TPR푐,¬푎 ies how the perceived lack of fit between feminine stereotypic RMS attributes and male gender-typed jobs results in gender bias and for occupation 푐 ∈ 퐶. Then, Gap is defined as: impedes women’s careers. √︄ 1 ∑︁ When these patterns are replicated by SNoB in machine learning GapRMS = Gap2 . |퐶| 푐,푎 algorithms, this results in two types of harms. The associations that 푐 ∈퐶 we highlight may lead to 1) representational harm, when actual members of the occupation are made invisible, and 2) allocational Root mean square (RMS) penalizes larger values of the gender harm, when certain individuals are allocated fewer career opportu- gap more heavily than a linear average across the occupations. nities based on their degree of adherence to gender norms [7, 10]. Although fairness-aware classifiers may be effective at reducing GapRMS, they neglect other axes of harm related to SNoB, which our framework, as described in the next section, aims to capture. 3.3 Gender Norms in Language Different English words have gender-related connotations56 [ ]. 4 METHODS Crawford et al. (2004) provide a corpus of 600 words and human- To study SNoB, we focus on the presence of gender norms in oc- labeled gender scores, as scored on a 1-5 scale (1 is the most femi- cupation classification, a component of automated recruiting. We nine, 5 is the most masculine) by undergraduates at U.S. universities. assume that a "fair" occupation classification algorithm should They find that words referring to explicitly gendered roles suchas not exhibit gender bias, including SNoB, since someone’s career wife, husband, princess, and prince are the most strongly gendered, prospects should not be hindered by their gender. There ought to be while words such as pretty and handsome also skew strongly in the no association between the probability that the classifier correctly feminine and masculine directions respectively. Gender also affects predicts someone’s occupation and any dimension of the person’s the ways that people are perceived and described [51]. A study on gender (pronouns, expression, etc.). However, unlike for gender resumes describes gender-based disparities in how people represent pronouns, there is no ground-truth label for other notions of gender themselves, despite having similar experiences [73]. In letters of in our biography dataset. Thus, we use a data-driven approach to recommendation, women are described as more communal and less measure each biography’s adherence to gender norms. We validate agentic than men, which harms the women who are perceived in this approach by comparing it to crowd-sourced notions of gender this way [51]. Wikipedia articles also contain significant gender norms. We then compare the degree to which individuals’ adher- differences, such as notable women’s biographies focusing more ence to gender norms is correlated with occupation classification on their romantic and family lives [78]. predictions. We introduce metrics to quantify masculine and femi- nine SNoB in occupation classifiers on two different scales: single 3.4 Addressing Gender Bias in Automated occupation association and cross-occupation association. Hiring Audit studies reveal that employers tend to discriminate against 4.1 Occupation Classification women [8, 40]. These biases are also replicated in automated hiring. 4.1.1 Dataset. We use the dataset1 and task described by De-Arteaga For example, De-Arteaga et al. [25] measure the gender gap in error et al. [25]. The dataset, containing 397,340 biographies spanning rates of an occupation classification algorithm. Many in academia twenty-eight occupations, is obtained by using regular expressions and industry alike have been motivated to mitigate these concerns to filter the Common Crawl for online biographies. [12, 63, 68]. For instance, LinkedIn developed a post-processing Each biography is labeled with its gender based on the use of approach for ranking candidates so that their candidate recommen- "she" or "he" pronouns; biographies that contain neither pronoun dations are demographically representative of the broader candidate are not captured in the original data (in Section 6.1, we study the pool; their system is deployed across a service affecting more than effect of SNoB on a newly-collected small set of biographies with 600 million users [31]. Other intervention techniques have also been nonbinary pronouns). Let 퐻푐 , 푆푐 be the sets of biographies in occu- proposed [28, 64]. These approaches share a reliance on categorical pation 푐 using "he" and "she" respectively. |퐻푐 |, |푆푐 | are the numbers gender labels to measure fairness. of biographies in the respective sets. To preserve the ratios between |퐻푐 | and |푆푐 |, we use a stratified split to create the training, valida- 3.5 Group Fairness Metric: GapRMS tion, and test datasets, containing 65%, 10%, and 25% of the biogra- phies respectively. This split is used by other work focused on this Many established fairness intervention approaches focus on re- dataset [25, 64]. We use the data to train and evaluate algorithms ducing the GapRMS, which is a measure of the difference in the that predict a biography’s occupation title from the subsequent classifier’s true positive rate (TPR) between "she" and "he" biogra- sentences. phies. This is the group fairness metric used by prior research that studies the occupation classification task we consider [25, 64]. Let 푎, ¬푎 be binary values of the sensitive attribute (gender). For the 1The dataset is publicly available at http://aka.ms/biasbios. Cheng et al.

4.1.2 Semantic Representations and Learning Algorithms. For the weights of these words, defined as the cosine similarity with the occupation classification algorithms, we use three semantic repre- coefficients of the logistic regression classifier trained on theWE sentations with different degrees of complexity: bag-of-words, word representation (see Section 6.2). We find a strong correlation (Spear- embeddings, and BERT. In the bag-of-words (BOW) representation, man’s 푟-value 0.68) between these values (Figure 1). In Section 6.3, a biography 푥 is represented as a sparse vector of the frequencies of we perform a robustness test using a gender classifier that omits the words in 푥. BOW is widely used in settings where interpretabil- occupation-relevant words. ity is important. In the word embedding (WE) representation, 푥 is represented by an average of the fastText word embeddings [13, 54] for the words in 푥. Previous work demonstrates that the WE rep- Brother resentation captures semantic information effectively1 [ ]. For the Logical

BOW and WE representations, we train a one-versus-all logistic re- Most Masculine Bottleneck

Testosterone gression model with 퐿 regularization on the training set, as done by Volume 2 Apple De-Arteaga et al. [25]. The BERT contextualized word embedding Destructive Acetone model [26] is state-of-the-art for various natural language process- Baseball Powerful ing tasks, and it has been widely adopted for many uses. Unlike the Hemming

G Funny

other language representations, a biography’s BERT encoding is r Thin Cologne e i f

i Complete s Diet context-dependent. We fine-tune the BERT model, which pre-trains s a

l Ethical C

r Mascara deep bidirectional representations from unlabeled English text [81], e Husband

d Pleasing n e Garter for the occupation classification task. G

E

W Bunny Nun Aversion n i

t h

g Wife Teacher i e 4.2 Quantifying Gender Norms W Bridesmaid We leverage the natural language properties of our biography Lipstick dataset to measure how much a biography’s gender expression Princess aligns with social norms. As discussed in Section 3.3, there are Grandmother Mother Child many differences in the ways that language is used to describe Sorority people of different genders53 [ ] and in the ways that people of different genders choose to use language [6]. Pregnant Building on work that studies gender norms in language [22, 56], Most Feminine line of best fit, Spearman's r = 0.68 Most Feminine Most Masculine one could attempt to measure biographies’ adherence to gender Genderedness from Human-Labeled Words norms by obtaining crowdsourced gender ratings for every word used in the dataset and then scoring each biography using these Figure 1: To validate that the WE gender classifier 퐺 effec- ratings. However, the method of generating biography ratings from tively measures how the language of biographies adheres to aggregated word scores would be a non-trivial question, and human- gender norms, we use the corpus provided by Crawford et al. labeled corpora of gendered words [22, 24] are limited to a few [22]. We compare the words’ human-labeled gender scores hundred words, while our biography dataset has tens of thousands (푥-axis) to their weights in 퐺 (푦-axis). We find a strong cor- of unique words. Thus, we take a machine learning approach to relation between the notion of gender norms learned by 퐺 quantify these notions rather than relying on the human-labeled and the human-labeled gender scores. corpora alone. We train a logistic regression classifier 퐺 on the biographies dataset to distinguish between biographies labeled with “she” or “he.” For each biography, this yields a score of how much it adheres to the norms of "s/he" biographies. 4.3 Measuring Masculine and Feminine SNoB Since some occupations are dominated by biographies of a single 4.3.1 Single Occupation Setting. For a given biography 푥, the occu- gender, we pre-process the data to prevent 퐺 from learning to pation classification algorithm 푌푐 outputs the predicted probability identify gender from occupation alone. After pre-processing the 푌푐 (푥) that the biography 푥 belongs to occupation 푐. The gender clas- data using the weighting scheme described in Section 5.1.1, the sifier 퐺 outputs the predicted probability 퐺 (푥) that the biography "she" and "he" biographies in each occupation are weighted equally. is labeled with "she". We train 퐺 using the WE semantic representation and excluding To evaluate SNoB, we consider the correlation 푟푐 between 퐺 (푥) explicit gender indicators (e.g. "she", "he", "mr.", "mrs."). 퐺 achieves and 푌푐 (푥), the predicted probabilities from the two classifiers, across 0.68 accuracy. the "she" bios in the occupation. Specifically, we compute Spear- For an individual biography, we use 퐺’s predicted probability man’s rank correlation coefficient 푟푐 between {푌푐 (푥)|푥 ∈ 푆푐 } and of “s/he" as a measure of how much the biography aligns with {퐺 (푥)|푥 ∈ 푆푐 }. We use Spearman’s correlation rather than Pear- feminine/masculine gender norms. To validate that 퐺 learns a son’s since the latter measures only the strength of a linear associa- meaningful notion of gender norms, we compare its similarity tion, while we aim to capture general monotonic relationships. The to human-labeled gender scores. Specifically, for a corpus of 600 magnitude of 푟푐 is a measure of the degree of SNoB exhibited by words with gender scores labeled via crowdsourcing [22], we com- the occupation classifier. A positive/negative value indicates that pare the human-labeled gender scores reported in the study to 퐺’s more feminine/masculine language is rewarded by 푌푐 . Social Norm Bias: Residual Harms of Fairness-Aware Algorithms

Let the fraction of biographies in occupation 푐 that use "she" be Table 1: Comparison of group fairness and SNoB met- RMS 휌( , ) |푆 | rics across fairness approaches. Gap , p퐶 r퐶 , and 푐 irrev 푝푐 = . 휌(p퐶, r ) are defined in Sections 3.5, 4.3.2, and 6.3, re- |푆푐 | + |퐻푐 | 퐶 spectively. For BOW and WE, the one-versus-all 푌푐 accuracy If 푝 < 0.5, 푐 is male-dominated, and vice versa. If 푟 is more 푐 푐 is averaged across all occupations. For BERT, the model is negative in more male-dominated occupations, this implies that a multi-class classifier. Although post-processing (PO) fair- individuals whose biographies are more aligned to masculine gen- ness intervention techniques mitigate GapRMS the most, der norms are also more likely to have their occupation correctly they have the highest values of 휌(p , r퐶 ). This suggests that predicted by the occupation classification algorithm. Since 푟 is 퐶 푐 they are the least effective at reducing SNoB.2 computed from 푆푐 , these associations are present within the gender group. Thus, a negative 푟푐 indicates that classification algorithms RMS ( ) ( irrev) for male-dominated occupations operationalize gendered language, Approach 푌푐 Accuracy Gap 휌 p퐶, r퐶 휌 p퐶, r퐶 privileging not only the referential gender of pronouns but also the BOW, PR 0.92 0.078 0.79** 0.44* "she" biographies with more masculine words and writing styles. BOW, PO 0.95 0 0.80** 0.52** BOW, DE 0.96 0.10 0.78** 0.26 4.3.2 SNoB Across Occupations. Let BOW, CoCL3 0.95 0.074 0.74** 0.46 |퐶 | r퐶 = {푟푐 |푐 ∈ 퐶} ∈ [0, 1] , WE, PR 0.94 0.061 0.77** 0.32 |퐶 | WE, PO 0.97 0 0.85** 0.60** p퐶 = {푝푐 |푐 ∈ 퐶} ∈ [0, 1] , WE, DE 0.94 0.060 0.82** 0.43* where 퐶 is the set of occupations. Let 휌( , ) be Spearman’s rank p퐶 r퐶 WE, RE 0.88 0.035 0.59** 0.06 correlation coefficient between p퐶 and r퐶 . We use 휌(p퐶, r퐶 ) to BERT, PR 0.84 0.11 0.46* 0.08 quantify SNoB across all occupations: a high value of 휌(p퐶, r퐶 ) im- BERT, PO 0.85 0 0.46* -0.04 plies that in more gender-imbalanced occupations, 푟푐 is larger in BERT, DE 0.85 0.22 0.48* -0.05 magnitude. Furthermore, a positive 휌(p퐶, r퐶 ) indicates that the more gender-imbalanced an occupation is, the more it favors adher- ence to gender norms of the over-represented gender. 휌(p퐶, r퐶 ) en- ables us to compare the degree of SNoB in different techniques, including fairness-aware interventions. acceptance rates [35]. These approaches are relatively cost-effective to implement given that they do not require re-training and have 5 COMPARING FAIRNESS APPROACHES been deployed in large-scale systems such as LinkedIn’s Recruiter Algorithmic group fairness approaches fall broadly into three paradigms: service [31]. pre-, post-, and in-processing techniques [43]. In Section 5.1, we describe the algorithmic fairness approaches that we evaluate, and 5.1.3 In-processing. We also consider various state-of-the-art in- we compare their SNoB in Section 5.2. In Section 6.1 we present processing group fairness approaches, which modify the algorithm an analysis on a dataset of biographies with non-binary pronouns. at training time, including decoupled classifiers (DE), reductions Section 6.2 provides insight into the mechanisms behind why the (RE), and Covariance Constrained Loss (CoCL). In DE, a separate classifiers may rely on gender norms. Section 6.3 discusses thero- classifier is trained for each gender group28 [ ]. In RE, a classification bustness analysis we perform to verify that our approach effectively task is reduced to a sequence of cost-sensitive classification prob- captures adherence to social norms. lems [2]. RE is the primary in-processing mitigation method in the Fairlearn Python package [9]. We consider RE only for the WE repre- 5.1 Fairness Intervention Techniques sentation since it is too computationally expensive to use with BERT RMS 5.1.1 Pre-processing. Pre-processing fairness approaches modify and unable to reduce either Gap or 휌(p퐶, r퐶 ) with BOW. Finally, the data before it is used to train the algorithm, such as by reweight- we evaluate CoCL, which differs from most in-processing tech- ing the training distribution to account for representation imbal- niques in that it does not require access to gender labels. The method ances or by changing features of the data [19, 43]. Since many of adds a penalty to the loss function that minimizes the covariance the occupations are gender-imbalanced in our dataset, we adopt between the probability that an individual’s occupation is correctly the pre-processing (PR) approach of re-weighting the data such predicted and the word embedding of their name [64]. Romanov RMS that half of the total weight belongs to each of 푆푐 and 퐻푐 [79]. et al. [64] validate CoCL’s effectiveness in reducing Gap on the Specifically, we assign weight 훼푥 to a biography 푥 as follows: same biographies dataset using the BOW representation, which we

( |퐻푐 | also use. if 푥 ∈ 푆푐, 훼 = max{|푆푐 |,|퐻푐 |} 푥 |푆푐 | if 푥 ∈ 퐻푐 . max{|푆푐 |,|퐻푐 |}

5.1.2 Post-processing. Post-processing fairness approaches apply 2 We compute p-values for the two-sided test of zero correlation between 푝푐 and an intervention after training the algorithm to balance some metric 푟푐 using SciPy’s spearmanr function [77]. Values marked with a * indicate that the across groups [35, 44, 50, 61]. One common approach (PO) is to p-value is < 0.05. Values marked with ** denotes that the p-value is < 0.01. 3CoCL [64] is modulated by a hyperparameter 휆 that determines the strength of the change the decision threshold for particular groups to equalize fairness constraint. We use 휆 = 2, which Romanov et al. [64] finds to have the smallest relevant fairness-related metrics, such as false positive rates or GapRMSon the occupation classification task. Cheng et al.

nb 5.2 SNoB vs. Group Fairness Metrics for Table 2: Correlation 푟professor (first three columns) and Different Approaches 푟professor (latter three columns) across different approaches. For DE on the nonbinary dataset, we consider both classi- We use 휌( , ) , introduced in Section 4.3 as a measure of SNoB, p퐶 r퐶 fiers, trained on the “she" and “he" data respectively.4 to compare the cross-occupation associations of different fairness approaches. Table 1 lists the values of accuracy, group fairness RMS Correlation 푟 nb 푟 metric Gap , and 휌(p퐶, r퐶 ) for each approach. professor Figure 2 plots the gender proportion in each occupation versus Approach PR PO DE PR PO DE its value of 푟 when using the PO approach with a WE representa- 푐 BOW 0.36 0.34 0.35, 0.32 0.09** 0.10** 0.11** tion, and Figure 3 visualizes this trend for various fairness-aware WE 0.29 0.29 0.31, 0.31 -0.01** -0.02** -0.01** classifiers. For all approaches, 휌(p퐶, r퐶 ) quantifies the relationship between gender imbalance and 푟푐 .

We find that the PO approaches have the largest value of 휌(p , r퐶 ), ) 퐶 C r

( i.e. the strongest associations with gender norms. Note that for PO, SNoB using WE, PO Approach s

n the within-group ranking of the individuals are unchanged from o

i model t 0.6 c

i the fairness-unaware occupation classification algorithm. Since

d physician dietitian e r PO techniques do not change the ordering within a group, 푟푐 and P yoga_teacher

r 0.4 e psychologist 휌(p퐶, r퐶 ) remain identical to that of the approach without any fair- d

n dentist interior_designer e ness intervention (Figure 2). Thus, the PO intervention continues

G 0.2 teacher nurse d to privilege individuals who align with the occupation’s gender n poet a photographer painter norms. n 0.0 o attorney i chiropractor t professor Even when the desired group fairness metric is perfectly met, i.e. a filmmaker journalist p RMS u architect accountant Gap = 0, the SNoB correlations remain untouched. For PO, the c 0.2 c composer pastor personal_trainer O group fairness and SNoB metrics are unrelated; the mitigation of comedian

w surgeon / software_engineer (pC, rC) = 0.848 one does not affect the presence of the other. This can be explained b 0.4

. r

r 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 by the within-group order-preserving nature of PO approaches. o Percentage of ''She'' in Occupation (pC) C The pre- (PR) and in-processing approaches (DE, RE, CoCL) mitigate this observed association, as indicated by their lower Figure 2: SNoB across occupations using a post-processing 휌(p , r퐶 ) compared to that of PO. Table 1 reveals that unlike for (PO) approach. The extent to which an algorithm’s predic- 퐶 PO, there are more complex relationships between GapRMS and tions align with gender norms (푦-axis) is correlated with the 휌(p퐶, r퐶 ) for pre- and in-processing approaches; they generally gender imbalance in the occupation (푥-axis). Ideally, with- RMS tend to improve both Gap and 휌(p퐶, r퐶 ). However, the improve- out any SNoB, the correlation 푟푐 = 0, so every point would lie ments are not necessarily monotonic; for BOW, GapRMS is smaller on the dotted red line. Other representations (BOW, BERT) in PR while 휌( , ) is larger compared to DE in Table 1. have similar trends. Note that these values are the same for p퐶 r퐶 In-processing approaches seem to be most effective at minimiz- the fairness-unaware approach as the PO approach, since ing SNoB, as shown in Table 1 and Figure 3. Since in-processing the latter does not change the ordering of individuals. approaches are typically more expensive to implement than PO and PR, this highlights trade-offs between ease of implementation, classifier accuracy, and association with gender norms. ) C r ( Comparison of Fairness Approaches using WE Representation s 6 FURTHER ANALYSIS n

o WE, PR, (pC, rC) = 0.771 i

t 0.6

c WE, PO, (pC, rC) = 0.848 i 6.1 Analysis on Nonbinary Dataset d WE, DE, (pC, rC) = 0.818 e r

P 0.4 WE, RE, (pC, rC) = 0.592 We aim to consider how algorithmic fairness approaches affect

r e

d nonbinary individuals, who are overlooked by group fairness ap-

n 0.2 e proaches [46]. Using the same regular expression as De-Arteaga G

d

n 0.0 et al. [25] to identify biography-format strings, we collected a a

n

o dataset of biographies that use nonbinary pronouns such as “they", i

t 0.2 a

p “xe", and “hir." Since “they" frequently refers to plural people, we u c

c 0.4 manually inspected a sample of 2000 biographies using“they" to O

w identify those biographies that refer to individuals. professor is /

b 0.6

.

r the only occupation title with more than 20 such biographies; the

r 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 o Percentage of ''She'' in Occupation (pC) C other occupations have too few biographies to perform meaningful

4 nb Figure 3: Comparing fairness interventions. While The p-value corresponding to 푟professor is for the hypothesis test whose null hy- SNoB persists across group fairness interventions, it is pothesis is that the rankings from 퐺 and 푌 are uncorrelated. Values marked with a * indicate that the p-value is < 0.05. Values marked with ** denotes that the p-value is somewhat mitigated by the in-processing approaches. < 0.01. Social Norm Bias: Residual Harms of Fairness-Aware Algorithms

nb SURGEON: Significance of words to WE, PO statistical analysis. We computed 푟professor, which is analogous to 푟professor, the measure of SNoB for an individual occupation jim chairman classifier introduced in Section 4.3.1. While 푟professoris Spearman’s guy engineer

nb Most Masculine correlation computed across the biographies in 푆푐 , 푟professor is chief arthroscopic scientific the correlation across the nonbinary biographies. The results are jr nb several surgeonsurgeons reported in Table 2. We find that 푟professor is positive across differ- dj former jd most spine ent approaches. However, the associated p-values are quite large one orthopaedic bs

(> 0 1), so it is challenging to analyze these associations. This is G az medical . of r cpa the e cut

i surgery f likely due to the small sample size; while 푟 is computed i has professor s music transplant s

a an l tv addition across the 10677 professor biographies that use “she" pronouns, C bone book

e at

nb h ph 푟 is across only 21 biographies. S from professor - e with cosmetic H and clinic n i

t school h g 6.2 Gendered Words Used in Classifiers i e lgbt education practices W We provide insight into some of the differences across the classifiers activism breast that may be driving the SNoB described in preceding sections. beauty children We define 훽푤 as the weight of a word 푤 based on the value of daughter the classifiers’ coefficients. We focus on the logistic regression actress female classifiers using the BOW and WE representations since the BERT representations are contextualized, so each word does not have a woman women fixed weight to the model that is easily interpretable. Most Feminine For the BOW representation of a biography 푥, each feature in 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4 0.5 Weight in Occupation Classifier Y the input vector 푣푥 corresponds to a word 푤 in the vocabulary. SURGEON: Significance of words to WE, DE ("she") We define 훽푤 as the value of the corresponding coefficient in the jim chairman logistic regression classifier. The magnitude of 훽푤 is a measure guy of the importance of 푤 to the occupation classification, while the engineer sign (positive or negative) of 훽 indicates whether 푤 is correlated or Most Masculine chief arthroscopic scientific anti-correlated with the positive class of the classifier. jr several surgeonsurgeons For the WE representation, we compute the weight of each word dj former jd most spine one orthopaedic as bs

G az medical of r cpa the · e cut 푒푤 푊푐 i surgery f

i has 훽푤 = , s music transplant s

| || | a an 푒푤 푊푐 l tv addition C bone book

e at h ph S from -

i.e. the cosine similarity between each word’s fastText word embed- e with cosmetic H and clinic n i ding 푒푤 and the coefficient weight vector푊푐 of the WE-representation t school h g classifier. Like in the BOW representation, the magnitude of 훽 i 푤 e lgbt educationpractices W quantifies the word’s importance, while the sign indicates thedi- activism breast rection of the association. beauty children If a word 푤 has positive/negative weight for classifier 푌푐 , then daughter adding 푤 to a biography 푥 increases/decreases the predicted proba- actress female bility 푌 (푥) respectively. 푐 empowerment In Figure 4, we plot the weight of each word in the BOW vo- woman women cabulary in the occupation classifiers and gender classifiers. These Most Feminine weights illuminate some of the mechanisms behind the predictions. 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4 0.5 Ideally, without SNoB, every point would have small magnitude in Weight in Occupation Classifier Y either the occupation or gender classifier, i.e. lie on either the 푥− or 푦−axis of Figure 4. We observe that in the DE approach, words are Figure 4: Words’ weights in the occupation and gender clas- closer to the푦−axis compared to the PO approach. This corresponds sifiers for different approaches in the surgeon occupation. Each point represents a word; its 푥-position and 푦-position to the smaller value of 휌(p , r ) exhibited by WE, DE compared to 퐶 퐶 푌 퐺 WE, PO in Table 1. Note that the PO classifier is trained on all ofthe represents its weight in 푐 and respectively. Each point is biographies, while the DE classifier is trained on only biographies colored based on its quadrant in the WE, PO approach. Many 푦− that use the same pronoun. points are closer to the axis in the DE approach. For ani- mations of how the weights change between the approaches, Let 훽푤 (푌푐 ) be the weights for approach 푌푐 . We examine the see http://bit.ly/snobgifs. words whose weights 훽푤 satisfy

(1) |훽푤 (푌푐 )| > 푇 , (2) |훽푤 (퐺)| > 푇 , ′ ′ (3) |훽푤 (푌푐 )| > 푇 · |훽푤 (푌푐 )|, Cheng et al.

′ ′ SURGEON: Significance of task-ir/relevant words where 푇,푇 are significance thresholds and 푌푐, 푌푐 are two different task-irrelevant words chairman occupation classification approaches. jim TASK-RELEVANT WORDS guy Words that satisfy these conditions are not only associated with ENGINEER Most Masculine either masculinity or femininity but also weighted more highly in chief ARTHROSCOPIC ′ scientific approach 푌푐 compared to 푌푐 . Thus, including these gendered words several SURGEON jr DJ former SURGEONS in a biography influences 푌푐 ’s classification more strongly than most ′ SPINE ORTHOPAEDIC that of 푌푐 . This suggests that they may contribute more strongly jd one BS AZ OF MEDICAL to the 휌( , ) in one approach than the other. For example, we G p퐶 r퐶 r SURGERY

e cpa i THE has f cut examined these words for the occupations of surgeon, software i MUSIC s s AN TRANSPLANT a

l tv ADDITION engineer, composer, nurse, dietitian, and yoga teacher, which C BOOK

e BONE h PH FROM

S AT are the six most gender-imbalanced occupations, with 푌푐 = {BOW, - e COSMETIC

′ ′ H WITH

PO}, 푌 = {BOW, DE}, 푇 = 0.5 and 푇 = 0.7. The words that satisfy n i

t AND CLINIC h g these conditions are “miss", “mom", “wife", “mother", and “hus- i school e LGBT PRACTICES

′ W EDUCATION band." Conversely, with 푌푐 = {BOW, DE} and 푌푐 = {BOW, PO}, the BREAST words are “girls", “women", “gender", “loves", “mother", “romance", activism beauty “daughter", “sister", and “female." These gendered words illustrate CHILDREN daughter the multiplicity of gender present in the biographies beyond cate- actress gorical labels, which standard group fairness interventions do not female empowerment consider. woman

Our analysis is limited by the fact that we only consider the women individual influence of each word conditioned on the remaining Most Feminine 0.2 0.1 0.0 0.1 0.2 0.3 0.4 words, while the joint influence of two or more words may also be Weight in Occupation Classifier Y of relevance. Figure 5: Identifying task-irrelevant words. Each point rep- 6.3 Robustness of Associations: resents a word, plotted based on its weight in the occupa- Occupation-Irrelevant Gender Classifiers tion classifier푥 ( -axis) and gender classifier푦 ( -axis). Task- 퐺푐−irrev relevant and task-irrelevant words are labeled as upper- case and lowercase respectively. 퐺푐−irrev uses only the task- For some occupations, it could be argued that certain “gendered irrelevant words to learn a notion of gender norms. Words words" may be acceptable for the occupation classifier to use and highly predictive of the occupation (large 푥-value) are deter- thus should not be considered as evidence of SNoB. For example, mined as task-relevant and thus not used in 퐺푐−irrev. the word “computer” is learned to be weakly masculine by the gender classifier (“computer" has weight 0.124 in the WE approach). However, since the word “computer" is related to the software approach is more robust than the pre-processing approach (the engineering profession, it does not seem discriminatory that the reweighting scheme described in Section 5.1.1) since Freq (푆푐 ) occupation classifier relies on “computer" to determine whether an 푤 may still differ from Freq푤 (퐻푐 ). If Freq푤 (푆푐 ) ≠ Freq푤 (퐻푐 ), then individual is a software engineer. In other words, one could worry 푤 may continue to be used by 퐺 to infer gender even after the ( ) that the SNoB identified using 휌 p퐶, r퐶 in Section 5.2 could be samples are reweighted to balance gender within each occupation; attributed to the gender classifier using occupation as a proxy. While this is precluded in 퐺푐−irrev. our approach to quantify gender norms (Section 4.2) addresses this Figure 5 illustrates the weights in the occupation and gender concern by pre-processing the training data of the gender classifier classifiers for task-irrelevant versus task-relevant words. Weob- to balance gender within each occupation, this section presents an serve that the words with high weights in the occupation classifier additional robustness check. 푐−irrev are more likely to be labeled as task-relevant. Furthermore, this We train gender classifiers 퐺 using only the words that method is cautious in determining “task-irrelevant" words: it la- are task-irrelevant for occupation 푐. We define task-relevance based bels some words as task-relevant, even when they do not seem on conditional independence of the word to the occupation, con- related to the occupation and/or have low weight in the occupa- ditioned on gender: a word 푤 is task-relevant to occupation 푐 if tion classifier. Since we aim to eliminate the use of task-relevant there is a significant difference in the frequency with which both words in the gender classifier for the purpose of a robustness check, “she" and “he" in 푐 use 푤 compared to all “she" and “he" biographies, ′ it is acceptable to err on the side of labeling too many words as respectively. That is, a word 푤 is task-relevant for occupation 푐 if task-relevant words. Following this definition, around 40% of the there are significant differences between words are labeled as task-irrelevant for each of the occupations. The (1) Freq푤 (푆푐′ ) versus Freq푤 ({푆푐 }|푐 ∈ 퐶), resulting gender classifiers, 퐺푐−irrev, have average accuracy 0.61 (2) Freq푤 (퐻푐′ ) versus Freq푤 ({퐻푐 }|푐 ∈ 퐶), with standard deviation 0.04. To verify that they learn a reasonable where Freq푤 (푆푐′ ) denotes the number of times 푤 appears in the notion of social norms, we compared the vocabulary’s weights to ′ “she" biographies of occupation 푐 . Freq푤 (퐻푐′ ) is defined anal- the human-labeled gender scores as in Section 4.2. Across the classi- ogously. We determine conditional independence using the chi- fiers, they have average Spearman’s correlation 0.58 with standard squared test with a 99% significance threshold. Notice that this deviation 0.04. Social Norm Bias: Residual Harms of Fairness-Aware Algorithms

SNoB with Occupation-Irrelevant Gender Classifier Our evaluation is based on the cultural gender norms embed- WE, PO, (p , rirrev) = 0.595 physician 0.4 C C model ded in the dataset, which is limited to English biographies. Our

0.3 notions of masculinity and femininity are constructed from snap-

v dietitian shots of Common Crawl; the broader notions of gender expression e r r i 0.2 and gender identity are ever-changing cultural phenomena [69]. c

G psychologist chiropractor Furthermore, we do not explicitly engage with the ways in which g 0.1 yoga_teacher n i dentist gender’s intersection with other attributes, such as race or class, s

u photographer teacher

C 0.0 painter interior_designer affects individuals and their careers [23, 74]. r architect attorney journalist nurse personal_trainer software_engineer poet 0.1 composer filmmaker professor pastor 8 IMPLICATIONS AND FUTURE WORK surgeon accountant 0.2 comedian Our work introduces a framework to quantify SNoB, which is 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Percentage of ''She'' in Occupation (pC) not considered by group fairness techniques. We measure asso- ciations between algorithmic predictions and gender norms in occupation classification, revealing that SNoB is the strongest in Comparison Using Occupation-Irrelevant Gender Classifiers

irrev post-processing approaches. 0.4 WE, PR, (pC, rC ) = 0.318 irrev WE, PO, (pC, r ) = 0.595 Since occupation classification is a subtask of automated recruit- 0.3 C WE, DE, (p , rirrev) = 0.432 C C ing, the associations may have significant, troubling consequences irrev 0.2 WE, RE, (pC, rC ) = 0.056 v e

r in people’s lives. Fairness-aware approaches that are deployed in r i 0.1

c industry may conceal and thus enable harm to individuals under the G 0.0 g guise of group-based "fairness." This work builds upon and raises n i s

u 0.1 further concerns about how uninterrogated assumptions underly-

C r 0.2 ing "machine learning fairness" efforts may hinder progress toward

0.3 the goal of minimizing discrimination against and achieving justice for marginalized populations. 0.4 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 The differences found across pre-, post-, and in-processing ap- Percentage of ''She'' in Occupation (pC) proaches should be considered when choosing which of these ap- proaches to use in practice. In particular, the results in this work 푐− Figure 6: SNoB using 퐺 irrev. These plots measure the same introduce a new dimension that may justify the costs of re-training − associations as Figures 2 and 3, using 퐺푐 irrev instead of 퐺 an algorithm rather than relying on post-processing approaches. to obtain gender scores. The trends are similar, which sug- More broadly, we characterize how algorithms may discriminate gests that our method effectively captures masculine and based on SNoB, a non-categorical aspect of a sensitive attribute. feminine SNoB. SNoB can be measured in similar ways for other algorithms that make predictions based on text, such as automatic resume or appli- cation scanning. Our approach may also be generalized to capture irrev irrev SNoB in other types of data that may reflect social norms, such as We compute 푟푐 and 휌(p , r ), which are analogs of 푟푐 and 퐶 퐶 images, behavioral data, and other forms of structured data that 휌(p , r ) respectively, using the outputs obtained from 퐺푐−irrev in 퐶 퐶 are influenced by sociocultural patterns. Future work is needed to place of 퐺. 휌( , irrev) is listed in the last column of Table 1. Figure p퐶 r퐶 understand the prevalence of SNoB across tasks and data types, as irrev ( irrev) 6 shows the trends of 푟푐 and 휌 p퐶, r퐶 across occupations well as other sensitive attributes beyond gender. ( irrev) for different approaches. While the magnitude of 휌 p퐶, r퐶 is By illuminating a new axis along which discrimination may smaller than 휌(p퐶, r퐶 ), we find the same trends across fairness occur, our work shows a type of harm that may stem from the interventions discussed in the previous section (Table 1). As before, reliance of "group fairness" approaches on reductive definitions 푟푐 for male-dominated occupations are more negative, and vice and sets the stage for progress in mitigating these harms. In future versa for female-dominated occupations (Figure 6). PO approaches work we hope to explore algorithmic approaches to reducing these ( irrev) have the highest values of 휌 p퐶, r퐶 across the different seman- associations, as well as socio-technical considerations of how the tic representations, while in-processing approaches most strongly [23] between different dimensions of a sensitive mitigate this effect. These results suggest that our proposed metrics attribute affect an algorithm. capture meaningful notions of gender norms beyond occupation- relevant words, thus effectively quantifying SNoB. REFERENCES [1] Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg. 2016. 7 LIMITATIONS Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. As Wojcik and Remy [80] note, a gender classifier’s notion of gender arXiv preprint arXiv:1608.04207 (2016). [2] Alekh Agarwal, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna is based only on the characteristics observable by the system. Thus, Wallach. 2018. A reductions approach to fair classification. In International our classifiers quantify only the dimensions of gender and social Conference on Machine Learning. PMLR, 60–69. [3] Silvan Agius and Christa Tobler. 2012. Trans and intersex people. Discrimination norms that are in the language of our biographies dataset; there on the grounds of sex, gender identity and gender expression. Office for Official are many other elements that we do not capture. Publications of the European Union. Cheng et al.

[4] Osman Aka, Ken Burke, Alex Bäuerle, Christina Greer, and Margaret Mitchell. Adam Tauman Kalai. 2019. Bias in bios: A case study of semantic representa- 2021. Measuring Model Biases in the Absence of Ground Truth. arXiv preprint tion bias in a high-stakes setting. In proceedings of the Conference on Fairness, arXiv:2103.03417 (2021). Accountability, and Transparency. 120–128. [5] Maria Antoniak and David Mimno. 2021. Bad Seeds: Evaluating Lexical Methods [26] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: for Bias Measurement. In Proceedings of the 59th Annual Meeting of the Association Pre-training of deep bidirectional transformers for language understanding. arXiv for Computational Linguistics and the 11th International Joint Conference on Natu- preprint arXiv:1810.04805 (2018). ral Language Processing (Volume 1: Long Papers). Association for Computational [27] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Linguistics, Online, 1889–1904. https://doi.org/10.18653/v1/2021.acl-long.148 Zemel. 2012. Fairness through awareness. In Proceedings of the 3rd innovations in [6] Shlomo Argamon, Moshe Koppel, Jonathan Fine, and Anat Rachel Shimoni. 2003. theoretical computer science conference. 214–226. Gender, genre, and writing style in formal written texts. Text - Interdisciplinary [28] Cynthia Dwork, Nicole Immorlica, Adam Tauman Kalai, and Max Leiserson. 2018. Journal for the Study of Discourse 23, 3 (2003). https://doi.org/10.1515/text.2003. Decoupled classifiers for group-fair and efficient machine learning. In Conference 014 on Fairness, Accountability and Transparency. PMLR, 119–133. [7] Marion Bartl, Malvina Nissim, and Albert Gatt. 2020. Unmasking Contextual [29] Nathan Ensmenger. 2015. “Beards, sandals, and other signs of rugged individual- Stereotypes: Measuring and Mitigating BERT’s Gender Bias. In Proceedings of ism”: masculine culture within the computing professions. Osiris 30, 1 (2015), the Second Workshop on Gender Bias in Natural Language Processing. 1–16. 38–65. [8] Marianne Bertrand and Sendhil Mullainathan. 2004. Are Emily and Greg more [30] Anjalie Field and Yulia Tsvetkov. 2020. Unsupervised discovery of implicit gender employable than Lakisha and Jamal? A field experiment on labor market discrim- bias. arXiv preprint arXiv:2004.08361 (2020). ination. American economic review 94, 4 (2004), 991–1013. [31] Sahin Cem Geyik, Stuart Ambler, and Krishnaram Kenthapadi. 2019. Fairness- [9] Sarah Bird, Miro Dudík, Richard Edgar, Brandon Horn, Roman Lutz, Vanessa aware ranking in search & recommendation systems with application to linkedin Milan, Mehrnoosh Sameki, Hanna Wallach, and Kathleen Walker. 2020. Fairlearn: talent search. In Proceedings of the 25th ACM SIGKDD International Conference A toolkit for assessing and improving fairness in AI. Technical Report MSR-TR-2020- on Knowledge Discovery & Data Mining. 2221–2231. 32. Microsoft. https://www.microsoft.com/en-us/research/publication/fairlearn- [32] Jennifer L Glick, Katherine Theall, Katherine Andrinopoulos, and Carl Kendall. a-toolkit-for-assessing-and-improving-fairness-in-ai/ 2018. For data’s sake: dilemmas in the measurement of gender minorities. Culture, [10] Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Lan- health & sexuality 20, 12 (2018), 1362–1377. guage (Technology) is Power: A Critical Survey of “Bias” in NLP. In Proceedings [33] Hila Gonen and Yoav Goldberg. 2019. Lipstick on a pig: Debiasing methods cover of the 58th Annual Meeting of the Association for Computational Linguistics. 5454– up systematic gender biases in word embeddings but do not remove them. arXiv 5476. preprint arXiv:1903.03862 (2019). [11] Su Lin Blodgett, Gilsinia Lopez, Alexandra Olteanu, Robert Sim, and Hanna Wal- [34] Alex Hanna, Emily Denton, Andrew Smart, and Jamila Smith-Loud. 2020. Towards lach. 2021. Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness a critical race methodology in algorithmic fairness. In Proceedings of the 2020 Benchmark Datasets. In Proceedings of the 59th Annual Meeting of the Association conference on fairness, accountability, and transparency. 501–512. for Computational Linguistics and the 11th International Joint Conference on Natu- [35] Moritz Hardt, Eric Price, and Nathan Srebro. 2016. Equality of opportunity in ral Language Processing (Volume 1: Long Papers). Association for Computational supervised learning. In Proceedings of the 30th International Conference on Neural Linguistics, Online, 1004–1015. https://doi.org/10.18653/v1/2021.acl-long.81 Information Processing Systems. 3323–3331. [12] Miranda Bogen and Aaron Rieke. 2018. HELP WANTED. An Examination of [36] Ursula Hébert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. 2018. Hiring Algorithms, Equity, and Bias (2018). Multicalibration: Calibration for the (computationally-identifiable) masses. In [13] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. International Conference on Machine Learning. PMLR, 1939–1948. Enriching word vectors with subword information. Transactions of the Association [37] Madeline E Heilman. 2001. Description and prescription: How gender stereotypes for Computational Linguistics 5 (2017), 135–146. prevent women’s ascent up the organizational ladder. Journal of social issues 57, [14] Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T 4 (2001), 657–674. Kalai. 2016. Man is to computer programmer as woman is to homemaker? [38] Madeline E Heilman. 2012. Gender stereotypes and workplace bias. Research in debiasing word embeddings. Advances in neural information processing systems organizational Behavior 32 (2012), 113–135. 29 (2016), 4349–4357. [39] Lily Hu and Issa Kohler-Hausmann. 2020. What’s sex got to do with machine [15] Shikha Bordia and Samuel R. Bowman. 2019. Identifying and Reducing Gender learning?. In Proceedings of the 2020 Conference on Fairness, Accountability, and Bias in Word-Level Language Models. In Proceedings of the 2019 Conference of the Transparency. 513–513. North American Chapter of the Association for Computational Linguistics: Student [40] Stefanie K Johnson, David R Hekman, and Elsa T Chan. 2016. If there’s only Research Workshop. Association for Computational Linguistics, Minneapolis, one woman in your candidate pool, there’s statistically no chance she’ll be hired. Minnesota, 7–15. https://doi.org/10.18653/v1/N19-3002 Harvard Business Review 26, 04 (2016). [16] Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accu- [41] Kenneth Joseph, Wei Wei, and Kathleen M Carley. 2017. Girls rule, boys drool: racy disparities in commercial gender classification. In Conference on fairness, Extracting semantic and affective stereotypes from Twitter. In Proceedings of the accountability and transparency. PMLR, 77–91. 2017 ACM conference on computer supported cooperative work and social computing. [17] Judith Butler. 2011. Gender trouble: and the subversion of identity. 1362–1374. routledge. [42] John T Jost and Aaron C Kay. 2005. Exposure to benevolent and com- [18] Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. 2017. Semantics derived plementary gender stereotypes: consequences for specific and diffuse forms of automatically from language corpora contain human-like biases. Science 356, system justification. Journal of personality and social psychology 88, 3 (2005), 498. 6334 (2017), 183–186. [43] Faisal Kamiran and Toon Calders. 2012. Data preprocessing techniques for [19] Flavio P Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan classification without discrimination. Knowledge and Information Systems 33, 1 Ramamurthy, and Kush R Varshney. 2017. Optimized pre-processing for discrim- (2012), 1–33. ination prevention. In Proceedings of the 31st International Conference on Neural [44] Faisal Kamiran, Asim Karim, and Xiangliang Zhang. 2012. Decision theory for Information Processing Systems. 3995–4004. discrimination-aware classification. In 2012 IEEE 12th International Conference on [20] Yang Trista Cao and Hal Daumé III. 2019. Toward Gender-Inclusive Coreference Data Mining. IEEE, 924–929. Resolution. CoRR abs/1910.13913 (2019). arXiv:1910.13913 http://arxiv.org/abs/ [45] Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. 2017. Preventing 1910.13913 Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness. ArXiv [21] Ontario Commission. [n.d.]. Gender identity and gender e-prints (Nov. arXiv preprint cs.LG/1711.05144 (2017). expression. http://www.ohrc.on.ca/en/policy-preventing-discrimination- [46] Os Keyes, Chandler May, and Annabelle Carrell. 2021. You Keep Using That because-gender-identity-and-gender-expression/3-gender-identity-and- Word: Ways of Thinking about Gender in Computing Research. Proceedings of gender-expression the ACM on Human-Computer Interaction 5, CSCW1 (2021), 1–23. [22] Jarret T. Crawford, P. Andrew Leynes, Christopher B. Mayhorn, and Martin L. [47] Vaibhav Kumar, Tenzin Singhay Bhotia, Vaibhav Kumar, and Tanmoy Bink. 2004. Champagne, beer, or coffee? A corpus of gender-related and neutral Chakraborty. 2020. Nurse is Closer to Woman than Surgeon? Mitigating Gender- words. Behavior Research Methods, Instruments, & Computers 36, 3 (2004), 444–458. Biased Proximities in Word Embeddings. Transactions of the Association for https://doi.org/10.3758/bf03195592 Computational Linguistics 8 (2020), 486–503. https://doi.org/10.1162/tacl_a_00327 [23] Kimberle Crenshaw. 1990. Mapping the margins: Intersectionality, identity [48] Brian N Larson. 2017. Gender as a variable in natural-language processing: politics, and of color. Stan. L. Rev. 43 (1990), 1241. Ethical considerations. (2017). [24] Jenna Cryan, Shiliang Tang, Xinyi Zhang, Miriam Metzger, Haitao Zheng, and [49] Jennifer S Light. 1999. When computers were women. Technology and culture 40, Ben Y Zhao. 2020. Detecting Gender Stereotypes: Lexicon vs. Supervised Learn- 3 (1999), 455–483. ing Methods. In Proceedings of the 2020 CHI Conference on Human Factors in [50] Pranay K Lohia, Karthikeyan Natesan Ramamurthy, Manish Bhide, Diptikalyan Computing Systems. 1–11. Saha, Kush R Varshney, and Ruchir Puri. 2019. Bias mitigation post-processing [25] Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Chris- for individual and group fairness. In Icassp 2019-2019 ieee international conference tian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, and on acoustics, speech and signal processing (icassp). IEEE, 2847–2851. Social Norm Bias: Residual Harms of Fairness-Aware Algorithms

[51] Juan M Madera, Michelle R Hebl, and Randi C Martin. 2009. Gender and letters [76] Shiliang Tang, Xinyi Zhang, Jenna Cryan, Miriam J Metzger, Haitao Zheng, of recommendation for academia: agentic and communal differences. Journal of and Ben Y Zhao. 2017. Gender bias in the job market: A longitudinal analysis. Applied Psychology 94, 6 (2009), 1591. Proceedings of the ACM on Human-Computer Interaction 1, CSCW (2017), 1–19. [52] MN Mangheni, HA Tufan, L Nkengla, BO Aman, and B Boonabaana. 2019. Gender [77] Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler norms, technology access, and women farmers’ vulnerability to climate change Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, in sub-Saharan Africa. In Agriculture and Ecosystem Resilience in Sub Saharan Jonathan Bright, et al. 2020. SciPy 1.0: fundamental algorithms for scientific Africa. Springer, 715–728. computing in Python. Nature methods 17, 3 (2020), 261–272. [53] Michela Menegatti and Monica Rubini. 2017. Gender bias and sexism in language. [78] Claudia Wagner, David Garcia, Mohsen Jadidi, and Markus Strohmaier. 2015. It’s In Oxford Research Encyclopedia of Communication. a man’s Wikipedia? Assessing gender inequality in an online encyclopedia. In [54] Tomáš Mikolov, Édouard Grave, Piotr Bojanowski, Christian Puhrsch, and Ar- Proceedings of the International AAAI Conference on Web and Social Media, Vol. 9. mand Joulin. 2018. Advances in Pre-Training Distributed Word Representations. [79] Tianlu Wang, Jieyu Zhao, Mark Yatskar, Kai-Wei Chang, and Vicente Ordonez. In Proceedings of the Eleventh International Conference on Language Resources and 2019. Balanced datasets are not enough: Estimating and mitigating gender Evaluation (LREC 2018). bias in deep image representations. In Proceedings of the IEEE/CVF International [55] Margaret Mitchell, Dylan Baker, Nyalleng Moorosi, Emily Denton, Ben Hutchin- Conference on Computer Vision. 5310–5319. son, Alex Hanna, Timnit Gebru, and Jamie Morgenstern. 2020. Diversity and [80] Stefan Wojcik and Emma Remy. 2020. The challenges of using machine learning inclusion metrics in subset selection. In Proceedings of the AAAI/ACM Conference to identify gender in images. https://www.pewresearch.org/internet/2019/09/ on AI, Ethics, and Society. 117–123. 05/the-challenges-of-using-machine-learning-to-identify-gender-in-images/ [56] Rosamund Moon. 2014. From gorgeous to grumpy: adjectives, age and gender. [81] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Gender & Language 8, 1 (2014). Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe [57] Moin Nadeem, Anna Bethke, and Siva Reddy. 2020. Stereoset: Measuring stereo- Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, typical bias in pretrained language models. arXiv preprint arXiv:2004.09456 Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, (2020). and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural Language [58] Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R Bowman. 2020. Crows- Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural pairs: A challenge dataset for measuring social biases in masked language models. Language Processing: System Demonstrations. Association for Computational arXiv preprint arXiv:2010.00133 (2020). Linguistics, Online, 38–45. https://www.aclweb.org/anthology/2020.emnlp- [59] Safiya Umoja Noble. 2018. Algorithms of : How search engines reinforce demos.6 . nyu Press. [82] Wendy Wood and Alice H Eagly. 2009. Gender identity. Handbook of individual [60] Ji Ho Park, Jamin Shin, and Pascale Fung. 2018. Reducing Gender Bias in Abusive differences in social behavior (2009), 109–125. Language Detection. In Proceedings of the 2018 Conference on Empirical Methods in [83] Xuhui Zhou, Maarten Sap, Swabha Swayamdipta, Noah A Smith, and Yejin Choi. Natural Language Processing. Association for Computational Linguistics, Brussels, 2021. Challenges in automated debiasing for toxic language detection. arXiv Belgium, 2799–2804. https://doi.org/10.18653/v1/D18-1302 preprint arXiv:2102.00086 (2021). [61] Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q Weinberger. 2017. On fairness and calibration. arXiv preprint arXiv:1709.02012 (2017). [62] Abigail Powell, Barbara Bagilhole, and Andrew Dainty. 2009. How women engineers do and undo gender: Consequences for gender equality. Gender, work & organization 16, 4 (2009), 411–428. [63] Manish Raghavan, Solon Barocas, Jon Kleinberg, and Karen Levy. 2020. Mitigating bias in algorithmic hiring: Evaluating claims and practices. In Proceedings of the 2020 conference on fairness, accountability, and transparency. 469–481. [64] Alexey Romanov, Maria De-Arteaga, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, Anna Rumshisky, and Adam Tauman Kalai. 2019. What’s in a Name? Reducing Bias in Bios without Access to Protected Attributes. arXiv preprint arXiv:1904.05233 (2019). [65] Rachel Rudinger, Chandler May, and Benjamin Van Durme. 2017. Social Bias in Elicited Natural Language Inferences. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing. Association for Computational Linguistics, Valencia, Spain, 74–79. https://doi.org/10.18653/v1/W17-1609 [66] Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. 2018. Gender bias in coreference resolution. arXiv preprint arXiv:1804.09301 (2018). [67] Brenda Russell. 2012. Perceptions of female offenders: How stereotypes and social norms affect criminal justice responses. Springer Science & Business Media. [68] Javier Sánchez-Monedero, Lina Dencik, and Lilian Edwards. 2020. What does it mean to’solve’the problem of discrimination in hiring? Social, technical and legal perspectives from the UK on automated hiring systems. In Proceedings of the 2020 conference on fairness, accountability, and transparency. 458–468. [69] Morgan Klaus Scheuerman, Jacob M. Paul, and Jed R. Brubaker. 2019. How Computers See Gender. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (2019), 1–33. https://doi.org/10.1145/3359246 [70] Maya Sen and Omar Wasow. 2016. Race as a bundle of sticks: Designs that estimate effects of seemingly immutable characteristics. Annual Review of Political Science 19 (2016), 499–522. [71] Muzafer Sherif. 1936. The psychology of social norms. (1936). [72] Stephanie A Shields. 2008. Gender: An intersectionality perspective. Sex roles 59, 5 (2008), 301–311. [73] Kieran Snyder. 2015. The resume gap: Are different gender styles contributing to tech’s dismal diversity. Fortune Magazine (2015). [74] Luke Stark, Amanda Stanhaus, and Denise L. Anthony. 2020. “I Don’t Want Someone to Watch Me While Im Working”: Gendered Views of Facial Recognition Technology in Workplace Surveillance. Journal of the Association for Information Science and Technology 71, 9 (2020), 1074–1088. https://doi.org/10.1002/asi.24342 [75] Nathaniel Swinger, Maria De-Arteaga, Neil Thomas Heffernan IV, Mark DM Leiserson, and Adam Tauman Kalai. 2019. What are the biases in my word embedding?. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. 305–311.