Arxiv:1909.02224V2 [Cs.CL] 9 Sep 2019 Code Gender Bias in Society (Bolukbasi Et Al., 2016; Other Languages, Such As French (FR)
Total Page:16
File Type:pdf, Size:1020Kb
Examining Gender Bias in Languages with Grammatical Gender Pei Zhou1;2, Weijia Shi1, Jieyu Zhao1, Kuan-Hao Huang1, Muhao Chen1;3, Ryan Cotterell4, Kai-Wei Chang1 1Department of Computer Science, University of California Los Angeles 2Department of Computer Science, University of Southern California 3Department of Computer and Information Science, University of Pennsylvania 4Department of Computer Science, Johns Hopkins University [email protected]; fswj0419,jyzhao, khhuang, [email protected]; [email protected]; [email protected] Abstract in English word embeddings cannot be directly applied to languages with grammatical gender1, Recent studies have shown that word embed- where all nouns are assigned a gender class and dings exhibit gender bias inherited from the the corresponding dependent articles, adjectives, training corpora. However, most studies to and verbs must agree in gender with the noun (e.g. date have focused on quantifying and mitigat- ing such bias only in English. These analy- in Spanish: la buena enfermera the good female ses cannot be directly extended to languages nurse, el buen enfermero the good male nurse) (Cor- that exhibit morphological agreement on gen- bett, 1991, 2006). This is because most existing ap- der, such as Spanish and French. In this paper, proaches define bias in word embeddings based on we propose new metrics for evaluating gender the projection of a word on a gender direction (e.g. bias in word embeddings of these languages “nurse” in English is biased because its projection and further demonstrate evidence of gender on the gender direction inclines towards female). bias in bilingual embeddings which align these When the grammatical gender exists, such bias def- languages with English. Finally, we extend an existing approach to mitigate gender bias in inition is problematic as masculine and feminine word embeddings under both monolingual and words naturally carry gender information. For ex- bilingual settings. Experiments on modified ample, “enfermero” (male nurse) and “enfermera” Word Embedding Association Test, word sim- (female nurse) lean toward male and female, re- ilarity, word translation, and word pair transla- spectively. This is due to morphological agreement tion tasks show that the proposed approaches and should not be considered as a stereotype. effectively reduce the gender bias while pre- serving the utility of the embeddings. However, gender bias in the embeddings of lan- guages with grammatical gender indeed exists. For 1 Introduction example, when we align Spanish (ES) embeddings to English embeddings, the word “abogado” (male Word embeddings are widely used in modern nat- lawyer) is closer to “lawyer” than “abogada” (fe- ural language processing tools. By virtue of their male lawyer). This observation implies a discrep- being trained on large, human-written corpora, re- ancy in semantics between the masculine and femi- cent studies have shown that word embeddings, in nine forms of the same occupation in Spanish word addition to capturing a word’s semantics, also en- embeddings. A similar observation is also found in arXiv:1909.02224v2 [cs.CL] 9 Sep 2019 code gender bias in society (Bolukbasi et al., 2016; other languages, such as French (FR). Caliskan et al., 2017). As a result of this bias, the In this paper, we refer languages that have gen- embeddings may cause undesired consequences der distinctions for all nouns (e.g., Spanish and in the resulting models (Zhao et al., 2018a; Font German) as gendered languages and others that do and Costa-jussa`, 2019). Therefore, extensive ef- not mark grammatical gender as languages with- fort has been put toward analyzing and mitigating out grammatical gender (e.g., English). We use gender bias in word embeddings (Bolukbasi et al., Spanish and French as running examples and pro- 2016; Zhao et al., 2018b; Dev and Phillips, 2019; Ethayarajh et al., 2019). 1Grammatical gender is a complicated linguistic phe- Existing studies on gender bias almost exclu- nomenon; many languages contain more than two gender classes. In this paper, we focus on languages with masculine sively focus on English (EN) word embeddings. and feminine classes. For gender in semantics, we follow the Unfortunately, the techniques used to mitigate bias literature and address only binary gender. pose quantitative methods for evaluating bias in ysis of gender bias assumes that the language (e.g,. word embeddings of gendered languages and bilin- English) only has the former direction. However, gual word embeddings that align a gendered lan- when analyzing languages like Spanish and French, guage with English. We first define gender bias considering the second is necessary. We discuss in the word embeddings of gendered languages by these two directions in detail as follows. constructing two gender directions: the semantic gender direction and the grammatical gender di- Semantic Gender Following Bolukbasi et al. rection. We then analyze gender bias in bilingual (2016), we collect a set of gender-definition pairs embeddings using a similar approach. (e.g., “mujer” (woman) and “hombre” (man) in To mitigate gender bias in these embeddings, we Spanish). Then, the gender direction is de- propose two approaches. One is shifting words rived by conducting principal component analysis along the semantic gender direction with respect to (PCA) (Jolliffe, 2011) over the differences between an anchor point and the other is mitigating English male- and female-definition word vectors. Similar first and then aligning the embedding spaces. Re- to the analysis in English, we observe that there is sults show that a hybrid of these two approaches ef- one major principal component carrying the mean- fectively mitigate bias in Spanish and French word ing of gender in French and Spanish and we define ~ embeddings as well as ES-EN, FR-EN bilingual it as the semantic gender direction dPCA. word embeddings. We also show through word Grammatical Gender The number of grammat- similarity and word translation experiments that ical gender classes ranges from two to several the utility of the original monolingual and bilingual tens (Corbett, 1991). To simplify the discussion, word embeddings is preserved. we focus on noun class systems where two major Our contributions are summarized in the follow- gender classes are feminine and masculine. How- ing. (1) We show that word embeddings of gen- ever, the proposed approach can be generalized to dered languages such as Spanish and French con- languages with multiple gender classes (e.g., Ger- tain gender bias and bilingual word embeddings man). To identify grammatical gender direction, aligning these languages to English also inherit we collect around 3,000 common nouns that are the bias. (2) Based on the observation, we pro- grammatically masculine and 3,000 nouns that are pose new definitions of gender bias by constructing feminine. Since most nouns (e.g., water) do not two gender directions. (3) We propose methods to have a paired word in the other gender class, we do reduce gender bias for both monolingual and bilin- not have pairs of words to represent different gram- gual embeddings and show that they effectively matical genders. Therefore, instead of applying mitigate bias while preserving the original utility PCA, we learn the grammatical gender direction of embeddings. Source code and data are avail- d~g by Linear Discriminant Analysis (LDA) (Fisher, able at https://github.com/shaoxia57/ 1936), which is a standard approach for supervised Bias_in_Gendered_Languages. dimension reduction. The model achieves an aver- age accuracy of 0.92 for predicting the grammatical 2 Gender Bias Analysis gender in Spanish and 0.83 in French with 5-fold This section discusses how to analyze bias in em- cross-validation. ~ ~ beddings of gendered languages and bilingual word Comparing dPCA with dg, the cosine similar- embeddings. ity between them is 0.389, indicating these two directions are overlapped to some extent but not 2.1 Bias in Gendered Languages identical. This is reasonable because words such In gendered languages, all nouns are assigned a as “mujer (woman)”, “doctor (female doctor)” are gender class. However, inanimate objects (e.g., wa- both semantically and grammatically marked as ter and spoon) do not carry the meaning of male feminine. To better distinguish between these two or female. To address this issue, we define two directions, we project out the grammatical gender gender directions: (1) semantic gender, which is component in the computed gender direction to ~ defined by a set of gender definition words (e.g., make the semantic gender direction ds orthogonal man, male, waitress) and (2) grammatical gender, to the grammatical gender direction: which is defined by a set of masculine and femi- ~ ~ D~ ~ E ~ nine nouns (e.g., water, table, woman). Most anal- ds = dPCA − dPCA; dg dg; where h~x;~yi represents the inner product of two 0.2 hombre(man) vectors. abogado(lawyer_m) mapa(map) doctor(doctor_m) Visualizing and Analyzing Bias in Spanish We 0.1 cuchillo(knife) take Spanish as an example and analyze gender bias agua(water) 0.0 in Spanish fastText word embeddings (Bojanowski cuchara(spoon) et al., 2017) pre-trained on Spanish Wikipedia. For abogada(lawyer_f) simplicity, we assume that the embeddings contain 0.1 doctora(doctor_f) mujer(woman) all gender forms of the words. To show bias, we Grammatical Gender Direction randomly select several pairs of gender-definition 0.2 and occupation words as well as inanimate nouns 0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.3 in Spanish and visualize them on the two gender Semantic Gender Direction directions defined above. Figure 1: Projections of selected words in Spanish on Figure1 shows that the inanimate nouns lie near grammatical and semantic directions with masculine the origin point on the semantic gender direction, nouns in blue and feminine nouns in red. We find while the masculine and feminine forms of the oc- that: (1) Most inanimate nouns (e.g., map, spoon; en- cupation words are on the opposite sides for both closed by black dotted lines) lie near the origin point of the semantic gender axis; (2) Masculine definition and directions.