How Does Grammatical Gender Affect Noun Representations in Gender-Marking Languages?
Total Page:16
File Type:pdf, Size:1020Kb
How does Grammatical Gender Affect Noun Representations in Gender-Marking Languages? Hila Gonen1 Yova Kementchedjhieva2 Yoav Goldberg1;3 1Department of Computer Science, Bar-Ilan University 2University of Copenhagen 3Allen Institute for Artificial Intelligence [email protected],[email protected],[email protected] Abstract to inanimate nouns (e.g. dream, book). This gram- matical gender assignment is mostly arbitrary: the Many natural languages assign grammatical same inanimate concept can have different gender gender also to inanimate nouns in the lan- in different languages. For example, a flower is guage. In such languages, words that relate masculine in Italian (fiore) and feminine in Ger- to the gender-marked nouns are inflected to man (Blume). agree with the noun’s gender. We show that this affects the word representations of inani- Languages often maintain an agreement system mate nouns, resulting in nouns with the same in which certain words agree on different morpho- gender being closer to each other than nouns logical features with other words they relate to. with different gender. While “embedding de- For example, English present-tense verbs are in- biasing” methods fail to remove the effect, we flected to agree with their nominal subject on the demonstrate that a careful application of meth- number feature. In other languages the agreement ods that neutralize grammatical gender signals system is more elaborate, and in particular verbs, from the words’ context when training word embeddings is effective in removing it. Fix- adjectives, determiners and other functions agree ing the grammatical gender bias yields a posi- with nouns on many features, including gender tive effect on the quality of the resulting word (Corbett, 2006).1 embeddings, both in monolingual and cross- Such grammatical agreement affects the distri- lingual settings. We note that successfully re- butional environment of nouns, as nouns of differ- moving gender signals, while achievable, is ent gender become surrounded by different word not trivial to do and that a language-specific forms: feminine nouns co-occur more with the morphological analyzer, together with careful usage of it, are essential for achieving good re- feminine forms of words, while masculine nouns sults. with the masculine forms. For example, the Italian word viaggio (“journey”-masc) will co-occur with 1 Introduction durato (“last”-masc) and lungo (“long”-masc), while the word gita (“trip”-fem) will co-occur Work on distributional word embeddings focuses with durata (“last”-fem) and lunga (“long”-fem). almost exclusively on English, or on cross-lingual Such changes in the distributional environment and language-agnostic techniques. However, lan- may bias the learned distributional representations guages are diverse and different languages ex- of inanimate nouns. Indeed, we see that the major- hibit different linguistic phenomena, which may ity of the top-10 nearest neighbors of the word gita interact with the English-centric embedding learn- in Italian (“trip”-fem) are feminine words. Also, ing algorithms. In this work we look into one we notice that the word viaggio (“journey”-masc) such phenomenon—grammatical gender—and ex- is not on the list, while in English, for comparison, amine its effect on the learned representation. we can find journey in the top-10 nearest neigh- Many languages have rich grammatical sys- bors of trip. tems, that often include a complex gender system In this work, we are interested in investigating, as well (Corbett, 1991). Languages with grammat- demonstrating and quantifying this effect beyond ical gender assign and morphologically mark gen- 1As the gender of nouns is fixed, the other elements are in- der not only to animate nouns (which have biologi- flected to accommodate the agreement constraint. The nouns cal sex, e.g. man, woman, mother, father), but also are said to assign gender to the other words. 463 Proceedings of the 23rd Conference on Computational Natural Language Learning, pages 463–471 Hong Kong, China, November 3-4, 2019. c 2019 Association for Computational Linguistics the anecdotal level. We also explore methods for stream tasks. These models are based on the dis- removing such unwanted biases. tributional hypothesis according to which words We demonstrate that both in Italian and in Ger- that occur in the same contexts tend to have sim- man, the grammatical gender affects similarities ilar meanings (Harris, 1954). Indeed, they aim to between word representations (using words from create word representations that are derived from SimLex-999 (Hill et al., 2015; Leviant and Re- their shared contexts, where the context of a word ichart, 2015)): pairs of nouns with similar gender is essentially the words in its proximity (be it ac- are closer to each other while pairs of nouns with cording to linear order in the sentence or according different gender are farther apart. to syntactic relations) (Mikolov et al., 2013; Pen- After quantifying the effect, we explore several nington et al., 2014; Levy and Goldberg, 2014). methods of reducing it. A popular choice would be to simply lemmatize all the words prior to feed- Gender Biases in Word Embeddings Social ing them to the embedding learning algorithm. gender bias was demonstrated to be consistent However, full lemmatization can be destructive, in and pervasive across different word embeddings the sense that it will also remove morphological (Caliskan et al., 2017). Bolukbasi et al. (2016) distinction that we may want to keep. We thus show that using word embeddings for simple seek more surgical approaches. Interestingly, re- analogies surfaces many gender stereotypes. In cent embedding debiasing approaches (Bolukbasi addition, they define the gender bias of a word w by its projection on the “gender direction”: et al., 2016) do not work well. We instead look −! −! −! for methods that attempt to neutralize the gender w · (he − she), assuming all vectors are normal- signals from the training data. We find that such ized. Positive bias stands for male-bias. For ex- methods are effective in reducing the effect, but ample, the bias of manager is 0:06, while the bias 2 are also language specific and tricky to get right: of nurse is −0:10 . we rely on language specific morphological ana- Recently, some work has been done to reduce lyzers while carefully accounting for their pecu- social gender bias in word embeddings, both as liarities and adjusting our use for each language. a post-processing step (Bolukbasi et al., 2016) We take this work as a reminder that (a) linguistic and as part of the training procedure (Zhao et al., resources such as lexicons and morphological an- 2018). Bolukbasi et al. (2016) use a post- alyzers are still relevant and useful (cf. (Zalmout processing debiasing method. Given a word em- and Habash, 2017)); (b) languages are diverse and bedding matrix, they make changes to the word different languages require different treatments; vectors in order to reduce the gender bias for all and (c) small details may matter a lot. In partic- words that are not inherently gendered. They do ular, existing tools and resources, either learned or that by zeroing the gender projection of each word 3 human curated, should not be trusted blindly, but on a predefined gender direction. be carefuly adapted for the problem. In Zmigrod et al. (2019), the authors mitigate Finally, we show that reducing the effect of social gender bias in gender marking languages grammatical agreement also has a positive ef- using counterfactual data augmentation. Gender- fect on the quality of the resulting word repre- marking languages add several interesting dimen- sentations, both in monolingual and cross-lingual sions to the story: words relating to animate con- settings. We conclude that grammatical gen- cepts such as “nurse” or “cat” may have both mas- der indeed has its imprints on the representa- culine and feminine versions; the distributional tions of inanimate nouns, and that this should be environment of a word contains many more ex- taken into account when working with gender- plicit gender cues; and inanimate concepts are marking languages. Our code and debiased em- also assigned gender. All of these factors inter- beddings are available at https://github. act in complicated ways. In this work we focus on com/gonenhila/grammatical_gender. purely grammatical gender—the gender that is as- signed to inanimate nouns—and its effect on their 2 Background and Related Work resulting representations. 2 Word Embeddings Word embeddings have be- in English word2vec embeddings (Mikolov et al., 2013) trained on Wikipedia. come an important component in many NLP mod- 3The gender direction is chosen to be the top principal els and are widely used for a vast range of down- component (PC) of ten gender pair difference vectors. 464 Grammatical Gender Bias in Word Embed- 3.1 Inanimate Noun Pairs from SimLex-999 dings Grammatical gender is manifested in a We take the inanimate noun portion of the similar way to social bias. For example, when −! −! SimLex-999 dataset (Hill et al., 2015), a gold stan- lui − lei projected on the Italian gender direction dard resource for evaluating distributional seman- (Italian equivalents of “he” and “she”), the word tic models. This dataset has an English version, century “secolo” ( , masculine) has positive bias of and also German and Italian versions (Leviant and 0.073, while the word “zuppa” (soup, feminine) 4 Reichart, 2015), and includes both similar and dis- has negative bias of -0.079. similar word pairs, with human-assigned similar- We attribute this behavior to grammatical agree- ity judgments for each pair. This gives us 529 pairs ment. Since the context of different-gender nouns of English words, along with high quality trans- is expected to be very different because of the lations to Italian and German. We manually as- agreement of the surrounding words, and since sociate the Italian and German words with their the resulting representations are based on the con- grammatical gender.