Gender Bias in Natural Language Processing Across Human
Total Page:16
File Type:pdf, Size:1020Kb
Gender Bias in Natural Language Processing Across Human Languages Abigail Matthews Isabella Grasso Mariama Njie University of Christopher Mahoney Iona College Wisconsin-Madison Yan Chen Esma Wali Thomas Middleton Jeanna Matthews Clarkson University [email protected] Abstract relationship of profession words like doctor, nurse, or teacher relative to this gendered vector space. They Natural Language Processing (NLP) systems demonstrated that word embedding software trained on are at the heart of many critical automated a corpus of Google news could associate men with the decision-making systems making crucial rec- profession computer programmer and women with the ommendations about our future world. Gender profession homemaker. Systems based on such mod- bias in NLP has been well studied in English, els, trained even with “representative text” like Google but has been less studied in other languages. news, could lead to biased hiring practices if used to, In this paper, a team including speakers of 9 for example, parse resumes and suggest matches for languages - Chinese, Spanish, English, Ara- a computer programming job. However, as with many bic, German, French, Farsi, Urdu, and Wolof - results in NLP research, this influential result has not reports and analyzes measurements of gender been applied beyond English. bias in the Wikipedia corpora for these 9 lan- guages. We develop extensions to profession- In some earlier work from this team, “Quantifying level and corpus-level gender bias metric cal- Gender Bias in Different Corpora”, we applied Boluk- culations originally designed for English and basi et al.’s methodology to computing and compar- apply them to 8 other languages, including ing corpus-level gender bias metrics across different languages that have grammatically gendered corpora of the English text (Babaeianjelodar 2020). nouns including different feminine, masculine, We measured the gender bias in pre-trained models and neuter profession words. We discuss fu- based on a “representative” Wikipedia and Book Cor- ture work that would benefit immensely from pus in English and compared it to models that had been a computational linguistics perspective. fine-tuned with various smaller corpora including the General Language Understanding Evaluation (GLUE) 1. Introduction benchmarks and two collections of toxic speech, Rt- Gender and IdentityToxic. We found that, as might be Corpora of human language are regularly fed into ma- expected, the RtGender corpora produced the highest chine learning systems as a key way to learn about gender bias score. However, we also found that the hate the world. Natural Language Processing plays a signifi- speech corpus, IdentityToxic, had lower gender bias cant role in many powerful applications such as speech scores than some of more representative corpora found recognition, text translation, and autocomplete and is in the GLUE benchmarks. By examining the contents at the heart of many critical automated decision sys- of the IdentityToxic corpus, we found that most of the tems making crucial recommendations about our fu- text in Identity Toxic reflected bias towards race or sex- ture world (Yordanov 2018)(Banerjee 2020)(Garbade ual orientation, rather than gender. These results con- 2018). Systems are taught to identify spam email, sug- firmed the use of a corpus-level gender bias metric as gest medical articles or diagnoses related to a patient’s a way of measuring gender bias in an unknown corpus symptoms, sort resumes based on relevance for a given and comparing across corpora, but again was only ap- position, and many other tasks that form key compo- plied in English. nents of critical decision making systems in areas such as criminal justice, credit, housing, allocation of public Here we build on the work of Bolukbasi et al. and resources and more. our own earlier work to extend these important tech- In a highly influential paper “Man is to Computer niques in gender bias measurement and analysis be- Programmer as Woman is to Homemaker? Debiasing yond English. This is challenging because unlike En- Word Embeddings”, Bolukbasi et al. (2016) developed glish, many languages like Spanish, Arabic, German, a way to measure gender bias using word embedding French, and Urdu, have grammatically gendered nouns systems like Word2vec. Specifically, they defined a including feminine, masculine and, neuter or neutral set of gendered word pairs such as (“he”,“she”) and profession words. We translate and modify Boluk- used the difference between these word pairs to de- basi et al.’s defining sets and profession sets in En- fine a gendered vector space. They then evaluated the glish for 8 additional languages and develop exten- 45 Proceedings of the First Workshop on Trustworthy Natural Language Processing, pages 45–54 June 10, 2021. ©2021 Association for Computational Linguistics sions to the profession-level and corpus-level gender Bolukbasi et al. in order to extend the methodology be- bias metric calculations for languages with grammati- yond English. Before applying these changes to other cally gendered nouns. We use this methodology to an- languages, we evaluate the impact of the changes on alyze the gender bias in Wikipedia corpora for Chinese calculations in English. In this section, we also describe (Mandarin Chinese), Spanish, English, Arabic, Ger- the Wikipedia corpora we used across 9 languages and man, French, Farsi, Urdu, and Wolof. We demonstrate analyze the occurrences of our defining set and profes- how the modern NLP pipeline not only reflects gender sion set words in these corpora. This work is also de- bias, but also leads to substantially over-representing scribed, but with a different focus in Wali et al. (2020) some (especially English voices recorded in the digital and Chen et al. (2021). text) and under-representing most others (speakers of most of the 7000 human languages and even writers of 2.1. Defining Set classic works that have not been digitized). The defining set is a list of gendered word pairs In Section 2, we describe modifications that we used to define what a gendered relationship looks made to the defining set and profession set proposed like. Bolukbasi et al’s original defining set contained by Bolukbasi et al. in order to extend the method- 10 English word pairs (she-he, daughter-son, her-his, ology beyond English. In Section 3, we discuss the mother-father, woman-man, gal-guy, Mary-John, girl- Wikipedia corpora and the occurrence of words in the boy, herself-himself, and female-male) (Bobluski et al. modified defining and profession sets for 9 languages 2016). We began with this set, but made substantial in Wikipedia. In Section 4, we extend Bolukbasi’s gen- changes in order to compute gender bias effectively der bias calculation to languages, like Spanish, Arabic, across 9 languages. German, French, and Urdu, with grammatically gen- Specifically, we removed 6 of the 10 pairs, added 3 dered nouns. We apply this to calculate and compare new pairs and translated the final set into 8 additional profession-level and corpus-level gender bias metrics languages. For example, we removed the pairs she-he for Wikipedia corpora in the 9 languages. We conclude and herself-himself because they are the same word in and discuss future work in Section 5. Throughout this some languages like Wolof, Farsi, Urdu, and German. paper, we discuss future work that would benefit im- Similarly, we removed the pair her-his because in some mensely from a computational linguistics perspective. languages like French and Spanish, the gender of the object does not depend on the person to which it be- 2. Modifying Defining Sets and Profession longs. Sets We also added 3 new pairs (queen-king, wife- Word embedding is a powerful NLP technique that rep- husband, and madam-sir) for which more consistent resents words in the form of numeric vectors. It is used translations were available across languages. Interest- for semantic parsing, representing the relationship be- ingly, as we will discuss, the pair wife-husband intro- tween words, and capturing the context of a word in a duces surprising results in many languages. Our final document (Karani 2018). For example, Word2vec is a defining set for this study thus contained 7 word pairs system used to efficiently create word embeddings by and Table 1 shows our translations of this final defining using a two-layer neural network that efficiently pro- set across the 9 languages included in our study. cesses huge data sets with billions of words, and with millions of words in the vocabulary (Mikolov 2013). 2.2 Professions Set Bolukbasi et al. developed a method for measur- We began with Bolukbasi et al’s profession word set ing gender bias using word embedding systems like in English, but again made substantial changes in or- Word2vec. Specifically, they defined a set of highly der to compute gender bias effectively across 9 lan- gendered word pairs such as (“he”, “she”) and used the guages. Bolukbasi et al. had an original list of 327 difference between these word pairs to define a gen- profession words (2016), including some words that dered vector space. They then evaluated the relation- would not technically be classified as professions like ship of profession words like doctor, nurse or teacher saint or drug addict. We narrowed this list down to 32 relative to this gendered vector space. Ideally, profes- words including: nurse, teacher, writer, engineer, sci- sion words would not reflect a strong gender bias. How- entist, manager, driver, banker, musician, artist, chef, ever, in practice, they often do. According to such a filmmaker, judge, comedian, inventor, worker, soldier, metric, doctor might be male biased or nurse female bi- journalist, student, athlete, actor, governor, farmer, per- ased based on how these words are used in the corpora son, lawyer, adventurer, aide, ambassador, analyst, as- from which the word embedding model was produced.