Quantifying 60 Years of Gender Bias in Biomedical Research with Word Embeddings

Anthony Rios1, Reenam Joshi2, and Hejin Shin3 1Department of Information Systems and Cyber Security 2Department of Computer Science 3Library Systems University of Texas at San Antonio {Anthony.Rios, Reenam.Joshi, Hejin.Shin}@utsa.edu

Abstract tion areas to understand the effect of bias. For ex- ample, Kay et al. (2015) found that the Google im- Gender bias in biomedical research can have age search application is biased (Kay et al., 2015). an adverse impact on the health of real peo- Specifically, they found an unequal representation ple. For example, there is evidence that heart of gender stereotypes in image search results for -related funded research generally fo- cuses on men. Health disparities can form different occupations (e.g., all police images are between men and at-risk groups of women of men). Likewise, ad-targeting algorithms may (i.e., elderly and low-income) if there is not an include characteristics of sexism and racism (Datta equal number of heart disease-related studies et al., 2015; Sweeney, 2013). Sweeney (2013) for both genders. In this paper, we study tem- found that the names of black men and women poral bias in biomedical research articles by are likely to generate ads related to arrest records. measuring gender differences in word embed- In healthcare, much of the prior work has stud- dings. Specifically, we address multiple ques- ied the bias in the diagnosis process made by doc- tions, including, How has gender bias changed over time in biomedical research, and what tors (Young et al., 1996; Hartung and Widiger, health-related concepts are the most biased? 1998). There have also been studies about ethical Overall, we find that traditional gender stereo- considerations about the use of machine learning types have reduced over time. However, we in healthcare (Cohen et al., 2014). also find that the embeddings of many medical It is possible to analyze and measure the pres- conditions are as biased today as they were 60 ence of gender bias in text. Garg et al. (2018) an- years ago (e.g., concepts related to drug addic- tion and body dysmorphia). alyzed the presence of well-known gender stereo- types over the last 100 years. Hamberg (2008) 1 Introduction shown that gender blindness and stereotyped pre- conceptions are the key cause for gender bias in It is important to develop gender-specific best- medicine. Heath et al. (2019) studied the gender- practice guidelines for biomedical research (Hold- based linguistic differences in physician trainee croft, 2007). If research is heavily biased towards evaluations of medical faculty. Salles et al. (2019) one gender, then the biased guidance may con- measured the implicit and explicit gender bias tribute towards health disparities because the evi- among health care professionals and surgeons. dence drawn-on may be questionable (i.e., not well Feldman et al. (2019) quantified the exclusion of studied). For example, there is more research fund- females in clinical studies at scale with automated ing for the study of heart disease in men (Weisz data extraction. Recently, researchers have studied et al., 2004). Therefore, the at-risk populations methods to quantify gender bias using word embed- of older women in low economic classes are not dings trained on biomedical research articles (Ku- as well-investigated. Therefore, this opens up the rita et al., 2019). Kurita et al. (2019) shown that possibility for an increase in the health disparities the resulting embeddings capture some well-known between genders. gender stereotypes. Moreover, the embeddings ex- Among informatics researchers, there has been hibit the stereotypes at a lower rate than embed- increased interest in understanding, measuring, and dings trained on other corpora (e.g., Wikipedia). overcoming bias associated with machine learning However, to the best of our knowledge, there has methods. Researchers have studied many applica- not been an automated temporal study in the change

1 Proceedings of the BioNLP 2020 workshop, pages 1–13 Online, July 9, 2020 c 2020 Association for Computational Linguistics of gender bias. we analyze two groups of words: occupations In this paper, we look at the temporal change of and mental health disorders. For each group, gender bias in biomedical research. To study social we measure the overall change in bias over biases, we make use of word embeddings trained time. Moreover, we measure the individual on different decades of biomedical research arti- bias associated with each occupation and men- cles. The two main question driving this work are, tal health disorder. In what ways has bias changed over time, and Are there certain illnesses associated with a specific 2 Related Work gender? We leverage three computational tech- In this section, we discuss research related to the niques to answer these questions, the Word Em- three major themes of this paper: gender disparities bedding Association Test (WEAT) (Caliskan et al., in healthcare, biomedical word embeddings, and 2017), the Embedding Coherence Test (ECT) (Dev bias in natural language processing (NLP). and Phillips, 2019), and Relational Inner Product Association (RIPA) (Ethayarajh et al., 2019). To 2.1 Gender Disparities in Healthcare. the best of our knowledge, this will be the first tem- poral analysis of bias of word embeddings trained There is evidence of gender disparities in the health- on biomedical research articles. Moreover, to the care system, from the diagnosis of mental health best of our knowledge, this is the first analysis that disorders to differences in . An measures the gender bias associated with individual important question is, Do similar biases appear biomedical words. in biomedical research? In this work, while we Our work is most similar to Garg et al. (2018). explore traditional gender stereotypes (e.g., Intelli- Garg et al. (2018) study the temporal change of gence vs Appearance), we also measure potential both gender and racial biases using word embed- bias in the occupations and mental health-related dings. Our work substantially differs in three ways. disorders associated with each gender. First, this paper is focused on biomedical litera- With regard to mental health, as an example, af- ture, not general text corpora. Second, we analyze fecting more than 17 million adults in the United gender stereotypes using three distinct methods to States (US) alone, major is one of the see if the bias is robust to various measurement most common mental health illnesses (Pratt and techniques. Third, we extend the study beyond Brody, 2014). Depression can cause people to lose gender stereotypes. Specifically, we look at bias in pleasure in daily life, complicate other medical sets of occupation words, as well as bias in men- conditions, and possibly lead to suicide (Pratt and tal health-related word sets. Moreover, we quan- Brody, 2014). Moreover, depression can occur to tify the bias of individual occupational and mental anyone, at any age, and to people of any race or health-related words. ethnic group. While treatment can help individuals In summary, the paper makes the following con- suffering from major depression, or mental illness tributions: in general, only about 35% of individuals suffering from severe depression seek treatment from mental • We answer the question; How has the usage health professionals. It is common for people to of gender stereotypes changed in the last 60 resist treatment because of the belief that depres- years of biomedical research? Specifically, sion is not serious, that they can treat themselves, we look at the change in well-known gender or that it would be seen as a personal weakness stereotypes (e.g., Math vs Arts, Career vs Fam- rather than a serious medical illness (Gulliver et al., ily, Intelligence vs Appearance, and occupa- 2010). Unfortunately, while depression can affect tions) in biomedical literature from 1960 to anyone, women are almost twice as likely as men 2020. to have had depression (Albert, 2015). Moreover, depression is generally higher among certain de- • The second contribution answers the question; mographic groups, including, but not limited to, What are the most gender-stereotyped words Hispanic, non-Hispanic black, low income, and for each decade during the last 60 years, and low education groups (Bailey et al., 2019). The have they changed over time? This contribu- focus of this paper is to understand the impact of tion is more focused than simply looking at these mental health disparities in word embeddings traditional gender stereotypes. Specifically, trained on biomedical corpora.

2 2.2 Biomedical Word Embeddings. Year # Articles Word embeddings capture the distributional nature 1960-1969 1,479,370 between words (i.e., words that appear in similar 1970-1979 2,305,257 contexts will have a similar vector encoding). Over 1980-1989 3,322,556 1990-1999 4,109,739 the years, there have been multiple methods of pro- 2000-2010 6,134,431 ducing word embeddings, including, but not lim- 2010-2020 8,686,620 ited to, latent semantic analysis (Deerwester et al., 1990), Word2Vec (Mikolov et al., 2013a,b), and Total 26,037,973 GLOVE (Pennington et al., 2014). Moreover, pre- Table 1: The total number of articles in each decade. trained word embeddings have been shown to be useful for a wide variety of downstream biomedical NLP tasks (Wang et al., 2018), such as text classi- also been work on measuring bias in sentence em- fication (Rios and Kavuluru, 2015), named entity beddings (May et al., 2019). Furthermore, there recognition (Habibi et al., 2017), and relation ex- has been a significant amount of research that ex- traction (He et al., 2019). In Chiu et al. (2016), the plores different ways to measure bias in word em- authors study a standard methodology to train good beddings (Caliskan et al., 2017; Dev and Phillips, biomedical word embeddings. Essentially, they 2019; Ethayarajh et al., 2019). In this work, we study the impact of the various Word2Vec-specific make use of many of the bias measurement tech- hyperparameters. In this paper, we use the strate- niques (Caliskan et al., 2017; Dev and Phillips, gies proposed in Chiu et al. (2016) to train optimal 2019; Ethayarajh et al., 2019) to apply them to the decade-specific biomedical word embeddings. biomedical domain.

2.3 Bias and Natural Language Processing. 3 Dataset Unfortunately, because word embeddings are We analyze PubMed-indexed titles and abstracts learned using naturally occurring data, implicit bi- published anytime between 1960 and 2020. The ases expressed in text will be transferred to the total number of articles per decade are shown in Ta- vectors. Bias (and fairness) is an important topic ble 1. The text is lower-cased and tokenized using among natural language processing researchers. the SimpleTokenizer available in GenSim (Khos- Bias has been found in word embeddings (Boluk- rovian et al., 2008). We find that the total number basi et al., 2016; Zhao et al., 2018, 2019), text clas- of papers have grown substantially each decade, sification models (Dixon et al., 2018; Park et al., from 1.4 million indexed articles in the 1960s to 2018; Badjatiya et al., 2019; Rios, 2020), and in 8.6 million in the 2010s. Yet, the rate of growth machine translation systems (Font and Costa-jussa`, stayed relatively stable each decade. 2019; Escude´ Font, 2019). In general, each paper 4 Method generally focuses on either testing whether bias exists in various models, or on removing bias from We train the Skip-Gram model on PubMed-indexed classification models for specific applications. titles and abstracts from 1960 to 2020. The hyper- Much of the work on measuring (gender) bias parameters of the Skip-Gram model are optimized using word embeddings neither studies the tempo- independently for each decade. Next, given the ral aspect (i.e., how bias changes over time) nor best set of embeddings for each decade, we ex- focuses on biomedical research (Chaloner and Mal- plore three different techniques to measure bias: donado, 2019). For example, Caliskan et al. (2017) the Word Embedding Association Test (WEAT), studied the bias in groups of words—focusing on the Embedding Coherence Test (ECT), and the Re- traditional gender stereotypes. Kurita et al. (2019) lational Inner Product Association (RIPA). Each expanded on Caliskan et al. (2017) to generalize method allows us to quantify bias in different ways, to contextual word embeddings. Garg et al. (2018) such as comparing multiple sets of words (e.g., developed a technique to study 100 years of gen- comparing the bias with respect to Career vs Fam- der and racial bias using word embeddings. They ily), comparing a single set of words (e.g., occupa- evaluated the bias over time using the US Cen- tions), and measuring the bias of individual words sus as a baseline to compare embedding bias to (e.g., nurse). In this section, we briefly discuss the demographic and occupation shifts. There has procedure we used to train the word embeddings,

3 X male, man, boy, brother, he, him, his, son, father, uncle, grandfather Attribute Words Male vs Female Y female, woman, girl, sister, she, her, hers, daughter, mother, aunt, grandmother A executive, management, professional, corporation, salary, office, business, career Career vs Family B home, parents, children, family, cousins, marriage, wedding, relatives A math, algebra, geometry, calculus, equations, computation, numbers, addition Math vs Art B poetry, art, Shakespeare, dance, literature, novel, symphony, drama A science, technology, physics, chemistry, Einstein, NASA, experiment, astronomy Science vs Art B poetry, art, Shakespeare, dance, literature, novel, symphony, drama Target Words precocious, resourceful, inquisitive, genius, inventive, astute, adaptable, reflective, A discerning, intuitive, inquiring, judicious, analytical, apt, venerable, imaginative, shrewd, thoughtful, wise, smart, ingenious, clever, brilliant, logical, intelligent Intelligence vs Appearance alluring, voluptuous, blushing, homely, plump, sensual, gorgeous, slim, bald, athletic, B fashionable, stout, ugly, muscular, slender, feeble, handsome, healthy, attractive, fat, weak, thin, pretty, beautiful, strong power, strong, confident, dominant, potent, command, assert, loud, bold, succeed, A triumph, leader, shout, dynamic, winner Weak vs Strong weak, surrender, timid, vulnerable, weakness, wispy, withdraw, yield, failure, shy, B follow, lose, fragile, afraid, loser

Table 2: Attribute and Target words words used by WEAT to measure the presence of traditional gender stereotypes in biomedical literature. as well as provide descriptions of each of the bias alization of the implicit bias test for word embed- measurement techniques. dings, measuring the association between two sets of target concepts and two sets of attributes. We 4.1 Word2Vec Model Training. use the same target and attribute sets from Kurita We train a Skip-Gram model using GenSim (Khos- et al. (2019). We list the targets and attributes in rovian et al., 2008). Following Chiu et al. (2016), Table 2. The attribute sets of words are related to we search over the following key hyper-parameters: the groups in which the embeddings are biases to- Negative sample size, sub-sampling, minimum- wards or against, e.g., Male vs Female. The words count, learning rate, vector dimension, and context in the target categories—Career vs Family, Math vs window size. See Chiu et al. (2016, Table 2) for Arts, Science vs Arts, Intelligence vs Appearance, more details. and Strength vs Weakness—represent the specific To find the best model, as we search over the var- types of biases. For example, using the attributes ious hyper-parameters, we make use of the UMLS- and targets, we want to know whether the learned Sim dataset (McInnes et al., 2009). UMLS-Sim embeddings that represent men are more related to consists of 566 medical concept pairs for measur- career than the female-related words (i.e., test if ing similarity. The degree of association between female words are more related to family, than male terms in UMLS-Sim was rated by four medical res- words). idents from the University of Minnesota medical Formally, let X and Y be equal-sized sets of tar- school. All these clinical terms correspond to Uni- get concept embeddings and let A and B be sets fied Medical Language System (UMLS) concepts of attribute embeddings. To measure the bias, we included in the Metathesaurus (Bodenreider, 2004). follow Caliskan et al. (2017), which defines the fol- Evaulation is performed using Spearman’s rho rank lowing test statistic that is the difference between correlation between a vector of cosine similarities the sums over the respective target concepts, between each of the 566 pairs of words and their respective medical-resident ratings. Intuitively, the  X  s(X, Y, A, B) = s(x, A, B) − ranking of the pairs using cosine similarity, from x∈X most similar pairs to the least, should be similar to  X  the human (medical expert) annotations. s(y, A, B) y∈Y 4.2 Word Embedding Association Test The implicit bias test measures unconscious prej- where s(w, A, B) measures the association be- udice (Greenwald et al., 1998). WEAT is a gener- tween a single target word w (e.g., career) with

4 each of the attribute (gendered) words as Year Sim Pair Cnt 1960-1969 .6586 101  X  s(w, A, B) = cos(w~ ,~a) − 1970-1979 .6715 207 a∈A 1980-1989 .7033 277 1990-1999 .7282 265  X  cos(w~ , ~b) , 2000-2010 .7078 272 b∈B 2010-2020 .6867 306 such that cos() represents the cosine similarity be- Table 3: Quality of the embeddings trained for each tween two vectors. w~ ∈ Rd, ~a ∈ Rd, and ~b ∈ Rd decade, measured using the UMLS-Sim dataset. Sim represents the word embedding for x, y, and w, represents Spearman’s rho ranking correlation. Pair count is the number of UMLS-Sim’s word-pairs that d respectively. Similarly, is the dimension of each were present in that decades embeddings. word embedding. Instead of using the test statistic directly, to measure bias, we use the effect size. Ef- fect size is a normalized measure of the separation diagnostic tool published by the American Psychi- of the two distributions, defined as atric Association (Association et al., 2013). For each mental health disorder in DSM-5, which are     µx∈X s(x, A, B) − µy∈Y s(y, A, B) generally multi-word expressions, we split it into σw∈X∪Y s(w, A, B) individual words. Next, we manually remove unin- formative adjective and function words. For exam- where µ and µ represent the mean score x∈X y∈Y ple, the disorder “Specific learning disorder, with over target words for a specific attribute word. Like- impairment in mathematics” is tokenized into the wise, σ is the standard deviation of the w∈X∪Y following words: “learning”, “disorder”, “impair- scores for the word w in the union of X and Y . ment”, and “mathematics”. A complete listing of Intuitively, a positive score means that the attribute the occupational and mental health words can be words in X (e.g., male, man, boy) are more similar found in the appendix. to the target words A (e.g., strong, power, domi- Formally, ECT first computes the mean vectors nant) than Y (e.g., female, woman, girl). Moreover, for the attribute word sets X and Y, defined as larger effects represent more biased embeddings. As previously stated, the Attribute and Target 1 X ~v = ~x words are from Kurita et al. (2019). It is important X |X| x∈X to note that the list is manually curated. Moreover, d the bias measurement can change depending on the where ~vX ∈ R and |X| represents the number of exact list of words. RIPA is more robust to slight words in category X. ~vY is calculated similarly. changes to the attribute words than WEAT (Etha- For both ~vX and ~vY , ECT computes the (cosine) yarajh et al., 2019). similarities with all vectors a ∈ A, i.e., the cosine similarity is calculated between each target word a 4.3 Embedding Coherence Test. |A| and ~vX and stored in sX ∈ R . The two resultant We also explore a second method of measuring vectors of similarity scores, sX (for X) and sY (for bias, the Embedding Coherence Test (ECT) (Dev Y ) are used to obtain the final ECT score. It is and Phillips, 2019). Unlike WEAT, it compares the Spearman’s rank correlation between the rank the attribute Words (e.g., Male vs Female) with a orders of sX and sY —the higher the correlation, single target set (e.g., Career). Thus, we do not the lower the bias. Intuitively, if the correlation need two contrasting target sets (e.g., Career vs is high, then the rank of target words based on Family) to measure bias. We take advantage of this similarity is correlated when calculated for the both to measure bias associated with occupations and X and Y (i.e., male and female). mental health-related disorders. Specifically, we use a total of 290 occupation words and 222 mental 4.4 Relational Inner Product Association. health-related words. The occupation words come While ECT only requires a single target set, both from prior work measuring per-word bias (Dev and WEAT and ECT 1 calculate the bias between sets Phillips, 2019). To form a list of mental health 1The cosine similarities from ECT can be used to mea- words, we use the Diagnostic and Statistical Man- sure scores for individual words, but it is not as robust as ual of Mental Disorders (DSM-5), a taxonomic and RIPA (Ethayarajh et al., 2019).

5 Career vs Family Math vs Art Science vs Art 2.5 0.4 1.0 2.0 0.8 1.5 0.6 0.6 1.0 Bias Bias Bias 0.4 0.5 0.8 0.2 0.0 1.0 0.0 0.5

1960-1969 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 1960-1969 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 1960-1969 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 Decade Decade Decade (a) (b) (c)

Strong vs Weak Intelligence vs Appearance 1.6

1.5 1.4 1.2 1.0 1.0 Bias

Bias 0.5 0.8

0.0 0.6

0.4 0.5

1960-1969 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 1960-1969 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 Decade Decade (d) (e)

Figure 1: Each subfigure plots the bias measures using WEAT for one of five gender stereotypes: (a) Career vs Family, (b) Math vs Art, (c) Science vs Art, (d) Intelligence vs Appearance, and (e) Strong vs Weak. A bias score of zero represents no bias, i.e., no measurable difference between the two target categories for each gender. The shaded area of each subplot represents the bootstrap estimated 95% confidence interval. of words. However, neither approach calculates a have changed over time. To understand the biased robust bias score for individual words. To study the stereotypes, we make use of the WEAT method. most gender biased words over time, we make use Third, we look at whether occupational and mental of RIPA (Ethayarajh et al., 2019). Intuitively, RIPA health-related words are biased, and how the bias uses a single vector to represent gender, then each has changed over time. For this result, we only use word is scored by taking the dot product between a single set of target words. Thus, we make use of the gender embedding and its respective embed- ECT. Fourth, we use RIPA to find the most biased ding. The sign of the score will determine whether words for each gender in each decade. the embedding is more male or female-related. The major aspect of RIPA is creating the gender 5.1 Embedding Quality. embedding. Formally, given S, a non-empty set of In Table 3, we report the quality of each decade’s ordered word pairs (x, y) (e.g., (‘man’, ‘woman’), embeddings based on the UMLS-sim dataset. Over- (‘male’, ‘female’)) that defines the gender associ- all, we find that the quality consistently improves ation, we take the first principal component of all until the 1990s, however, we see drops in the 2000s the difference vectors {~x − ~y|(x, y) ∈ S}, which and 2010s. We hypothesize that the reason for the d we call the relation vector ~g ∈ R —that would be decrease in embedding quality is because of the a one-dimensional bias subspace. Then, for some growth of research articles indexed on PubMed. d word vector w~ ∈ R the dot product is taken with Intuitively, word embeddings are only able to cap- ~g to measure bias. ture a single sense of a word. However, given the breadth of articles PubMed indexes—from ma- 5 Results chine learning (e.g., BioNLP) to biomaterials— In this section, we present the results of our study in multiple word meanings are being stored in a single four parts. First, we report the embedding quality vector. Thus, the overall quality begins to drop. using UMLS-sim. Second, we study the tempo- ral bias of traditional gender stereotypes, such as 5.2 Traditional Gender Stereotypes. Career vs Family and Strong vs Weak. Ideally, we In Figure 1, we plot the bias scores reported using want to understand how, and which, stereotypes WEAT. Remember, a large positive score means

6 Occupations Mental Disorders 0.95 1.00 0.90 0.95

0.85 0.90 Bias Bias 0.80 0.85

0.80 0.75 0.75

1960-1969 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 1960-1969 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 Decade Decade (a) (b)

Figure 2: ECT bias estimates for both the set of occupation and words. The shaded area of each subplot represents the bootstrap estimated 95% confidence interval. that the male words are more similar to the targets bias in general text corpora to what is found in A (e.g., career) than the female words. There is biomedical literature. no measurable bias with a value of zero. Overall, we find that the results from the WEAT test vary 5.3 Occupational and Mental Health Bias. depending on the stereotype. For Career vs Family, In Figure 2, we report the gender bias results from in Figure 1a, we find a steady linear decrease in ECT on two categories: occupations (e.g., doc- bias each decade—with the exception of the 1990s. tor, nurse, teacher) and mental health disorders We also find similar linear decreases in bias for (e.g., depression, alcoholism, PTSD). Again, un- both Science vs Art and Strong vs Weak (Figures 1c like WEAT, ECT calculates bias scores on a single and 1e). In Figure 1b, for Math vs Art, however, target set of words. Therefore, we do not need two the bias stays relatively static, i.e., it does not dra- contrasting target word sets (e.g., Math vs Art), in- matically change over time. Moreover, the WEAT stead we can focus on bias for a single set (e.g., score for Math vs Art is negative, meaning that the Math). Also, the larger the score, the lower the female words are more similar to math than the bias—a score of one would represent no difference male words. Likewise, for Intelligence vs Appear- between male and female words for that specific tar- ance (Figure 1d), we see relatively little bias from get set. Interestingly, we find that the ECT scores 1960 to 1989, however, in the 1990s and 2000s, we follow a similar pattern as found in Table 3, the had a substantial jump in the bias score. better the embedding quality, the lower the bias. Comparing Figures 2a and 2b, we find that Our evaluation supports prior work evaluating the word embeddings for both occupations and bias in biomedical word embeddings (e.g., Strong mental disorders have relatively little bias in the vs Weak is the most biased stereotype in biomed- 1990s. Furthermore, while there was small vari- ical literature) (Chaloner and Maldonado, 2019, ation, mental disorders experienced little change Table 2). However, we also find differences when in bias decade-by-decade. Yet, occupation-related measuring bias over time. For example, we find words had a substantial amount of bias in the 1960s that from 2010 to 2019 there is not a lot evidence and 1970s. Moreover, we find that the bias re- for the Career vs Family stereotype in biomedical lated to occupations experienced more change, than corpora, matching the results from Chaloner and mental disorders, starting 0.83 in the 1960s and in- Maldonado (2019, Table 2). Yet, this is only a crease by more than ten points to 0.94 in the 1990s. recent phenomenon. The embeddings trained on ar- Whereas, mental disorder-related bias scores only ticles published from 1990 to 1999 exhibit a Career ranged from 0.90 to 0.94. vs Family bias score greater than 1.5. Overall, com- paring to Chaloner and Maldonado (2019, Table 5.4 Biased Words. 2), this means that the bias in recently published In Figure 2, we analyze the bias of individ- biomedical literature may not be as strong as what ual occupational and mental health-related words. is found in general text corpora. But, if we exclude We found a substantial change in the bias of the most recent decade’s embeddings, the bias in occupational-related words. biomedical literature becomes much stronger. Fu- We found little change in the bias of mental ture work should explore comparing the temporal health-related words since the 1960s. Yet, while

7 Male Female 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 Occupations 1 promoter conductor chef dentist mediator teacher housewife neurosurgeon swimmer priest 2 collector chef baker counselor promoter professor teenager pediatrician baker fisherman 3 investigator biologist astronaut librarian dentist counselor bishop educator butcher teenager 4 principal collector swimmer pharmacist principal physician lawyer teenager medic chef 5 baker dad prisoner teenager collector pediatrician pediatrician counselor barber writer 6 researcher singer mechanic bishop cop consultant athlete neurologist physicist nanny 7 character chemist character acquaintance conductor doctor physician consultant soldier historian 8 mechanic butler worker cardiologist substitute student pathologist dentist baron president 9 analyst mechanic soldier promoter coach lawyer educator athlete director inventor 10 conductor promoter analyst attorney employee pathologist carpenter doctor singer housewife Mental Disorders 1 caffeine cannabis separation lacunar lacunar dysmorphic factitious binge dissociative munchausen 2 restrictive hypnotic restrictive bulimia circadian psychogenic dysmorphic nervosa coordination mutism 3 attachment caffeine coordination erectile nicotine nervosa bulimia separation factitious 4 separation coordination dyskinesia gambling gambling adolescent mutism opioid parasitosis dysmorphic 5 circadian hallucinogen conversion bereavement phencyclidine nervosa bulimia hypersomnia terror 6 coordination dependence mathematics binge ocpd mutism tourette narcolepsy hysteria cotard 7 benzodiazepine attachment attachment nervosa cocaine infancy infancy anorexia conversion claustrophobia 8 dependence mathematics residual mood munchausen episode panic malingering ekbom 9 selective restrictive parasitosis depressive sleep factitious anorexia korsakoff tic diogenes 10 conversion pdd developmental polysubstance caffeine disorder munchausen factitious munchausen encopresis

Table 4: The top ten words with the largest RIPA scores (i.e., the most biased) across each decade. The RIPA scores are reported for both occupations and mental health disorders. While all the listed words are biased, they are ranked starting with the most biased word to the least. we found little change in mental health bias over- e.g., “caffeine”, “cannabis”, “nicotine”, and “gam- all, are there at least a few disorders that changed bling”. Similarly, disorders related to appearance over time? Moreover, we found a slight bias in are more female-related, e.g., “dysmorphic” 2 and mental health terms, therefore, What are the biased “anorexia”. We also find that disorders related to terms in each group? We look at the most gen- emotions are more female-related, such as “mun- der biased occupational and mental health-related chausen” 3, “hysteria” 4, and “terror”. Interest- terms for each decade in Table 4. Because of space ingly, we find that the word “hysteria” is heavily limitations, we only display the gendered words biased in the 2010s. Even though the diagnosis of from the 1970s to the 2010s. The words from the female hysteria substantially fell in the 1900s (Mi- 1960s can be found in the appendix. The word-level cale, 1993), it still seems to be a biased term. We scores were generated using RIPA. First, for occu- want to note that this could simply be caused by pations, the words vary between male and female. research studying mental health diagnosis bias in For example, in the 1970s, male-related words in- women, however, the underlying cause of why the clude “mechanic”, “principal”, and “investigator”. term is biased in the 2010s is left for future work. The female-related words include “teacher”, “coun- selor”, and “pediatrician”. Interestingly, the jobs 6 Discussion associated with men such as “principal” and “re- searcher” are positions with power over the jobs In this section, we discuss the impact of the results associated with woman. For example “principals” on two stakeholders of this research: BioNLP re- (male) have power over “teachers” (female) and searchers and general biomedical researchers. Fur- “researchers” (male) have power over “students” thermore, we discuss the limitations of focusing on (female). We also find other well-known occupa- binary gender (Male vs Female). tions appear to be gender-related. For instance, “butler” in the 1980s is associated to male while 2Dysmorphia is a mental health disorder in which you “nanny” is related to female in the 2010s. can’t stop thinking about one or more perceived defects or flaws With regard to mental health, we find that dis- 3Munchausen is a mental disorder in which a person re- orders associated with well-known gender dispari- peatedly and deliberately acts as if he or she has a physical or ties appear to be biased using RIPA (Organization, mental illness 4Hysteria is a (biased) catch-all for symptoms including, 2013). For example, through the last 60 years, but not limited to, nervousness, , and emo- words associated with addictions are male-related, tional outbursts.

8 6.1 Impact on BioNLP researchers. 2019), we can measure the bias of individual words The results in this paper are important for BioNLP across many contexts. Thus, we can possibly over- research in two ways. First, we have produced come the problem of a limited number of words decade-specific word embeddings. 5 Therefore, per gender. BioNLP research can use the embeddings to study 7 Conclusion other historical phenomenon in biomedical re- search articles. Second, the analysis of historical In this paper, we studied the historical bias present bias in biomedical research in this paper can be in word embeddings from 1960 to 2020. In sum- applied to other domains, beyond occupations and mary, we found that while some biases have shown mental disorders. a consistently decrease over time (e.g., Strong vs Weak), others have stayed relatively static, or worse, 6.2 Impact on Biomedical Researchers. increased (e.g., Intelligence vs Appearance). More- With regard to general biomedical researchers (e.g., over, we found that the gender bias towards occupa- medical researchers and biologist), this work can tions has substantially changed over time, showing provide a way to measure which demographics cur- that in the past, there was more gender bias associ- rent research is leaning towards in an automated ated with certain jobs. fashion. As discussed in Holdcroft (2007), if re- There are two major avenues for future work. search is heavily focused on a single gender, then First, this work quantified various aspects of gen- health disparities can increase. Treatments should der bias over time. However, we do not know why be explored equally for all at-risk patients. Fur- the bias is present in the word embeddings. For thermore,with the use of contextual word embed- example, is the word “hysteria” biased in 2010 be- dings (Scheuerman et al., 2019), implicit bias mea- cause researchers are associating it with women surement techniques can be used as part of the writ- implicitly, or is it that researchers are studying the ing process to avoid gendered language when it is historical usage of the diagnosis to ensure the di- not necessary (e.g., using singular they vs he/she). agnosis is not made because of implicit bias in the future? Thus, our future work will focus on causal 6.3 A Note About Gender. studies of bias in biomedical literature. Second, we Similar to prior work measuring gender simply independently trained Skip-Gram word em- bias (Chaloner and Maldonado, 2019), we beddings for each decade. However, recent work focus on binary gender. However, it is important has shown that dynamic embeddings, rather than to note that the results for binary gender do not static (decade-specific), perform better with regard necessarily generalize to other genders, including, to analyzing public perception over time (Gillani but not limited to, binary trans people, non-binary and Levy, 2019). Future work will focus on de- people, gender non-conforming people (Scheuer- veloping new techniques to study bias temporally. man et al., 2019). Therefore, we want to explicitly Moreover, many techniques may depend on the note that our research does not necessarily magnitude of the bias, therefore, we plan to ana- generalize beyond binary gender. In future lyze the circumstances in which one embedding work, we recommend that researcher’s studies approach may measure bias (e.g., Skip-Gram) bet- should be performed for other genders, beyond ter than another (e.g., dynamic embeddings). simply studying Male vs Female. Acknowledgements How can this study be expanded beyond binary gender? The three bias measurement techniques We would like to thank the anonymous reviewers studied in this paper (i.e., WEAT, ECT, and RIPA) for their invaluable help improving this manuscript. require sets of words representing a single gender This material is based upon work supported by (e.g., boy, men, male). Unfortunately, there is not the National Science Foundation under Grant a large number of words to represent every gender No. 1947697. of interest. A promising area of research is to ex- plore bias in contextual word embeddings. With the References use of contextual word embeddings (Kurita et al., Paul R Albert. 2015. Why is depression more prevalent 5https://github.com/AnthonyMRios/ in women? Journal of & neuroscience: Gender-Bias-PubMed JPN, 40(4):219.

9 American Psychiatric Association et al. 2013. Diag- Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, nostic and statistical manual of mental disorders and Lucy Vasserman. 2018. Measuring and mitigat- (DSM-5 R ). American Psychiatric Pub. ing unintended bias in text classification. In Confer- ence on AI, Ethics, and Society. Pinkesh Badjatiya, Manish Gupta, and Vasudeva Varma. 2019. Stereotypical bias removal for hate Joel Escude´ Font. 2019. Determining bias in machine speech detection task using knowledge-based gen- translation with deep learning techniques. Master’s eralizations. In The World Wide Web Conference, thesis, Universitat Politecnica` de Catalunya. pages 49–59. Kawin Ethayarajh, David Duvenaud, and Graeme Hirst. Rahn Kennedy Bailey, Josephine Mokonogho, and 2019. Understanding undesirable word embedding Alok Kumar. 2019. Racial and ethnic differences in associations. In Proceedings of the 57th Annual depression: current perspectives. Neuropsychiatric Meeting of the Association for Computational Lin- disease and treatment, 15:603. guistics, pages 1696–1705. Olivier Bodenreider. 2004. The unified medical lan- Sergey Feldman, Waleed Ammar, Kyle Lo, Elly Trep- guage system (umls): integrating biomedical termi- man, Madeleine van Zuylen, and Oren Etzioni. 2019. nology. Nucleic acids research, 32(suppl 1):D267– Quantifying sex bias in clinical studies at scale with D270. automated data extraction. JAMA network open, 2(7):e196700–e196700. Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. 2016. Joel Escude´ Font and Marta R Costa-jussa.` 2019. Man is to computer programmer as woman is to Equalizing gender biases in neural machine trans- homemaker? debiasing word embeddings. In Ad- lation with word embeddings techniques. arXiv vances in neural information processing systems, preprint arXiv:1901.03116. pages 4349–4357. Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and Aylin Caliskan, Joanna J Bryson, and Arvind James Zou. 2018. Word embeddings quantify Narayanan. 2017. Semantics derived automatically 100 years of gender and ethnic stereotypes. Pro- from language corpora contain human-like biases. ceedings of the National Academy of Sciences, Science, 356(6334):183–186. 115(16):E3635–E3644. Kaytlin Chaloner and Alfredo Maldonado. 2019. Mea- Nabeel Gillani and Roger Levy. 2019. Simple dynamic suring gender bias in word embeddings across do- word embeddings for mapping perceptions in the mains and discovering new gender bias word cate- public sphere. In Proceedings of the Third Work- gories. In Proceedings of the First Workshop on shop on Natural Language Processing and Compu- Gender Bias in Natural Language Processing, pages tational Social Science, pages 94–99. 25–32. Anthony G Greenwald, Debbie E McGhee, and Jor- dan LK Schwartz. 1998. Measuring individual dif- Billy Chiu, Gamal Crichton, Anna Korhonen, and ferences in implicit cognition: the implicit associa- Sampo Pyysalo. 2016. How to train good word em- tion test. Journal of personality and social psychol- beddings for biomedical nlp. In Proceedings of the ogy 15th workshop on biomedical natural language pro- , 74(6):1464. cessing, pages 166–174. Amelia Gulliver, Kathleen M Griffiths, and Helen Christensen. 2010. Perceived barriers and facilita- I Glenn Cohen, Ruben Amarasingham, Anand Shah, tors to mental health help-seeking in young people: Bin Xie, and Bernard Lo. 2014. The legal and a systematic review. BMC psychiatry, 10(1):113. ethical concerns that arise from using complex pre- dictive analytics in health care. Health affairs, Maryam Habibi, Leon Weber, Mariana Neves, 33(7):1139–1147. David Luis Wiegandt, and Ulf Leser. 2017. Deep learning with word embeddings improves biomed- Amit Datta, Michael Carl Tschantz, and Anupam Datta. ical named entity recognition. Bioinformatics, 2015. Automated experiments on ad privacy set- 33(14):i37–i48. tings. Proceedings on Privacy Enhancing Technolo- gies, 2015(1):92–112. Katarina Hamberg. 2008. Gender bias in medicine. Women’s Health, 4(3):237–243. Scott Deerwester, Susan T Dumais, George W Fur- nas, Thomas K Landauer, and Richard Harshman. Cynthia M Hartung and Thomas A Widiger. 1998. 1990. Indexing by latent semantic analysis. Jour- Gender differences in the diagnosis of mental disor- nal of the American society for information science, ders: Conclusions and controversies of the dsm–iv. 41(6):391–407. Psychological bulletin, 123(3):260. Sunipa Dev and Jeff Phillips. 2019. Attenuating bias in Bin He, Yi Guan, and Rui Dai. 2019. Classifying med- word vectors. In The 22nd International Conference ical relations in clinical text via convolutional neural on Artificial Intelligence and Statistics, pages 879– networks. Artificial intelligence in medicine, 93:43– 887. 49.

10 Janae K Heath, Gary E Weissman, Caitlin B Clancy, Jeffrey Pennington, Richard Socher, and Christopher D Haochang Shou, John T Farrar, and C Jessica Dine. Manning. 2014. Glove: Global vectors for word rep- 2019. Assessment of gender-based linguistic differ- resentation. In Proceedings of the 2014 conference ences in physician trainee evaluations of medical fac- on empirical methods in natural language process- ulty using automated text mining. JAMA network ing (EMNLP), pages 1532–1543. open, 2(5):e193520–e193520. LA Pratt and DJ Brody. 2014. Depression and obe- Anita Holdcroft. 2007. Gender bias in research: how sity in the us adult household population, 2005-2010. does it affect evidence based medicine? Journal of NCHS data brief, (167):1–8. the Royal Society of Medicine, 100(1):2. Anthony Rios. 2020. FuzzE: Fuzzy fairness evaluation Matthew Kay, Cynthia Matuszek, and Sean A Munson. of offensive language classifiers on african-american 2015. Unequal representation and gender stereo- english. In Proceedings of the AAAI Conference on types in image search results for occupations. In Artificial Intelligence, volume 34. Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pages 3819– Anthony Rios and Ramakanth Kavuluru. 2015. Convo- 3828. ACM. lutional neural networks for biomedical text classi- fication: application in indexing biomedical articles. Keyvan Khosrovian, Dietmar Pfahl, and Vahid Garousi. In Proceedings of the 6th ACM Conference on Bioin- 2008. Gensim 2.0: a customizable process simula- formatics, Computational Biology and Health Infor- tion model for software process evaluation. In Inter- matics, pages 258–267. national conference on software process, pages 294– 306. Springer. Arghavan Salles, Michael Awad, Laurel Goldin, Kelsey Krus, Jin Vivian Lee, Maria T Schwabe, and Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W Black, Calvin K Lai. 2019. Estimating implicit and ex- and Yulia Tsvetkov. 2019. Quantifying social bi- plicit gender bias among health care professionals ases in contextual word representations. In 1st ACL and surgeons. JAMA network open, 2(7):e196545– Workshop on Gender Bias for Natural Language e196545. Processing. Morgan Klaus Scheuerman, Jacob M. Paul, and Jed R. Chandler May, Alex Wang, Shikha Bordia, Samuel Brubaker. 2019. How computers see gender: An Bowman, and Rachel Rudinger. 2019. On measur- evaluation of gender classification in commercial fa- ing social biases in sentence encoders. In Proceed- cial analysis services. Proc. ACM Hum.-Comput. In- ings of the 2019 Conference of the North American teract., 3(CSCW). Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 Latanya Sweeney. 2013. Discrimination in online ad (Long and Short Papers), pages 622–628. delivery. Queue, 11(3):10. Bridget T McInnes, Ted Pedersen, and Serguei VS Yanshan Wang, Sijia Liu, Naveed Afzal, Majid Pakhomov. 2009. Umls-interface and umls- Rastegar-Mojarad, Liwei Wang, Feichen Shen, Paul similarity: open source software for measuring paths Kingsbury, and Hongfang Liu. 2018. A comparison and semantic similarity. In AMIA annual sympo- of word embeddings for the biomedical natural lan- sium proceedings, volume 2009, page 431. Ameri- guage processing. Journal of biomedical informat- can Medical Informatics Association. ics, 87:12–20. Mark S Micale. 1993. On the” disappearance” of hys- Daniel Weisz, Michael K Gusmano, and Victor G Rod- teria: A study in the clinical deconstruction of a di- win. 2004. Gender and the treatment of heart dis- agnosis. Isis, 84(3):496–526. ease in older persons in the united states, france, and england: a comparative, population-based view of Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey a clinical phenomenon. Gender medicine, 1(1):29– Dean. 2013a. Efficient estimation of word represen- 40. tations in vector space. ICLR Workshop. Terry Young, Rebecca Hutton, Laurel Finn, Safwan Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Badr, and Mari Palta. 1996. The gender bias in rado, and Jeff Dean. 2013b. Distributed representa- sleep apnea diagnosis: are women missed because tions of words and phrases and their compositional- they have different symptoms? Archives of internal ity. In Advances in neural information processing medicine, 156(21):2445–2451. systems, pages 3111–3119. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Ryan Cot- World Health Organization. 2013. Gender Disparities terell, Vicente Ordonez, and Kai-Wei Chang. 2019. in Mental Health. World Health Organization. Gender bias in contextualized word embeddings. arXiv preprint arXiv:1904.03310 Ji Ho Park, Jamin Shin, and Pascale Fung. 2018. Re- . ducing gender bias in abusive language detection. Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai- In Proceedings of the 2018 Conference on Empiri- Wei Chang. 2018. Learning gender-neutral word cal Methods in Natural Language Processing, pages embeddings. In Proc. of EMNLP, pages 4847–4853. 2799–2804.

11 A 1960s Most Biased Words depression, depressive, derealization, dermatillo- mania, desynchronosis, deux, developmental, dio- Male: genes, disease, disorder, dissociative, dyscalcu- lia, dyskinesia, dyslexia, dysmorphic, eating, ejac- • physician ulation, ekbom, encephalitis, encopresis, enure- • doctor sis, , episode, erectile, erotomania, exhi- bitionism, factitious, fantastica, fetishism, fregoli, • president fugue, functioning, gambling, ganser, grandiose, • dentist hallucinogen, hallucinosis, histrionic, huntington, • psychiatrist hyperactivity, hypersomnia, hypnotic, hypochon- driasis, hypomanic, hysteria, ideation, identity, • surgeon impostor, induced, infancy, insomnia, intellec- • student tual, intermittent, intoxication, kleptomania, ko- rsakoff, lacunar, lethargica, love, major, maladap- • nurse tive, malingering, mania, mathematics, megalo- • worker mania, melancholia, misophonia, mood, mun- chausen, mutism, narcissistic, narcolepsy, nervosa, • professor neurocysticercosis, neurodevelopmental, nicotine, Female: nightmare, nos, obsessive, obsessive–compulsive, ocd, ocpd, oneirophrenia, opioid, oppositional, or- • substitute thorexia, pain, panic, paralysis, paranoid, parasito- sis, parasomnia, parkinson, partialism, pathologi- • principal cal, pdd, perception, persecutory, personality, per- • editor vasive, phencyclidine, phobia, phobic, phonolog- ical, physical, pica, polysubstance, posttraumatic, • baker pseudologia, psychogenic, , psychotic, • character ptsd, pyromania, reactive, residual, retrograde, ru- mination, schizoaffective, schizoid, , • author schizophreniform, schizotypal, seasonal, sedative, • pharmacist selective, separation, sexual, sleep, sleepwalking, • scientist social, sociopath, somatic, somatization, somato- form, stereotypic, stockholm, stress, stuttering, sub- • therapist stance, suicidal, suicide, tardive, terror, tic, tourette, • teacher transient, transvestic, tremens, trichotillomania, tru- man, withdrawal, wonderland] B Mental Health-Related Terms C Occupations [abuse, acute, adaptation, adjustment, adoles- cent, adult, affective, agoraphobia, alcohol, alco- [detective, ambassador, coach, officer, epidemiol- holic, alzheimer, amnesia, amnestic, amphetamine, ogist, rabbi, ballplayer, secretary, actress, man- anorexia, anosognosia, anterograde, antisocial, anx- ager, scientist, cardiologist, actor, industrial- iety, anxiolytic, asperger, atelophobia, attachment, ist, welder, biologist, undersecretary, captain, attention, atypical, autism, autophagia, avoidant, economist, politician, baron, pollster, environmen- avoidant, restrictive, barbiturate, behavior, benzodi- talist, photographer, mediator, character, housewife, azepine, bereavement, bibliomania, binge, bipolar, jeweler, physicist, hitman, geologist, painter, em- body, borderline, brief, bulimia, caffeine, cannabis, ployee, stockbroker, footballer, tycoon, dad, pa- capgras, catalepsy, catatonia, catatonic, childhood, trolman, chancellor, advocate, bureaucrat, strate- circadian, claustrophobia, cocaine, cognitive, com- gist, pathologist, psychologist, campaigner, magis- munication, compulsive, condition, conduct, con- trate, judge, illustrator, surgeon, nurse, mission- version, coordination, cotard, cyclothymia, day- ary, stylist, solicitor, scholar, naturalist, artist, dreaming, defiant, deficit, delirium, delusion, delu- mathematician, businesswoman, investigator, cura- sional, delusions, dependence, depersonalization, tor, soloist, servant, broadcaster, fisherman, land-

12 lord, housekeeper, crooner, archaeologist, teenager, rian, citizen, worker, pastor, serviceman, filmmaker, councilman, attorney, choreographer, principal, sportswriter, poet, dentist, statesman, minister, der- parishioner, therapist, administrator, skipper, aide, matologist, technician, nun, instructor, alderman, chef, gangster, astronomer, educator, lawyer, mid- analyst, chaplain, inventor, lifeguard, bodyguard, fielder, evangelist, novelist, senator, collector, goal- bartender, surveyor, consultant, athlete, cartoonist, keeper, singer, acquaintance, preacher, trumpeter, negotiator, promoter, socialite, architect, mechanic, colonel, trooper, understudy, paralegal, philoso- entertainer, counselor, janitor, firebrand, sports- pher, councilor, violinist, priest, cellist, hooker, man, anthropologist, performer, crusader, envoy, jurist, commentator, gardener, journalist, warrior, trucker, publicist, commander, professor, critic, co- cameraman, wrestler, hairdresser, lawmaker, psy- median, receptionist, financier, valedictorian, in- chiatrist, clerk, writer, handyman, broker, boss, spector, steward, confesses, bishop, shopkeeper, lieutenant, neurosurgeon, protagonist, sculptor, ballerina, diplomat, parliamentarian, author, sociol- nanny, teacher, homemaker, cop, planner, la- ogist, photojournalist, guitarist, butcher, mobster, borer, programmer, philanthropist, waiter, barrister, drummer, astronaut, protester, custodian, maestro, trader, swimmer, adventurer, monk, bookkeeper, pianist, pharmacist, chemist, pediatrician, lecturer, radiologist, columnist, banker, neurologist, bar- foreman, cleric, musician, cabbie, fireman, farmer, ber, policeman, assassin, marshal, waitress, artiste, headmaster, soldier, carpenter, substitute, director, playwright, electrician, student, deputy, researcher, cinematographer, warden, marksman, congress- caretaker, ranger, lyricist, entrepreneur, sailor, man, prisoner, librarian, magician, screenwriter, dancer, composer, president, dean, comic, medic, provost, saxophonist, plumber, correspondent, or- legislator, salesman, observer, pundit, maid, arch- ganist, baker, doctor, constable, treasurer, superin- bishop, firefighter, vocalist, tutor, proprietor, restau- tendent, boxer, physician, infielder, businessman, rateur, editor, saint, butler, prosecutor, sergeant, protege] realtor, commissioner, narrator, conductor, histo-

13