, Neo-Whorfianism, and Word Embeddings: A Data-Driven Approach to

Katharina Kann New York University, USA [email protected]

Abstract 2012). Ultimately, Whorf went on to hypothesize that the Hopi perceive time differently as a result The relation between and thought of their language’s grammar, kicking off a more has occupied linguists for at least a century. encompassing debate on the relation of language Neo-Whorfianism, a weak version of the con- troversial Sapir–Whorf hypothesis, holds that and thought and engendering one of the larger con- our thoughts are subtly influenced by the gram- troversies in to date (Deutscher, 2010). matical structures of our native language. One While influential, the strong version of what area of investigation in this vein focuses on has come to be known as the Sapir–Whorf2 hy- how the grammatical gender of nouns affects pothesis has been disproven. For instance, even the way we perceive the corresponding objects. though differ in the basic color terms For instance, does the fact that key is mascu- they employ, e.g., Korean has one word that rep- line in German (der Schlussel¨ ), but feminine in Spanish (la llave) change the speakers’ views resents both green and blue, the development of of those objects? Psycholinguistic evidence color terminology and perception is subject to uni- presented by Boroditsky et al.(2003, §4) sug- versalist constraints due to biology (Berlin and Kay, gested the answer might be yes: When asked 1969). Recent years, however, have witnessed a to produce adjectives that best described a key, resurgence of a milder strain of the hypothesis, al- German and Spanish speakers named more ternatively known as neo-Whorfianism or the weak stereotypically masculine and feminine ones, Sapir–Whorf hypothesis (Boroditsky, 2003). One respectively. However, recent attempts to repli- prominent controversial research direction inves- cate those experiments have failed (Mickan et al., 2014). In this work, we offer a com- tigates this hypothesis with respect to grammat- putational analogue of Boroditsky et al.(2003, icalized notions of gender and tense. Based on §4)’s experimental design on 9 languages, find- experimental evidence obtained from native Ger- ing evidence against neo-Whorfianism. man and Spanish speakers, Boroditsky(2003, §4.6) argued that grammatical gender affects how speak- 1 Introduction ers view objects in their native language. Similarly, During his tenure as a graduate student, 20th- differences in the usage of tense in Mandarin and century American linguist Benjamin Whorf con- English were found to affect how the respective ducted field work on Hopi, an Uto-Aztecan lan- speakers view time (Boroditsky, 2001). In this guage spoken in Southern Arizona. To his surprise, work, we focus on the influence of grammatical arXiv:1910.09729v1 [cs.CL] 22 Oct 2019 he found that Hopi does not mark the tense of a verb gender and how we can use NLP tools to shed light in the way many Western European languages do on the validity of this aspect of neo-Whorfianism. (Whorf, 1956).1 Thus, according to Whorf, a Hopi The gist of Boroditsky et al.(2003, §4)’s hy- speaker must infer whether an action takes place pothesis is that speakers of languages that mark in the past, present or future only from the senten- grammatical gender perceive nouns, even inani- tial context in which the verb occurs. This finding mate ones, differently depending on the noun’s inspired Whorf to start questioning whether lan- gender. They sought psycholinguistic evidence for guage influences thought, a position that has come this claim, arguing that speakers will more often to be known as linguistic relativity (Whorf et al., choose stereotypically masculine adjectives to de- scribe grammatically masculine inanimate nouns 1Whorf’s claim has subsequently been challenged. Later analyses of Hopi grammar suggest that the language marks 2The hypothesis is named after both Benjamin Whorf and two tenses: future and non-future (Malotki, 1983). his Ph.D. advisor Edward Sapir. POS= POS= POS= POS= POS= POS= POS= CASE= CASE= CASE= CASE= CASE= CASE= NUM= NUM= NUM= NUM= NUM= NUM= GEN= Все счастливые семьи похожи друг на друга… All happy families are similar to each other

Figure 1: A Russian sentence showing the original forms and their decomposition into tag–lemma pairs. and stereotypically feminine adjectives to describe either two or three genders: the bipartite distinc- grammatically feminine inanimate nouns. To give tions masculine-feminine and the tripartite distinc- a concrete example, the German word for key, tion masculine-feminine-neuter. An important as- Schlussel¨ , happens to be masculine, so speakers sumption of our study is the tenet that the gender are more likely to use words such as heavy, jagged, assigned to inanimate objects is arbitrary. and hard. In contrast, the Spanish word for key, All languages we experiment on exhibit concord llave, happens to be feminine, so speakers are more in grammatical gender, i.e., articles and adjectives likely to use stereotypically feminine words such that modify a noun must agree with that noun in as elegant, pretty, and delicate. However, Mickan gender as well as in other features. Consider the et al.(2014) failed to replicate this experiment, German and Spanish translations of the English leaving the validity of the findings questionable. sentence: The beautiful key is on the table. In this work, we provide evidence against La llave bonita esta´ sobre la mesa. (1) Boroditsky et al.(2003, §4)’s hypothesis using NLP techniques. We contend that if this instance Der schone¨ Schlussel¨ liegt auf dem Tisch. (2) of neo-Whorfianism is true, then we should see Here, German employs the masculine article der a reflection of it in corpus co-occurrence counts. as Schlussel¨ is masculine; had Schlussel¨ been fem- We propose two experiments, based on word em- inine, German speakers would say die Schlussel¨ . beddings trained on large corpora, to investigate Spanish exhibits a similar behavior, making use of how speakers of different languages use certain la, rather than el, as llave is feminine. For a thor- nouns. Our results on 9 languages suggest that ough treatment of the subject, see Corbett(2006). the grammatical gender of inanimate nouns does A key part of our experimental design will be not influence how they are used in context, taking stripping side effects of the grammatical concord credence away from Boroditsky et al.(2003, §4)’s from articles, adjectives, and verbs, thus removing claims. However, we caution that it is important overt signals of the noun’s gender. not to overstate the findings of one study; we see our results as providing further evidence in the lin- 2.2 Morphological Tagging and guistic relativity debate against neo-Whorfianism Lemmatization from a largely orthogonal source. Our study will make use of NLP tools that automate bits of linguistic analysis: morphological tagging 2 Background and lemmatization. Both will be explained here. In languages that exhibit inflectional morphol- 2.1 Grammatical Gender ogy, we may decompose a word into a bundle In this section, we provide a brief overview of of morpho-syntactic features and a lemma, its grammatical gender systems since those play an canonical form. Formally, we denote a word important role in Boroditsky et al.(2003, §4)’s, in a natural language as w, and a sentence of and hence our, experiments. Languages range length n as w = w1 ··· wi ··· wn. Each word from encoding no grammatical gender on inan- can be factored into a lemma ` and a bundle imate nouns, like English or Mandarin Chinese, of morphological features m. For instance, we to distinguishing tens of gender-like noun classes, may think of the German word Schlussels¨ as as found in the Bantu languages of Africa (Cor- the lemma ` = Schlussel¨ and the features m = bett, 1991). In this work, we will exclusively con- [ POS=n, GEN=masc, NUM=sg, CASE=gen ]. sider gendered languages of Indo-European and Each morphological feature in m is an attribute– Afro-Asiatic (Semitic) stock; they all distinguish value pair. Attributes encompass lexical properties bg es fr he it pl ro ru sk where Xij denotes the probability that vi co-occurs all 96.6 98.5 98.4 96.5 98.2 96.5 98.1 96.5 95.0 with vj within a certain context window. For in- OOV 82.1 91.7 89.6 79.6 91.2 86.9 89.0 88.1 86.6 stance, a symmetric context window of size 5 is a common choice. This asks how often vj occurs Table 1: Token-level lemmatization accuracy obtained by within five positions to the left or right of vi. Un- LEMMING on the test splits of the UD treebanks for all lan- guages when evaluated on all tokens or just OOVs. der this interpretation, WORD2VEC is an instance of exponential-family PCA (Collins et al., 2001; such as gender, number, and case, taking values Cotterell et al., 2017). The output of WORD2VEC such as masculine, singular and genitive, respec- is a mapping of word types to a vector space: d tively. Following Booij(1996), we may divide e : V → R . We highlight that the model only the morphological attributes into two categories: considers words categorically, i.e., it is unable to inherent and contextual. Inherent categories are look at the surface form of the word. This is rel- those that are embedded in the lemma itself: the evant as there are sub-word indicators of gender, lemma Schlussel¨ without any additional inflection e.g., feminine nouns in Spanish often end in -a. reveals POS=n and GEN=masc. However, the In practice, embeddings are taken as a proxy for sentential context dictates that NUM=sg and lexical semantic meaning—words that have similar CASE=gen. We write m for a sequence of meanings should be closer together in the space. morphological tags and ` for a sequence of This idea has a long history in NLP; Firth(1957) lemmata. A morphologically tagged Russian famously quipped, “You shall know a word by the example sentence is given in Fig.1. company it keeps.” As we will discuss in §3, are In general, each word in a sentential context we interested in how well the embeddings for inan- will have exactly one lemma and one bundle of imate nouns encode the respective genders. morpho-syntactic attributes. We may think of a decomposed sentence of length n as an interleaved 3 Neo-Whorfianism and Word Embeddings: What is the Link? trisequence: hw1, `1, m1i · · · hwn, `n, mni, where w ith ` m i is the word, i is its lemma, and i is the set Our primary contribution is the development of of its morphological features. a computational analogue of the previously con- Several techniques exist to map sentences to the ducted psycholinguistic study by Boroditsky et al. lemmata of their words together with their morpho- (2003, §4), mentioned in §1. While, naturally, the syntactic attributes. The task of mapping a sentence signals in a big-data analysis are different than to a sequence of morphological tags, in above no- those extracted from subjects in a laboratory, we w 7→ m tation , is known as morphological tag- find it useful to first describe the original work. ging. The task of mapping a sequence of words to a sequence of lemmata, i.e., w 7→ `, is known 3.1 The Psycholinguistic Experiment as lemmatization. Performing the tasks jointly can improve performance (Muller¨ et al., 2015). To test the hypothesis that the grammatical gender assigned to inanimate objects, even though it is not 2.3 Word Embeddings (and, in fact, cannot be) a direct reflection of natural In our experiments, we will make strong use of gender (since the latter is not defined for inanimate d nouns), has an influence on the manner in which embeddings of words into R . Given a fixed vocab- speakers perceive those objects, Boroditsky et al. ulary V = {v1, . . . , v|V |} of word types, we will d (2003, §4) use the following experimental proce- denote the embedding of a type v as e(v) ∈ R .3 dure. They created a list of 24 words in German We employ the WORD2VEC (Mikolov et al., and Spanish that were selected to be translations of 2013a; Goldberg and Levy, 2014) toolkit, in par- each other. Importantly, the same number of mas- ticular the skip-gram model, for the creation of culine and feminine nouns were present in each our word embeddings. Skip-gram may be consid- language; however, all nouns had different genders ered a form of matrix factorization; specifically, it |V |×|V | in German and Spanish. We show a list of stimuli factorizes a matrix of probabilities X ∈ R , for the experiment in Tab.2, taken from Borodit- 3For notational clarity, we use different symbols for types sky and Schmidt(2000, Appendix A). Then, native and tokens. In this work, vi will always denote a word type, that is, the ith lexical item in a fixed vocabulary V , whereas speakers of each language were brought into the th wi will denote the i token in a sentence w. laboratory and asked to describe the nouns with English Spanish German published in their own right, but rather merely de- apple manzana f. Apfel m. scribed in a summary book chapter. Nevertheless, arrow flecha f. Pfeil m. the idea that grammatical gender may influence boot bota f. Stiefel m. broom escoba f. Besen m. thought has taken off with much ink spilled on the moon luna f. Mond m. subject in the popular press. Indeed, in partial re- spoon cuchara f. Loffel¨ m. sponse to the popularity of the idea, McWhorter star estrella f. Stern m. toaster tostador m. Roster¨ m. (2014) authored an entire volume, The Language pumpkin calabaza f. Kurbis¨ m. Hoax, with the explicit purpose of removing much bench banco m. Bank f. of the hype surrounding neo-Whorfian claims. brush cepillo m. Burste¨ f. clock reloj m. Uhr f. disk disco m. Scheibe f. 3.2 Neo-Whorfianism, Gender, and Word drum tambor m. Trommel f. Embeddings fork tenedor m. Gabel f. sun sol m. Sonne f. How can we use NLP to test the claims investigated toilet inodoro m. Toilette f. in §3.1? We contend that the thrust of Boroditsky violin viol´ın m. Geige f. et al.(2003, §4)’s argument may be reduced to one Table 2: A list of 18 English words for common inanimate of basic lemma co-occurrence counts in a large cor- objects with their translations into both Spanish and German. pus. If German speakers are more likely to describe The list, inspired by Boroditsky and Schmidt(2000, Appendix A), demonstrates that grammatical gender is relatively arbi- a Schlussel¨ (key) with a stereotypically masculine trary in how it differs between languages. adjective, i.e., jagged, rather than with a stereo- typically feminine one, i.e., delicate, we should the first three adjectives that came to mind. (Note reasonably expect this predisposition to manifest the that the experiment took place in English, even itself in corpus counts: grammatically masculine though each subject’s native language was either nouns should have stereotypically masculine adjec- German or Spanish.) As a second step, a group tives modifying them with higher frequency, while of native English speakers were asked to judge the feminine nouns should be more likely to be modi- elicited adjectives as either −1 (feminine) or +1 fied by stereotypically feminine ones. (masculine), which yielded a gender rating. Using Linking this idea to NLP, recall from §2.3 that this rating, Boroditsky et al.(2003, §4) found a cor- co-occurrence counts are the primary signal for relation between the genderedness of the German training word embeddings, one of the most popular and Spanish speakers’ choice of adjectives in the lexical semantic representations of meaning at the first experiment and the English speakers’ rating of type level. Expanding on Firth(1957)’s original how gendered each adjective was. This was taken statement, we further ask if you shall also know the to be a very conservative test for neo-Whorfian word’s grammatical gender from its kept company? effects regarding gender, as the experiment took Operationalizing this, we will attempt to predict place in English, which lacks grammatical gender. the gender of a noun type from its word embed- As a qualitative example, they report that ding under several experimental conditions. This the German-speaking participants described “key,” is a reformulation of Boroditsky et al.(2003, §4)’s which is masculine in German, as hard, heavy, and experimental paradigm. The original experiment jagged, whereas the Spanish-speaking participants asked the participants to generate adjectives given described “key,” which is feminine in Spanish, as a noun stimulus, whereas we look at the sponta- beautiful, elegant, and fragile. They argue that neously written contexts of nouns in large corpora. these findings show that the manner in which Ger- In more detail, Boroditsky et al.(2003, §4)’s par- man and Spanish speakers think about inanimate ticipants are given nouns as stimuli, whose gender objects is influenced by the grammatical gender they assume the participants have access to in their that the language assigns to their nouns. internal representation of the lexicon. (Recall that The findings of Boroditsky et al.(2003, §4) have gender is an inherent morphological property, as not gone uncontested. Mickan et al.(2014) re- discussed in §2.1.) Then, those participants are port two unsuccessful attemps to replicate the ex- asked to generate adjectives that would be good periments described in §3. They also note that, descriptors for those nouns, i.e., they generate con- while the study is widely cited, the experiments, texts for those nouns. The contexts are then scored along with their experimental stimuli, were never in a second experiment, where English speakers determined how gender-stereotyped the adjectives trained on forms (the original corpus), (ii) lem- are. In our experiments, we induce a representa- mata: the embeddings are trained on lemmata (the tion of the noun’s context from copious amounts whole corpus is lemmatized), (iii) nouns: the em- of raw text. (The context will include adjectives, beddings are trained on lemmatized nouns where like the ones Boroditsky et al.(2003, §4)’s partic- the rest of the corpus is left unlemmatized, (iv) ipants generate.) Then, we employ a classifier to ¬nouns: the embeddings are trained on unlemma- reconstruct the gender of the inanimate noun from tized nouns and the rest are lemmatized. its embedded (lemmatized) context. Despite this difference, the idea tested is identical: can one pre- Hypotheses. We compare the ability of the clas- dict a noun’s grammatical gender from the words sifier to predict the gender of the noun in the word humans associate with it? conditions outlined above and, additionally, com- pare the results to a majority-class baseline. We Methodologically, we will introduce two experi- hypothesize the classifier to easily be able to pre- ments that investigate the relation between a noun’s dict the gender from conditions (i) forms and (iii) grammatical gender and the noun’s context. Exper- nouns since the primary cue for gender is the con- iment 1 (§4.3) attempts to predict the gender from a cord exhibited by the context words. Indeed, we word embedding with a black-box classifier, as we see no a-priori reason why (i) should perform sig- hint at above. Experiment 2 (§4.4) complements nificantly differently than (iii). The conditions (ii) experiment 1 in that it provides a technique for lemmata and (iv) ¬nouns are more interesting: If analyzing the “genderedness” of inanimate nouns. the inherent gender in the inanimate nouns influ- Both experiments make used of the same multilin- ences the choice of context lemmata, as Boroditsky gual data, whose preparation is described in §4.2. et al.(2003, §4) believe, then we hypothesize the 4 Experimental Setup classifier to be able to predict the gender from the embeddings in (ii) lemmata and (iv) ¬nouns better The goal of both our experiment and that of Borodit- than a majority-class baseline. However, if speak- sky et al.(2003, §4) is to determine whether the ers are uninfluenced by grammatical gender, then words that occur in the context of a noun are in- we should fail to predict grammatical gender from fluenced by its grammatical gender. In our exper- context. We note (i) and (iii) are skylines since (ii) iment, we opt to represent a context by its word and (iv) contain less gender-related information. embedding and try to predict the gender of a (inan- imate) noun given its word embedding. 4.2 Data Preparation Extracting the gender of a word from its vector The first step in the data creation pipeline is lemma- representation is often trivial for many languages tization. This is necessary to remove grammatical due to certain grammatical artifacts: e.g., in Span- concord, for instance, because the Spanish phrase ish, nouns are usually accompanied by a gender- la llave bonita shows concord between the femi- specific article (el or la). Thus, in order to obtain nine noun llave, the definite article la and the adjec- meaningful experimental results, we need to con- tive bonita. The lemmatized version of the sentence trol for such obvious indicators, i.e., lemmatize our is el llave bonito, where both the article and the ad- corpora and train lemma embeddings instead of jective have been reverted to their canonical form, embeddings for all inflected forms. Indeed, word which, traditionally, is the masculine singular form embeddings famously capture gender, as evinced in Spanish. Hence, gender information is no longer by Mikolov et al.(2013b)’s (approximate) equation trivially detectable using neighboring words.

e(king) − e(man) + e(woman) ≈ e(queen). Lemmatizer. We choose the LEMMING (Muller¨ et al., 2015) package4 for lemmatization. LEM- Recall from §2.3 that our embeddings do not have MING is a conditional random field (Lafferty et al., access to subword information, so clues for a 2001) with artisanal feature templates. While oc- noun’s gender must come from context. casionally surpassed in performance, LEMMING is competitive with neural state-of-the-art models, 4.1 Word Embedding Comparison while remaining robust in low-resource settings and being fast to train (Heigold et al., 2017). For We induce word embeddings under four experi- mental conditions: (i) forms: the embeddings are 4http://cistern.cis.lmu.de/lemming/ majority all languages in our experiments, we keep LEM- 100 lemmata not nouns MING’s defaults as our hyperparameters and train nouns forms it on the training and development splits of the 90 corresponding Universal Dependencies (UD) tree- banks (Nivre et al., 2017). Since errors during 80 lemmatization could significantly alter our experi- ments, we make use of the dev splits to ensure that a 70 high lemmatization performance is reached. Tab.1 shows the resulting token-based accuracies of our final models, both on the entire UD test sets and 60 on out-of-vocabulary words only. The latter serves as a lower-bound; it corresponds to the (extreme) 50 case where none of the words of the corresponding 40 Wikipedia corpus appear in the union of the UD bg es fr he it pl ro ru sk training and development sets. Figure 2: Accuracies for our classifer on all languages. Experimental Languages. We choose 9 experi- The majority-class baseline (blue bar) is generally above the lemmata-based prediction (green bar), but much below the mental languages randomly from UD: Bulgarian form-based prediction (yellow bar). This leads us to con- (bg), French (fr), Hebrew (he), Italian (it), Pol- clude that we cannot reliably predict grammatical gender for ish (pl), Romanian (ro), Russian (ru), Slovak (sk), inanimate nouns from context alone. Spanish (es). The only imposed condition is that that they have ≥ 80% performance on lemmatizing vectors. Eq. (3) represents a network of depth 2, out-of-vocabulary words on the UD development but we consider depth-k networks where k is a set. This is to ensure that our lemmatizer is rea- hyperparameter. We additionally consider the non- sonably leak-free for rare words. We drop neuter linearities ReLU and sigmoid. words in languages that exhibit a neuter, such as Training Set. To learn the parameters of Eq. (3), Bulgarian and Russian. we construct the following training set. Given Word Vectors. We employ the skip-gram model the lemmatized and morphologically analyzed from the WORD2VEC package to induce 100- Wikipedias discussed in §4.2, we construct a lexi- dimensional embeddings. We use negative sam- con as follows. For every lemma type that occurs pling with 10 samples. For all languages, the vec- more than 50 times, we find the gender, extracted tors are trained on corpora lemmatized in the way from the morphological tag of each token, that most we just described; namely, we make use of multi- frequently occurs among its tokens in the corpus. lingual Wikipedia editions from March 2018. All This yields a lexicon of lemma-gender pairs. Note words with a frequency below 5 are ignored, and that this training set will include animate words as we compare symmetric context window sizes of 2, we are unable to exclude them easily. However, we 5 and 10, finding 2 works the best. will only evaluate on inanimate nouns; see below.

4.3 Experiment 1: Gender Classification Evaluation Set. For the evaluation, we focus ex- clusively on the same common, inanimate words Our gender prediction problem constitutes a binary in all languages. We use the NorthEuraLex dataset classification task, where the classes are masculine (Dellert and Jager¨ , 2017),5 which is a multi-way and feminine and the input is the word embedding concept-aligned dictionary. In order to avoid bio- of a noun. We employ a multi-layer perceptron logical gender interfering with grammatical gender (MLP) for this classification, defining the probabil- and, thus, influencing our experiments, we manu- ity of the gender g given a noun type v, as ally annotate all concepts as animate or inanimate p(g | v) = and exclude all animate nouns. For instance, we keep eye, lake, and circle, while discarding words softmax(W tanh (W e(v) + b ) + b ) (3) 2 1 1 2 like wife, dog, or son. The list containing all con- where we feed in the noun’s embedding e(v) into cepts in our evaluation set can be found in App.A. d0×d 2×d0 We take care to remove these words from the train- the network, W1 ∈ R and W2 ∈ R are 0 d 2 5 weight matrices, b1 ∈ R and b2 ∈ R are bias http://www.northeuralex.org/ forms lemmata 4.4 Experiment 2: A Gender Dimension fem masc fem masc In addition to Experiment 1, we would also like to estrella (f.) sonido (m.) labio (m.) hoguera (f.) analyze the degree to which words are more mas- ola (f.) labio (m.) mejilla (f.) aldea (f.) bolsa (f.) tejido (m.) entendimiento (m.) esqu´ı (m.) culine or feminine using their word embeddings, isla (f.) lazo (m.) angulo´ (m.) altitud (f.) which, in turn, tells us how masculine or feminine carta (f.) nudo (m.) vena (f.) silla (f.) their contexts are. A high-level overview of how Table 3: The most gendered nouns in our Spanish test set we can achieve this is as follows. We may iso- according to our induced gender dimension under the forms late genderedness by fixing one of the dimensions and the lemmata conditions. The gender dimension is the column header and the true gender is marked in parentheses. of the word embeddings to be the gender dimen- sion. Then, we seek a method that will shift the information regarding gender into that dimension. ing set. Note that not all words are observed in each In effect, we need a method which maps every language, due to the varying sizes of the Wikipedia word to a scalar quantity which corresponds to how corpora. We split the resulting lexicons in half, gendered the word is, allowing us to compare indi- creating a dev and test set for each language. vidual words and to discover those whose gender is more saliently encoded in the embeddings. We Training and Hyperparameters. The model is adopt the ultra-dense strategy developed by Rothe implemented in PyTorch (Paszke et al., 2017). We et al.(2016), which we will describe below. train our models on the training sets using Adam Learning Ultra-Dense Embeddings. Given an (Kingma and Ba, 2015) with a base learning rate embedding e(v) ∈ d of a word v, we are of 0.1 for all models. All models are trained for 50 R interested in learning a real orthogonal matrix epochs. Hyperparameters include network depth Q ∈ d×d in order to create a new embedding (k ∈ [1, 5]), size of the hidden layer (taken from R e0(v) = Q e(v). Defining Q to be real orthogonal {100, 200, 300}) and type of nonlinearity (taken ensures that no information is lost or gained as a from {tanh, sigmoid, ReLU}). We randomly parti- result of the transformation—the dot product, and, tion the evaluation set in half 10 times and sweep thus, the cosine similarity between vectors will be the hyperparameters on each dev partition, perform- preserved. In order to learn a transformation that ing early stopping. Final results average the perfor- moves gender information to certain components mance on test across these splits. of the embeddings, let S be the set of of all pairs of distinct nouns that have the same gender and let Results and Discussion. The gender-prediction D be the set of all pairs of distinct nouns that have d×d accuracies from the four embeddings conditions a different gender. Let P ∈ R be a matrix with are shown in Fig.2. As discussed, classification all entries being zero except for P11, which is 1. accuracy is highest for conditions (i) forms and Now, we minimize the following objective (iii) nouns, i.e., embeddings trained on the corpora where the context words are unaltered. In these con-   X 0  2 O Q; S,D = || PQ e(v) − e(v ) ||2 ditions, where we see unlemmatized forms as con- (v,v0)∈S text, our classifier handily surpasses the majority- X − || PQ e(v) − e(v0) ||2 (4) class baseline with differences up to 30 points. All 2 0 differences are significant (p < 0.05). On the other (v,v )∈D hand, inspecting conditions (ii) lemmata and (iv) with respect to the matrix Q subject to the con- ¬nouns, we see that performance of each is rarely straint that Q>Q = I; that is, Q is real orthogonal. better than the majority-class baseline and in no case is statistically better (p < 0.05). Thus, de- Stochastic Projected Gradient. The above spite a relatively extensive hyperparameter search, objective can be optimized using a stochas- we are unable to reliably predict grammatical gen- tic projected-gradient-style algorithm (Bertsekas, der from the context words along. This negative 1999). This algorithm alternates between two steps result provides evidence against Boroditsky et al. until convergence: (i) A stochastic gradient step: (2003, §4)’s hypothesis that the inherent gender of During this step, one element is randomly sampled the word will have an effect on the context words from each of S and D. Then, the gradient of Eq. (4) that a speaker uses for inanimate nouns. is computed with respect to Q, and Q is updated Algorithm 1 Stochastic Projected Gradient Algorithm Lang Spearman’s ρ

1: input S, D . same- and different-gender pairs lemmata ¬nouns nouns forms 2: Q ← I bg 07.80 02.38 78.29 84.65 3: for t = 1 to T do es 06.80 22.65 84.79 85.47 0 fr 06.88 19.24 84.23 85.08 4: (vS , vS ) ∼ uniform(S) 0 he 09.94 18.57 56.53 63.09 5: (vD, vD) ∼ uniform(D) it 02.93 09.21 80.70 82.68 ˜ 0 ˜ 0 6: S ← {(vS , vS )} ; D ← {(vD, vD)} pl 14.92 06.46 03.43 18.42 0 ro 08.52 14.21 53.14 82.37 7: Q ← ηt · ∇Q O(Q; S˜, D˜) > 0 ru 18.79 42.02 70.09 79.22 8: UΣV ← SVD (Q ) sk 15.14 33.21 81.02 82.74 9: Q ← UV > Table 4: Spearman’s ρ between the generated dimension on test data and the true gender annotation. The correlation is statistically different than zero with p < 0.05 in those entries by taking a step in the direction of the gradient. (ii) in blue; the ones in red were not found to be significant. A projection step: After obtaining a new matrix 0 Q = Q + η · ∇QO during the gradient update inine gender dimension are displayed in Tab.3 in where η is the learning rate, we no longer have the the (i) forms and (ii) lemmata conditions. The guarantee that Q0 is orthogonal. Thus, we must qualitative analysis shows the same trend. perform a projection step to orthogonalize Q0. This can be achieved through singular value decomposi- 5 Other Related Work tion (SVD). We compute the SVD: Q0 = UΣV >, In the realm of NLP, the closest work to ours deals where U, Σ and V > are guaranteed to be real as Q0 with bias in word embeddings. Many have ob- is real. Then, we may define Q = UV >, which is served that word embeddings encode the biases the closest to Q0 under the Frobenius norm (Horn present in the data they were trained on on. For and Johnson, 2012). Pseudocode for this algorithm instance, Bolukbasi et al.(2016) and Zhao et al. is given in Alg.1. (2017) note that the engineer embedding has a Training Details. Our learning rate schedule ηt higher cosine similarity with man than with woman, (see Alg.1) is chosen by the Adam optimizer reflecting a structural imbalance in the gender of (Kingma and Ba, 2015). We run T = 1000 itera- the profession; they propose to debias the embed- tions. Upon termination, we extract a scalar-valued dings such that gender no longer plays a role. gender quantity as follows: [PQ e(w)]11, i.e., the first component of the new embedding. We train 6 Conclusion on half the NorthEuraLex data and test on the other Using word embeddings in 9 different languages half, using the splits described in §4.3. trained on lemmatized corpora, we investigated Analyzing the Data. The gender dimension ad- whether adjective choice is influenced by the gram- mits both a quantitative and a qualitative analysis. matical gender of inanimate nouns. This question We start with a quantitative analysis; we consider has larger implication in the debate on the relation Spearman’s ρ between the gender dimension and between language and thought. We developed a the grammatical gender of the nouns, marking mas- computational analogue of Boroditsky et al.(2003, culine as 0 and feminine as 1. The results are §4)’s experimental paradigm and showed that con- shown in Tab.4. They mirror those found in exper- text in which a noun occurs, stripped of its overt iment 1 (§4.3): we are unable to find correlation gender markings, is no longer predictive of the significantly different than 0 for any of the cases inherent gender of the original, inanimate noun. where the context words have been lemmatized These negative results contradict Boroditsky et al. (conditions (i) lemmata and (iii) ¬nouns). On the (2003, §4)’s claims. contrary, when the context words are left unlem- Any scientific study should be viewed with a matized (conditions (i) forms and (iii) nouns), we healthy dose of skepticism, especially one, such as are generally able to find a significant correlation. ours, that considers a controversial question. We Qualitatively, the gender dimension tells us which believe our big-data study should be taken as com- words are more gendered than others. Here, we plementary evidence in the context of the larger perform a case study of our Spanish test set. The debate that inanimate nouns’ gender does not influ- five words with the most masculine and most fem- ence the way speakers describe them in a corpus. Acknowledgments Guy Deutscher. 2010. Through the language glass: Why the world looks different in other languages. I would like to thank Hanna Wallach, Lawrence Metropolitan Books. Wolf-Sonkin, and Ryan Cotterell for discussions John R. Firth. 1957. A synopsis of linguistic theory, and contributions, and they approve this acknowl- 1930-1955. Studies in linguistic analysis. edgment. Except for the addition of author informa- tion, these acknowledgments, and minor changes Yoav Goldberg and Omer Levy. 2014. word2vec this paper is unchanged from a manuscript written explained: Deriving mikolov et al.’s negative- sampling word-embedding method. arXiv preprint in 2018. arXiv:1402.3722.

Georg Heigold, Guenter Neumann, and Josef van Gen- References abith. 2017. An extensive empirical evaluation of character-based morphological tagging for 14 lan- Brent Berlin and Paul Kay. 1969. Basic color terms: guages. In EACL. Their universality and evolution. University of Cali- fornia Press. Roger A. Horn and Charles R. Johnson. 2012. Matrix Analysis. Cambridge University Press. Dimitri P. Bertsekas. 1999. Nonlinear Programming. Athena scientific Belmont. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR. Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam Tauman Kalai. John Lafferty, Andrew McCallum, and Fernando CN 2016. Man is to computer programmer as woman Pereira. 2001. Conditional random fields: Prob- is to homemaker? Debiasing word embeddings. In abilistic models for segmenting and labeling se- NIPS. quence data. In ICML.

Geert Booij. 1996. Inherent versus contextual inflec- Ekkehart Malotki. 1983. Hopi time: A linguistic anal- tion and the split morphology hypothesis authors. In ysis of the temporal concepts in the Hopi language, Yearbook of Morphology 1995, pages 1–16. Springer volume 20. Walter de Gruyter. Netherlands. John H. McWhorter. 2014. The language hoax: Why Lera Boroditsky. 2001. Does language shape thought? the world looks the same in any language. Oxford Mandarin and English speakers’ conceptions of time. University Press. Cognitive , 43:1–22. Anne Mickan, Maren Schiefke, and Anatol Stefanow- Lera Boroditsky. 2003. Linguistic relativity. Encyclo- itsch. 2014. Key is a llave is a schlussel: A failure to pedia of . replicate an experiment from Boroditsky et al 2003. Yearbook of the German Cognitive Linguistics Asso- Lera Boroditsky and Lauren A. Schmidt. 2000. Sex, ciation, 2(1):39. syntax, and . In Proceedings of the Annual Meeting of the Cognitive Science Society. Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- frey Dean. 2013a. Efficient estimation of word Lera Boroditsky, Lauren A. Schmidt, and Webb representations in vector space. arXiv preprint Phillips. 2003. Sex, syntax, and semantics. Lan- arXiv:1301.3781. guage in Mind: Advances in the Study of Language and Thought, pages 61–79. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013b. Linguistic regularities in continuous space Michael Collins, Sanjoy Dasgupta, and Robert E. word representations. In NAACL-HLT. Schapire. 2001. A generalization of principal com- ponents analysis to the exponential family. In NIPS. Thomas Muller,¨ Ryan Cotterell, Alexander Fraser, and Hinrich Schutze.¨ 2015. Joint lemmatization and Greville G. Corbett. 1991. Gender. Cambridge Univer- morphological tagging with lemming. In EMNLP. sity Press. Joakim Nivre, Zeljkoˇ Agic,´ Lars Ahrenberg, Maria Je- Greville G. Corbett. 2006. Agreement. Cambridge Uni- sus Aranzabe, Masayuki Asahara, Aitziber Atutxa, versity Press. Miguel Ballesteros, John Bauer, Kepa Bengoetxea, Riyaz Ahmad Bhat, Eckhard Bick, Cristina Ryan Cotterell, Adam Poliak, Benjamin Van Durme, Bosco, Gosse Bouma, Sam Bowman, Marie Can- and Jason Eisner. 2017. Explaining and generaliz- dito, Guls¸en¨ Cebiroglu˘ Eryigit,˘ Giuseppe G. A. ing skip-gram through exponential family principal Celano, Fabricio Chalub, Jinho Choi, C¸agrı˘ component analysis. In EACL. C¸ oltekin,¨ Miriam Connor, Elizabeth Davidson, Marie-Catherine de Marneffe, Valeria de Paiva, Johannes Dellert and Gerhard Jager.¨ 2017. Arantza Diaz de Ilarraza, and Dobrovoljc. 2017. NorthEuraLex. Version 0.9. Universal dependencies 2.0. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. In NIPS-W. Sascha Rothe, Sebastian Ebert, and Hinrich Schutze.¨ 2016. Ultradense word embeddings by orthogonal transformation. In NAACL-HLT. . 1956. The Hopi language, Toreva dialect. Benjamin Lee Whorf, John B. Carroll, Stephen C. Levinson, and Penny Lee. 2012. Language, thought, and reality: Selected writings of Benjamin Lee Whorf. MIT Press. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or- donez, and Kai-Wei Chang. 2017. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In EMNLP. A (Inanimate) NorthEuraLex Concepts Continued from previous column Concept ID English Concept ID English Buch::N book Abend::N evening Buchstabe::N character Abhang::N slope Bucht::N cove Abstand::N gap Busen::N bosom Ader::N vein Butter::N butter Alter::N age Bundel::N¨ bundle Angelegenheit::N matter Dach::N roof Anhohe::N¨ elevation Decke::N blanket Anzahl::N count Deckel::N cover Apfel::N apple Donner::N thunder Arbeit::N work Dorf::N village Arm::N arm Dreck::N filth Art::N sort Ecke::N corner Arznei::N medicine Ei::N egg Asche::N ashes Eimer::N bucket Ast::N limb Eis::N ice Atem::N breath Eisen::N iron Auge::N eye Ellenbogen::N elbow Bach::N brook Ende::N end Band::N ribbon Entfernung::N distance Bart::N beard Erde::N earth Bau::N lair Erzahlung::N¨ story Bauch::N belly Essen::N meal Baum::N tree Faden::N thread Beere::N berry Falle::N trap Bein::N leg Farbe::N paint Berg::N mountain Feder::N feather Besen::N broom Fehler::N mistake Bett::N bed Fell::N fur Beutel::N pouch Fenster::N window Bild::N picture Ferse::N heel Birke::N birch Festland::N land Blatt::N leaf Fett::N fat Blume::N flower Feuer::N fire Blut::N blood Fieber::N fever Boden::N ground, soil Figur::N figure Bogen[Waffe]::N bow Finger::N finger Boot::N boat Fingernagel::N fingernail Brei::N mush Fleisch::N meat Brett::N board Fluss::N river Brief::N letter Flugel::N¨ wing Brot::N bread Frost::N frost Brunnen::N well Funke::N spark Brust::N breast, chest Fuß::N foot Brucke::N¨ bridge Fußboden::N floor Gabel::N fork Continued from previous column Continued from previous column Concept ID English Concept ID English Gang::N walk Herz::N heart Gast::N guest Heu::N hay Gedanke::N thought Hilfe::N help Gedachtnis::N¨ memory Himmel::N sky Gegend::N area Hitze::N heat Gegenstand::N item Holz::N wood Gehirn::N brain Honig::N honey Geist::N spirit Horn::N horn Geld::N money Hose::N trousers Gelachter::N¨ laughter Hunger::N hunger Genick::N nape Halfte::N¨ half Geruch::N odour Hohe::N¨ height Geschenk::N gift Hohle::N¨ cave Geschirr::N dishes Hugel::N¨ hill Geschmack::N flavour Insel::N island Geschaft::N¨ business Jahr::N year Gesicht::N face Kamm::N comb Gesprach::N¨ talk Kampf::N fight Gesundheit::N health Kante::N edge Getreide::N corn Kehle::N throat Gewalt::N violence Kessel::N kettle Gewehr::N gun Kiefer[Anatomie]::N jaw Gewicht::N weight Kiefer[Baum]::N pine Gipfel::N summit Kinn::N chin Glas::N glass Kirche::N church Gluck::N¨ happiness Kissen::N pillow Gold::N gold Kiste::N box Grab::N grave Klaue::N claw Gras::N grass Kleidung::N clothes Grenze::N border Knie::N knee Griff::N handle Knochen::N bone Grube::N pit Knopf::N button Grund::N reason Knoten::N knot Große::N¨ size Kohle::N coal Gurtel::N¨ belt Kopf::N head Haar::N hair Korn::N grain Haken::N hook Kraft::N force Hals::N neck Kragen::N collar Hand::N hand Kralle::N claw Handflache::N¨ palm Krankheit::N illness Handtuch::N towel Kreis::N circle Haufen::N heap Kreuz::N cross Haus::N house Krieg::N war Haut::N skin Kummer::N grief Heim::N home Kalte::N¨ chill Hemd::N shirt Korper::N¨ body Continued from previous column Continued from previous column Concept ID English Concept ID English Kuste::N¨ coast Nahrung::N food Laden::N shop Name::N name Lagerfeuer::N campfire Nase::N nose Land::N country Nebel::N fog Last::N load Nest::N nest Laut::N sound Netz::N net Leben::N life Neuigkeit::N news Leber::N liver Norden::N north Leder::N leather Nutzen::N profit Lehm::N clay Oberschenkel::N thigh Leine::N leash Ofen::N stove Leiter::N ladder Ohr::N ear Leute::N people Ort::N place Licht::N light Osten::N east Lied::N song Pfad::N path Linie::N line Pfeil::N arrow Lippe::N lip Pilz::N mushroom Loch::N hole Platte::N slab Luft::N air Platz::N space Lust::N desire Preis::N price Lange::N¨ length Puppe::N doll Larm::N¨ noise Quelle::N source Loffel::N¨ spoon Rand::N fringe Luge::N¨ lie Rauch::N smoke Macht::N power Raureif::N hoarfrost Magen::N stomach Rede::N speech Meer::N sea Regal::N shelf Menge::N amount Regen::N rain Messer::N knife Regenbogen::N rainbow Milch::N milk Reichtum::N wealth Mittag::N noon Reihe::N row Mitte::N middle Riemen::N strap Monat::N month Rinde::N bark Mond::N moon Ring::N ring Moor::N moor Rohr::N pipe Morgen::N morning Ruder::N oar Mund::N mouth Ruf::N call Muster::N pattern Ruhe::N calm Marchen::N¨ fairy tale Ratsel::N¨ puzzle Mutze::N¨ cap Rucken::N¨ back, spine Nabel::N navel Saat::N seed Nachricht::N message Sache::N thing Nacht::N night Sack::N sack Nadel::N needle Salz::N salt Nagel::N nail Sand::N sand Nagel[Anatomie]::N nail Schaden::N damage Continued from previous column Continued from previous column Concept ID English Concept ID English Schale::N husk Stoff::N cloth Schatten::N shadow Straße::N road Schaufel::N shovel Strich::N stroke Schaum::N foam Stromung::N¨ current Scheibe::N slice Stuhl::N chair Schlaf::N sleep Starke::N¨ strength Schlinge::N noose Stuck::N¨ piece Schlitten::N sleigh Stutze::N¨ bracket Schloss::N lock Sumpf::N swamp Schluss::N conclusion Suppe::N soup Schmerz::N pain Suden::N¨ south Schmutz::N dirt Sunde::N¨ sin Schnee::N snow Tag::N day Schnur::N string Tanne::N fir Schnurrbart::N moustache Tasche::N bag Schritt::N step Tasse::N cup Schuh::N shoe Tee::N tea Schuld::N fault Teil::N part Schulter::N shoulder Tisch::N table Schwanz::N tail Tod::N death See::N lake Ton::N tone Sehne::N sinew Topf::N pot Seite::N side Tor::N gate Silber::N silver Traum::N dream Sinn::N meaning Tropfen::N drop Ski::N ski Trane::N¨ tear Sonne::N sun Tuch::N scarf Spaten::N spade Tur::N¨ door Speise::N dish Ufer::N shore Spiegel::N mirror Ungluck::N¨ misfortune Spiel::N game Verstand::N mind Spitze::N tip Volk::N nation Sprache::N language Wahrheit::N truth Spur::N track Wald::N forest Staat::N state Wange::N cheek Stab::N staff Ware::N ware Stadt::N town Wasser::N water Stamm::N trunk Weg::N way Stange::N pole Weide::N pasture Staub::N dust Weide[Baum]::N willow Stein::N stone Welle::N wave Stern::N star Welt::N world Stiefel::N boot Westen::N west Stimme::N voice Wetter::N weather Stirn::N forehead Wiege::N cradle Stock::N stick Wiese::N meadow Continued from previous column Concept ID English Wind::N wind Winkel::N angle Woche::N week Wolke::N cloud Wolle::N wool Wort::N word Wunde::N wound Wunsch::N wish Wurzel::N root Zahn::N tooth Zaun::N fence Zeh::N toe Zeichen::N sign Zeit::N time Zeitung::N newspaper Zunge::N tongue Zweig::N branch Zwiebel::N onion Armel::N¨ sleeve Ol::N¨ oil