Arxiv:2008.01946V2 [Cs.CL] 3 Nov 2020 H Raino Meddwr Representa- Word Embedded of Creation the Fifrainaotgne Nwr Embeddings
Total Page:16
File Type:pdf, Size:1020Kb
An exploration of the encoding of grammatical gender in word embeddings Hartger Veeman Ali Basirat Dep. of Linguistics and Philology Dep. of Linguistics and Philology Uppsala University Uppsala University [email protected] ali.basirat@lingfil.uu.se Abstract The way this information is encoded in word em- beddings is of interest because it can contribute to The vector representation of words, known as the discussion of how nouns of a language are clas- word embeddings, has opened a new research sified into different gender categories. approach in linguistic studies. These represen- tations can capture different types of informa- This paper explores the presence of information tion about words. The grammatical gender of about grammatical gender in word embeddings nouns is a typical classification of nouns based from three perspectives: on their formal and semantic properties. The study of grammatical gender based on word 1. Examine how such information is encoded embeddings can give insight into discussions across languages using a model transfer be- on how grammatical genders are determined. tween languages. In this study, we compare different sets of word embeddings according to the accuracy of 2. Determine the classifier’s reliance on seman- a neural classifier determining the grammati- tic information by using contextualized word cal gender of nouns. It is found that there is an embeddings. overlap in how grammatical gender is encoded in Swedish, Danish, and Dutch embeddings. 3. Test the effect of gender agreements between Our experimental results on the contextualized nouns and other words categories on the in- embeddings pointed out that adding more con- formation encoded into word embeddings. textual information to embeddings is detrimen- tal to the classifier’s performance. We also 2 Grammatical gender observed that removing morpho-syntactic fea- tures such as articles from the training corpora Grammatical gender is a nominal classification of embeddings decreases the classification per- system found in many languages. In languages formance dramatically, indicating a large por- with grammatical gender, the noun forms an agree- tion of the information is encoded in the rela- ment with another language aspect based on the tionship between nouns and articles. noun class. The most common grammatical 1 Introduction gender divisions are masculine/feminine, mascu- arXiv:2008.01946v2 [cs.CL] 3 Nov 2020 line/feminine/neuter, and uter/neuter, but many The creation of embedded word representa- other divisions exist. tions has been a key breakthrough in natu- Gender is assigned based on a noun’s meaning ral language processing (Mikolov et al., 2013; and form (Basirat et al., in press). Gender assign- Pennington et al., 2014; Bojanowski et al., 2017; ment systems vary between languages and can be Peters et al., 2018a; Devlin et al., 2019). Words based solely on meaning (Corbett, 2013). The ex- embeddings have proven to capture informa- periments in this paper concern grammatical gen- tion relevant to many tasks within the field. der in Swedish, Danish, and Dutch. Basirat and Tang (2019) have shown that word em- Nouns in Swedish and Danish can have one of beddings contain information about the grammat- two grammatical genders, uter, and neuter indi- ical gender of nouns. In their study, a classifier’s cated through the article for indefinite form and ability to classify embeddings on grammatical gen- the suffix for the definite and plural form. Dutch der is interpreted as an indication of the presence nouns can technically be classified as masculine, of information about gender in word embeddings. feminine, or neuter. However, in practice, the masculine and feminine genders are only used for Table 1: Accuracy for model (vertical) applied to test nouns referring to people, with the distinction be- set (horizontal) tween masculine and feminine being the subject and object pronouns, while in other cases, all non- SV DA NL neuter nouns can be categorized as uter. Grammat- ical gender in Dutch is indicated in the definite ar- SV 93.55 73.89 73.37 ticle, adjective agreement, pronouns, and demon- DA 81.18 91.81 78.50 stratives. NL 71.32 78.54 93.34 3 Multilingual embeddings and model Table 2: Corrected accuracy for model (vertical) ap- transferability plied to test set (horizontal) The first experiment explores if grammatical gen- SV DA NL der is encoded in a similar way between different languages. For this experiment, we leverage the SV 32.03 15.25 11.37 multilingual word embeddings. Multilingual word DA 22.54 35.33 19.50 embeddings are word embeddings that are mapped to the same latent space so that the vector of words NL 9.32 19.54 31.15 that have a similar meaning has a high similarity regardless of the source language. The aligned word embeddings allow for the model transfer of 3.2 Results the neural classifier. This is done by applying the To isolate how much of the accuracy of a model is neural classifier model to a different language than caused by successfully applying learned patterns it is trained on. If the model effectively classifies from the source language to the target language embeddings from a different language, it is likely rather than successful random guessing, the fol- grammatical gender is encoded in the embeddings lowing baseline is defined: of two languages similarly. X p(gs)p(gt) 3.1 Data & Experiment g∈{u,n} The lemma form of all unique nouns and their This baseline is the chance the model would make genders were extracted from Universal Dependen- a correct guess purely based on the class distribu- cies treebanks (Nivre et al., 2016). This resulted tions of the training data and test data. Subtract- in about 4.5k, 5.5k, and 7.2k nouns in Danish (da), ing this chance from the results should give an Swedish (sv), and Dutch (nl), respectively. Ten indication of how much information in the model percent of each data set was sampled randomly was transferable to the task of classifying test data to be used as test data. For each noun, the em- in another language. The absolute results are dis- beddings was extracted from pre-trained aligned played in table 1 while the results corrected with word embeddings published in the MUSE project this baseline are displayed in table 2. (Conneau et al., 2017). The uter/neuter class dis- All transfers yielded a result above the random tributions are 74/26 for Swedish, 68/32 for Danish, guessing baseline. This means the classifier is and 75/25 for Dutch. learning patterns in the source language data that The classification is performed using a feed- are applicable to the target language data. From forward neural network. The network has an in- this, it can be concluded that grammatical gender put size of 300 and a hidden layer size of 600. is encoded similarly between these languages to a The loss function used is binary cross-entropy loss. certain degree. The training is run until no further improvement is The performance of the Swedish model for clas- observed. sifying Dutch grammatical gender and vice versa For every language, a network was trained us- is the lowest of all language pairs. This language ing the data described above. Every model was pair is also the pair with the largest phylogenetic then applied to its own test data and the other lan- language distance. The language pair with the guages’ test data to test the model’s transferability. smallest distance is Danish/Swedish. The Danish model applied to the Swedish test data produces Table 3: Accuracy and loss for different layers of the the best result of all model transfers, achieving an ELMO model accuracy of 21.06% over the baseline. This ob- servation on the inverse relationship between the Loss Accuracy language distance and the gender transferability agrees with the broader experiments performed by Word representation layer 0.168 93.82 (Veeman et al., 2020). First layer 0.260 92.13 Interestingly, the success of the model trans- Second layer 0.274 91.24 fer is not symmetrical. This is most clear in the Swedish-Danish pair which achieve an accuracy of 22.54% over baseline from Danish to Swedish, ers’ output was made to discover the influence of but only 15.25% over baseline from Swedish to this difference in semantic information. Danish. The patterns the Danish model learns are more applicable to the Swedish data than vice 4.2 Data & Experiment versa. The comparison was performed using a Swedish pre-trained ELMo model (Che et al., 2018). The 4 Contextual word embeddings nouns and their gender labels were extracted from Adding contextual information to a word embed- UD treebanks. The noun embeddings were gener- ding model has proven to be effective for seman- ated using their treebank sentence as context. The tic tasks like named entity recognition or seman- output is collected at the word representation layer, tic role labeling (Peters et al., 2018b). This addi- the first contextualized layer, and the second con- tion of semantic information can be leveraged to textualized (output) layer. This resulted in a set of measure its influence on the gender classification a little over 10k embeddings for every layer, from task. A contextualized word embedding model which 10% was randomly sampled and split as test was used to quantify the role of semantic informa- data. The embeddings have a size of 1024, and the tion in the noun classification. hidden layer size of the classifier has a size double the input size, 2048. The results for this compari- 4.1 Embeddings from Language Models son are shown in table 2. (ELMo) 4.3 Results The ELMo model (Peters et al., 2018a) is a multi- An apparent decrease in accuracy and an increase layer biRNN that leverages pre-trained deep bidi- in loss is observed when classifying the gender of rectional language models to create contextual the contextualized word embeddings.