arXiv:2008.01946v2 [cs.CL] 3 Nov 2020 h raino meddwr representa- word embedded of creation The fifrainaotgne nwr embeddings. word in presence the about of information indication of an as interpreted is classifier’s der gen- a grammatical study, on embeddings their classify to In ability . field. of grammat- gender the the ical about within information contain tasks beddings informa- many Tang capture and to Basirat to relevant proven tion have embeddings al. et Peters al. et Pennington in a enakybekhog nnatu- in breakthrough ( key processing a language been ral has tions Introduction 1 odebdig codn oteacrc of accuracy the to according embeddings word research new a opened has embeddings, word ino h nomto secddi h rela- articles. the and in nouns encoded between is tionship information the por- of large tion a indicating dramatically, formance per- classification the corpora decreases training embeddings of the from articles also as such We tures fea- morpho-syntactic removing performance. that classifier’s observed the to tal detrimen- is embeddings to con- information more textual adding that out pointed embeddings contextualized embeddings. the on Dutch results experimental and Our Danish, Swedish, encoded an in is is gender there grammatical that how found in is overlap It grammati- nouns. of the gender cal determining of classifier sets neural different a compare we study, determined. this are In grammatical discussions how into word on insight on give based can gender embeddings The grammatical properties. of semantic study and formal based their nouns on of classification of typical a gender is grammatical nouns The words. informa- about of tion types different capture can represen- tations These studies. linguistic in approach as known words, of representation vector The nepoaino h noigo rmaia edri word in gender grammatical of encoding the of exploration An [email protected] e.o igitc n Philology and of Dep. , 2018a , ( 2019 2014 psl University Uppsala ; Abstract ate Veeman Hartger elne al. et Devlin aesonta odem- word that shown have ) ; oaosie al. et Bojanowski ioo tal. et Mikolov , 2019 .Words ). , , embeddings 2017 2013 ; ; h a hsifraini noe nwr em- word in encoded is information this way The ihgamtclgne,tenu om nagree- an forms the gender, grammatical with bu rmaia edri odembeddings word perspectives: three in from gender grammatical about categories. gender clas- different are into language sified a of nouns to how contribute of can discussion it the because interest of is beddings n om( form and many exist. but divisions grammatical other uter/neuter, common and mascu- most line/feminine/neuter, masculine/feminine, are The divisions the gender on based class. aspect noun language another with languages ment In languages. many classification in found a system is gender Grammatical 2 ae oeyo enn ( meaning be on can solely and based languages between vary systems ment eiie rnue.Hwvr npatc,the practice, in However, neuter. masculine, as or Dutch classified feminine, be form. technically can and and nouns definite form the indefinite for for suffix indi- the article neuter the and through uter, cated genders, grammatical two Dutch. and Danish, gen- Swedish, grammatical in der concern paper this in periments .Ts h feto edrareet between agreements gender of effect the Test 3. seman- on reliance classifier’s the Determine 2. encoded is information such how Examine 1. e.o igitc n Philology and Linguistics of Dep. hspprepoe h rsneo information of presence the explores paper This edri sindbsdo onsmeaning noun’s a on based assigned is Gender on nSeihadDns a aeoeof one have can Danish and Swedish in Nouns on n te od aeoiso h in- embeddings. the word on into encoded categories formation words other and nouns wordembeddings. contextualized using by information tic be- transfer languages. model tween a using languages across ali.basirat@lingfil.uu.se psl University Uppsala aia tal. et Basirat l Basirat Ali , npress in Corbett .Gne assign- Gender ). , 2013 .Teex- The ). masculine and feminine genders are only used for Table 1: Accuracy for model (vertical) applied to test nouns referring to people, with the distinction be- set (horizontal) tween masculine and feminine being the and pronouns, while in other cases, all non- SV DA NL neuter nouns can be categorized as uter. Grammat- ical gender in Dutch is indicated in the definite ar- SV 93.55 73.89 73.37 ticle, adjective , pronouns, and demon- DA 81.18 91.81 78.50 stratives. NL 71.32 78.54 93.34

3 Multilingual embeddings and model Table 2: Corrected accuracy for model (vertical) ap- transferability plied to test set (horizontal)

The first experiment explores if grammatical gen- SV DA NL der is encoded in a similar way between different languages. For this experiment, we leverage the SV 32.03 15.25 11.37 multilingual word embeddings. Multilingual word DA 22.54 35.33 19.50 embeddings are word embeddings that are mapped to the same latent space so that the vector of words NL 9.32 19.54 31.15 that have a similar meaning has a high similarity regardless of the source language. The aligned word embeddings allow for the model transfer of 3.2 Results the neural classifier. This is done by applying the To isolate how much of the accuracy of a model is neural classifier model to a different language than caused by successfully applying learned patterns it is trained on. If the model effectively classifies from the source language to the target language embeddings from a different language, it is likely rather than successful random guessing, the fol- grammatical gender is encoded in the embeddings lowing baseline is defined: of two languages similarly. X p(gs)p(gt) 3.1 Data & Experiment g∈{u,n}

The lemma form of all unique nouns and their This baseline is the chance the model would make genders were extracted from Universal Dependen- a correct guess purely based on the class distribu- cies treebanks (Nivre et al., 2016). This resulted tions of the training data and test data. Subtract- in about 4.5k, 5.5k, and 7.2k nouns in Danish (da), ing this chance from the results should give an Swedish (sv), and Dutch (nl), respectively. Ten indication of how much information in the model percent of each data set was sampled randomly was transferable to the task of classifying test data to be used as test data. For each noun, the em- in another language. The absolute results are dis- beddings was extracted from pre-trained aligned played in table 1 while the results corrected with word embeddings published in the MUSE project this baseline are displayed in table 2. (Conneau et al., 2017). The uter/neuter class dis- All transfers yielded a result above the random tributions are 74/26 for Swedish, 68/32 for Danish, guessing baseline. This means the classifier is and 75/25 for Dutch. learning patterns in the source language data that The classification is performed using a feed- are applicable to the target language data. From forward neural network. The network has an in- this, it can be concluded that grammatical gender put size of 300 and a hidden layer size of 600. is encoded similarly between these languages to a The loss function used is binary cross-entropy loss. certain degree. The training is run until no further improvement is The performance of the Swedish model for clas- observed. sifying Dutch grammatical gender and vice versa For every language, a network was trained us- is the lowest of all language pairs. This language ing the data described above. Every model was pair is also the pair with the largest phylogenetic then applied to its own test data and the other lan- language distance. The language pair with the guages’ test data to test the model’s transferability. smallest distance is Danish/Swedish. The Danish model applied to the Swedish test data produces Table 3: Accuracy and loss for different layers of the the best result of all model transfers, achieving an ELMO model accuracy of 21.06% over the baseline. This ob- servation on the inverse relationship between the Loss Accuracy language distance and the gender transferability agrees with the broader experiments performed by Word representation layer 0.168 93.82 (Veeman et al., 2020). First layer 0.260 92.13 Interestingly, the success of the model trans- Second layer 0.274 91.24 fer is not symmetrical. This is most clear in the Swedish-Danish pair which achieve an accuracy of 22.54% over baseline from Danish to Swedish, ers’ output was made to discover the influence of but only 15.25% over baseline from Swedish to this difference in semantic information. Danish. The patterns the Danish model learns are more applicable to the Swedish data than vice 4.2 Data & Experiment versa. The was performed using a Swedish pre-trained ELMo model (Che et al., 2018). The 4 Contextual word embeddings nouns and their gender labels were extracted from Adding contextual information to a word embed- UD treebanks. The noun embeddings were gener- ding model has proven to be effective for seman- ated using their treebank sentence as context. The tic tasks like named entity recognition or seman- output is collected at the word representation layer, tic role labeling (Peters et al., 2018b). This addi- the first contextualized layer, and the second con- tion of semantic information can be leveraged to textualized (output) layer. This resulted in a set of measure its influence on the gender classification a little over 10k embeddings for every layer, from task. A contextualized word embedding model which 10% was randomly sampled and split as test was used to quantify the role of semantic informa- data. The embeddings have a size of 1024, and the tion in the noun classification. hidden layer size of the classifier has a size double the input size, 2048. The results for this compari- 4.1 Embeddings from Language Models son are shown in table 2. (ELMo) 4.3 Results The ELMo model (Peters et al., 2018a) is a multi- An apparent decrease in accuracy and an increase layer biRNN that leverages pre-trained deep bidi- in loss is observed when classifying the gender of rectional language models to create contextual the contextualized word embeddings. The added word embeddings. The ELMo model has three semantic and contextual information is not only layers, where layer 0 is the non-contextualized to- unhelpful but even detrimental to the classifier’s ken representation, which is a concatenation of performance. Based on these results, it can be ar- the word embeddings and character-based embed- gued that the classifier uses very little semantic in- dings created with a CNN or RNN. This token rep- formation for classifying grammatical gender and resentation is fed into a biRNN. Afterward, the re- that the semantic information added in this experi- sulting hidden state is concatenated with the states ment acted like noise. of both directions of the language model and fed into another biRNN layer. 5 Word embeddings from stripped (Peters et al., 2018b) show that the word rep- corpus resentation layer of ELMo has can capture mor- phology faithfully while encoding little semantics. In the previous experiment, we have observed that The semantic information is instead represented the classifier does not seem to strongly rely on in the contextual layers. Comparing the results semantic data to classify grammatical gender in for embeddings extracted from different layers al- word embeddings. This observation would lead lows us to compare less semantically rich embed- to the hypothesis that it relies on information on dings (non-contextualized word representations) form and agreement instead. with more semantically rich embeddings (contex- Form and agreement could be encoded in tualized embeddings). A comparison of these lay- word embeddings through the noun’s relation with Table 4: Accuracy and loss for embeddings with differ- be connected to language relatedness. It has also ent source corpora been shown through a comparison of embeddings from different layers of a contextualized embed- Loss Accuracy dings model that adding semantic information to embeddings is detrimental to a grammatical gen- Wikipedia corpus 0.247 91.37 der classifier’s performance. Lastly, by creating Wikipedia corpus, no articles 0.368 85.61 embeddings from a corpus that is stripped of infor- Wikipedia corpus, stemmed 0.397 84.66 mation on form and agreement, it has been shown that a noun’s form and relationship to gender spe- cific articles are an essential source of information agreed words. A neuter noun in Swedish would for grammatical gender in word embeddings. It have a high co-occurrence with the neuter article has also shown that the model with stemming and ’ett’, thus leading to a strong relationship between no articles still performs better than chance. the vectors for the noun and the word ’ett’. To test this hypothesis, embeddings have been created from a corpus that has all forms of agree- References ment removed through the removal of articles and Ali Basirat, Marc Allassonni`ere-Tang, and Aleksan- the stemming of all words. drs Berdicevskis. in press. An empirical study on fastText (Bojanowski et al., 2017) was used to the contribution of formal and semantic features to the grammatical gender of nouns. Linguistics Van- create Swedish word embeddings from a corpus guard. consisting of all Swedish articles on Wikipedia. Another set of embeddings was created from the Ali Basirat and Marc Tang. 2019. Linguistic informa- same corpus, but all articles were removed from tion in word embeddings. In Agents and Artificial Intelligence, pages 492–513, Cham. Springer Inter- the corpus. A third set of embeddings was created national Publishing. from the same corpus after stemming it with the Snowball stemmer (Porter, 2001). A classifier was Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with trained on these embeddings in the same configu- subword information. Transactions of the Associa- ration as the previous experiments. The results can tion for Computational Linguistics, 5:135–146. be found in table 3. Wanxiang Che, Yijia Liu, Yuxuan Wang, Bo Zheng, The classifier only manages accuracy of 85.61% and Ting Liu. 2018. Towards better UD parsing: on the embeddings from the no articles corpus. Deep contextualized word embeddings, ensemble, This is almost a 6% drop caused by the missing and treebank concatenation. In Proceedings of the information, which is very significant, considering CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages the 70% majority baseline. This indicates that the 55–64, Brussels, Belgium. Association for Compu- relationship between nouns and articles in word tational Linguistics. embeddings is a large part of what encoded in- formation on grammatical gender in word embed- Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herv´eJ´egou. 2017. dings for Swedish. Word translation without parallel data. arXiv When classifying the stemmed embeddings, the preprint arXiv:1710.04087. accuracy falls to . . It could be argued this is 84 66% Greville G Corbett. 2013. Systems of gender assign- in part due to the decrease in quality of the embed- ment. In Matthew S Dryer and Martin Haspel, edi- dings overall that comes with stemming the corpus. tors, The world atlas of language structures online. However, it is still a clear indicator that the formal Max Planck Institute for Evolutionary Anthropol- information is a source of information for the clas- ogy. sifier. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of 6 Conclusion deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference Our exploration of grammatical gender in word of the North American Chapter of the Association for Computational Linguistics: Human Language embeddings has shown some overlap between how Technologies, Volume 1 (Long and Short Papers), grammatical gender is encoded in different lan- pages 4171–4186, Minneapolis, Minnesota. Associ- guages and that the degree of the overlap could ation for Computational Linguistics. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013. Distributed representa- tions of words and phrases and their composition- ality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Ad- vances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc. Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin- ter, Yoav Goldberg, Jan Hajic, Christopher D. Man- ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal dependencies v1: A multilingual treebank collection. In Proceedings of the Tenth In- ternational Conference on Language Resources and Evaluation (LREC 2016), Paris, France. European Language Resources Association (ELRA). Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Ken- ton Lee, and Luke Zettlemoyer. 2018a. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics. Matthew E. Peters, Mark Neumann, Luke Zettlemoyer, and Wen-tau Yih. 2018b. Dissecting contextual word embeddings: Architecture and representation. CoRR, abs/1808.08949. Martin F. Porter. 2001. Snowball: A language for stem- ming algorithms. Published online. Hartger Veeman, Marc Allassonni`ere-Tang, Aleksan- drs Berdicevskis, and Ali Basirat. 2020. Cross- lingual embeddings reveal universal and lineage- specific patterns in grammatical gender assignment. In Proceedings of the SIGNLL Conference on Com- putational Natural Language Learning (CoNLL), Online. Association for Computational Linguistics.