Learning Language Representations for Typology Prediction
Total Page:16
File Type:pdf, Size:1020Kb
Learning Language Representations for Typology Prediction Chaitanya Malaviya and Graham Neubig and Patrick Littell Language Technologies Institute Carnegie Mellon University cmalaviy,gneubig,pwl @cs.cmu.edu { } Abstract One central mystery of neural NLP is what neural models “know” about their subject matter. When a neural machine transla- tion system learns to translate from one language to another, does it learn the syn- tax or semantics of the languages? Can this knowledge be extracted from the sys- Figure 1: Learning representations from mul- tem to fill holes in human scientific knowl- tilingual neural MT for typology classification. edge? Existing typological databases con- (Model MTBOTH) tain relatively full feature specifications for only a few hundred languages. Ex- ploiting the existence of parallel texts in Typological information from sources like the more than a thousand languages, we build World Atlas of Language Structures (WALS) a massive many-to-one neural machine (Dryer and Haspelmath, 2013), has proven use- translation (NMT) system from 1017 lan- ful in many NLP tasks (O’Horan et al., 2016), guages into English, and use this to pre- such as multilingual dependency parsing (Ammar dict information missing from typological et al., 2016), generative parsing in low-resource databases. Experiments show that the pro- settings (Naseem et al., 2012; Tackstr¨ om¨ et al., posed method is able to infer not only syn- 2013), phonological language modeling and loan- tactic, but also phonological and phonetic word prediction (Tsvetkov et al., 2016), POS- inventory features, and improves over a tagging (Zhang et al., 2012), and machine trans- baseline that has access to information lation (Daiber et al., 2016). about the languages’ geographic and phy- However, the needs of NLP tasks differ in many logenetic neighbors.1 ways from the needs of scientific typology, and ty- pological databases are often only sparsely pop- 1 Introduction ulated, by necessity or by design.2 In NLP, on the other hand, what is important is having a rela- Linguistic typology is the classification of human tively full set of features for the particular group languages according to syntactic, phonological, of languages you are working on. This mis- and other classes of features, and the investiga- match of needs has motivated various proposals tion of the relationships and correlations between to reconstruct missing entries, in WALS and other these classes/features. This study has been a sci- databases, from known entries (Daume´ III and entific pursuit in its own right since the 19th cen- Campbell, 2007; Daume´ III, 2009; Coke et al., tury (Greenberg, 1963; Comrie, 1989; Nichols, 2016; Littell et al., 2017). 1992), but recently typology has borne practical In this study, we examine whether we can fruit within various subfields of NLP, particularly on problems involving lower-resource languages. 2For example, each chapter of WALS aims to provide a statistically balanced set of languages over language families 1Code and learned vectors are available at http:// and geographical areas, and so many languages are left out in github.com/chaitanyamalaviya/lang-reps order to maintain balance. 2529 Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2529–2535 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics tackle the problem of inferring linguistic typol- Glottolog (Hammarstrom¨ et al., 2015). These fea- ogy from parallel corpora, specifically by training tures are divided into separate classes regarding a massively multi-lingual neural machine trans- syntax (e.g. whether a language has prepositions lation (NMT) system and using the learned rep- or postpositions), phonology (e.g. whether a lan- resentations to infer typological features for each guage has complex syllabic onset clusters), and language. This is motivated both by prior work in phonetic inventory (e.g. whether a language has linguistics (Bugarski, 1991; Garc´ıa, 2002) demon- interdental fricatives). There are 103 syntactical strating strong links between translation studies features, 28 phonology features and 158 phonetic and tools for contrastive linguistic analysis, work inventory features in the database. in inferring typology from bilingual data (Ostling,¨ 2015) and English as Second Language texts Baseline Feature Vectors: Several previous (Berzak et al., 2014), as well as work in NLP (Shi methods take advantage of typological implica- et al., 2016; Kuncoro et al., 2017; Belinkov et al., ture, the fact that some typological traits corre- 2017) showing that syntactic knowledge can be late strongly with others, to use known features extracted from neural nets on the word-by-word of a language to help infer other unknown fea- or sentence-by-sentence level. This work presents tures of the language (Daume´ III and Campbell, a more holistic analysis of whether we can dis- 2007; Takamura et al., 2016; Coke et al., 2016). cover what neural networks learn about the lin- As an alternative that does not necessarily require guistic concepts of an entire language by aggre- pre-existing knowledge of the typological features gating their representations over a large number of in the language at hand, Littell et al. (2017) have the sentences in the language. proposed a method for inferring typological fea- tures directly from the language’s k nearest neigh- We examine several methods for discovering bors (k-NN) according to geodesic distance (dis- feature vectors for typology prediction, including tance on the Earth’s surface) and genetic distance those learning a language vector specifying the (distance according to a phylogenetic family tree). language while training multilingual neural lan- In our experiments, our baseline uses this method guage models (Ostling¨ and Tiedemann, 2017) or by taking the 3-NN for each language according neural machine translation (Johnson et al., 2016) to normalized geodesic+genetic distance, and cal- systems. We further propose a novel method for culating an average feature vector of these three aggregating the values of the latent state of the en- neighbors. coder neural network to a single vector represent- ing the entire language. We calculate these feature Typology Prediction: To perform prediction, vectors using an NMT model trained on 1017 lan- we trained a logistic regression classifier3 with guages, and use them for typlogy prediction both the baseline k-NN feature vectors described above on their own and in composite with feature vectors and the proposed NMT feature vectors described from previous work based on the genetic and geo- in the next section. We train individual classi- graphic distance between languages (Littell et al., fiers for predicting each typological feature in a 2017). Results show that the extracted representa- class (syntax etc). We performed 10-fold cross- tions do in fact allow us to learn about the typol- validation over the URIEL database, where we ogy of languages, with particular gains for syn- train on 9/10 of the languages to predict 1/10 of tactic features like word order and the presence of the languages for 10 folds over the data. case markers. 3 Learning Representations for Typology 2 Dataset and Experimental Setup Prediction Typology Database: To perform our analysis, In this section we describe three methods for we use the URIEL language typology database learning representations for typology prediction (Littell et al., 2017), which is a collection of bi- with multilingual neural models. nary features extracted from multiple typological, phylogenetic, and geographical databases such LM Language Vector Several methods have as WALS (World Atlas of Language Structures) been proposed to learn multilingual language (Collins and Kayne, 2011), PHOIBLE (Moran 3We experimented with a non-linear classifier as well, but et al., 2014), Ethnologue (Lewis et al., 2015), and the logistic regression classifier performed better. 2530 models (LMs) that utilize vector representations Syntax Phonology Inventory of languages (Tsvetkov et al., 2016; Ostling¨ and -Aux +Aux -Aux +Aux -Aux +Aux Tiedemann, 2017). Specifically, these models NONE 69.91 83.07 77.92 86.59 85.17 90.68 LMVEC 71.32 82.94 80.80 86.74 87.51 89.94 train a recurrent neural network LM (RNNLM; MTVEC 74.90 83.31 82.41 87.64 89.62 90.94 Mikolov et al. (2010)) using long short-term mem- MTCELL 75.91 85.14 84.33 88.80 90.01 90.85 ory (LSTM; Hochreiter and Schmidhuber (1997)) MTBOTH 77.11 86.33 85.77 89.04 90.06 91.03 with an additional vector representing the current Table 1: Accuracy of syntactic, phonological, language as an input. The expectation is that and inventory features using LM language vec- this vector will be able to capture the features of tors (LMVEC), MT language vectors (MTVEC), the language and improve LM accuracy. Ostling¨ MT encoder cell averages (MTCELL) or both and Tiedemann (2017) noted that, intriguingly, ag- MT feature vectors (MTBOTH). Aux indicates glomerative clustering of these language vectors auxiliary information of geodesic/genetic nearest results in something that looks roughly like a phy- neighbors; “NONE -Aux” is the majority class logenetic tree, but stopped short of performing ty- chance rate, while “NONE +Aux” is a 3-NN clas- pological inference. We train this vector by ap- sification. pending a special token representing the source language (e.g. “ fra ” for French) to the begin- h i ning of the source sentence, as shown in Fig. 1, sentiment (Radford et al., 2017), we expect that then using the word representation learned for this the cell states will represent features that may be token as a representation of the language. We will linked to the typology of the language. To cre- call this first set of feature vectors LMVEC, and ate vectors for each language using LSTM hidden examine their utility for typology prediction. states, we obtain the mean of cell states (c in the standard LSTM equations) for all time steps of all 4 NMT Language Vector In our second set of sentences in each language.