From Phonology to Syntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings
Total Page:16
File Type:pdf, Size:1020Kb
From Phonology to Syntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings Johannes Bjerva Isabelle Augenstein Department of Computer Science Department of Computer Science University of Copenhagen University of Copenhagen Denmark Denmark [email protected] [email protected] Abstract popularity (Dunn et al., 2011;W alchli¨ , 2014; ¨ A core part of linguistic typology is the clas- Ostling, 2015; Cotterell and Eisner, 2017; Asgari sification of languages according to linguistic and Schutze¨ , 2017; Malaviya et al., 2017; Bjerva properties, such as those detailed in the World and Augenstein, 2018). One part of traditional ty- Atlas of Language Structure (WALS). Doing pological research can be seen as assigning sparse this manually is prohibitively time-consuming, explicit feature vectors to languages, for instance which is in part evidenced by the fact that only manually encoded in databases such as the World 100 out of over 7,000 languages spoken in the Atlas of Language Structures (WALS, Dryer and world are fully covered in WALS. Haspelmath, 2013). A recent development which We learn distributed language representations, can be seen as analogous to this is the process which can be used to predict typological prop- erties on a massively multilingual scale. Ad- of learning distributed language representations in ditionally, quantitative and qualitative analy- the form of dense real-valued vectors, often re- ses of these language embeddings can tell us ferred to as language embeddings (Tsvetkov et al., how language similarities are encoded in NLP 2016; Ostling¨ and Tiedemann, 2017; Malaviya models for tasks at different typological levels. et al., 2017). We hypothesise that these language The representations are learned in an unsuper- embeddings encode typological properties of lan- vised manner alongside tasks at three typolog- guage, reminiscent of the features in WALS, or ical levels: phonology (grapheme-to-phoneme even of parameters in Chomsky’s Principles and prediction, and phoneme reconstruction), mor- phology (morphological inflection), and syn- Parameters framework (Chomsky, 1993). tax (part-of-speech tagging). In this paper, we model languages in deep neu- We consider more than 800 languages and find ral networks using language embeddings, consid- significant differences in the language rep- ering three typological levels: phonology, mor- resentations encoded, depending on the tar- phology and syntax. We consider four NLP tasks get task. For instance, although Norwegian to be representative of these levels: grapheme-to- Bokmal˚ and Danish are typologically close to phoneme prediction and phoneme reconstruction, one another, they are phonologically distant, morphological inflection, and part-of-speech tag- which is reflected in their language embed- ging. We pose three research questions (RQs): dings growing relatively distant in a phonolog- ical task. We are also able to predict typolog- ical features in WALS with high accuracies, RQ 1 Which typological properties are encoded even for unseen language families. in task-specific distributed language repre- sentations, and can we predict phonologi- 1 Introduction cal, morphological and syntactic properties For more than two and a half centuries, linguistic of languages using such representations? typologists have studied languages with respect to their structural and functional properties (Haspel- RQ 2 To what extent do the encoded properties math, 2001; Velupillai, 2012). Although typol- change as the representations are fine tuned ogy has a long history (Herder, 1772; Gabelentz, for tasks at different linguistic levels? 1891; Greenberg, 1960, 1974; Dahl, 1985; Com- rie, 1989; Haspelmath, 2001; Croft, 2002), com- RQ 3 How are language similarities encoded in putational approaches have only recently gained fine-tuned language embeddings? 907 Proceedings of NAACL-HLT 2018, pages 907–916 New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics One of our key findings is that language represen- several hundred languages; and (iv) grounding the tations differ considerably depending on the tar- language representations in linguistic theory. get task. For instance, for grapheme-to-phoneme mapping, the differences between the representa- 3 Background tions for Norwegian Bokmal˚ and Danish increase 3.1 Distributed language representations rapidly during training. This is due to the fact that, although the languages are typologically close to There are several methods for obtaining dis- one another, they are phonologically distant. tributed language representations by training a re- current neural language model (Mikolov et al., 2 Related work 2010) simultaneously for different languages (Tsvetkov et al., 2016; Ostling¨ and Tiedemann, Computational linguistics approaches to typology 2017). In these recurrent multilingual lan- are now possible on a larger scale than ever before guage models with long short-term memory cells due to advances in neural computational models. (LSTM, Hochreiter and Schmidhuber, 1997), lan- Even so, recent work only deals with fragments of guages are embedded into an n-dimensional typology compared to what we consider here. space. In order for multilingual parameter shar- ing to be successful in this setting, the neural net- Computational typology has to a large extent work is encouraged to use the language embed- focused on exploiting word or morpheme align- dings to encode features of language. In this pa- ments on the massively parallel New Testament, per, we explore the embeddings trained by Ostling¨ in approximately 1,000 languages, in order to in- and Tiedemann(2017), both in their original state, fer word order (Ostling¨ , 2015) or assign linguis- and by further tuning them for our four tasks. tic categories (Asgari and Schutze¨ , 2017).W alchli¨ These are trained by training a multilingual lan- (2014) similarly extracts lexical and grammatical guage model with language representations on a markers using New Testament data. Other work collection of texts from the New Testament, cov- has taken a computational perspective on language ering 975 languages. While other work has looked evolution (Dunn et al., 2011), and phonology (Cot- at the types of representations encoded in different terell and Eisner, 2017; Alishahi et al., 2017). layers of deep neural models (Kad´ ar´ et al., 2017), Language embeddings In this paper, we follow we choose to look at the representations only in an approach which has seen some attention in the the bottom-most embedding layer. This is moti- past year, namely the use of distributed language vated by the fact that we look at several tasks using representations, or language embeddings. Some different neural architectures, and want to ensure typological experiments are carried out by Ostling¨ comparability between these. and Tiedemann(2017), who learn language em- 3.1.1 Language embeddings as continuous beddings via multilingual language modelling and Chomskyan parameter vectors show that they can be used to reconstruct ge- nealogical trees. Malaviya et al.(2017) learn lan- We now turn to the theoretical motivation of the guage embeddings via neural machine translation, language representations. The field of NLP is and predict syntactic, morphological, and phonetic littered with distributional word representations, features. which are theoretically justified by distributional semantics (Harris, 1954; Firth, 1957), summarised Contributions Our work bears the most resem- in the catchy phrase You shall know a word by the blance to Bjerva and Augenstein(2018), who fine- company it keeps (Firth, 1957). We argue that lan- tune language embeddings on the task of PoS tag- guage embeddings, or distributed representations ging, and investigate how a handful of typological of language, can also be theoretically motivated by properties are coded in these for four Uralic lan- Chomsky’s Principles and Parameters framework guages. We expand on this and thus contribute (Chomsky and Lasnik, 1993; Chomsky, 1993, to previous work by: (i) introducing novel qual- 2014). Language embeddings encode languages itative investigations of language embeddings, in as dense real-valued vectors, in which the dimen- addition to thorough quantitative evaluations; (ii) sions are reminiscent of the parameters found in considering four tasks at three different typologi- this framework. Briefly put, Chomsky argues that cal levels; (iii) considering a far larger sample of languages can be described in terms of principles 908 (abstract rules) and parameters (switches) which are typically found in several African languages can be turned either on or off for a given lan- as well as languages in South-East Asia. guage (Chomsky and Lasnik, 1993). An example Morphological features cover a total of 41 fea- of such a switch might represent the positioning tures. We consider the features included in the of the head of a clause (i.e. either head-initial or Morphology chapter as well as those included in head-final). For English, this switch would be set the Nominal Categories chapter to be morpholog- to the ‘initial’ state, whereas for Japanese it would ical in nature.3 This includes features such as the be set to the ‘final’ state. Each dimension in an n- number of genders, usage of definite and indefinite dimensional language embedding might also de- articles and reduplication. As an example, con- scribe such a switch, albeit in a more continuous sider WALS feature 37A (Definite Articles).4 This fashion. The number of dimensions