Reconstructing Language Ancestry by Performing Word Prediction with Neural Networks
Total Page:16
File Type:pdf, Size:1020Kb
MSc Artificial Intelligence Track: Natural Language Processing Master Thesis Reconstructing language ancestry by performing word prediction with neural networks by Peter Dekker 10820973 Defense date: 25th January 2018 42 ECTS January 2017 – January 2018 Supervisors: Assessor: dr. Jelle Zuidema dr. Raquel Fernández prof. dr. Gerhard Jäger Institute for Logic, Language and Computation – University of Amsterdam Seminar für Sprachwissenschaft – University of Tübingen 2 Contents 1 Introduction 5 1.1 Historical linguistics . 5 1.1.1 Historical linguistics: the comparative method and beyond . 5 1.1.2 Sound changes . 6 1.1.3 Computational methods in historical linguistics . 8 1.2 Developments in natural language processing . 10 1.2.1 Natural language processing . 10 1.2.2 Machine learning and language . 10 1.2.3 Deep neural networks . 10 1.3 Word prediction . 11 1.3.1 Word prediction . 11 1.3.2 Model desiderata . 12 1.4 Summary . 12 2 Method 13 2.1 Pairwise word prediction . 13 2.1.1 Task . 13 2.1.2 Models . 13 2.1.3 Data . 17 2.1.4 Experiments . 19 2.2 Applications . 22 2.2.1 Phylogenetic tree reconstruction . 22 2.2.2 Sound correspondence identification . 22 2.2.3 Cognate detection . 22 2.3 Summary . 23 3 Results 25 3.1 Word prediction . 25 3.2 Phylogenetic tree reconstruction . 25 3.3 Identification of sound correspondences . 27 3.4 Cognate detection . 29 3.5 Summary . 31 4 Context vector analysis 33 4.1 Extraction of context vectors and input/target words . 33 4.2 PCA visualization . 33 4.3 Cluster analysis . 34 4.3.1 Distance matrices . 34 3 4 CONTENTS 4.3.2 Clustering . 34 4.4 Summary . 37 5 Phylogenetic word prediction 39 5.1 Beyond pairwise word prediction . 39 5.2 Method . 39 5.2.1 Network architecture . 39 5.2.2 Weight sharing and protoform inference . 40 5.2.3 Implementation details . 41 5.2.4 Training and prediction . 41 5.2.5 Experiments . 41 5.3 Results . 41 5.4 Discussion . 43 5.5 Summary . 43 6 Conclusion and discussion 45 7 Appendix 47 Chapter 1 Introduction How are the languages of the world related and how have they evolved? This is the central question in one of the oldest linguistic disciplines: historical linguistics. Recently, computational methods have been applied to aid historical linguists. Independently, in the computer science and natural language processing community, machine learning methods have become popular: by learning from data, a com- puter can learn to perform advanced tasks. In this thesis, I will explore the following question: How can machine learning algorithms be used to predict words between languages and serve as a model of sound change, in order to reconstruct the ancestry of languages? In this introductory chapter, I will first describe existing methods in historical linguistics. Subse- quently, I will cover the machine learning methods used in natural language processing. Finally, I will propose word prediction as task to apply machine learning to historical linguistics. 1.1 Historical linguistics 1.1.1 Historical linguistics: the comparative method and beyond In historical linguistics, the task of reconstructing the ancestry of languages is generally performed using the comparative method (Clackson, 2007). The goal is to find genetic relationships between lan- guages, excluding borrowings. Different linguistic levels, like phonology, morphology and syntax can be taken into account. Mostly, phonological forms of a fixed list of basic vocabulary is taken into ac- count, containing concepts with a low probability of being borrowed. Durie and Ross (1996, ch. Introduction) describe the general workflow of the comparative method, as it is followed by many historical linguists. 1. Word lists are taken into account for languages which are assumed to be genetically related, based on diagnostic evidence. Diagnostic evidence can be common knowledge, eg. Nichols (1996) argues that in different historical texts by Slavic speakers, the Slavic languages already have been assumed to be related for ages. Another possibility is morphological or syntactical evidence. 2. Sets of cognates (words that are ancestrally related) are created. 3. Based on these cognates, sound correspondences are identified: which sounds do always change between cognates in two languages? 4. The sound correspondences are used to reconstruct a common protolanguage. First, protosounds are inferred, from these, protoforms of words are reconstructed. 5 6 CHAPTER 1. INTRODUCTION 5. Common innovations of sounds and words between groups of languages are identified. 6. Based on the common innovations, a phylogenetic tree is reconstructed: language B is a child of language A, if B has all the innovations that A has. 7. An etymological dictionary can be constructed, taking into account borrowing and semantic change. This process can be performed in an iterative fashion: cognate sets (2) are updated by finding newsound correspondences (3), and by further steps, the set of languages taken into account (1) can be altered. Although the comparative method is regarded as the most reliable method to establish genetic rela- tionships between languages, sometimes less constrained methods are applied, to be able to look further back in time (Campbell, 2003, p. 348). One of the most notable is Greenberg’s mass lexical comparison (Greenberg, 1957), in which genetic relationships are inferred from surface forms, instead of sound correspondences, of words across a large number of languages. This workflow is criticized for not dis- tinguishing between genetically related words (cognates) and other effects which could cause similarity of words, such as chance similarity and borrowing (Campbell, 2013). In this thesis, we will therefore take the comparative method as starting point, and look for compu- tational methods that can automate parts of that workflow. 1.1.2 Sound changes When comparing words in different languages, sound changes between these words are observed. Most sound changes are assumed to be regular: if it occurs in a word in a certain context, it should also occur in the same context in another word. The Neogrammarian hypothesis of the regularity of sound change states that “sound change takes place according to laws that admit no exception” (Osthoff and Brugmann, 1880). There are however also cases of irregular, sporadic sound change. (Durie and Ross, 1996). Regular sound change is crucial for the reconstruction of language ancestry using the comparative method. I will now discuss possible sound changes, divided in three categories: phonemic changes, loss of segments and insertion/movement of segments. For each sound change, I will mention whether it is regular or sporadic. I will refer to the sound changes and their regularity later when looking at different computer algorithms that should be able to model these sound changes. This is a compilation of the sound changes described in Hock and Joseph (2009), Trask (1996), Beekes (2011) and Campbell (2013). Phonemic changes Phonemic changes are sound changes which change the inventory of phonemes. When a phonemic change occurs, a sound changes into another sound in all words in a language, this is thus a regular change. Two general patterns of phonemic change can be distinguished: mergers and splits.A merger is a phonemic change where two distinct sounds in the phoneme inventory are merged into one, existing or new, sound. A split is a phonemic change where one phoneme splits into two phonemes. Based on these two general patterns on the phoneme inventory level, a number of concrete changes inword forms can occur, of which I will highlight two: assimilation and vowel changes. Assimilation (regular) Assimilation is the process where one sound in a word becomes more similar to another sound. For example, in Latin nokte ‘night’ > Italian notte, /k/ changes to /t/, influenced by the neighboring /t/. Assimilation of sounds more distantly located in a word is also pssible. Umlaut in Germanic phonol- ogy is an example of this: the first vowel of the word changes to be more similar to the second vowel. For example, the plural of the Proto-Germanic word gast ‘guest’ became gestiz. /a/ changed to /e/ to be 1.1. HISTORICAL LINGUISTICS 7 more similar to /i/, as both /e/ and /i/ are front vowels. This Proto-Germanic umlaut is still visible inthe German plural Gast/Gäste. Lenition (regular) Lenition is a process which reduces articulatory effort and only affects con- sonants. Lenitive changes include voicing (voiceless to voiced), degemination (geminate, two of the same sounds, to simplex, one sound) and nasalization (non-nasal to nasal). An example of voicing is the change from Latin strata ‘road’ to Italian strada. A voiceless /t/ changes to a voiced /d/. Degemination can be observed when comparing the change from Latin gutta ‘drop’ to Spanish gota. The reduction of articulatory effort can continue so far, that a sound is completely omitted.For example, Latin regāle > Spanish real. Vowel changes (regular) There are a number of changes where vowels change into other vowels, some of them caused by the aforementioned processes, like lenition. However, some vowel changes should be mentioned explicitly. There are a number of changes, where a vowel acquires a new phonetic feature, like lowering, fronting, rounding). An example of fronting is Basque dut ‘I have it’ > Zuberoan dʏt, where back /u/ changes to near-front /ʏ/. Coalescence is the effect where two identical vowels change into one. Compensatory lengthening is the lengthening of a vowel after the loss of the following consonant. For example Old French bɛst ‘beast’ > French bɛ:t. Loss of segments In the previous section, I discussed changes of a sound into another sound. Now, I will cover sound changes were segments are completely lost. Loss (regular) There are several types of regular loss. Aphearesis is loss at the beginnning of a word, as the loss of k in English knee.