MSc Artificial Intelligence Track: Natural Language Processing

Master Thesis

Reconstructing language ancestry by performing word prediction with neural networks

by Peter Dekker 10820973

Defense date: 25th January 2018

42 ECTS January 2017 – January 2018

Supervisors: Assessor: dr. Jelle Zuidema dr. Raquel Fernández prof. dr. Gerhard Jäger

Institute for Logic, Language and Computation – University of Amsterdam Seminar für Sprachwissenschaft – University of Tübingen 2 Contents

1 Introduction 5 1.1 Historical linguistics ...... 5 1.1.1 Historical linguistics: the comparative method and beyond ...... 5 1.1.2 Sound changes ...... 6 1.1.3 Computational methods in historical linguistics ...... 8 1.2 Developments in natural language processing ...... 10 1.2.1 Natural language processing ...... 10 1.2.2 Machine learning and language ...... 10 1.2.3 Deep neural networks ...... 10 1.3 Word prediction ...... 11 1.3.1 Word prediction ...... 11 1.3.2 Model desiderata ...... 12 1.4 Summary ...... 12

2 Method 13 2.1 Pairwise word prediction ...... 13 2.1.1 Task ...... 13 2.1.2 Models ...... 13 2.1.3 Data ...... 17 2.1.4 Experiments ...... 19 2.2 Applications ...... 22 2.2.1 Phylogenetic tree reconstruction ...... 22 2.2.2 Sound correspondence identification ...... 22 2.2.3 Cognate detection ...... 22 2.3 Summary ...... 23

3 Results 25 3.1 Word prediction ...... 25 3.2 Phylogenetic tree reconstruction ...... 25 3.3 Identification of sound correspondences ...... 27 3.4 Cognate detection ...... 29 3.5 Summary ...... 31

4 Context vector analysis 33 4.1 Extraction of context vectors and input/target words ...... 33 4.2 PCA visualization ...... 33 4.3 Cluster analysis ...... 34 4.3.1 Distance matrices ...... 34

3 4 CONTENTS

4.3.2 Clustering ...... 34 4.4 Summary ...... 37

5 Phylogenetic word prediction 39 5.1 Beyond pairwise word prediction ...... 39 5.2 Method ...... 39 5.2.1 Network architecture ...... 39 5.2.2 Weight sharing and protoform inference ...... 40 5.2.3 Implementation details ...... 41 5.2.4 Training and prediction ...... 41 5.2.5 Experiments ...... 41 5.3 Results ...... 41 5.4 Discussion ...... 43 5.5 Summary ...... 43

6 Conclusion and discussion 45

7 Appendix 47 Chapter 1

Introduction

How are the languages of the world related and how have they evolved? This is the central question in one of the oldest linguistic disciplines: historical linguistics. Recently, computational methods have been applied to aid historical linguists. Independently, in the computer science and natural language processing community, machine learning methods have become popular: by learning from data, com- puter can learn to perform advanced tasks. In this thesis, I will explore the following question: How can machine learning algorithms be used to predict words between languages and serve as a model of sound change, in order to reconstruct the ancestry of languages? In this introductory chapter, I will first describe existing methods in historical linguistics. Subse- quently, I will cover the machine learning methods used in natural language processing. Finally, I will propose word prediction as task to apply machine learning to historical linguistics.

1.1 Historical linguistics 1.1.1 Historical linguistics: the comparative method and beyond In historical linguistics, the task of reconstructing the ancestry of languages is generally performed using the comparative method (Clackson, 2007). The goal is to find genetic relationships between lan- guages, excluding borrowings. Different linguistic levels, like phonology, morphology and syntax can be taken into account. Mostly, phonological forms of a fixed list of basic vocabulary is taken into ac- count, containing concepts with a low probability of being borrowed. Durie and Ross (1996, ch. Introduction) describe the general workflow of the comparative method, as it is followed by many historical linguists. 1. Word lists are taken into account for languages which are assumed to be genetically related, based on diagnostic evidence. Diagnostic evidence can be common knowledge, eg. Nichols (1996) argues that in different historical texts by Slavic speakers, the Slavic languages already have been assumed to be related for ages. Another possibility is morphological or syntactical evidence. 2. Sets of cognates (words that are ancestrally related) are created. 3. Based on these cognates, sound correspondences are identified: which sounds do always change between cognates in two languages? 4. The sound correspondences are used to reconstruct a common protolanguage. First, protosounds are inferred, from these, protoforms of words are reconstructed.

5 6 CHAPTER 1. INTRODUCTION

5. Common innovations of sounds and words between groups of languages are identified.

6. Based on the common innovations, a phylogenetic tree is reconstructed: language B is a child of language A, if B has all the innovations that A has.

7. An etymological dictionary can be constructed, taking into account borrowing and semantic change.

This process can be performed in an iterative fashion: cognate sets (2) are updated by finding newsound correspondences (3), and by further steps, the set of languages taken into account (1) can be altered. Although the comparative method is regarded as the most reliable method to establish genetic rela- tionships between languages, sometimes less constrained methods are applied, to be able to look further back in time (Campbell, 2003, p. 348). One of the most notable is Greenberg’s mass lexical comparison (Greenberg, 1957), in which genetic relationships are inferred from surface forms, instead of sound correspondences, of words across a large number of languages. This workflow is criticized for not dis- tinguishing between genetically related words (cognates) and other effects which could cause similarity of words, such as chance similarity and borrowing (Campbell, 2013). In this thesis, we will therefore take the comparative method as starting point, and look for compu- tational methods that can automate parts of that workflow.

1.1.2 Sound changes When comparing words in different languages, sound changes between these words are observed. Most sound changes are assumed to be regular: if it occurs in a word in a certain context, it should also occur in the same context in another word. The Neogrammarian hypothesis of the regularity of sound change states that “sound change takes place according to laws that admit no exception” (Osthoff and Brugmann, 1880). There are however also cases of irregular, sporadic sound change. (Durie and Ross, 1996). Regular sound change is crucial for the reconstruction of language ancestry using the comparative method. I will now discuss possible sound changes, divided in three categories: phonemic changes, loss of segments and insertion/movement of segments. For each sound change, I will mention whether it is regular or sporadic. I will refer to the sound changes and their regularity later when looking at different computer algorithms that should be able to model these sound changes. This is a compilation of the sound changes described in Hock and Joseph (2009), Trask (1996), Beekes (2011) and Campbell (2013).

Phonemic changes Phonemic changes are sound changes which change the inventory of phonemes. When a phonemic change occurs, a sound changes into another sound in all words in a language, this is thus a regular change. Two general patterns of phonemic change can be distinguished: mergers and splits.A merger is a phonemic change where two distinct sounds in the phoneme inventory are merged into one, existing or new, sound. A split is a phonemic change where one phoneme splits into two phonemes. Based on these two general patterns on the phoneme inventory level, a number of concrete changes inword forms can occur, of which I will highlight two: assimilation and vowel changes.

Assimilation (regular) Assimilation is the process where one sound in a word becomes more similar to another sound. For example, in Latin nokte ‘night’ > Italian notte, /k/ changes to /t/, influenced by the neighboring /t/. Assimilation of sounds more distantly located in a word is also pssible. Umlaut in Germanic phonol- ogy is an example of this: the first vowel of the word changes to be more similar to the second vowel. For example, the plural of the Proto-Germanic word gast ‘guest’ became gestiz. /a/ changed to // to be 1.1. HISTORICAL LINGUISTICS 7 more similar to /i/, as both /e/ and /i/ are front vowels. This Proto-Germanic umlaut is still visible inthe German plural Gast/Gäste.

Lenition (regular) Lenition is a process which reduces articulatory effort and only affects con- sonants. Lenitive changes include voicing (voiceless to voiced), degemination (geminate, two of the same sounds, to simplex, one sound) and nasalization (non-nasal to nasal). An example of voicing is the change from Latin strata ‘road’ to Italian strada. A voiceless /t/ changes to a voiced /d/. Degemination can be observed when comparing the change from Latin gutta ‘drop’ to Spanish gota. The reduction of articulatory effort can continue so far, that a sound is completely omitted.For example, Latin regāle > Spanish real.

Vowel changes (regular) There are a number of changes where vowels change into other vowels, some of them caused by the aforementioned processes, like lenition. However, some vowel changes should be mentioned explicitly. There are a number of changes, where a vowel acquires a new phonetic feature, like lowering, fronting, rounding). An example of fronting is Basque dut ‘I have it’ > Zuberoan dʏt, where back /u/ changes to near-front /ʏ/. Coalescence is the effect where two identical vowels change into one. Compensatory lengthening is the lengthening of a vowel after the loss of the following consonant. For example Old French bɛst ‘beast’ > French bɛ:t.

Loss of segments

In the previous section, I discussed changes of a sound into another sound. Now, I will cover sound changes were segments are completely lost.

Loss (regular) There are several types of regular loss. Aphearesis is loss at the beginnning of a word, as the loss of k in English knee. Loss at the end of a word is called apocope. When a sound in the middle is omitted, this is called syncope, as in English chocolate.

Haplology (sporadic) When two similar segments occur in a word, haplology is the process which removes one of these segments. Eg. combining Basque sagar ‘apple’ with ardo ‘wine’ gives sagardo ‘cider’, dropping an ar segment.

Insertion/movement of segments

Next to change and loss of segments, segments can also be inserted or moved in a word.

Insertion (regular) New sounds can be inserted at different places of a word. Insertion at the start of the word is called prothesis, eg. Latin scala ‘sword’ > Spanish escala. Insertion of a sound in the middle is called epenthesis, illustrated by the development in Latin poclum ‘goblet’ > poculum. Addition of a sound to the end of a word is excrescence, eg. Middle English amonges > English amongst.

Metathesis (sporadic) Metathesis is a sound change which alters the order of sounds in a word. Eg. Old English wæps > English wasp and Latin parabola ‘word’ > Spanish palabra. 8 CHAPTER 1. INTRODUCTION

1.1.3 Computational methods in historical linguistics

Computational methods are applied to automate parts of the workflow in historical linguistics. Reasons to apply computational methods include speeding up the process and having a process following more formal guidelines, instead of expert intuition (Jäger and Sofroniev, 2016). Approaches which received a lot of attention were Gray and Atkinson (2003), which charted the age of Indo-European languages, and Bouckaert et al. (2012), which proposed to map the Indo-European homeland to Anatolia. Steps in the workflow of the comparative method for which automatic methods are available include the detection of cognates (2), the detection of sound correspondences (3), reconstruction of protoforms (4) and reconstruction of phylogenetic trees (6). Some computational methods stay conceptually closer to the comparative method than others. List (2012) distinguishes between computational methods which act based on genotypic similarity and meth- ods based on phenotypic similarity. Genotypic methods compare languages based on the language- specific regular sound correspondences that can be established between the languages. Phenotypic meth- ods compare languages based on the surface forms of words. When using surface forms, it is harder to detect ancestral relatedness of words which underwent much phonetic change and it is more challenging to detect borrowings. I will now describe computational methods for different tasks in historical linguistics. I will refer to these tasks later, when I introduce a model that can be applied to a number of these tasks.

Cognate detection

In cognate detection, the task is to detect ancestrally related words (cognates) in different languages. Inkpen et al. (2005) apply different ortographic features (number of common n-grams, normalized edit distance, common prefix, longest common subsequence, etc.) for the task of cognate detection. This gives good results, even when not training a model, but just applying the features as a similarity mea- sure. List (2012) places phonetic strings of words into sound classes. Then, a matrix of language-pair dependent scores for sound correspondences is extracted. Based on this matrix, distances are assigned to cognate candidates. Finally, they are clustered into cognate classes. Jäger and Sofroniev (2016) com- pute paiwise probabilities of cognacy of words using an SVM classifier. PMI-based features are used to compare phonetic strings. Probabilities are then converted to distances and words are clustered into cognate clusters. Rama (2016b) applies a siamese convolutional neural network (CNN), trained on pairs of possible cognates. A CNN is a machine learning model that uses a sliding filter over the input to get the output. A siamese CNN runs two parallel versions of the network, one for each input word, but shares the weights. After the two parallel networks, a layer calculating absolute distance between the output of the two networks is applied and the network outputs a cognacy decision.

Sound correspondence detection

Sound correspondence detection is involved with the identification of regular sound correspondences: a sound in one language that always changes into another sound in a different language, given the same context. Kondrak (2002) treats sound correspondences in the same way as translation equivalence is treated in bilingual corpora in machine translation. Both should occur regularly: if they occur in one place, they should also occur in another place. The alignment links that are made between words in machine translation, are now made between phonemes. The benefit of this method is that it givesan explicit list of sound correspondences and is also suitable for cognate detection. Hruschka et al. (2015) creates a phylogeny of languages using Bayesian MCMC, while at the same time giving a probabilistic description of regular sound correspondences. 1.1. HISTORICAL LINGUISTICS 9

Protoform reconstruction

In protoform reconstruction, word forms for an ancestor of known current languages are reconstructed. Bouchard-Côté et al. (2013) perform protoform reconstruction by directly comparing phonetic strings, without manual cognate judgments, in order to reconstruct protoforms. Probabilistic string transducers model the sound changes, taking into account context. A tree is postulated, and in an iterative process, candidate protoforms are generated. Parameters are estimated using Expectation Maximization. This approach works for protolanguage reconstruction, cognate detection. Furthermore, the results support the functional load hypothesis of language change, which states that sounds that are used to distinguish words, have a lower probability of changing.

Phylogenetic tree reconstruction

The reconstruction of a phylogenetic, ancestral, tree of languages can be performed using different types of input data. Depending on the type of data, different methods are applied. When a distance matrix between languages is used, distance-based methods are applied. A different type of data is character data, where every language is represented by a string of features, eg. cognate judgments. In the case of character data, maximum parsimony or likelihood-based methods are used.

Distance-based methods UPGMA (Sokal and Michener, 1958) and neighbor joining (Saitou and Nei, 1987) are methods which hierarchically cluster entities based on their pairwise distances. At every step, the UPGMA method joins the two clusters which are closest to each other. UPGMA implicitly assumes a molecular clock (a term which originates from phylogenetic models in bioinformatics): the rate of change through time is constant. The neighbor joining method uses a Q matrix at every step, in which the distance of a language to a newly created node is based on the distances to all other languages. Neighbor joining does not assume a molecular clock, so different branches can evolve at different rates. An example of using a distance-based algorithm for tree reconstruction is Jäger (2015). String simi- larities between alignments of words are directly used as distances between the languages. This enables the use of data without cognate judgments, which is more widely available. Taking into account surface forms, instead of cognates and sound correspondences, resembles Greenberg’s mass lexical comparison, described in 1.1.1. The concerns raised against mass lexical comparison are accommodated by removing words which have a high probability of being a loanword or occurring by chance. The resulting distances are fed as input to a distance-based clustering algorithm, the greedy minimum evolution algorithm, to contruct a phylogenetic tree.

Character-based methods Now I will describe methods which operate on character data, where every language is represented by a string of features. Two types of these character-based methods are maximum parsimony methods and likelihood-based methods. Maximum parsimony methods try to create a tree by using a minimum number of evolutionary changes to explain the character data, in most cases cognate judgments. One of the problems of this approach is long branch attention: long branches, branches with a lot of change, tend to be falsely clustered together. Likelihood-based methods solve this problem, by looking at the likelihood: the probability of the data, given a certain tree. Evaluating the likelihood for all possible trees is computationally infeasible. Maximizing the likelihood can be efficiently performed using Bayesian Markov Chain Monte Carlo (Bayesian MCMC) methods. These methods randomly sample from the space of possible trees, in order to find the tree with the highest likelihood (Dunn, 2015). Applications of Bayesian MCMC methods for phylogenetic tree reconstruction include Gray and Atkinson (2003), Bouckaert et al. (2012) and Chang et al. (2015). 10 CHAPTER 1. INTRODUCTION

1.2 Developments in natural language processing

In the previous section, I have looked at historical linguistics, the comparative method, and computa- tional automation of this method. Now, I will look at a different field, natural language processing, which has a different research goal. It however provides techniques to learn from large amounts of data, which I will try to apply to research in historical linguistics.

1.2.1 Natural language processing The field of natural language processing (NLP) is involved with “getting computers to perform use- ful tasks involving human language, tasks like enabling human-machine communication, improving human-human communication, or simply doing useful processing of text or speech” (Jurafsky, 2000, p. 35). Contrary to linguistics, the main objective of NLP is to perform practical tasks involving language, getting a better understanding of language as a system is only a secondary goal. However, themethods employed in natural language processing, can be useful to apply to problems in linguistics. Tasks in natural language processing include the syntactic parsing of sentences, sentiment analysis, machine translation and the creation of dialogue systems. Several approaches exist in natural language processing, including the logical and the statistical approach. In recent years, the statistical approach, using machine learning methods, has shown success in performing a range of tasks.

1.2.2 Machine learning and language Machine learning “is concerned with the automatic discovery of regularities in data through the use of computer algorithms and with the use of these regularities to take actions such as classifying the data into different categories” (Bishop, 2006). In a supervised setting, this is performed bylearning from training examples (x, y) during the training phase. During the prediction phase, the algorithm is presented with test examples x, without a label y. The goal of the algorithm is to predict correct labels y∗ for the test examples. The algorithm is able to do this by its ability to generalize over thetraining examples. Language is a sequential phenomenon: when a speaker performs an utterance, this does not happen at one moment in time, but stretches out over time. Furthermore, there are dependencies between the linguistic items at different time steps. For example, in many languages, a speaker has a high probability of using a vowel, after two consonants have occurred. It is also likely that a determiner will befollowed by a noun or an adjective. When using machine learning methods for prediction tasks in language, this sequential nature can be exploited. Instead of predicting the linguistic items at different time steps independently, the depen- dencies between the items at different time steps can be taken into account. To this end, sequential (Di- etterich, 2002) or, more general, structured prediction methods (Daume and Marcu, 2006) are employed. Examples of sequential and structured methods used in NLP are Hidden Markov Models (HMMs) (Baum and Petrie, 1966), Probabilistic Context Free Grammars (PCFGs) (Baker, 1979; Lari and Young, 1990), Conditional Random Fields (CRFs) (Lafferty et al., 2001) and structured perceptrons (Collins, 2002).

1.2.3 Deep neural networks Neural networks are a class of machine learning algorithms, which consist of nodes ordered in layers. Between the different layers, non-linear functions are applied, allowing the network to learn verycom- plex patterns. The architecture of neural networks is loosely inspired by the structure of thebrain.The networks are therefore sometimes proposed as cognitive models. However, in many cases, biological plausibility is not claimed, and a neural network is just used to solve a certain machine learning problem. 1.3. WORD PREDICTION 11

In recent years, with the availability of more computing power and large amounts of data, deep learning has seen its advent. By employing deep neural networks, with many layers and feeding a high volume of data, representation learning becomes possible. In traditional machine learning, feature engineering takes place: the data is structured to be in a representation where the machine learning algorithm can learn from. In deep learning, data is fed as raw as possible, the successive layers of the network are able to learn the right representation of the data (Goodfellow et al., 2016). Deep learning methods have been applied to areas as diverse as pedestrian detection (Sermanet et al., 2013), exploration of new molecules for drugs (Dahl et al., 2014) and analysis of medical images (Avendi et al., 2016). As described in section 1.2.2, machine learning methods have to be adjusted to the sequential na- ture of language. Recurrent neural networks (RNNs) are neural networks designed for sequential input (Rumelhart et al., 1986). When feeding the input at a certain timestep forward through the network, the input from the previous timestep is also taken into account. In order to accommodate long-distance depencies well, modified network units were developed: the Long Short Term Memory (LSTM) unit (Hochreiter and Schmidhuber, 1997) and the Gated Recurrent Unit (GRU) (Cho et al., 2014). LSTM and GRU networks have been succesful in natural language processing tasks, such as language modelling and machine translation. In machine translation, encoder-decoder approaches have been adopted: one recurrent network (encoder) encodes the input into a representation, which is decoded by another re- current network (decoder) (Sutskever et al., 2014; Cho et al., 2014).

1.3 Word prediction

1.3.1 Word prediction In section 1.1, I have given an overview of the challenges in historical linguistics and the efforts to automate these tasks. In section 1.2, I showed the recent successes of machine learning methods, and specifically deep neural networks, in natural language processing. In this thesis, I propose thetaskof word prediction, phrasing the reconstruction of language ancestry as a machine learning problem. For this, a dataset of words for a large number of concepts in a large number of languages is needed. A machine learning model is trained on pairs of word forms denoting the same concept in a two lan- guages. Through training, the model learns correspondences between sounds in the two languages. Then, for an unseen concept, the model can predict the word form in one language given theword form in the other language. This task can be performed as pairwise word prediction, where predictions are made per language pair: information from other language pairs is not taken into account. I also explore possibilities to exploit the assumed phylogenetic structure of language structure during predic- tion, which I call phylogenetic word prediction. In this setting, information is shared between language pairs. Word prediction can be used to automate several tasks in historical linguistics. I assume that lan- guage pairs with lower prediction error are more closely related, enabling reconstruction of phylogenetic trees. Languages can be hierarchically clustered, using the prediction error for every language pair as distance. Furthermore, the model learns sound correspondences between input and output. These can be identified by visualizing the learned model parameters or by looking at the substitutions between source and prediction. Finally, cognate detection can be performed, the clustering of words based on their ancestral relatedness. I use the prediction error per word from the model to perform this task. Earlier work related to word prediction is Mulloni (2007), Beinborn et al. (2013) and Ciobanu (2016). Although the applied methods differ, the approaches have in common that their input consists solely of cognates. Furthermore, orthographic input is used. In my approach, the algorithm can be trained on data which is not labelled for cognacy. The input is phonetic, reducing the effect of orthographic differences between languages. 12 CHAPTER 1. INTRODUCTION

Some of the computational methods in historical linguistics, described in 1.1.3, also apply machine learning algorithms. However, in the word prediction task I propose, machine learning is at the core of the method. I try to exploit the analogy between the regularity of sound change and the regularities machine learning algorithms use to learn. Machine learning is aimed at retrieving regularities from data and predicting based on these regularities. Sound changes, a central notion in historical linguistics, are also assumed to be regular. The machine learning algorithm serves as a model of sound change.

1.3.2 Model desiderata When building a machine learning model for the task of word prediction, one has to ask which phenom- ena the algorithm should be able to model. The task in this thesis is to predict aword wd,B in language B from a word wd,A in language A. The question is which sound changes can occur between wd,B and wd,B in different languages, for which the model should account. These sound changes are described in section 1.1.2. Bouchard-Côté et al. (2013) describes that most regular sound changes (eg. lenition, epenthesis) can be captured by a probabilistic string transducer. This is an algorithm with a relatively simple struc- ture, but sensitive to the context in which a sound change occurs. For other changes (eg. metathesis, reduplication), more complex models need to be applied. However, these changes are in many cases not regular, but sporadic. Models applied to word prediction should at least have the capabilities of a probabilistic string transducer to model regular sound change. A phenomenon for which a model should ideally also account, is semantic shift. The words for concept c in languages A and B may not be cognate. Therefore, the sound changes learned from this pair can be seen as noise during training. However, the meaning of concept c may have shifted to concept d. The word for another concept d in language B, may be cognate with the word for c in language A, so this pair could be used as training data. It would be beneficial if an algorithm could find these cross-concept cognate pairs, or at least be able to give less significance to non-cognate pairs during training.

1.4 Summary

In this introductory chapter, I described the general method in historical linguistics and computational methods applied in the field. Then, I introduced the machine learning and deep learning methods cur- rently applied in natural language processing. Finally, I proposed the word prediction task, using ma- chine learning algorithms as models of sound change, to automate tasks in historical linguistics. In the next chapter, I will specifically describe which machine learning models and linguistic dataI will use in my pairwise word prediction experiments. Furthermore, I will give an overview of the tasks in historical linguistics that can be performed based on word prediction. In subsequent chapters, I will show the results of the experiments on the pairwise word prediction task and propose an extension of the task, sharing more information between language pairs: phylogenetic word prediction. Finally, I will draw conclusions on the contributions that the methods described in this thesis bring to research in historical linguistics. Chapter 2

Method

Now the task of word prediction has been defined, I will describe the models and data I will use inmy experiments. Furthermore, I will show how I will apply word prediction to different tasks in historical linguistics. In this chapter, I will discuss pairwise word prediction: the prediction of words between two languages, without taking into account information from other languages. Phylogenetic word prediction, where information is shared between language pairs, will be covered in a later chapter.

2.1 Pairwise word prediction 2.1.1 Task I will now more precisely define the task of word prediction, described informally in section 1.3.1.A machine learning model is trained on pairs of phonetic word forms (wc,A, wc,B) denoting the same con- cept c in two languages A and B. By learning the sound correspondences between the two languages, the model can then predict, for an unseen concept d, the word form wd,B given a word form wd,A. In this training and prediction process between languages A and B, information from a third language C is not taken into account. After training a model for a language pair, the prediction distance between the prediction andthe real target word is informative. Also, the internal parameters of the model can convey interesting information on the learned correspondences between words in the two languages. In section 2.2, I will further describe how these outcomes of word prediction can be applied to useful tasks in historical linguistics. But first, I will turn to the core of the method, word prediction itself, and specify theused model and data.

2.1.2 Models As machine learning algorithms to perform word prediction, I apply two neural networks (see section 1.2): a more complex RNN encoder-decoder and a simpler structured perceptron.

RNN encoder-decoder The first model I apply, is a recurrent neural network (RNN), in an encoder-decoder structure. AnRNN takes a sequence as input and produces a sequence. RNNs are good at handling sequential information, because the output of a recurrent node depends on both the input at the current time step (the phoneme at the current position) and on the values of the previous recurrent node, carrying a representation of previous phonemes in the word.

13 14 CHAPTER 2. METHOD

One RNN emits one output phoneme per input phoneme, assuming that the source and target lengths are the same. I apply an encoder-decoder structure, inspired by models used in machine translation (Sutskever et al., 2014; Cho et al., 2014). An encoder-decoder model consists of two RNNs, see Figure 2.1. The encoder processes the input string and uses the output at the last time step to summarize the input into a fixed-size vector. This fixed size vector serves as input to the decoder RNN at every time step. The decoder outputs a predicted word. This architecture enables the use of different source and target lengths and outputs a phoneme based on the whole input string. Instead of normal recurrent network nodes, I use Gated Recurrent Units (GRU) (Cho et al., 2014), which are capable of capturing long-distance dependencies. The GRU is a adaptation of the Long Short Term Memory (LSTM) unit (Hochreiter and Schmidhuber, 1997). Both the encoder and decoder consist of 400 hidden units. I apply a bidirectional encoder, combining the output vectors of a forward and backward encoder. The weights of the network are initialized using Xavier initialization (Glorot and Bengio, 2010). With the right initialization, the network can be trained faster, because the incoming data fits better to the activation functions of the layers. Xavier initialization is designed to keep thevariance of the input and output of a layer equal. It works by sampling weights from a normal distribution 1 N(0, , with nincoming the number of units in the previous layer. I apply dropout, the random nincoming disabling of network nodes to prevent overfitting to training data; the dropout factor is 0.1. Datais supplied in batches, the default batch size is 10. Because I use one-hot output encoding, predicting a phoneme corresponds to single-label classifi- cation: only one element of the vector can be 1. Therefore, the output layer of the network isa softmax layer, which outputs a probability distribution over the possible one-hot positions, corresponding to phonemes. The network outputs are compared to the target values usinga categorical cross-entropy loss function, which is known to work together well with softmax output. I add an L2 regularization term to the loss function, which penalizes large weight values, to prevent overfitting on the training data. The applied optimization algorithm is Adagrad (Duchi et al., 2011): this is an algorithm to update the weights with the gradient of the loss, using an adaptive learning rate. The initial learning rate is 0.01. The threshold for gradient clipping is set to 100. In the experiments, the default number oftraining epochs, the number of times the training set is run through, is 15. The network was implemented using the Lasagne neural network library (Dieleman et al., 2015). In the model desiderata (section 1.3.2), I formulated the types of sound changes that a model should account for. Schmidhuber et al. (2006) showed that LSTM models (closely related to the GRU used in the RNN encoder-decoder) are capable of recognizing context-free and context-sensitive languages, up to a certain length. This is even more than my desideratum of modelling sound change described bya regular language.

Cognacy prior As formulated in the desiderata, ideally, the model should learn more from cognate word pairs than from non-cognate word pairs in the training data. I tried to cater for this by including a cognacy prior in the loss function, one of the contributions of this thesis. The network should learn as little as possible from non-cognate word pairs. The weights areupdated using a derivative of the loss. Therefore, I would like to make the loss dependent on an estimation of cognacy of the input. A heuristic for cognacy could be edit distance: words with a small distance can still be deemed cognate, but words with a large edit distance cannot. I propose a new loss function Lnew, based on the cross-entropy loss LCE and a cognacy prior function CP :

Lnew = LCE(t, p) · CP (t, p) 1 CP (t, p) = − 1 + eLCE (t,p) θ

θ = LCEhistory + vσ 2.1. PAIRWISE WORD PREDICTION 15

Where: Lnew the new loss function, which takes into account cognacy LCE(t, p) the original categorical cross-entropy loss between target t and prediction p CP (t, p) the cognacy prior: the estimated score (between 0 and 1) of t and p being cognate θ threshold after which inverse sigmoid function starts to decline steeply LCEhistory mean of all previous cross-entropy loss values v constant, determines number of standard deviations that is added to the treshold Note that the cross-entropy loss LCE(t, p) occurs two times in the formula: as the “body” of the new loss function and inside the cognacy prior. In its original formulation, a larger distance between t and p gives a higher loss. Inside the cognacy prior, the cross-entropy loss is wrapped in an inverse sigmoid function, see figure 2.2, where a larger distance suddenly gives a lower loss, after a certain threshold. The idea is that not much should be learned from words with a very large distance, which areprobably non-cognates. The treshold at which a distance is considered “very large”, is determined by takingthe mean of all previous distances (losses), plus a constant number of standard deviations.

Structured perceptron The second machine learning model I use in my experiments is a simpler model,the structured percep- tron. A structured perceptron (Collins, 2002) is an extension of a perceptron (one-layer neural network) (Rosenblatt, 1958) for performing sequential tasks.

Algorithm The structured perceptron algorithm is run for I iterations. At every iteration, all N data points are processed. For every input sequence (word, in this case) xn, a sequence yˆn is predicted, based on the current model parameters w:

T yˆn = argmaxy∈Y w ϕ(xn, yn) (2.1)

T w ϕ(xn, yn) is the equation of a perceptron: a one-layer neural network with activation function ϕ and weights w. By the argmax, the perceptron has to be evaluated for all possible values of yn; the value which gives the highest output is used as prediction yˆn. This argmax is computationally expensive, therefore, the Viterbi algorithm (Viterbi, 1967) can be run to efficiently estimate the best value yˆn. If the predicted sequence yˆn is different from the target sequence yn, the weights are updated using the difference between the activation function applied to the target and the activation function applied to the predicted value:

w ← w + ϕ(xn, yn) − ϕ(xn, yˆn) (2.2)

After I iterations, the weights w of the last iteration are returned. In practice, the averaged structured perceptron is used, which outputs an average of the weights over all updates. Figure 2.3 shows the pseudocode of the averaged structured perceptron algorithm.

Application I formulated in the model desiderata (Section 1.3.2) that a machine learning model should at least be capable of modelling regular languages. The structured perceptron is succesfully applied to POS tagging (Collins, 2002), a task which can be described by a regular language, so a structured perceptron should be powerful enough for the word prediction task. I use the implementation from the seqlearn library1. In the experiments, the structured percep- tron algorithm is run for 100 iterations of parameter training.

1https://github.com/larsmans/seqlearn 16 CHAPTER 2. METHOD

Tt0 Tt1 Tt2 Tt3

Loss ɡ aɪ s t V V V ht0 ht1 ht2 ht3 Decoder

C Fixed-size context vector W W W ht0 ht1 ht2 ht3 Encoder

ɣ eː s t

Figure 2.1: Structure of the RNN encoder-decoder model

p(cog)

E(t, p)

Figure 2.2: The cognacy prior of the loss function has an inverse sigmoid shape: when thedistance between target and prediction is below θ, the cognacy prior value is close to 1. The cognacy prior sharply decreases when the distance between target and prediction is greater than or equal to θ.

Figure 2.3: Pseudocode of the averaged structured perceptron algorithm (adopted from Daume and Marcu (2006)). 2.1. PAIRWISE WORD PREDICTION 17

2.1.3 Data

Data set

Data from many linguistic levels can be used to study language change, including lexical, phonetic or syntactic data. Using word forms (lexical or phonetic) seems suitable for the prediction task. There are many training examples (words) available per language and the prediction algorithm can generalize over the relations between phonemes. Word forms also have a lower probability of being borrowed or being similar by chance than syntactic data (Greenhill et al., 2017). I use word forms in phonetic representation, because this stays close to the actual use of language by speakers. Word forms in or- thographic representation are dependent on political conventions: the same sound can be described by different letters in different languages. I use the NorthEuraLex dataset (Dellert and Jäger, 2017), which consists of 1016 phonetic word forms for 107 languages in Northern Eurasia. In historical linguistics, generally, only basic vocabulary (eg. kinship terms, body parts) is used, because this vocabulary is least prone to borrowing (Campbell, 2013, p. 352). However, machine learning algorithms need a large number of examples to train on and a meaningful number of examples to evaluate the algorithm. I hope that performance increase of the algorithm by using enough training examples compensates for the possible performance decrease by borrowing. I use a version of the dataset which is formatted in the ASJPcode alphabet (Brown et al., 2008). ASJPcode consists of 41 sound classes, considerably less than the number of IPA phonemes, reducing the complexity of the prediction problem. Table 7.1 gives an overview of ASJP phonemes. There can be multiple word forms for a concept in one language. Per language pair, I create adataset by using all combinations of the alternatives for a concept in both languages. The dataset is then split into a training set (80%), development set (10%) and test set(10%). The training set is used to train the model. The training and test set should be separated, so the model predicts on different datathanit learned from. The development set is used to tune model parameters. Models are run for different parameters and evaluated on the development set. The parameter setting with the highest performance on the development set, is used for the real experiments on the test set.

Input encoding

To enable a machine learning algorithm to process the phonetic data, every phoneme is encoded into a numerical vector. I will evaluate three types of encoding: one-hot, phonetic and embedding encod- ing. The embedding encoding is a new encoding in computational historical linguistics and one of the contributions of this thesis.

One-hot In one-hot encoding, every phoneme is represented by a vector of length ncharacters, with a 1 at the position which correspondends to the current character, and 0 at all other positions. No qualitative information about the phoneme is stored. Table 2.1 gives an example of a one-hot feature matrix.

Phonetic In phonetic encoding, a phoneme is encoded as a vector of its phonetic features (eg. back, bilabial, voiced), enabling the model to generalize observed sound changes across different phonemes. Rama (2016a), using a siamese convolutional neural network for cognate detection, shows that a pho- netic representation gives a better performance for some datasets. I used the phonetic feature matrix for ASJP tokens from Rama (2016a), adding the encoding of the vowels from Brown et al. (2008). Table 2.2 shows an example of a phonetic feature matrix. Table 7.2 shows the full table of phonetic features used in the experiments. 18 CHAPTER 2. METHOD

ASJP phoneme p 1 0 0 0 b 0 1 0 0 f 0 0 1 0 v 0 0 0 1

Table 2.1: Example of feature matrix for one-hot encoding, for an alphabet consisting of four phonemes. Every phoneme is represented by one feature that is turned on, that feature is unique for that phoneme.

ASJP Voiced Labial Dental Alveolar ··· phoneme p 0 1 0 0 ··· b 1 1 0 0 ··· f 0 1 1 0 ··· v 1 1 1 0 ··· m 1 1 0 0 ··· 8 1 0 1 0 ···

Table 2.2: Example of feature matrix for phonetic encoding: every phoneme can have multiple features turned on.

Embedding Encoding linguistic items as a distribution of the items appearing in their context is called an embedding encoding. Word embeddings are succesfully applied in many NLP tasks, where a word is represented by a distribution of the surrounding words (Mikolov et al., 2013; Pennington et al., 2014). The assumption is that “you shall know a word by the company it keeps” (Firth, 1957). Iftwowords have a similar embedding vector, they usually appear in the same context and can thus relatively easy be interchanged. I would like to apply embeddings, and the notion of interchangeability, to a different linguistic level: phonology. I encode a phoneme as a vector of the phonemes occurring in its context. The same interchangeability of word embeddings is assumed: if two phoneme vectors are similar, they appear in a similar context. This corresponds to language-specific rules in phonotactics (the study of the combination of phonemes), which specify that a certain class of phonemes (eg. ) can follow after a certain other class (eg. voiceless ). It can be expected that embeddings of phonemes inside a certain class are more likely to be similar to each other than to phonemes in other classes. In some respect, the embedding encoding learns the same feature matrix as the phonetic encoding, but inferred from the data, and with more room for language-specific phonotactics. I created language-specific embedding encodings from the whole NorthEuraLex corpus. For every phoneme, the preceding and following phonemes, for all occurrences of the phoneme in the corpus, are counted. Position is taken into account, ie. an /a/ appearing before a certain phoneme is counted separately from an /a/ appearing after a certain phoneme. Start and end tokens, for phonemes atthe start and end of a word, are also counted. This approach is different from word embedding approaches in NLP, which count a larger window of eg. 15 surrounding words and do not take into account position. However, I wanted to put more emphasis on the direct neighbours of a phoneme, as most phonotactic rules describe these relations. After collecting the counts, the values are normalized per row, so all the features for aphonemesum to 1. Table 2.3 shows an example of an embedding feature matrix. To analyze what representation of the phonetic space the embedding encoding learns, I generated 2.1. PAIRWISE WORD PREDICTION 19

ASJP START i LEFT S LEFT p RIGHT ··· phoneme 3 0.004 0.003 0.001 0.002 ··· E 0.024 0.000 0.000 0.003 ··· a 0.050 0.002 0.000 0.012 ··· b 0.388 0.000 0.000 0.004 ··· p 0.152 0.039 0.000 0.000 ···

Table 2.3: Example of feature matrix for embedding encoding: every phoneme is represented by an array of floating point values, which correspond to the probabilities that other phonemes occur before or after this phoneme. The values in a row sumto1. embeddings for Dutch from the whole NorthEuraLex corpus. I then reduced these embeddings to two dimensions using Principal Components Analysis (PCA) (Pearson, 1901). For comparison, I also added PCA to the phonetic feature matrix (Table 7.2). Figure 2.4 shows PCA plots for the embedding and phonetic encoding. It can be seen that both encodings partially show the same representation of the phonetic space, but generally the phonetic representation is more clustered, while the embedding rep- resentation is more spread out. In both plots, a cluster of vowels can be observed. Also, in both plots, n, l, r and N are close together. There are also striking differences between the encodings: S and s are, in accordance with phonological theory, close together in the phonetic matrix, but far remote in embedding encoding. This remoteness does not have to be wrong, it can be a phonotactic pattern seen in the NorthEuraLex pattern, and it can be an effective way to code information. It however showsthat the embedding and phonetic encoding do not learn the same representation in all cases.

Target encoding For the target data in the neural network models, one-hot encoding is used, regardless of the input encoding. This means that target words are encoded in one-hot encoding and the algorithm willout- put predictions in one-hot encoding. One-hot output encoding facilitates convenient decoding of the predictions. Other output encodings did not show good results in preliminary experiments. As target for the structured perceptron model, unencoded data is supplied, since this is the format used by the used implementation of the algorithm.

Input normalization The training data is standardized, in order to fit it better to the activation functions of the neuralnetwork nodes. For the training data, the mean and standard deviation per feature are calculated, over the whole training set for this language pair. The mean is subtracted from the data and the resulting valueis divided by the standard deviation. After standardization, per feature, the standardized training datahas mean 0 and standard deviation 1. The test data is standardized using the mean and standard deviation of the training data. Thistransfer of knowledge can be regarded as being part of the training procedure.

2.1.4 Experiments I specified the machine learning models and data used in the experiments. Now, I will describe howthe training and prediction in the experiments will take place. 20 CHAPTER 2. METHOD

(a) Embedding matrix

(b) Phonetic matrix

Figure 2.4: PCA plots of the embedding encoding matrix for Dutch, generated from NorthEuraLex, and the phonetic feature matrix (Table 7.2). 2.1. PAIRWISE WORD PREDICTION 21

Languages Description ces bul rus bel ukr pol slk slv hrv Slavic swe isl nld deu dan nor Western Germanic lat fra ron ita por spa cat Romance lav hrv rus bel ukr ces slv Balto-Slavic fin krl liv ekk vep olo Finnic lit hrv ukr ces slv slk Balto-Slavic lit hrv ukr ces slv lav Balto-Slavic uzn tur tat bak azj Turkic sjd smj sms smn sme Sami smj sma sms smn sme Sami

Table 2.4: Overview of the 10 largest maximal cliques of languages with at least 100 shared cognates. To give an impression of the languages in the cliquees, informal descriptions of the language groups are added, these differ in level of grouping. For a mapping from the ISO code used here to language names, see Table 7.3.

Training Training is performed on the full training set per word pair, which consists of both cognate and non- cognate words. It would be easier for the model to learn sound correspondences if it would only receive cognate training examples. However, I want to develop a model that can be applied to problems where no cognate judgments are available.

Evaluation I evaluate the models by comparing the predictions and targets on the test section of the dataset. During development and tuning of parameters, the development set is used. I only evaluate on cognate pairs of words. If words are not genetically related, the algorithm will not be able to predict this word via regular sound correspondences. Cognate judgments from the IELex dataset are used.2 For words in NorthEuraLex for which no IELex cognate judgments are available, LexStat (threshold 0.6) automatic cognate judgments are generated. Languages which are not closely related, do not share many cognates. Because I only evaluate on cognate words, the test set for those language pairs will become too small. To alleviate this problem, I evaluate only on groups of more closely related languages. In these groups, every language in the group shares at least n cognates with all other languages. I determine these groups, by generating a graph of all languages, where two languages are connected if and only if the number of shared cognates exceeds the threshold n. Then, I determine the maximal cliques in this graph: groups of nodes where all nodes are connected to each other and it is not possible to add another node that is connected to all existing nodes. These maximal cliques correspond to my definition of language groups which share n cognates. Table 2.4 shows the 10 largest maximal cliques of languages with at least 100 shared cognates. The distance metric used between target and prediction isthe Levenshtein distance (Levenshtein, 1966) (also called edit distance) divided by the length of the longest sequence. An average of this distance metric, over all words in the test set, is used as the distance between two languages. I use the prediction distances for two goals: to determine the distance of languages to each other and to determine the general accuracy of a model. If a certain model has a lower prediction distance over all language pairs than another model, I consider it to be more accurate.

2An intersection, which applied the IELex cognate judgments to the NorthEuraLex dataset, was supplied by Gerhard Jäger. 22 CHAPTER 2. METHOD

The prediction results are compared to two baselines, for which the distance between targetand baseline is calculated. The first baseline is the trivial source prediction baseline, predicting exactly the source word. The second baseline is based on Pointwise Mutual Information (PMI). As in Jäger etal. (2017), multiple runs of Needleman Wunsch-alignment (Needleman and Wunsch, 1970) of words are performed on a training set, iteratively optimizing PMI-scores and alignments. At prediction time, the alignment of the last training iteration is used. For every source phoneme, the target phoneme with the highest probability of being aligned to the source phoneme is predicted.

2.2 Applications

In the next paragraphs, I will show multiple applications of word prediction in historical linguistics: phylogenetic tree reconstruction, sound correspondence identification and cognate detection. The applica- tions use the outcomes of pairwise word prediction as a basis.

2.2.1 Phylogenetic tree reconstruction Language pairs with a good prediction score (low edit distance) usually share more cognates, since these are predictable through regular sound correspondences. I regard the prediction score between language pairs as a measure of ancestral relatedness and use these scores to reconstruct a phylogenetic tree. I perform hierarchical clustering on the matrix of edit distances for all language pairs, using the UPGMA (Sokal and Michener, 1958) and neighbor joining (Saitou and Nei, 1987) algorithms (described in section 1.1.3), implemented in the LingPy library (List and Forkel, 2016). The generated trees are then compared to reference trees from (Hammarström et al.,2017), based on current insights in historical linguistics. Evaluation is performed using Generalized Quartet Distance (Pompei et al., 2011), a generalization of Quartet Distance (Bryant et al., 2000) to non-binary trees. I apply the algorithm as implemented in the QDist program (Mailund and Pedersen, 2004).

2.2.2 Sound correspondence identification To be able to make predictions, the word prediction model has to learn the probabilities of phonemes changing into other phonemes, given a certain context. I would like to extract these correspondences from the model. It is challenging to identify specific neural network nodes that fire when a certain sound correspondence is applied. Instead, I estimate the internal sound correspondences that the network learned, by looking at the output: the substitutions made between the source word and the prediction. Pairs of source and predictions words are aligned using the Needleman-Wunsch algorithm. Then, the pairs substituted phonemes between these source-prediction alignments can be counted. These can be compared to the counts of subsituted phonemes between source and target.

2.2.3 Cognate detection Cognate detection is the detection of word forms in different languages (usually per concept), which derive from the same ancestral word. In order to perform cognate detection based on word prediction, I cluster words for the same concept in different languages based on the prediction distances per word. First, word prediction is performed for all language pairs of languages which we want to evaluate. In the normal word prediction workflow (section 2.1.4), predictions are made only on word pairs which are deemed cognate by existing judgments. When performing cognate detection, the whole point is to make these judgments, so I perform word prediction on the full test set: cognates and non-cognates. I take into account concepts for which word forms occur in all languages, this vastly reduces the number of concepts. For every concept, I create a distance matrix between the word forms in all languages, based 2.3. SUMMARY 23 on the prediction distance per word. Next, I perform a flat clustering algorithm on this distance matrix. Applied clustering algorithms are flat UPGMA (Sokal and Michener, 1958), link clustering (Ahn et al., 2010) amd MCL (van Dongen, 2000), implemented in the LingPy library. Preliminary experiments show that a threshold of θ = 0.7 gives best results for MCL and link clustering, and θ = 0.8 gives best results for flat UPGMA. Conceptually, the performed cognate detection operation is the same as the phylogenetic tree recon- struction operation, but now I cluster per word, instead of per language, and I perform a flat clustering instead of an hierarchical clustering. For evaluation, I use cognate judgments from IElex (Dunn, 2012). Evaluation is performed using the B-Cubed F measure (Bagga and Baldwin, 1998; Amigó et al., 2009), implemented in the bcubed library3.

2.3 Summary

In this chapter, I described the method of the experiments on pairwise word prediction. After defining the task, models and data, I showed how several tasks in historical linguistics can be performed based on pairwise word prediction. Contributions of the methods described in this section include:

• The proposal of a new cognacy prior loss, enabling a neural network to learn more from some training examples than from others. • The usage of embedding encoding, inspired by word embeddings in natural language processing, to encode phonemes in historical linguistics.

• Use of clustering algorithms to identify learned patterns by a neural network. • Inference of cognates from word prediction distances. Earlier cognate detection detection algo- rithms did not use prediction as basis. In the next chapter, I will show the results of the experiments.

3https://github.com/hhromic/python-bcubed 24 CHAPTER 2. METHOD Chapter 3

Results

3.1 Word prediction

Pairwise word prediction was performed for all possible language pairs for the 9 languages from the Slavic group in the Indo-European : Czech, Bulgarian, Russian, Belarusian, Ukrainian, Polish, Slovak, Slovene and Croatian. This is the largest group of languages sharing at least 100 cognates, found in section 2.1.4. A smaller number of experiments was performed for the second largest clique, a group of 8 Germanic languages: Swedish, Icelandic, English, Dutch, German, Danish and Norwegian. For three free parameters, the results were evaluated in the experiments:

• Machine learning model: RNN encoder-decoder/structured perceptron

• Input encoding: one-hot/phonetic/embedding

• Cognacy prior (only for encoder-decoder): off/v = 1.0/v = 2.0

All other parameters were set as described in section 2.1.2. Table 3.1 shows the output of the word prediction algorithm, for a structured perceptron model on the language pair Dutch-German. For every word, a prediction distance is calculated. From these word distances, a mean distance per language pair is calculated. When again taking the mean of the scores of all language pairs in a family, one could get a score which represents the performance of a model on a language family. Table 3.1 shows mean word prediction distances for the different conditions. The structured percep- tron model outperforms the RNN encoder-decoder for the language groups under consideration. This difference is larger for the Slavic than for the Germanic language family. The differences betweendif- ferent conditions (input encoding and cognacy prior) for the RNN model are small, but the embedding encoding generally seems to perform a bit better. For the Slavic language family, the PMI baseline model outperforms both of the prediction models. For the Germanic language family, the structured percep- tron outperforms the baseline, with a slightly better performance. This difference may be explained by the fact that the Slavic languages in the dataset are more closely related and therefore easier for the baseline models to predict right, if little changes are made to the source words.

3.2 Phylogenetic tree reconstruction

In section 3.1, word prediction was performed for the Slavic language family. From the prediction results, I now reconstruct phylogenetic trees, by hierarchically clustering the matrix of edit distances

25 26 CHAPTER 3. RESULTS

Input Target Prediction Distance blut blut blut 0.00 inslap3 ainSlaf3n inSlaun 0.33 blot blat blat 0.00 wExan vEge3n vag3n 0.33 xlot glat glat 0.00 warhEit vaahait vaahait 0.00 orbEit aabait oabait 0.17 mElk3 mElk3n mEl3n 0.17 vostbind3 anbind3n fostaiN3n 0.78 hak hak3n hak 0.40 stEl3 StEl3n Staln 0.33 hust3 hust3n hiSta 0.67 xord3l giat3l goad3l 0.33 l3is laus laiS 0.50 mont munt mant 0.25 ler3 leG3n le3n 0.20 fEift3x finfciS faift3n 0.71 zwEm3 Svim3n Svam3 0.33 slap Slaf Slep 0.50 klop3 klop3n klaun 0.50 vex3 feg3n feg3 0.20 tont3 tant3 tanta 0.20 dox tak daS 0.67 nevEl neb3l nebEl 0.20 ku ku kl 0.50 spits Spic Spist 0.40 lerar leGa leGa 0.00 dot das dat 0.33 brot bGot bGat 0.25 bind3l bind3l b3nd3l 0.17

Table 3.1: Word prediction output for a structured perceptron on language pair Dutch-German, encoded in ASJP phonemes (see Table 7.1 for an overview of ASJP). Prediction is the German word predicted by the model when Input is given as Dutch input. The edit distance between the prediction and the target German word, which is not seen by the model, is calculated. Lower distance is better performance. 3.3. IDENTIFICATION OF SOUND CORRESPONDENCES 27

Method Language family

Model Input encoding Cognacy prior Slavic Germanic Encoder-decoder One-hot None 0.5582 0.5721 Encoder-decoder Phonetic None 0.5767 0.5853 Encoder-decoder Embedding None 0.5579 0.5710 Encoder-decoder One-hot v = 1.0 0.5607 0.5754 Encoder-decoder Phonetic v = 1.0 0.5770 0.5824 Encoder-decoder Embedding v = 1.0 0.5573 0.5620 Encoder-decoder One-hot v = 2.0 0.5620 0.5688 Encoder-decoder Phonetic v = 2.0 0.5752 0.5744 Encoder-decoder Embedding v = 2.0 0.5543 0.5580 Structured perceptron One-hot 0.3436 0.4374 Structured perceptron Phonetic 0.3465 0.4497 Structured perceptron Embedding 0.3423 0.4375 Source prediction baseline 0.3714 0.4933 PMI-based baseline 0.3249 0.4520

Table 3.2: Word prediction distance (edit distance between prediction and target) for different test con- ditions, for two language families: Slavic and Germanic. The distance is the mean of the distance ofall language pairs in the family. Lower distance means better prediction. of all language pairs. Table 3.3 shows the generalized Quartet distance between the generated trees, for different conditions, and a Glottolog reference tree. In the table, one could see that the structured perceptron model consistently creates good trees. Some trees generated from the RNN encoder-decoder model give the same performance, but this is less stable across conditions. The baseline models, especially the PMI model, also create good trees. The performance differences between models are smaller than in Table 3.2. This is not very surprising, given that phylogenetic tree reconstruction is an easier task than word prediction: there are less possible branchings in a tree, than possible combinations of phonemes in a word. Even a model with lower performance on word prediction can generate a relatively good tree. Figure 3.1 graphically shows the trees inferred from three different models. The reference tree is added for comparison. Trees (a) and (b) (structured perceptron, UPGMA and NJ) are the same, the branches are only shown in a different order. These trees receive the lowest possible distance tothe reference tree of 0.047619: the generated binary trees will never precisely match the multiple-branching reference tree. This is also the reason why no generated tree reaches a Quartet distance of0inTable 3.3. Tree (c) is one of the worst-performing models (encoder-decoder, one-hot, UPGMA), with a Quartet distance to the reference tree of 0.31746. Comparing to the reference tree, it can be observed that Polish, Czech and Slovak are placed in the wrong subtrees.

3.3 Identification of sound correspondences

I identified sound correspondences between Dutch and German, two closely related Germanic lan- guages. Source-prediction substitutions were extracted using a structured perceptron model, with the default settings described in section 2.1.2. The test set, on which the model was evaluated, consisted of 93 cognate word pairs. Table 3.4 shows the most frequent sound substitutions for source-prediction and 28 CHAPTER 3. RESULTS

b bul b bel

b b

b b slv b ukr b

b b hrv rus b b rus b slv b b b b b bel b hrv b

b b b ukr bul b b pol b pol

b b b ces b ces b b b slk b slk (a) Structured perceptron, one-hot encoding, (b) Structured perceptron, one-hot encoding, UPGMA clustering. NJ clustering. Quartet distance: 0.047619. Quartet distance: 0.047619.

b bul

b b bel b ces b

b b slk b rus b b slv b ukr b b b hrv b hrv b

b b b pol b slv

b b rus b bul

b b bel b ces b b

b b ukr b slk (c) RNN encoder-decoder, one-hot encoding, b pol UPGMA clustering. Quartet distance: 0.31746. (d) Glottolog

Figure 3.1: Phylogenetic trees for the Slavic language family, using different models and clustering methods, and the Glottolog reference tree. For the model trees, Quartet distances to the Glottolog reference trees are shown. Lower distance means better correspondence. 3.4. COGNATE DETECTION 29

Method Clustering

Model Input encoding Cognacy prior UPGMA Neighbor joining Encoder-decoder One-hot None 0.047619 0.047619 Encoder-decoder Phonetic None 0.047619 0.190476 Encoder-decoder Embedding None 0.047619 0.047619 Encoder-decoder One-hot v = 1.0 0.31746 0.190476 Encoder-decoder Phonetic v = 1.0 0.047619 0.047619 Encoder-decoder Embedding v = 1.0 0.31746 0.31746 Encoder-decoder One-hot v = 2.0 0.31746 0.190476 Encoder-decoder Phonetic v = 2.0 0.31746 0.047619 Encoder-decoder Embedding v = 2.0 0.190476 0.190476 Structured perceptron One-hot 0.047619 0.047619 Structured perceptron Phonetic 0.047619 0.047619 Structured perceptron Embedding 0.047619 0.047619 Source prediction baseline 0.269841 0.047619 PMI-based baseline 0.047619 0.047619

Table 3.3: Generalized Quartet distance between trees of the Slavic language family, inferred from word prediction results and the Glottolog reference tree. Lower is better: a generated tree equal to theref- erence tree will have a theoretical distance of 0. In this case, the lower bound is 0.047619, because the generated binary trees will never precisely match the multiple-branching reference tree.

source-target. It can be observed that the most frequent substitutions between source and prediction are also frequent between source and target. This implies that the model learned meaningful sound correspondences.

3.4 Cognate detection

I perform cognate detection for the Slavic and Germanic language families, by clustering words based on word prediction distances. I evaluate performance for the encoder-decoder and structured perceptron models. The models use the parameter settings described in section 2.1.2. As described in section 2.1.4, during cognate detection, contrary to the default setting, prediction is performed on both cognates and non-cognates. I apply three clustering algorithms, as described in section 2.2.3: MCL (θ = 0.7), Link clustering (θ = 0.7) and Flat UPGMA (θ = 0.8). Table 3.5 shows B-Cubed F scores for cognate detection, using different models and clustering al- gorithms on the Slavic and Germanic language families. For the Slavic language family, the source prediction baseline model slightly outperforms the prediction models. The structured perceptron model performs better than the RNN model. For the Germanic language family, both prediction modelsper- form above the baseline. Here, the one-hot structured perceptron using MCL clustering performs best. It must be noted that the sample of shared concepts in a language family is small, this makes results less stable. 30 CHAPTER 3. RESULTS

Substitution Source- Source- prediction target frequency frequency a 21 13 r a 14 13 s S 14 8 v f 12 10 E a 12 9 3 n 10 1 r G 9 12 x g 9 10 w v 8 9 - n 7 28 3 a 4 1 i n 4 t - 3 3 r - 3 2 - 3 3 7 p u 3 r n 2 e a 2 3 w - 2 2 i 3 2 - t 2 1 e E 2 1 x S 2 2 p f 2 5 u l 2 x v 2 v b 2 3 k - 2 x n 1 i - 1

Table 3.4: Substitutions between aligned source-prediction pairs and substitutions between aligned source-target pairs for Dutch-German word prediction, using a structured perceptron model, with a test set consisting of 93 cognate pairs, under standard conditions. The list is ordered on frequency of source-prediction substitutions, the 30 most frequent entries are shown. 3.5. SUMMARY 31

Method Clustering Language family algo- rithm

Model Input en- Cognacy Slavic Germanic coding prior Encoder-decoder One-hot None MCL 0.800022 0.895440 Encoder-decoder One-hot None LC 0.911171 0.866622 Encoder-decoder One-hot None fUPGMA 0.898277 0.861093 Structured perceptron One-hot MCL 0.877458 0.932077 Structured perceptron One-hot LC 0.915680 0.905044 Structured perceptron One-hot fUPGMA 0.919739 0.889788 Source prediction baseline MCL 0.920806 0.851781 Source prediction baseline LC 0.926061 0.875475 Source prediction baseline fUPGMA 0.929840 0.878705

Table 3.5: B-Cubed F scores for cognate detection using different models and clustering algorithms on the Slavic (27 concepts) and Germanic (29 concepts) language families. MCL=MCL clustering (θ = 0.7), LC=Link Clustering (θ = 0.7), fUPGMA=Flat UPGMA (θ = 0.8). Higher F score means better correspondence between computed and real clustering.

3.5 Summary

In this Results section, I have shown that the word prediction models generally perform on a comparable level to a PMI-based baseline model, on both the word prediction task, and a number of related tasks in historical linguistics. When comparing the two word prediction models, the structured perceptron model outperforms the encoder-decoder RNN. In the next chapter, I will take a closer look at what the neural network learned, by visualizing activations of the network. Subsequently, I will introduce an elaboration of the word prediction task, phylogenetic word prediction, in which information is shared between language pairs. 32 CHAPTER 3. RESULTS Chapter 4

Context vector analysis

In this section, I will try to visualize what the encoder-decoder neural network learned, by examining the network activations. I will look at the activation of the context vectors, when different words are supplied. The context vector is situated between the encoder and decoder (see Figure 2.1) and carries an abstract, word length-independent, representation of the correspondences between input and target words. These context vectors can be compared to the encoded representations of the input and target words. I will first describe how the context vectors are extracted from the network. Then, I willshowhow the context vectors and input/target words can be visualized graphically using Principal Components Analysis (PCA). Finally, I will describe and perform an analysis of the clusterings of the context vectors, compared to clusterings of the input and target words.

4.1 Extraction of context vectors and input/target words

To extract the context vectors, the encoder-decoder model is trained for 15 epochs, with batch size 1, on the training part of the data set. Next, during the prediction phase, I supply the input words from the test set and extract the context vector activations for these words. These context vectors have dimensionality nhidden, 400 in my experiments. For comparison, vectors per word are generated from the input and target words, encoded in one-hot encoding. The dimensions representing word length and number of features are merged. This yields word vectors from the input and target words with dimensionality len(word) × nfeatures, which is around 500 in most experiments. Every element of the vector represents one feature in one position. Now, I have three vectors representations of the words: context vectors, input words and target words. I compare the three representations in two ways: by visualizing the three representations in a plot using PCA and by analyzing the clusters of words in the three representations.

4.2 PCA visualization

In order to visualize the high-dimensional context vectors graphically, the dimensionality is reduced to two dimensions using principal components analysis (PCA) (Pearson, 1901). The result is a scatter plot of words, where patterns can be seen: words that are close together, have a similar context vector repre- sentation. I also plot the vectors of the input words and target words. Learned sound correspondences can now be identified by looking for clusters that appear in the context vector plot, but are not apparent in the input words plot.

33 34 CHAPTER 4. CONTEXT VECTOR ANALYSIS

Distance matrix pair Cosine distance Vectors-Input 0.0520 Vectors-Target 0.0615 Input-Target 0.0152

Table 4.1: Cosine distances between the three distance matrices: pairwise distances of context vecors, pairwise distances of input words and pairwise distances of target words. Lower score means closer relationship.

Sound correspondences were identified between Dutch and German, two closely related Germanic languages. Context vector activations are extracted from an encoder-decoder model, with all parameters set as described in section 2.1.2, but with batch size 1. Figure 4.1 visualizes the context vectors of the encoder-decoder for different words using PCA and compares them to PCA visualizations of the input words and target words. It is challenging to immedi- ately discern shared clusterings across the three plots. It can be noted that more clusterings are apparent in the context vectors plot than in the visualizations of the input and target words.

4.3 Cluster analysis

I would like to compare the clusterings of words in the three representations: context vectors, input words and target words. It is challenging to visually distinguish clusters based on the PCA plots. Fur- thermore, information has been lost by the dimensionality reduction. Instead, it is also possible to auto- matically cluster the high-dimensional vectors of the words using a clustering algorithm and compare the clusterings in the three representations.

4.3.1 Distance matrices To create a clustering, a distance matrix is needed. To obtain a distance matrix per representation (context vectors/input words/target words), I apply a pairwise cosine distance, for every pair of word vectors, to get a distance matrix between all words in one representation. Cosine distance captures the angle between vectors, instead of the magnitude: vectors with different magnitude, but the same proportion between the different dimensions, will receive a low distance. This will give distances which are in the same range. I use a pairwise cosine distance calculation method from the scikit-learn library. Table 4.1 shows cosine distances between the distance matrices of the context vectors, input words and output words.

4.3.2 Clustering On the three resulting distance matrices, I perform flat clustering algorithms. These belong to thesame class algorithms I used for cognate detection in section 2.2.3. The applied clustering method and thresh- old are crucial for the outcome of the cluster analysis, therefore, I evaluated different flat clustering algorithms, Flat UPGMA, MCL, Link clustering and Affinity propagation (Frey and Dueck, 2007), imple- mented in the LingPy library. I wanted to find a cluster method that gives a clustering with relatively small clusters (cluster size around 5), not too much variability in size between clusters, and similar clus- ter size for the three distance matrices. Per clustering method, I assess the minimum, maximum and median cluster sizes. Table 4.2 shows the min/max/median cluster sizes for different clustering methods 4.3. CLUSTER ANALYSIS 35

(a) Context vectors

(b) Input words, one-hot encoding

(c) Target words, one-hot encoding

Figure 4.1: The weights of the context vectors of all words in the encoder-decoder have been visualized using PCA. For comparison, PCA visualizations of the input words and target words have been added. 36 CHAPTER 4. CONTEXT VECTOR ANALYSIS

Method Context Input Output

Method θ Min Med. Max Min Med. Max Min Med. Max Flat UPGMA 0.6 1 3.0 59 1 44.0 87 1 2.0 85 Flat UPGMA 0.5 1 3.0 17 1 3.0 79 1 2.5 82 Affinity prop 0.4 3 6.0 11 1 4.0 35 17.522 Affinity prop 0.5 1 7.5 12 10 18.0 42 138.049 Affinity prop 0.2 1 8.5 15 1 5.5 10 13.09 Affinity prop 0.1 1 10.0 17 1 3.0 11 14.013 Affinity prop 0.6 7 11.0 15 34 44.0 54 14 17.057 MCL 0.5 1 11.5 64 88 88.0 88 2 44.0 86 Link cluster. 0.5 1 43.5 86 87 87.0 87 87 87.0 87 InfoMap 0.5 1 44.0 87 88 88.0 88 2 44.0 86 Link cluster. 0.6 87 87.0 87 87 87.0 87 87 87.0 87 MCL 0.6 88 88.0 88 88 88.0 88 88 88.0 88 InfoMap 0.6 88 88.0 88 88 88.0 88 88 88.0 88

Table 4.2: Minimum, maximum and median cluster sizes for clustering of distance matrices of the con- text vectors, input words (one-hot representation) and target words (one-hot representation), using different clustering algorithms, on the validation set. The results have been sorted by median cluster size of the context vectors. Methods with a median cluster size of 1 have been removed, clusterings with such small clusters are definitely not suitable for the purposes of this thesis.

Judgments Gold Precision Recall F Context Input 0.324 0.413 0.363 Context Target 0.279 0.438 0.340 Input Target 0.376 0.474 0.419

Table 4.3: B-Cubed recall, precision and F scores between the clusterings (using affinity propagation, θ = 0.2) of the context vectors, input words and target words distance matrices. Higher score means closer relationship. on the validation set. Affinity propagation with threshold 0.2 seems the best choice, because of the small clusters and relatively small variability, and will be used in the experiments. The distance matrices were clustered using affinity propagation, with θ = 0.2. Table 4.3 shows B-Cubed F scores between the clusterings for the three distance matrices. According to the direct cosine distances, the context vector weights are closer to the input words than to the target words. This pattern is also observed when looking at the B-Cubed precision and F of the clusterings. However, when looking at B-Cubed Recall of the clusterings, the context vector weights are closer to the target words than to the input words. The fact that the context vector weights recall many of the clusterings in the target words, is striking, since the target words have not been seen by the network when the weights are extracted. This shows that the network has learned sound correspondences. In all comparisons in Tables 4.1 and 4.3, the input and target words remain closer to each other than the context vectors are to the input or target. This underlines that the context vector weights are really situated in a different space than the input and target words. 4.4. SUMMARY 37

4.4 Summary

In this chapter, I tried to get an insight in the learned correspondences of the neural network, by vi- sualizing weights using PCA and by performing an analysis of word clusters. I propose a clustering of context vector activations as a new method to analyze learned patterns. A qualitative inspection of the PCA plots and a comparison of the clusterings of context vectors, input words and target words shows that the network has learned sound correspondences. However, it remains challenging to draw solid conclusions from the high-dimensional activations of layers of the neural network. 38 CHAPTER 4. CONTEXT VECTOR ANALYSIS Chapter 5

Phylogenetic word prediction

5.1 Beyond pairwise word prediction

A flaw of pairwise prediction is that no information is shared between word pairs, and pairs ofdistantly related languages are assumed to have the same relation to each other as languages of closely related languages. In historical linguistics, a phylogenetic tree of languages is assumed. In this framework, languages genetically share information if they have a common ancestor. Also, languages are related hierarchically, giving closely related languages a different relationship to each other than distantly re- lated languages. I propose phylogenetic word prediction, in which the task is to predict words between languages, while exploiting the phylogenetic structure of language ancestry. I assume a fixed phylogenetic tree structure, given by the linguistic literature, and the prediction performance for all language pairs in this tree. From the ancestral nodes in the tree, protoforms, word forms in a postulated ancestral language, can be inferred. In principle, the approach could be extended to a joint task of word prediction and phylogenetic tree reconstruction. Phylogenetic word prediction can be performed for a number of possible trees for a group of languages. The tree with the lowest average prediction distance can be labelled as the optimal phylogenetic tree. This is however outside the scope of this thesis.

5.2 Method

5.2.1 Network architecture In pairwise word prediction, there was a separate model for every language pair. In phylogenetic word prediction, there is one model for all language pairs. I build a neural network, structured as a given phylogenetic tree. The languages which are taken into account are at the leafs of the tree, as input layers. Dense layers are used to represent edges of the phylogenetic tree: they connect input languages to ancestral languages and ancestral languages to output languages. The general structure of the network is that of a feed-forward neural network, where every layer has an fixed input length, without modelling sequentiality. At the leaf of every output language, a RNN encoder (same settings as in section 2.1.2) is attached to the input layer. The RNN encoder transforms the variable-length input into afixed-size vector. This fixed-size vector is fed through the layers of the network. At the leaf of everyoutput language, an RNN decoder is attached, to transform the fixed-size vector to a variable-length word. Figure 5.1 shows the architecture of a phylogenetic neural network for three languages.

39 40 CHAPTER 5. PHYLOGENETIC WORD PREDICTION

Dec C

Enc A Enc B

Figure 5.1: Phylogenetic neural network architecture for three languages A, B and C. When feeding data from the encoder of language A to the decoder of language C (red), edges are visited that are also visited when feeding data from language B to language C (blue). The weights for these edges are shared (purple). For other language pairs, different edges can be shared.

The difference to the neural network of the pairwise prediction model (section 2.1.2) isthatthe phylogenetic neural network is a feedforward neural network, with encoders and decoders at the leafs, to model sequentiality. The pairwise prediction model was a full encoder-decoder, without any feed- forward layers between the encoder and decoder. In the existing literature, Recursive neural networks (RxNNs) (Goller and Kuchler, 1996), are exam- ples of tree-structured neural networks. They have been applied on tasks where hierarchical structure is involved, like syntactic parsing (Socher et al., 2010) and dependency parsing (Le and Zuidema, 2014). In RxNNs, the weight applied at every layer of the tree is the same, which models a recursive structure, which occurs in eg. language syntax. In the phylogenetic neural network, the weight applied at every layer is different, because the relationship between languages and protolanguages at different levelsof the tree is not the same. Another difference is that, in most RxNNs, informations flows bottom-up, and the output of the network is collected at the root. In the phylogenetic neural network I propose, infor- mation passes the root of the phylogenetic tree; the output is collected at the leaf of the target language. Inside-Outside RxNNs (Le and Zuidema, 2014) are another example of tree-structured neural networks where information does not only flow bottom-up; this network however has a different architecture.

5.2.2 Weight sharing and protoform inference Training and prediction are performed per language pair. Per language pair, data travels from the RNN encoder of one input language to the RNN decoder of one output language, following a path through the tree. The weights of all the tree branches that were visited are then updated. The weights alongthe tree branches are shared between the language pairs, so information can be shared between languages. Another view is that every language pair has an own feed-forward neural network (the path of tree edges), where weights of the feed-forward edges are shared with other language pairs according to the phylogenetic tree. Weights are shared between all dense layers which represent tree edges with the same location in the tree and the same directionality. Edges with the same location, but different directionality, do not share weights. I assume that predicting from a certain language requires different weights than prediction to that language. Figure 5.1 gives an example of weight sharing in the phylogenetic neural network, for two language pairs. I chose to share weights between all encoders and to share weights between all decoders. There is 5.3. RESULTS 41 no weight sharing between encoders and decoders mutually, because these have to perform different tasks. By sharing weights, the network is trained to develop universal encoders and universal decoders. The encoders can transfer word forms from every language into a fixed-size representation. Thede- coders can convert fixed-size representations into any language. This vaguely reminds of the universal encoder-decoder model applied by Ha et al. (2016) in machine translation. The universality of encoders and decoders forces the dense layers, representing tree edges, to learn language-dependent sound cor- resondences. This is a difference to the encoder-decoder used in pairwise word prediction: in thatsetup, the encoders and decoders were responsible for learning language-specific sound correspondences. The universality of encoders and decoders enables me to infer protoforms from ancestral languages. By connecting the universal decoder to an ancestral node in the network and feeding words at an input language, I can extract protoforms. With language-dependent encoders and decoders, this would not have been straightforward.

5.2.3 Implementation details The number of hidden units for the dense layers is 400, this provided the best results inpreliminary experiments. The dense layers use a rectified linear unit (Hahnloser et al., 2000) as nonlinear activation function, the function f(x) = max(0, x), which has shown to improve training in deep neural networks (Glorot et al., 2011). The layers are initialized using Xavier initialization. Between dense layers, dropout layers, with dropout factor 0.1, have been added, to prevent overfitting. In all experiments, I use one-hot input encoding.

5.2.4 Training and prediction Training is performed by feeding batches for language pairs alternately. By training the weights with alternating impressions from different language pairs, I try to make the network learn a generalization over different languages. The network is trained for 40 epochs, longer than in pairwise word prediction, to make sure the larger number of weights gets trained well. Predictions are performed per language pair, as in pairwise word prediction. Predictions on proto- forms are performed by supplying input words at one input language, and collecting output at the protolanguage decoder.

5.2.5 Experiments In my experiments, I will perform a pilot study on phylogenetic word prediction, on the family tree of the West-Germanic languages Dutch, German and English. I will evaluate performance when the model assumes the reference family tree, where Dutch and German are more closely related, and En- glish is grouped on a higher level: ((nld, deu), eng) in Newick notation. For comparison, I will also evaluate performance for the “false” (or at least different from reference) trees ((nld, eng), deu) and ((deu, eng), nld). Furthermore, I will generate protoforms from the model which uses the reference tree.

5.3 Results

I performed phylogenetic word prediction for the West-Germanic languages Dutch, German and En- glish, using three models: one assuming the reference tree ((nld, deu), eng) and two assuming the false trees ((nld, eng), deu) and ((deu, eng), nld). I compare the performance of these models to the predictions of the pairwise prediction model (one hot, v = 1.0), from section 3.1, which I now split out 42 CHAPTER 5. PHYLOGENETIC WORD PREDICTION

Model Language pair deu-eng deu-nld eng-deu eng-nld nld-deu nld-eng PhylNet, ref. tree 0.7273 .5949 0.7085 0.5919 0.5528 0.6009 ((nld,deu),eng) PhylNet, false tree 0.7726 0.5900 0.6807 0.5647 0.5826 0.6323 ((nld,eng),deu) PhylNet, false tree 0.6889 0.5909 0.6981 0.6199 0.5733 0.5806 ((deu,eng),nld) Pairwise encoder- 0.6472 0.49155 0.6569 0.5773 0.4676 0.5725 decoder

(a) Prediction distances for all language pairs for the languages Dutch, German and English, with three phylogenetic neural networks and a pairwise encoder-decoder. One phylogenetic network is modelled with the true reference tree for these languages, two others are modelled with false reorderings of the reference tree. Model Pearson correlation PhylNet, ref. tree 0.8945 ((nld,deu),eng) PhylNet, false tree 0.7546 ((nld,eng),deu) PhylNet, false tree 0.8899 ((deu,eng),nld)

(b) Pearson r correlation between results of the different phylogenetic networks and the pairwise encoder-decoder model, on the 6 language pairs shown in (a).

Table 5.1: Phylogenetic word prediction results for three phylogenetic neural networks: one modelled with the true reference tree for these languages, two others are modelled with false reorderings of the reference tree.

per language pair. Table 5.1 shows the results of the three models, the pairwise model and the Pearson r correlation (Pearson, 1896) between phylogenetic models and the pairwise model. From the table, it can be observed that the prediction distances of phylogenetic word prediction are higher than the distances from the pairwise word prediction model. The distances of phylogenetic word prediction are however not very high and the proportions between the languages follow the pat- tern seen in pairwise word prediction. This shows that the new architecture of phylogenetic word prediction works reasonably. One could see that the phylogenetic network based on the reference tree does not generally have a lower prediction distance than the networks based on the false trees. The phylogenetic network following the reference tree does have the highest correlation with the pairwise encoder-decoder, by a small difference. Next, protoforms are extracted from ancestral nodes, representing postulated protolanguages. I use the phylogenetic network assuming the reference tree and generate protoforms for Proto-Franconian, the ancestor of Dutch and German, and Proto-West-Germanic, the ancestor of Dutch, German and English. I use Dutch as input language. Table 5.2 shows the inferred forms. Most protoforms seem not very reliable, but some forms of Franconian, such as the form i33n for Dutch inslap3 (fall asleep), seem to capture at least some patterns from the input language. 5.4. DISCUSSION 43

Input (Dutch) Protoform Input (Dutch) Protoform blut oi blut kik inslap3 i33n inslap3 op blot no blot silEm wExan bol wExan doi xlot no xlot Ekkom warhEit fol warhEit sE orbEit st3 orbEit sk3r3 mElk3 oi mElk3 kutaa vostbind3 s3rn vostbind3 sorl hak po3 hak on

(a) Protoforms for Proto-Franconian (b) Protoforms for Proto-West Germanic (ancestor of Dutch and German). (ancestor of Dutch, German and English).

Table 5.2: Reconstructed protoforms, extracted from a phylogenetic neural network trained on Dutch, German and English. The protoforms were extracted by giving Dutch words as input.

5.4 Discussion

I showed the prediction results using a phylogenetic neural network assuming the reference tree and compared it to phylogenetic neural networks using “false” tree structures. The network based on the reference tree performed equal or only slightly better than the models based on the false trees. I expected that using the right phylogenetic structure would give a clear performance benefit. These results seem to convey that the network in its current setup does not optimally exploit the phylogenetic structure. The universal encoders and decoders require the dense layers, representing tree branches, to modelall language-specific sound correspondences between languages. It may be beneficial for the performance to move some language-specific processing to the encoders and decoders. An interesting experiment would be to share the weights of encoders only for the same input language, and share the weights of decoders only for the same output language. In that setup, a new way will have to be found to decode protoforms, since universal decoders are no longer available. I generated protoforms for Proto-Franconian and Proto-West Germanic. The forms generally have a high variability and bear little resemblance to the input word. These protoforms are decoded from neural network weights that were trained to perform predictions for a different output language. No specific training is performed on the protoforms, so the weights are not optimized for the generation of protoforms. In future experiments, it may be good to perform training on the protoforms, to constrain their form more. Because there is no target word available for the protoforms, a loss function may be based on the characteristics of the outputted protoform itself, for example its word length, oron similarity to the input word.

5.5 Summary

In this chapter, I presented the phylogenetic word prediction task, a linguistically informed extension to word prediction, in which information is shared between language pairs according to the assumed phylogenetic tree of the languages. This task enables the reconstruction of protoforms and may leadto joint performance of word prediction and phylogenetic tree reconstruction in the future. I developed a neural network architecture and training regime, making extensive use of weight sharing, to perform this task. Experimental results show that the model works, although the phylogenetic structure still has 44 CHAPTER 5. PHYLOGENETIC WORD PREDICTION to be exploited more. I therefore propose several future modifications to improve the performance of the models. Chapter 6

Conclusion and discussion

In this thesis, I applied the machine learning paradigm, succesful in many computing tasks, to historical linguistics. I proposed the task of word prediction: by training a machine learning model on pairs of words in two languages, it learns the sound correspondences between the two languages and should be able to predict unseen words. I used two neural network models, a recurrent neural network (RNN) encoder-decoder and a structured perceptron, to perform this task. I have shown that, by performing the task of word prediction, results for multiple tasks in historical linguistics can be obtained, such as phylogenetic tree reconstruction, identification of sound correspondences and cognate detection. On top of this, I showed that the task of word prediction can be extended to phylogenetic word prediction, which could be used for protoform reconstruction and a joint performance of word prediction and phylogenetic tree reconstruction in the future. By combining insights from two fields, machine learning and historical linguistics, this thesis pro- vides some notable contributions. Firstly, to my knowledge, this is the first publication to use a deep neural network as a model of sound correspondences in historical linguistics. Secondly, in this thesis I propose a new cognacy prior loss, enabling a neural network to learn more from some training examples than from others. This new loss function has not yet given a clear performance increase in my experiments. I however hope it can be a first step in finding a method to learn more from cognate than non-cognate training examples, a key issue when applying machine learning to historical linguis- tics, and to other disciplines. Thirdly, I use embedding encoding, inspired by word embeddings in natural language processing, to encode phonemes in historical linguistics. In my experiments, this en- coding seems to work better than existing one-hot and phonetic encodings. Furthermore, I developed a method to visualize learned patterns by a neural network by comparing clusterings of net- work activations and input and target words. Aditionally, in this thesis, I introduced a new method to infer cognate judgments from word prediction results. Finally, I propose phylogenetic word pre- diction, sharing weights between language pairs along a phylogenetic tree, which enables protoform reconstruction from a neural network. Results of the prediction models on different tasks are generally at the same level as baseline mod- els. On the one hand, this means that I am not yet in a position to gather new insights about language ancestry from my models. On the other hand, the methods developed in this thesis do open up new possibilities. By extending the task to phylogenetic word prediction, the phylogenetic structure is im- mediately taken into account at prediction time. In the future, it may be possible to perform word prediction and phylogenetic tree reconstruction at the same time, by evaluating word prediction for dif- ferent tree structures. Also, as shown with some first examples in this thesis, protoforms can be derived from the phylogenetic neural network. This is relevant, since protoform reconstruction is one ofthe tasks in historical linguistics for which fewer computational methods are already available.

45 46 CHAPTER 6. CONCLUSION AND DISCUSSION

A striking fact is that the simpler structured perceptron model performs better on word predic- tion than the more complex encoder-decoder model. Apparently, the larger number of parameters and potentially beneficial word-length independent representation between encoder and decoder, donot give a benefit. However, the encoder-decoder gives more possibilities to identify the learned repre- sentations, by visualizing network layers. Also, the encoder-decoder has possibilities to be extended to a phylogenetic neural network, opening up new prospects, as described in the last paragraph. In the future, a fine-grained analysis of the solutions the network has found, possible using toy data ex- pressing certain phonological changes, can be valuable. Another area on which future work could be performed, is phylogenetic word prediction. Experiments with different weight sharing strategies, with more language-specific encoders and decoders, could lead to better performance. Additionally, proto- forms could possibly be improved by optimizing the network for the prediction of these forms, using a specific loss function. With this thesis, I hope to contribute to future insights about the ancestry of languages. By applying computational methods in historical linguistics, advances have been made in recent years. In this thesis, I built further upon this development and proposed a central role for machine learning in historical linguistics. This is motivated both by a practical perspective, machine learning has shown successes in many other research areas, as well as by a fundamental perspective, the observed parallel between regular sound change and generalization in machine learning. I am looking forward to the new findings in historical linguistics that may follow from this new line of methods.

Acknowledgments

I would like to thank my supervisors, Jelle Zuidema and Gerhard Jäger, for their enthusiasm about an unexplored topic and valuable new ideas, methods and data with which they supplied me. I am also grateful to the members of their research groups, for useful comments during discussions. Thanks to all people who commented or advised on parts of my thesis: Arjen Versloot, Marian Klamer, Dieuwke Hup- kes, Minh Ngo, Taraka Rama, Johann-Mattis List and the anonymous reviewer of an earlier workshop abstract. Chapter 7

Appendix

47 48 CHAPTER 7. APPENDIX

ASJPcode Description IPA i high , rounded and unrounded i, ɪ, y, ʏ e mid front vowel, rounded and unrounded e, ø E low front vowel, rounded and unrounded a, æ, ɛ, ɶ, œ 3 high and mid , rounded and unrounded ɨ, ɘ, ə, ɜ, ʉ, ɵ, ɞ a low central vowel, unrounded ɐ u high , rounded and unrounded ɯ, u o mid and low back vowel, rounded and unrounded ɤ, ʌ, ɑ, o, ɔ, ɒ p voiceless bilabial stop and fricative p, ɸ b voiced bilabial stop and fricative b, β m bilabial nasal m f voiceless labiodental fricative f v voiced labiodental fricative v 8 voiceless and voiced dental fricative θ, ð 4 dental nasal n̪ t voiceless alveolar stop t d voiced alveolar stop d s voiceless alveolar fricative s z voiced alveolar fricative z c voiceless and voiced alveolar ts, dz n voiceless and voiced alveolar nasal n S voiceless postalveolar fricative ʃ Z voiced postalveolar fricative ʒ C voiceless palato-alveolar affricate tʃ j voiced palato-alveolar affricate dʒ T voiceless and voiced palatal stop c, ɟ 5 palatal nasal ɲ k voiceless velar stop k g voiced velar stop g x voiceless and voiced velar fricative x, ɣ N velar nasal ŋ q voiceless uvular stop q G voiced uvular stop G X voiceless and voiced uvular fricative, voiceless and voiced pharyngeal fricative χ, ʁ, ħ, ʕ 7 voiceless glottal stop ʔ h voiceless and voiced glottal fricative h, ɦ l voiced alveolar lateral approximate l L all other laterals ʟ, ɭ, ʎ w voiced bilabial-velar approximant w y palatal approximant j r voiced apico-alveolar trill and all varieties of “r-sounds” r, R, etc. ! all varieties of “click-sounds” ǃ, ǀ, ǁ, ǂ

Table 7.1: Table of ASJP phonemes, adopted from Brown et al. (2008). 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 49 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Lateral Rhotic High Mid0 Low Front Central Back 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 imant 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 Velar Uvular Glottal Stop0 FricativeAffricateNasal Approx Click 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 Post- alveloar 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Voiced Labial Dental Alveolar Palatal/ 0 1 0 1 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 f v m 1 8 4 t d s z c n S Z C 0 j T 5 k g x N 1 q G 1 X 1 7 h l L w 1 y r ! i E e 3 a u o . ASJP phoneme p b Table 7.2: The table of phonetic(2016a), features vowel encoding per phoneme added from Brownfrom et the ASJP al.alphabet, (2008). used when This phonetic table encoding is essentially is applied. a formalization Table adopted from Rama of 7.1 50 CHAPTER 7. APPENDIX

ISO code Language name ISO code Language name abk Abkhaz kor Korean ady Adyghe kpv Komi-Zyrian ain Hokkaido Ainu krl North Karelian ale Aleut lat Latin arb Standard Arabic lav Latvian ava Avar lbe Lak azj North Azerbaijani lez Lezgian bak Bashkir lit Lithuanian bel Belarusian liv Livonian ben Bengali mal Malayalam bre Breton mdf Moksha bsk Burushaski mhr Meadow Mari bua Buryat mnc Manchu bul Bulgarian mns Northern Mansi cat Catalan mrj Hill Mari ces Czech myv Erzya che Chechen nio Nganasan chv Chuvash niv Nivkh ckt Chukchi nld Dutch cmn Mandarin Chinese nor Norwegian (Bokmål) cym Welsh olo Olonets Karelian dan Danish oss Ossetian dar Dargwa pbu Northern Pashto ddo Tsez pes Western Farsi deu German pol Polish ekk Estonian por Portuguese ell Modern Greek ron Romanian enf Forest Enets rus Russian eng English sah ess Central Siberian Yupik sel Northern Selkup eus Basque sjd Kildin Sami evn Evenki slk Slovak fin Finnish slv Slovene fra French sma gld Nanai sme gle Irish smj heb Modern Hebrew smn hin Hindi sms hrv Croatian spa Spanish hun Hungarian sqi Standard Albanian hye Armenian swe Swedish isl Icelandic tam Tamil ita Italian tat Tatar itl Itelmen tel Telugu jpn Japanese tur Turkish kal Kalaallisut udm Udmurt kan Kannada ukr Ukrainian kat Georgian uzn Northern Uzbek kaz Kazakh vep Veps kca Northern xal Kalmyk ket Ket ykg Northern Yukaghir khk Khalkha Mongolian yrk Tundra Nenets kmr Northern Kurdish yux Southern Yukaghir koi Komi-Permyak

Table 7.3: Mapping from ISO languge codes, used in this thesis, to language names Bibliography

Ahn, Y. Y., Bagrow, J. P., and Lehmann, S. (2010). Link communities reveal multiscale complexity in networks. Nature, 466(7307):761–764.

Amigó, E., Gonzalo, J., Artiles, J., and Verdejo, F. (2009). A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information retrieval, 12(4):461–486.

Avendi, M., Kheradvar, A., and Jafarkhani, H. (2016). A combined deep-learning and deformable-model approach to fully automatic segmentation of the left ventricle in cardiac mri. Medical image analysis, 30:108–119.

Bagga, A. and Baldwin, B. (1998). Entity-based cross-document coreferencing using the vector space model. In Proceedings of the 17th international conference on Computational linguistics-Volume 1, pages 79–85. Association for Computational Linguistics.

Baker, J. K. (1979). Trainable grammars for speech recognition. The Journal of the Acoustical Society of America, 65(S1):S132–S132.

Baum, L. E. and Petrie, T. (1966). Statistical inference for probabilistic functions of finite state markov chains. The annals of mathematical statistics, 37(6):1554–1563.

Beekes, R. S. (2011). Comparative Indo-European linguistics: an introduction. John Benjamins Publishing.

Beinborn, L., Zesch, T., and Gurevych, I. (2013). Cognate production using character-based machine translation. In IJCNLP, pages 883–891.

Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.

Bouchard-Côté, A., Hall, D., Griffiths, T. L., and Klein, D. (2013). Automated reconstruction of ancient languages using probabilistic models of sound change. Proceedings of the National Academy of Sciences, 110(11):4224–4229.

Bouckaert, R., Lemey, P., Dunn, M., Greenhill, S. J., Alekseyenko, A. V., Drummond, A. J., Gray, R. D., Suchard, M. A., and Atkinson, Q. D. (2012). Mapping the origins and expansion of the indo-european language family. Science, 337(6097):957–960.

Brown, C. H., Holman, E. W., Wichmann, S., and Velupillai, V. (2008). Automated classification of the world’s languages: a description of the method and preliminary results. STUF-Language Typology and Universals Sprachtypologie und Universalienforschung, 61(4):285–308.

Bryant, D., Tsang, J., Kearney, P. E., and Li, M. (2000). Computing the quartet distance between evo- lutionary trees. In Symposium on Discrete Algorithms: Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms, volume 9, pages 285–286.

51 52 BIBLIOGRAPHY

Campbell, L. (2003). Beyond the comparative method? AMSTERDAM STUDIES IN THE THEORY AND HISTORY OF LINGUISTIC SCIENCE SERIES 4, pages 33–58.

Campbell, L. (2013). Historical linguistics: an introduction. MIT Press, second edition edition.

Chang, W., Cathcart, C., Hall, D., and Garrett, A. (2015). Ancestry-constrained phylogenetic analysis supports the indo-european steppe hypothesis. Language, 91(1):194–244.

Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.

Ciobanu, A. M. (2016). Sequence labeling for cognate production. Procedia Computer Science, 96:1391– 1399.

Clackson, J. (2007). Indo-European linguistics: an introduction. ”Cambridge University Press”.

Collins, M. (2002). Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, pages 1–8. Association for Computational Linguistics.

Dahl, G. E., Jaitly, N., and Salakhutdinov, R. (2014). Multi-task neural networks for qsar predictions. arXiv preprint arXiv:1406.1231.

Daume, H. C. and Marcu, D. (2006). Practical structured learning techniques for natural language process- ing. University of Southern California.

Dellert, J. and Jäger, G. e. (2017). Northeuralex (version 0.9).

Dieleman, S., Schlüter, J., Raffel, C., Olson, E., Sønderby, S. K., Nouri, D., Maturana, D., Thoma,M., Battenberg, E., Kelly, J., Fauw, J. D., Heilman, M., de Almeida, D. M., McFee, B., Weideman, H.,Takács, G., de Rivaz, P., Crall, J., Sanders, G., Rasul, K., Liu, C., French, G., and Degrave, J. (2015). Lasagne: First release.

Dietterich, T. G. (2002). Machine learning for sequential data: A review. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pages 15–30. Springer.

Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochas- tic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159.

Dunn, M. (2012). Indo-European lexical cognacy database (IELex). URL: http://ielex. mpi.. nl

Dunn, M. (2015). Language phylogenies. The Routledge Handbook of Historical Linguistics. Routledge, pages 190–211.

Durie, M. and Ross, M. (1996). The comparative method reviewed: regularity and irregularity in language change. Oxford University Press.

Firth, J. R. (1957). A synopsis of linguistic theory, 1930-1955. Studies in linguistic analysis.

Frey, B. J. and Dueck, D. (2007). Clustering by passing messages between data points. Science, 315:972– 976. BIBLIOGRAPHY 53

Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural net- works. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 249–256. Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 315–323. Goller, C. and Kuchler, A. (1996). Learning task-dependent distributed representations by backpropaga- tion through structure. In Neural Networks, 1996., IEEE International Conference on, volume 1, pages 347–352. IEEE. Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep learning. MIT press. Gray, R. D. and Atkinson, Q. D. (2003). Language-tree divergence times support the anatolian theory of indo-european origin. Nature, 426(6965):435–439. Greenberg, J. H. (1957). Essays in linguistics. Greenhill, S. J., Wu, C.-H., Hua, X., Dunn, M., Levinson, S. C., and Gray, R. D. (2017). Evolutionary dynamics of language systems. Proceedings of the National Academy of Sciences, 114(42):E8822–E8829. Ha, T.-L., Niehues, J., and Waibel, A. (2016). Toward multilingual neural machine translation with universal encoder and decoder. arXiv preprint arXiv:1611.04798. Hahnloser, R. H., Sarpeshkar, R., Mahowald, M. A., Douglas, R. J., and Seung, H. S. (2000). Digital selec- tion and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405(6789):947–951. Hammarström, H., Bank, S., Forkel, R., and Haspelmath, M. (2017). Glottolog 3.1. Max Planck Institute for the Science of Human History. Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–1780. Hock, H. H. and Joseph, B. D. (2009). Language History, Language Change, Language Relationship: An Introduction to Historical and Interpretative Linguistics. 2 edition. Hruschka, D. J., Branford, S., Smith, E. D., Wilkins, J., Meade, A., Pagel, M., and Bhattacharya, T. (2015). Detecting regular sound changes in linguistics as events of concerted evolution. Current Biology, 25(1):1–9. Inkpen, D., Frunza, O., and Kondrak, G. (2005). Automatic identification of cognates and false friends in french and english. In Proceedings of the International Conference Recent Advances in Natural Language Processing, pages 251–257. Jäger, G. (2015). Support for linguistic macrofamilies from weighted sequence alignment. Proceedings of the National Academy of Sciences, 112(41):12752–12757. Jäger, G., List, J.-M., and Sofroniev, P. (2017). Using support vector machines and state-of-the-art algo- rithms for phonetic alignment to identify cognates in multi-lingual wordlists. Mayan, 895:0–05. Jäger, G. and Sofroniev, P. (2016). Automatic cognate classification with a support vector machine. Jurafsky, D. (2000). Speech & language processing. Pearson Education India. Kondrak, G. (2002). Determining recurrent sound correspondences by inducing translation models. In Proceedings of the 19th international conference on Computational linguistics-Volume 1, pages 1–7. Association for Computational Linguistics. 54 BIBLIOGRAPHY

Lafferty, J., McCallum, A., and Pereira, F. C. (2001). Conditional random fields: Probabilistic modelsfor segmenting and labeling sequence data.

Lari, K. and Young, S. J. (1990). The estimation of stochastic context-free grammars using the inside- outside algorithm. Computer speech & language, 4(1):35–56.

Le, P. and Zuidema, W. (2014). The inside-outside recursive neural network model for dependency parsing. In EMNLP, pages 729–739.

Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710.

List, J.-M. (2012). Lexstat: Automatic detection of cognates in multilingual wordlists. EACL 2012, page 117.

List, J.-M. and Forkel, R. (2016). LingPy. a Python library for historical linguistics.

Mailund, T. and Pedersen, C. N. (2004). QDist – quartet distance between evolutionary trees. Bioinfor- matics, 20(10):1636–1637.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.

Mulloni, A. (2007). Automatic prediction of cognate orthography using support vector machines. In Pro- ceedings of the 45th Annual Meeting of the ACL: Student Research Workshop, pages 25–30. Association for Computational Linguistics.

Needleman, S. B. and Wunsch, C. D. (1970). A gene method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48:443–453.

Nichols, J. (1996). The comparative method as heuristic. In Durie, M. and Ross, M., editors, The Compar- ative Method Reviewed: Regularity and Irregularity in Language Change, page 39. Oxford University Press.

Osthoff, H. and Brugmann, K. (1880). Morphologische Untersuchungen auf dem Gebiete der indogerman- ischen Sprachen, volume 3. S. Hirzel.

Pearson, K. (1896). Mathematical contributions to the theory of evolution. iii. regression, heredity, and panmixia. Philosophical Transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character, 187:253–318.

Pearson, K. (1901). Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572.

Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.

Pompei, S., Loreto, V., and Tria, F. (2011). On the accuracy of language trees. PLoS One, 6(6):e20109.

Rama, T. (2016a). Siamese convolutional networks based on phonetic features for cognate identification. arXiv preprint arXiv:1605.05172.

Rama, T. (2016b). Siamese convolutional networks for cognate identification. BIBLIOGRAPHY 55

Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386. Rumelhart, D. E., Hinton, G. E., Williams, R. J., et al. (1986). Learning representations by back- propagating errors. Nature, 323(6088):533–538. Saitou, N. and Nei, M. (1987). The neighbor-joining method: A new method for reconstructing phylo- genetic trees. Molecular Biology and Evolution, 4(4):406–425. Schmidhuber, J., Gers, F., and Eck, D. (2006). Learning nonregular languages: A comparison of simple recurrent networks and LSTM. Learning, 14(9). Sermanet, P., Kavukcuoglu, K., Chintala, S., and LeCun, Y. (2013). Pedestrian detection with unsu- pervised multi-stage feature learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3626–3633. Socher, R., Manning, C. D., and Ng, A. Y. (2010). Learning continuous phrase representations and syn- tactic parsing with recursive neural networks. In Proceedings of the NIPS-2010 Deep Learning and Unsupervised Feature Learning Workshop, pages 1–9. Sokal, R. R. and Michener, C. D. (1958). A statistical method for evaluating systematic relationships. University of Kansas Scientific Bulletin, 28:1409–1438. Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112. Trask, L. (1996). Historical linguistics. Arnold. van Dongen, S. M. (2000). Graph clustering by flow simulation. Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding al- gorithm. IEEE Transactions on Information Theory, 13(2):260–269.