Improved Reconstruction of Protolanguage Word Forms

Alexandre Bouchard-Cotˆ e´∗ Thomas L. Grifﬁths† Dan Klein∗ ∗Computer Science Division †Department of Psychology University of California at Berkeley Berkeley, CA 94720

Abstract Kondrak, 2002) or all of the process (Oakes, 2000; Bouchard-Cotˆ e´ et al., 2008). However, previous au- We present an unsupervised approach to re- tomated methods have been unable to leverage three constructing ancient word forms. The present important ideas a linguist would employ. We ad- work addresses three limitations of previous work. First, previous work focused on faith- dress these omissions here, resulting in a more pow- fulness features, which model changes be- erful method for automatically reconstructing an- tween successive languages. We add marked- cient protolanguages. ness features, which model well-formedness First, linguists triangulate reconstructions from within each language. Second, we introduce many languages, while past work has been lim- universal features, which support generaliza- ited to small numbers of languages. For example, tions across languages. Finally, we increase the number of languages to which these meth- Oakes (2000) used four languages to reconstruct ods can be applied by an order of magni- Proto-Malayo-Javanic (PMJ) and Bouchard-Cotˆ e´ et tude by using improved inference methods. al. (2008) used two languages to reconstruct Clas- Experiments on the reconstruction of Proto- sical Latin (La). We revisit these small datasets Oceanic, Proto-Malayo-Javanic, and Classical and show that our method signiﬁcantly outperforms Latin show substantial reductions in error rate, these previous systems. However, we also show that giving the best results to date. our method can be applied to a much larger data set (Greenhill et al., 2008), reconstructing Proto- 1 Introduction Oceanic (POc) from 64 modern languages. In addition, performance improves with more languages, A central problem in diachronic linguistics is the re- which was not the case for previous methods. construction of ancient languages from their modern descendants (Campbell, 1998). Here, we consider Second, linguists exploit knowledge of phonolog- the problem of reconstructing phonological forms, ical universals. For example, small changes in vowel given a known linguistic phylogeny and known cog- height or consonant place are more likely than large nate groups. For example, Figure 1 (a) shows a col- changes, and much more likely than change to ar- lection of word forms in several Oceanic languages, bitrarily different phonemes. In a statistical system, all meaning to cry. The ancestral form in this case one could imagine either manually encoding or auto- has been presumed to be /taNis/ in Blust (1993). We matically inferring such preferences. We show that are interested in models which take as input many both strategies are effective. such word tuples, each representing a cognate group, Finally, linguists consider not only how languages along with a language tree, and induce word forms change, but also how they are internally consistent. for hidden ancestral languages. Past models described how sounds do (or, more of- The traditional approach to this problem has been ten, do not) change between nodes in the tree. To the comparative method, in which reconstructions borrow broad terminology from the Optimality The- are done manually using assumptions about the rel- ory literature (Prince and Smolensky, 1993), such ative probability of different kinds of sound change models incorporated faithfulness features, captur- (Hock, 1986). There has been work attempting to ing the ways in which successive forms remained automate part (Durham and Rogers, 1969; Eastlack, similar to one another. However, each language 1977; Lowe and Mazaudon, 1994; Covington, 1998; has certain regular phonotactic patterns which con-

65 Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 65–73, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics strain these changes. We encode such patterns us- tion and increased scale. ing markedness features, characterizing the internal phonotactic structure of each language. Faithfulness 3 Model and markedness play roles analogous to the channel and language models of a noisy-channel system. We We start this section by introducing some notation. τ show that markedness features improve reconstruc- Let be a tree of languages, such as the examples tion, and can be used efficiently. in Figure 3 (c-e). In such a tree, the modern languages, whose word forms will be observed, are the leaves of τ. All internal nodes, particularly the root, 2 Related work are languages whose word forms are not observed. Our focus in this section is on describing the prop- Let L denote all languages, modern and otherwise. erties of the two previous systems for reconstruct- All word forms are assumed to be strings Σ∗ in the 1 ing ancient word forms to which we compare our International Phonological Alphabet (IPA). method. Citations for other related work, such as We assume that word forms evolve along the similar approaches to using faithfulness and marked- branches of the tree τ. However, it is not the case ness features, appear in the body of the paper. that each cognate set exists in each modern lan- In Oakes (2000), the word forms in a given pro- guage. Formally, we assume there to be a known list of C cognate sets. For each c 1,...,C tolanguage are reconstructed using a Viterbi multi- ∈ { } alignment between a small number of its descendant let L(c) denote the subset of modern languages that languages. The alignment is computed using hand- have a word form in the c-th cognate set. For each set c 1,...,C and each language ` L(c), we set parameters. Deterministic rules characterizing ∈ { } ∈ changes between pairs of observed languages are ex- denote the modern word form by wc`. For cognate tracted from the alignment when their frequency is set c, only the minimal subtree τ(c) containing L(c) higher than a threshold, and a proto-phoneme inven- and the root is relevant to the reconstruction infer- tory is built using linguistically motivated rules and ence problem for that set. parsimony. A reconstruction of each observed word From a high-level perspective, the generative pro- is first proposed independently for each language. If cess is quite simple. Let c be the index of the cur- at least two reconstructions agree, a majority vote rent cognate set, with topology τ(c). First, a word is taken, otherwise no reconstruction is proposed. is generated for the root of τ(c) using an (initially This approach has several limitations. First, it is not unknown) root language model (distribution over tractable for larger trees, since the time complexity strings). The other nodes of the tree are drawn incrementally as follows: for each edge ` ` in τ(c) use of their multi-alignment algorithm grows exponen- → 0 tially in the number of languages. Second, deter- a branch-specific distribution over changes in strings ministic rules, while elegant in theory, are not robust to generate the word at node `0. to noise: even in experiments with only four daugh- In the remainder of this section, we clarify the ex- ter languages, a large fraction of the words could not act form of the conditional distributions over string be reconstructed. changes, the distribution over strings at the root, and In Bouchard-Cotˆ e´ et al. (2008), a stochastic model the parameterization of this process. of sound change is used and reconstructions are in- ferred by performing probabilistic inference over an 3.1 Markedness and Faithfulness evolutionary tree expressing the relationships be- In Optimality Theory (OT) (Prince and Smolensky, tween languages. The model does not support gener- 1993), two types of constraints influence the selec- alizations across languages, and has no way to cap- tion of a realized output given an input form: faith- ture phonotactic regularities within languages. As a fulness and markedness constraints. Faithfulness en- consequence, the resulting method does not scale to 1 large phylogenies. The work we present here ad- The choice of a phonemic representation is motivated by the fact that most of the data available comes in this form. Dia- dresses both of these issues, with a richer model critics are available in a smaller number of languages and may and faster inference allowing improved reconstruc- vary across dialects, so we discarded them in this work.

66 x x x x x x x (a) (b) 1 2 3 4 5 6 7 The yi’s may be a single character, multiple charac- # t a ŋ i s # ters, or even empty. In the example shown, all three Language Word form ferred by performing probabilistic inference over an # a n g i # of these cases occur. Proto Oceanic /taNis/ y1 y2 y3 evolutionaryy4 y5 y6 y7 tree expressing the relationships be- Lau /aNi/ (c) tween languages. Use of approximateTo generate inferenceyi and, we define a mutation Markov Kwara’ae /angi/ θS θ ŋstochasticI rules addresseschain some ofthat the limitations incrementally of adds zero or more charac- Taiof /taNis/ (Oakes, 2000), but the resulting method is computa- Table 1: A cognate set from the Austronesian dataset. All tionally demanding and consequentlyters to an initially does not scale empty yi. First, we decide whether a n g ? .. word forms mean to cry. to large phylogenies. Thethe high current computational phoneme cost in the top word t = xi will be (f) /taŋi/ (e) /taŋi/ (d) of probabilistic inference also limits the features that 1[ŋ⟶g@Kw] y = can be included in the modeldeleted, (omitting in which global fea- case i as in the example of constrain these changes. We encode such patternsŋ 1[ŋ⟶g] tures supporting1[Subst] generalizations/s/ being across deleted. languages, If t is not deleted, we chose a sin- using markedness/afeatures,ŋi/ characterizing/aŋi/ the inter- 1[(n)@Kw] and1[(n markedness g)@Kw] features within languages). The nal phonotactic structure of each language. Faith- 1[Insert] gle substitution character in the bottom word. This work we1[(g)@Kw] present here addresses both of these issues, fulness and markedness play roles analogous ton the g a N /angi/ /angi/ with faster inference andis a richer the casemodel bothallowing when in- / / is unchanged and when / / channel and language models of a noisy-channel creased scale and improvedsubstitutes reconstruction. to /n/. We write S = Σ ζ for this set system. We show that markedness features greatly ∪{ } Figure 1: (a) A cognate set from the Austronesian dataset. ζ improve reconstruction quality, and we show how to 3 Model of outcomes, where is the special outcome indi- workAll with word them efficiently. forms mean to cry. (b-d) The mutation model cating deletion. Importantly, the probabilities of this used in this paper. (b) The mutation ofWe POc start this/taNis section/ to by introducing some notation. Let τ be a tree of languages,multinomial such as the examples can depend in on both the previous char- 2 Relatedangi Work Kw. / /. (c) Graphical model depictingFigure the 4 dependen- (c-e). In such a tree,acter the generated modern languages, so far (i.e. the rightmost character Our focuscies inamong this section variables is on describing in one step the ofprop- the mutationwhose word Markov forms will be observed, are the leaves erties of the two previous systems for reconstruct- p of yi 1) and the current character in the previous chain. (d) Active features for one step"1 in... this"m. process. All internal nodes, particularly− the root, ing ancient(e-f) Comparison word forms to whichof two we inference compare our proceduresare languages on trees:" whose wordgeneration forms are not string observed. (t). As we will see shortly, this al- method. Citations for other related work, such as Single sequence resampling (e) draws oneLet sequenceL denote all at languages, a lows modern modelling and otherwise. markedness and faithfulness at every similar approaches to using faithfulness and marked- All word forms are assumed to be strings Σ∗ in the ness features,time, conditioned appear in the onbody its of parent the paper. and children,International while ances- Phonologicalbranch, Alphabet jointly. (IPA).1 This multinomial decision acts as In Oakestry resampling (2000), the word (f) draws forms in an a given aligned proto- slice fromAs a first all words approximation,the initialwe assume distribution that word of the mutation Markov chain. languagesimultaneously. are reconstructed In using large a trees, Viterbi the multi- latterforms is more evolve efficient along the branchesWe of consider the tree τ.insertions How- only if a deletion was not alignmentthan between the former. a small number of its descendant ever, it is not the case that each cognate set exists languages. The alignment is computed using hand- in each modern langugage.selected Formally, in we the assume first step. Here, we draw from a set parameters. Deterministic rules characterizing there to be a known list ofmultinomialC cognate sets. over For eachS , where this time the special out- changes between pairs of observed languages are ex- c 1,...,C let L(c) denote the subset of mod- ∈ { } come ζ corresponds to stopping insertions, and the tractedcourages from the alignment similarity when between their frequency the is inputern languages and output that have a word form in the c-th cog- higher than a threshold, and a proto-phoneme inven- nate set. For each set c other1,...,C elementsand each of lan-S correspond to symbols that are while markedness favors well-formed output. ∈ { } tory is built using linguistically motivated rules and guage " L(c), we denoteappended the modern to wordyi. form In this case, the conditioning envi- parsimony. A reconstruction of each observed word ∈ Viewed from this perspective, previousby wc!. computa- For cognate set c, only the minimal subtree is first proposed independently for each language. If ronment is t = xi and the current rightmost symbol tional approaches to reconstruction areτ(c based) containing almostL(c) and the root is relevant to the at least two reconstructions agree, a majority vote reconstruction inferencep problemin yi for. Insertionsthat set. continue until ζ is selected. In is taken,exclusively otherwise no on reconstruction faithfulness, is proposed. expressed throughFrom a high-level a mu- perspective,the example, the generative we pro- follow the substitution of /N/ to /n/ This approach has several limitations. First, it is tation model. Only the words in the languagecess is quite at simple. the Letwithc be the an index insertion of the cur- of /g/, followed by a decision to not tractable for larger trees since the complexity of rent cognate set, with topology τ(c). First, a word the multi-alignmentroot of the tree,algorithm if any, grows are exponentially explicitlyis encouraged generated for the to rootstop of τ(c that) usingyi. an We (initially will use θS,t,p,` and θI,t,p,` to denote in thebe number well-formed. of languages. In Second, contrast, determinis- we incorporateunknown) root con- languagethe model probabilities (distribution over over the substitution and insertion tic rules, while elegant in theory, are not robust to straints on markedness for each languagestrings). with The otherboth nodesdecisions of the tree arein the drawn current in- branch ` `. noise: even in experiments with only four daughter crementally as follows: for each edge " " in τ(c) 0 general and branch-specific constraints on faithful- → " → languages, a large fraction of the words could not be 1 A similar process generates the word at the root reconstructed. The choice of a phonemic representation is motivated by ness. This is done using a lexicalizedthe fact that stochastic most of the data available` of comes a tree, in this form.treating Dia- this word as a single string In Bouchard-Cstring transducerotˆ e´ et al. (2008),(Varadarajan a stochastic model et al.,critics 2008). are available in a smaller number of languages and may of sound change is used and reconstructions are in- vary across dialects, so we discartedy1 generated them in this work. from a dummy ancestor t = x1. In We now make precise the conditional distribu- this case, only the insertion probabilities matter, and tions over pairs of evolving strings, referring to Fig- we separately parameterize these probabilities with ure 1 (b-d). Consider a language `0 evolving to ` θR,t,p,`. There is no actual dependence on t at the for cognate set c. Assume we have a word form root, but this formulation allows us to unify the pa- Σ +1 x = wcl . The generative process for producing rameterization, with each θω,t,p,` R| | where 0 ∈ y = w works as follows. First, we consider ω R, S, I . cl ∈ { } x to be composed of characters x1x2 . . . xn, with the first and last being a special boundary symbol 3.2 Parameterization x = # Σ which is never deleted, mutated, or Instead of directly estimating the transition proba- 1 ∈ created. The process generates y = y1y2 . . . yn in bilities of the mutation Markov chain (as the param- n chunks y Σ , i 1, . . . , n , one for each x . eters of a collection of multinomial distributions) we i ∈ ∗ ∈ { } i 67 express them as the output of a log-linear model. We ple voicing change) should be preferred over the lat- used the following feature templates: ter. In Bouchard-Cotˆ e´ et al. (2008), this has to be OPERATION identifies whether an operation in the learned for each branch in the tree. The difficulty is mutation Markov chain is an insertion, a deletion, that not all branches will have enough information a substitution, a self-substitution (i.e. of the form to learn this preference, meaning that we need to de- x y, x = y), or the end of an insertion event. fine the model in such a way that it can generalize → Examples in Figure 1 (d): 1[Subst] and 1[Insert]. across languages. MARKEDNESS consists of language-specific n- We used the following technique to address this gram indicator functions for all symbols in Σ. Only problem: we augment the sufficient statistics of unigram and bigram features are used for computa- Bouchard-Cotˆ e´ et al. (2008) to include the current tional reasons, but we show in Section 5 that this language (or language at the bottom of the current already captures important constraints. Examples in branch) and use a single, global weight vector in- Figure 1 (d): the bigram indicator 1[(n g)@Kw] (Kw stead of a set of branch-specific weights. Gener- stands for Kwara’ae, a language of the Solomon alization across branches is then achieved by using Islands), the unigram indicators 1[(n)@Kw] and features that ignore `, while branch-specific features 1[(g)@Kw]. depend on `. FAITHFULNESS consists of indicators for muta- For instance, in Figure 1 (d), 1[N n] is → tion events of the form 1[x y], where x Σ, an example of a universal (global) feature shared → ∈ y S . Examples: 1[N n], 1[N n@Kw]. across all branches while 1[N n@Kw] is branch- ∈ → → → Feature templates similar to these can be found specific. Similarly, all of the features in OPERA- for instance in Dreyer et al. (2008) and Chen (2003), TION, MARKEDNESS and FAITHFULNESS have uni- in the context of string-to-string transduction. Note versal and branch-specific versions. also the connection with stochastic OT (Goldwater and Johnson, 2003; Wilson, 2006), where a log- 3.4 Objective function linear model mediates markedness and faithfulness Concretely, the transition probabilities of the muta- of the production of an output form from an under- tion and root generation are given by: lying input form. exp λ, f(ω, t, p, `, ξ) θ (ξ) = {h i} µ(ω, t, ξ), ω,t,p,` Z(ω, t, p, `, λ) × 3.3 Parameter sharing where ξ S , f : S, I, R Σ Σ L S Rk Data sparsity is a significant challenge in protolan- ∈ { }× × × × → guage reconstruction. While the experiments we is the sufficient statistics or feature function, , k h· ·i present here use an order of magnitude more lan- denotes inner product and λ R is a weight vector. Here, k is the dimensionality∈ of the feature space of guages than previous computational approaches, the the log-linear model. In the terminology of exponen- increase in observed data also brings with it addi- tial families, Z and µ are the normalization function tional unknowns in the form of intermediate pro- and reference measure respectively: tolanguages. Since there is one set of parameters for each language, adding more data is not sufficient Z(ω, t, p, `, λ) = exp λ, f(ω, t, p, `, ξ0) {h i} ξ S for increasing the quality of the reconstruction: we X0∈ show in Section 5.2 that adding extra languages can 0 if ω = S, t = #, ξ = # 6 0 if ω = R, ξ = ζ actually hurt reconstruction using previous methods. µ(ω, t, ξ) =  0 if ω = R, ξ = # It is therefore important to share parameters across  6  1 o.w. different branches in the tree in order to benefit from  having observations from more languages. Here, µ is used to handle boundary conditions. As an example of useful parameter sharing, con- We will also need the following notation: let sider the faithfulness features 1[/p/ /b/] and Pλ( ), Pλ( ) denote the root and branch probabil- → ity models· ·|· described in Section 3.1 (with transition 1[/p/ /r/], which are indicator functions for the → probabilities given by the above log-linear model), appearance of two substitutions for /p/. We would I(c), the set of internal (non-leaf) nodes in τ(c), like the model to learn that the former event (a sim- pa(`), the parent of language `, r(c), the root of τ(c)

68 I(c) and W (c) = (Σ∗)| |. We can summarize our ob- (2009). Slices condition on observed data, avoiding jective function as follows: the problems mentioned above, and can propagate C 2 information rapidly across the tree. X X Y λ 2 log Pλ(wc,r(c)) Pλ(wc,` wc,pa(`)) || || | − 2σ2 c=1 ~w W (c) ` I(c) ∈ ∈ 5 Experiments The second term is a standard L2 regularization penalty (we used σ2 = 1). We performed a comprehensive set of experiments to test the new method for reconstruction outlined 4 Learning algorithm above. In Section 5.1, we analyze in isolation the effects of varying the set of features, the number of Learning is done using a Monte Carlo variant of the observed languages, the topology, and the number Expectation-Maximization (EM) algorithm (Demp- of iterations of EM. In Section 5.2 we compare per- ster et al., 1977). The M step is convex and com- formance to an oracle and to three other systems. puted using L-BFGS (Liu et al., 1989); but the E Evaluation of all methods was done by computing step is intractable (Lunter et al., 2003), so we used the Levenshtein distance (Levenshtein, 1966) be- a Markov chain Monte Carlo (MCMC) approxima- tween the reconstruction produced by each method tion (Tierney, 1994). At E step t = 1, 2,... , we and the reconstruction produced by linguists. We simulated the chain for O(t) iterations; this regime averaged this distance across reconstructed words to is necessary for convergence (Jank, 2005). report a single number for each method. We show In the E step, the inference problem is to com- in Table 2 the average word length in each corpus; pute an expectation under the posterior over strings note that the Latin average is much larger, giving in a protolanguage given observed word forms at the an explanation to the higher errors in the Romance leaves of the tree. The typical approach in biology dataset. The statistical significance of all perfor- or historical linguistics (Holmes and Bruno, 2001; mance differences are assessed using a paired t-test Bouchard-Cotˆ e´ et al., 2008) is to use Gibbs sam- with significance level of 0.05. pling, where the entire string at a single node in the tree is sampled, conditioned on its parent and chil- 5.1 Evaluating system performance dren. This sampling domain is shown in Figure 1 (e), We used the Austronesian Basic Vocabulary where the middle word is completely resampled but Database (Greenhill et al., 2008) as the basis for adjacent words are fixed. We will call this method a series of experiments used to evaluate the per- Single Sequence Resampling (SSR). While concep- formance of our system and the factors relevant to tually simple, this approach suffers from problems its success. The database includes partial cognacy in large trees (Holmes and Bruno, 2001). Con- judgments and IPA transcriptions, as well as a few sequently, we use a different MCMC procedure, reconstructed protolanguages. A reconstruction of called Ancestry Resampling (AR) that alleviates Proto-Oceanic (POc) originally developed by Blust the mixing problems (Figure 1 (f)). This method (1993) using the comparative method was the basis was originally introduced for biological applications for evaluation. (Bouchard-Cotˆ e´ et al., 2009), but commonalities be- We used the cognate information provided in tween the biological and linguistic cases make it the database, automatically constructing a global possible to use it in our model. tree2 and set of subtrees from the cognate set in- Concretely, the problem with SSR arises when the dicator matrix M(`, c) = 1[` L(c)], c tree under consideration is large or unbalanced. In ∈ ∈ 1,...,C , ` L. For constructing the global tree, this case, it can take a long time for information { } ∈ we used the implementation of neighbor joining in from the observed languages to propagate to the root the Phylip package (Felsenstein, 1989). We used of the tree. Indeed, samples at the root will ini- a distance based on cognates overlap, d (` , ` ) = tially be independent of the observations. AR ad- c 1 2 C M(` , c)M(` , c). We bootstrapped 1000 dresses this problem by resampling one thin vertical c=1 1 2 slice of all sequences at a time, called an ancestry. P 2The dataset included a tree, but it was out of date as of For the precise definition, see Bouchard-Cotˆ e´ et al. November 2008 (Greenhill et al., 2008).

69 / - s MARKED and Edit dist. 1.87 2.02 2.18 1.99 2.06 1.75 (described in Sec- of the POc words given 1 − 3 ORACLE FAITHFULNESS K , − 1 -fold validation. The system was ran K OPERATION FAITHFULNESS MARKEDNESS Condition Unsupervised full system - - -Sharing -Topology Semi-supervised system times, with that are branch-specific, -Topology corresponds to We also tried a fully supervised system where a flat topol- = 5 3 Figure 3 (b) shows the results of a concrete run For comparison, we also included in the same NESS ogy is used so that allbut of these it latent internal did nodes-Topology are experiment not avoided; of Table perform 1. as well—this is consistent with the Table 1: Effectsunsupervised of system ablation of on-Sharing corresponds various to mean the aspects restriction to edit offeatures the subset in our distance of the to POc. to the system as observations infor the each graphical run. model Itgold is reconstruction semi-supervised for in many the internal senseavailable in nodes that the are dataset (for not example thecestor common an- of Kwara’ae (Kw.) andso they Lau are in still Figure not 3 filled. (b)), over 32 languages,Solomonic languages zooming and the in cognateure to set 1 from a Fig- (a). pairtion In is of the as the good exampletion as shown, 5.2), the though the off reconstruc- is by not one present character in (theis any final of not / the reconstructed). 32both inputs and the In therefore global (a), andthe diagrams the expectations show, of local each for (Kwara’ae)on substitution features, an superimposed IPA soundchanges. chart, as Darker lines well indicate asrun higher a did counts. list not of use This natural the class top constraints, but it can using a flat topology where thenect only modern edges languages in to the POc. tree con- Thetem semi-supervised is sys- described in thethe text. unsupervised All full differences system) (compared are to statistically significant. ping the faithfulness features, andacross disabling branches. sharing Theare results shown in of Table 1. these experiments table the performance oftrained a by semi-supervised system K = = 70 and and such 20 ∗ O x ,M 20 ) | , then it . At the i O 18 y O | ( i =1 Σ , it is exact. ∈ 16 PRAGUE often finds the k x M ∈ , computes the , where 14 denote the Lev- x ) = . If =2 = min ) , but since words are O k k 12 contiguous substrings k x, y m ( k x, y L 10 ( 10 L O CENTROID ∈ EM Iteration 8 y where is the set of characters occurring ! 6 M ) ∗ Σ EM iteration O ∈ ( 4 ≤ is the set of observed word forms. x ´ e et al. (2008) respectively and sum- Σ | } ˆ | ot x 2 | O | and | ≤ 0 i 3 2 , are described in Oakes (2000) and 0

3.4 3.2 2.8 2.6 2.4 2.2 1.8 3.6 y 3 Error | . Even with this restriction, this optimization m i 1.8 3.6 2.4 , . . . , y O This third baseline, 1 . Ideally, we would like the baseline to y BCLKG taken from the word forms in is intractable. As anonly approximation, strings built we by considered at most Bouchard-C marized them in Section 1.well to Neither large approach datasets. scales In theis ﬁrst case, the the bottleneck complexitywithout of guide computing trees multi-alignments that and independent the reconstructions agree. vanishingond probability In case, the the sec- problemthe comes inference from algorithm slow and mixingliferation the of of unregularized parameters. pro- Forthird this baseline reason, that we scales built well in a large datasets. centroid of theshtein observed distance. wordenshtein forms distance in Let y Leven- between wordreturn forms argmin { Note that the optimum isthe not minimization changed to if be wethat restrict taken on max in Figure 5: MeanPOc distance as to a the function of target the reconstruction EM iteration. of 5.2 Comparisons against other methods The ﬁrst two competing methods, is equivalent to taking theother min end over of theThis scheme spectrum, is if exponential in relatively short, we found that 70 60 / is 60 s 50 of the POc. words 1 40 MARKEDNESS − K and Edit dist. 2.18 1.99 2.06 1.75 1.87 2.02 − 30 3 30 / which are common 1 h to the target reconstruction of POc so 20 Number of modern languages / to / c s d -fold validation. The system was ran 10 N. of modern lang. K FAITHFULNESS , 0

2 0

1.8 1.6 1.4 2.6 2.4 2.2 Error

/ fortition. t

/ 2.2 1.8 1.4 2.6 FAITHFULNESS MARKEDNESS Error time, with disjoint Condition Unsupervised full system - - -Sharing -Topology Semi-supervised system For comparison, we also included in the same The results are reported in Figure 4. They con- We then conducted a number of experiments in- Next, we found that all of the following ablations → / Next, we found that all of the following ablations We then conducted a number of experiments in- The results are reported in Figure 2 (a). They con- The first claim we verified experimentally is that ments are shown in Table 2. table the performance oftrained a by semi-supervised system K given to the system (as observations in the graph- by increasing the number of languages. firm that large-scale inference ismatic desirable proto-language for reconstruction: auto- going fromto-4, 2- 4-to-8, 8-to-16, 16-to-32 languagescantly all helped signifi- reconstruction. Thereerage was edit still distance an improvement av- of64 0.05 languages, from altough 32 this to nificant. was not statistically sig- tended to assess the robustness ofidentify the the system, contribution and made to by differentincorporates. factors it First, we ranferent random the seeds system and with assessed thesolution 20 stability found. dif- of the Inand each helded cases, performances. See learning Figure was 5. stable significantly hurts reconstruction:in using which a all flat languages tree constructed are root equidistant and from from the eachconsensus re- other tree, instead dropping of the thedisabling sharing markedness across features, branches andfaithfulness dropping features. the The results of these experi- Figure 4: Meanproto distance Oceanic to as a the function target ofguages the reconstruction used number by of of the modern inference lan- procedure. ones being confused with an improvement produced s OPERATION significantly hurt reconstruction: using awhich flat tree all (in languages are equidistantstructed root from and the from each recon- other) insteadsensus of tree, the dropping con- the markedness features, drop- tended to assess the robustness ofidentify the the system, contribution and made to by differentincorporates. factors it First, we ranferent random the seeds system to with assesslutions the 20 found. stability dif- of In the each so- case,accuracy improved learning during was training. stable See and Figure 2 (b). firm that large-scale inferencetomatic is protolanguage desirable reconstruction: fortion improved reconstruc- au- statistically significantly with each increase except from 32average edit to distance 64 improvement was languages, 0.05. where the having more observed languages aids reconstruction of protolanguages. To test thisobserved hypothesis we modern added languages indistance increasing order of that the languages thatconstruction are are most useful added for first.fects POc of re- adding This a prevents close languageones the after being ef- confused several distant with an improvementby produced increasing the number of languages. samples and formed antree. accurate The (90%) tree consensus inference algorithm obtained scales is linearly not infactor the binary, of branching the but tree the (intially AR (Lunter contrast, et SSR al., scales 2003)). exponen- Figure 2: Left: Mean distanceof to POc the target as reconstruction a functionused of by the number the of inference modernand procedure. confidence languages intervals as Right: a function Mean ofaveraged the over distance EM 20 iteration, random seeds and ran on 4 languages. 2 We also tried a fully supervised system where a flat topol- ∈ 3 Figure 6 shows the results of a concrete run over ! Nggela Bugotu , ogy is used so that allbut of it these did latent not internal perform nodes as are well. avoided; but poorly represented in a naive attribute-basedural nat- class scheme. On the othercal hand, to the the features language lo- Kwara’ae (Kw.) pickset out of the these sub- changes whichsuch are as active / in that branch, Table 2: Effectsunsupervised of system ablation of onOceanic. various mean -Sharing aspects corresponds edit oftures to in the distance our subset to of the proto fea- lines indicate higher counts.natural class This constraints, but run itguistically did can plausible not be substitutions seen use are thatglobal learned. lin- features The prefer amanner changes, range adjacent of vowel motion, voicingincluding and mutations changes, so like / on, not present inis any not of reconstructed). the Thethe 32 diagrams global inputs show, for and and both theof therefore each local substitution features, superimposed on thechart, an expectations as IPA sound well as a list of the top changes. Darker tor of Kw. and Lau in Figure 6). 32 languages, zooming in to alanguages pair and of the the Solomonic cognate setexample shown, from the Table reconstruction 1. isoracle, as In good though the as off the by one character (the final / ical model) for eachthe sense run. that gold It reconstructionnodes is for are many semi-supervised not internal available in (such as the common ances- that condition on the currentsponds language, to -Topology using corre- athe flat tree topology connect modern where languages the tosemi-supervised proto only system Oceanic. edges is The in described inferences the (compared text. to the Allstatistically unsupervised dif- significant. full system) are . We

Tape }

It Avava Pt ) Es Neveei Naman Nese SantaAna ,c

Nahavaq 2 Nati KwaraaeSol !

Lau ( Kwamera

Tolo ,...,C Marshalles La 1 PuloAnna M ChuukeseAK ) SaipanCaro { Puluwatese Woleaian PuloAnnan ,c ∈

Carolinian 1 Woleai Chuukese !

Nauna ( PaameseSou ,c Anuta

VaeakauTau )]

Takuu M Tokelau c

Tongan ( Samoan IfiraMeleM =1 Tikopia L Tuvalu C c POc Niue

FutunaEast ∈ UveaEast Rennellese !

Emae !

Kapingamar [ Sikaiana

Nukuoro 1 Luangiua Hawaiian Marquesan Tahitianth )=

MaoriRurutuan 2 )= Tuamotu !

Mangareva ,

Rarotongan to the target reconstruction of POc so ,c 1 Jv Snd Mal Mad Penrhyn

RapanuiEas ! !

Pukapuka c ( Mwotlap ( d

Mota c FijianBau

Namakir d Nguna M ArakiSouth Saa Raga PeteraraMa The dataset included a tree, but as of November 2008, it PMJ 2 We used the cognate information provided in the The first claim we verified experimentally is that . For constructing the global tree, we used the was generated automatically andwhile.” “has [not] been updated in a the comparative method was the basis for evaluation. database, automatically constructing aand global set tree of subtreesmatrix from the cognate set indicator ber 2008, 124,468 lexical items frommostly 587 from languages the Austronesian languagedatabase family. The includes partialIPA cognacy transcriptions, as judgments wellproto-languages. and as A reconstruction a of Proto few(POc) Oceanic reconstructed originally developed by (Blust, 1993) using Clockwise, from the top left: Romance,Proto-Malayo-Javanic. Austronesian and formance of our systemits and success. the factors The relevant database to contained, as of Novem- Figure 3: Phylogenetic trees for three language families. fects of adding a close language after several distant of proto-languages. To test this hypothesisobserved we added modern languages indistance increasing order of that the languages thatconstruction are are most useful added for first. POc re- This prevents the ef- (90%) consensus tree.nary, The but tree the obtained AR inference isin algorithm not the scales bi- branching linearly factor ofscales the exponentially tree (Lunter et (in al., contrast, 2003)). SSR having more observed languages aids reconstruction L implementation of neighborpackage joining (Felsenstein, in the 1989).trix Phylip used the Thecators, Hamming distance distance ma- ofbootstrapped cognate 1000 samples indi- and formed an accurate be seen that linguistically plausible substitutions are Comparison CENTROID PRAGUE BCLKG learned. The global features prefer a range of voic- Protolanguage POc PMJ La Heldout (prop.) 243 (1.0) 79 (1.0) 293 (0.5) ing changes, manner changes, adjacent vowel mo- Modern languages 70 4 2 tion, and so on, including mutations like /s/ to /h/ Cognate sets 1321 179 583 which are common but poorly represented in a naive Observed words 10783 470 1463 attribute-based natural class scheme. On the other Mean word length 4.5 5.0 7.4 hand, the features local to the language Kwara’ae Table 2: Experimental setup: number of held-out proto- pick out the subset of these changes which are ac- word from (absolute and relative), of modern languages, tive in that branch, such as /s/ /t/ fortition. → cognate sets and total observed words. The split for BCLKG is the same as in Bouchard-Cotˆ e´ et al. (2008). 5.2 Comparisons against other methods The first two competing methods, PRAGUE and BCLKG, are described in Oakes (2000) and ACLE. This is superior to picking a single closest Bouchard-Cotˆ e´ et al. (2008) respectively and sum- language to be used for all word forms, but it is pos- marized in Section 1. Neither approach scales well sible for systems to perform better than the oracle to large datasets. In the first case, the bottleneck is since it has to return one of the observed word forms. the complexity of computing multi-alignments with- We performed the comparison against Oakes out guide trees and the vanishing probability that in- (2000) and Bouchard-Cotˆ e´ et al. (2008) on the same dependent reconstructions agree. In the second case, dataset and experimental conditions as those used in the problem comes from the unregularized prolifera- the respective papers (see Table 2). Note that the tion of parameters and slow mixing of the inference setup of Bouchard-Cotˆ e´ et al. (2008) provides super- algorithm. For this reason, we built a third baseline vision (half of the Latin word forms are provided); that scales well in large datasets. all of the other comparisons are performed in a com- This third baseline, CENTROID, computes the pletely unsupervised manner. centroid of the observed word forms in Leven- The PMJ dataset was compiled by Nothofer shtein distance. Let L(x, y) denote the Lev- (1975), who also reconstructed the corresponding enshtein distance between word forms x and protolanguage. Since PRAGUE is not guaranteed to y. Ideally, we would like the baseline to return a reconstruction for each cognate set, only 55 return argminx Σ y O L(x, y), where O = word forms could be directly compared to our sys- ∈ ∗ ∈ y1, . . . , y O is the set of observed word forms. tem. We restricted comparison to this subset of the { | |} P Note that the optimum is not changed if we restrict data. This favors PRAGUE since the system only pro- the minimization to be taken on x Σ(O) such poses a reconstruction when it is certain. Still, our ∈ ∗ that m x M where m = min y ,M = system outperformed PRAGUE, with an average dis- ≤ | | ≤ i | i| max y and Σ(O) is the set of characters occurring tance of 1.60 compared to 2.02 for PRAGUE. The i | i| in O. Even with this restriction, this optimization difference is marginally significant, p = 0.06, partly is intractable. As an approximation, we considered due to the small number of word forms involved. only strings built by at most k contiguous substrings We also exceeded the performance of BCLKG on taken from the word forms in O. If k = 1, then it the Romance dataset. Our system’s reconstruction is equivalent to taking the min over x O. At the had an edit distance of 3.02 to the truth against 3.10 ∈ other end of the spectrum, if k = M, it is exact. for BCLKG. However, this difference was not signifi- This scheme is exponential in k, but since words are cant (p = 0.15). We think this is because of the high relatively short, we found that k = 2 often finds the level of noise in the data (the Romance dataset is the same solution as higher values of k. The difference only dataset we consider that was automatically con- was in all the cases not statistically significant, so we structed rather than curated by linguists). A second report the approximation k = 2 in what follows. factor contributing to this small difference may be We also compared against an oracle, denoted OR- that the the experimental setup of BCLKG used very ACLE, which returns argminy OL(y, x∗), where x∗ few languages, while the performance of our system ∈ is the target reconstruction. We will denote it by OR- improves markedly with more languages.

71 Nggela - Bugotu Tape Avava It Pt Es Neveei Naman Nese SantaAna Nahavaq under-

Nati FAITH KwaraaeSol Lau Kwamera Tolo Marshalles PuloAnna La ChuukeseAK SaipanCaro Puluwatese Woleaian (e) PuloAnnan to check if the Carolinian

Woleai , but for larger Chuukese Nauna PaameseSou Anuta VaeakauTau Takuu Tokelau Tongan Samoan IfiraMeleM Tikopia FAITHFULNESS POc Tuvalu Niue FutunaEast UveaEast Rennellese (d) Emae Kapingamar Sikaiana Nukuoro Luangiua Hawaiian Marquesan FAITHFULNESS - Tahitianth FAITHFULNESS Rurutuan Maori -

Jv Tuamotu Snd Mad Mal Mangareva Rarotongan Penrhyn RapanuiEas Pukapuka Mwotlap Mota FijianBau Namakir Nguna STRUCT

ArakiSouth STRUCT Saa PMJ Raga even slightly outperformed its structured PeteraraMa (c) .... i/ (Lau) We compared the performance of the system with ! /a FULNESS resentation of Kondrak (2000). and without algorithm can recover the structure of naturalin classes an unsupervised2 fashion. or 4 We observedperformed languages, found that with trees, the difference was not significant. cousin with 16 observed languages. 6 Conclusion By enriching our modeltures to like markedness, include and importantlarger by fea- data scaling sets up thanobtained to were substantial much previously possible, improvementstion we in quality, reconstruc- givingsets. the best While resultsstill many unmodeled, on more from reduplication complex past tochained phenomena borrowing data sound to are shifts, thecantly increases current the approach power, accuracy, signifi- of and automatic efficiency reconstruction. Acknowledgments We would like toviewers thank for Anna Rafferty their and comments.ported our by a re- NSERC This fellowship to work theNSF first grant was author and number sup- BCS-0631518 tothor. the second au- i/ (POc) i/ ! ! .) . /ta /a } 72 and in a (b) /angi/ (Kw 5 5 went from / h h / PRAGUE " " indexes types $ + + $ CENTROID β with their natu- 0 0 . . y : 3 ' ( ' ( : 3 BCLKG q q and ? g 4 9 4 9 ? g and k x where x k (1.13), we significantly ) j 7 B , 7 B , j x is a structured version of baseline (1.79). c ç c ç y manner, place, phonation, ( ; @ 6 A 1 6 A 1 ; @ { β % * % * ORACLE N 2 2 ) ) < r > n d z n d z < r > t s t s ORACLE and C C ) , replacing ! ! CENTROID x ( 8 v & 8 v & systems that do not support parameter f f β - m b # b # - m N p = p = do not scale well to large datasets, we hurt i a n k t e r h g l i a n k t e r h g l 1 1 −→ −→ −→ −→ −→ −→ −→ −→ −→ −→ −→ −→ −→ −→ −→ −→ −→ −→ −→ −→ l l r r s s s s e a e a N N o k g o g k STRUCT-FAITHFULNESS Since we have shown evidence that We conducted another experiment to verify this (a) FAITHFULNESS BCLKG ral classes of classes, ranging over isOral, isCentral, height, backness, roundedness 5.3 Incorporating prior linguistic knowledge The model also supportsguistic the knowledge. addition This oftemplates takes prior with the lin- form moreformed of internal experiments feature with structure. an additionalplate: feature tem- We per- This feature set is reminiscent of the featurized rep- outperform the 3.30 to 3.38. This suggestsactually that more languages can sharing. also compared against large-scale setting. Specifically, weexperimental compare setup to on the 64 modernreconstruct languages used POc to described before.while the system’s Encouragingly, average distanceattain (1.49) that does not of the by running both systems in largerRomance trees. dataset Because had the onlytranscribed three in modern IPA, we languages used the Austronesianto dataset perform the test. The resultsthis were all setup: significant in while our methodtance went of from 2.01 an to edit 1.79iment dis- in described the in 4-to-8 languages Section exper- 5.1, Figure 3: (a)the A bottom, visualization for of oneexpectation two particular of branch. learned a transition faithfulness Each betweenfrom parameters: pair them. our of on POc The phonemes five the experimentsAustronesian have strongest (see and top, a links Romance. text). are link from also with the (c-e) included grayscale universal Phylogenetic at value features, the trees proportional right. on for to (b) three the A language sample families: taken Proto-Malayo-Javanic, Kwa Universal Kwa Universal References G. Kondrak. 2002. Algorithms for Language Recon- struction. Ph.D. thesis, University of Toronto. R. Blust. 1993. Central and central-Eastern Malayo- V. I. Levenshtein. 1966. Binary codes capable of correct- Polynesian. Oceanic Linguistics, 32:241–293. ing deletions, insertions and reversals. Soviet Physics A. Bouchard-Cotˆ e,´ P. Liang, D. Klein, and T. L. Griffiths. Doklady, 10, February. 2008. A probabilistic approach to language change. In D. C. Liu, J. Nocedal, and C. Dong. 1989. On the limited Advances in Neural Information Processing Systems memory BFGS method for large scale optimization. 20 . Mathematical Programming, 45:503–528. A. Bouchard-Cotˆ e,´ M. I. Jordan, and D. Klein. 2009. J. B. Lowe and M. Mazaudon. 1994. The reconstruction Efficient inference in phylogenetic InDel trees. In Ad- engine: a computer implementation of the comparative vances in Neural Information Processing Systems 21. method. Comput. Linguist., 20(3):381–417. L. Campbell. 1998. Historical Linguistics. The MIT G. A. Lunter, I. Miklos,´ Y. S. Song, and J. Hein. 2003. Press. An efficient algorithm for statistical multiple align- S. F. Chen. 2003. Conditional and joint models for ment on arbitrary phylogenetic trees. Journal of Com- grapheme-to-phoneme conversion. In Proceedings of putational Biology, 10:869–889. Eurospeech. B. Nothofer. 1975. The reconstruction of Proto-Malayo- M. A. Covington. 1998. Alignment of multiple lan- Javanic. M. Nijhoff. guages for historical comparison. In Proceedings of M. P. Oakes. 2000. Computer estimation of vocabu- ACL 1998. lary in a protolanguage from word lists in four daugh- A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. ter languages. Journal of Quantitative Linguistics, Maximum likelihood from incomplete data via the EM 7(3):233–244. algorithm. Journal of the Royal Statistical Society. Se- A. Prince and P. Smolensky. 1993. Optimality theory: ries B (Methodological), 39(1):1–38. Constraint interaction in generative grammar. Techni- M. Dreyer, J. R. Smith, and J. Eisner. 2008. Latent- cal Report 2, Rutgers University Center for Cognitive variable modeling of string transductions with finite- Science. state methods. In Proceedings of EMNLP 2008. L. Tierney. 1994. Markov chains for exploring posterior S. P. Durham and D. E. Rogers. 1969. An application distributions. The Annals of Statistics, 22(4):1701– of computer programming to the reconstruction of a 1728. proto-language. In Proceedings of the 1969 confer- A. Varadarajan, R. K. Bradley, and I. H. Holmes. 2008. ence on Computational linguistics. Tools for simulating evolution of aligned genomic re- C. L. Eastlack. 1977. Iberochange: A program to gions with integrated parameter estimation. Genome simulate systematic sound change in Ibero-Romance. Biology, 9:R147. Computers and the Humanities. C. Wilson. 2006. Learning phonology with substantive J. Felsenstein. 1989. PHYLIP - PHYLogeny Inference bias: An experimental and computational study of ve- Package (Version 3.2). Cladistics, 5:164–166. lar palatalization. Cognitive Science, 30.5:945–982. S. Goldwater and M. Johnson. 2003. Learning OT constraint rankings using a maximum entropy model. Proceedings of the Workshop on Variation within Op- timality Theory. S. J. Greenhill, R. Blust, and R. D. Gray. 2008. The Austronesian basic vocabulary database: From bioinformatics to lexomics. Evolutionary Bioinformatics, 4:271–283. H. H. Hock. 1986. Principles of Historical Linguistics. Walter de Gruyter. I. Holmes and W. J. Bruno. 2001. Evolutionary HMM: a Bayesian approach to multiple alignment. Bioinfor- matics, 17:803–820. W. Jank. 2005. Stochastic variants of EM: Monte Carlo, quasi-Monte Carlo and more. In Proceedings of the American Statistical Association. G. Kondrak. 2000. A new algorithm for the alignment of phonetic sequences. In Proceedings of NAACL 2000.