Automated reconstruction of ancient using probabilistic models of

Alexandre Bouchard-Côtéa,1, David Hallb, Thomas L. Griffithsc, and Dan Kleinb

aDepartment of Statistics, University of British Columbia, Vancouver, BC V6T 1Z4, Canada; bComputer Science Division and cDepartment of Psychology, University of California, Berkeley, CA 94720

Edited by Nick Chater, University of Warwick, Coventry, United Kingdom, and accepted by the Editorial Board December 22, 2012 (received for review March 19, 2012)

One of the oldest problems in is reconstructing the building on work on the reconstruction of ancestral sequences that appeared in the protolanguages from which modern and alignment in computational biology (8–12). Several groups languages evolved. Identifying the forms of these ancient lan- have recently explored how methods from computational biology guages makes it possible to evaluate proposals about the nature can be applied to problems in , but such work of change and to draw inferences about human history. has focused on identifying the relationships between languages Protolanguages are typically reconstructed using a painstaking (as might be expressed in a phylogeny) rather than reconstructing – manual process known as the . We present a the languages themselves (13 18). Much of this type of work has family of probabilistic models of sound change as well as algo- been based on binary cognate or structural matrices (19, 20), which rithms for performing inference in these models. The resulting discard all information about the form that words take, simply in- system automatically and accurately reconstructs protolanguages dicating whether they are cognate. Such models did not have the from modern languages. We apply this system to 637 Austronesian goal of reconstructing protolanguages and consequently use a rep- resentation that lacks the resolution required to infer ancestral languages, providing an accurate, large-scale automatic reconstruc- ’ phonetic sequences. Using phonological representations allows us tion of a set of protolanguages. Over 85% of the system s recon- to perform reconstruction and does not require us to assume that structions are within one character of the manual reconstruction cognate sets have been fully resolved as a preprocessing step. Rep- provided by a linguist specializing in . Be- resenting the words at each point in a phylogeny and having a model ing able to automatically reconstruct large numbers of languages of how they change give a way of comparing different hypothesized provides a useful way to quantitatively explore hypotheses about the cognate sets and hence inferring cognate sets automatically. factors determining which sounds in a language are likely to change The focus on problems other than reconstruction in previous over time. We demonstrate this by showing that the reconstructed computational approaches has meant that almost all existing Austronesian protolanguages provide compelling support for a hy- protolanguage reconstructions have been done manually. How- pothesis about the relationship between the function of a sound ever, to obtain accurate reconstructions for older languages, and its probability of changing that was first proposed in 1955. large numbers of modern languages need to be analyzed. The Proto-Austronesian language, for instance, has over 1,200 de- ancestral | computational | diachronic scendant languages (21). All of these languages could potentially increase the quality of the reconstructions, but the number of possibilities increases considerably with each language, making it econstruction of the protolanguages from which modern lan- fi guages are descended is a difficult problem, occupying histor- dif cult to analyze a large number of languages simultaneously. R The few previous systems for automated reconstruction of pro- ical linguists since the late 18th century. To solve this problem – linguists have developed a labor-intensive manual procedure called tolanguages or cognate inference (22 24) were unable to handle the comparative method (1), drawing on information about the this increase in computational complexity, as they relied on de- sounds and words that appear in many modern languages to hy- terministic models of sound change and exact but intractable pothesize protolanguage reconstructions even when no written algorithms for reconstruction. records are available, opening one of the few possible windows Being able to reconstruct large numbers of languages also to prehistoric societies (2, 3). Reconstructions can help in un- makes it possible to provide quantitative answers to questions derstanding many aspects of our past, such as the technological about the factors that are involved in . We dem- onstrate the potential for automated reconstruction to lead to level (2), migration patterns (4), and scripts (2, 5) of early societies. fi Comparing reconstructions across many languages can help reveal novel results in historical linguistics by investigating a speci c the nature of language change itself, identifying which aspects hypothesized regularity in sound changes called functional load. of language are most likely to change over time, a long-standing The functional load hypothesis, introduced in 1955, asserts that question in historical linguistics (6, 7). sounds that play a more important role in distinguishing words are In many cases, direct evidence of the form of protolanguages is less likely to change over time (6). Our probabilistic reconstruction not available. Fortunately, owing to the world’s considerable lin- of hundreds of protolanguages in the Austronesian phylogeny guistic diversity, it is still possible to propose reconstructions by provides a way to explore this question quantitatively, producing leveraging a large collection of extant languages descended from a compelling evidence in favor of the functional load hypothesis. single protolanguage. Words that appear in these modern lan- guages can be organized into cognate sets that contain words sus- pected to have a shared ancestral form (Table 1). The key Author contributions: A.B.-C., D.H., T.L.G., and D.K. designed research; A.B.-C. and D.H. observation that makes reconstruction from these data possible is performed research; A.B.-C. and D.H. contributed new reagents/analytic tools; A.B.-C., that languages seem to undergo a relatively limited set of regular D.H., T.L.G., and D.K. analyzed data; and A.B.-C., D.H., T.L.G., and D.K. wrote the paper. sound changes, each applied to the entire vocabulary of a language The authors declare no conflict of interest. at specific stages of its history (1). Still, several factors make re- This article is a PNAS Direct Submission. N.C. is a guest editor invited by the Editorial construction a hard problem. For example, sound changes are often Board. context sensitive, and many are string insertions and deletions. Freely available online through the PNAS open access option. In this paper, we present an automated system capable of See Commentary on page 4159. large-scale reconstruction of protolanguages directly from words 1To whom correspondence should be addressed. E-mail: [email protected]. that appear in modern languages. This system is based on a This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. probabilistic model of sound change at the level of , 1073/pnas.1204678110/-/DCSupplemental.

4224–4229 | PNAS | March 12, 2013 | vol. 110 | no. 11 www.pnas.org/cgi/doi/10.1073/pnas.1204678110 Downloaded by guest on September 23, 2021 Table 1. Sample of reconstructions produced by the system SEE COMMENTARY

*Complete sets of reconstructions can be found in SI Appendix. † Randomly selected by stratified sampling according to the Levenshtein edit distance Δ. ‡ Levenshtein distance to a reference manual reconstruction, in this case the reconstruction of Blust (42). §The colors encode cognate sets. {We use this symbol for encoding missing data.

Model denote the subset of modern languages that have a form in We use a probabilistic model of sound change and a Monte Carlo the cth cognate set. For each set c ∈ f1; ...; Cg and each language ℓ ∈ inference algorithm to reconstruct the and of LðcÞ, we denote the modern word form by wcℓ. For cognate set protolanguages given a collection of cognate sets from modern c, only the minimal subtree τðcÞ containing LðcÞ and the root is languages. As in other recent work in computational historical relevant to the reconstruction inference problem for that set. linguistics (13–18), we make the simplifying assumption that each Our model of sound change is based on a generative process word evolves along the branches of a tree of languages, reflecting defined on this tree. From a high-level perspective, the genera- the languages’ phylogenetic relationships. The tree’s internal tive process is quite simple. Let c be the index of the current nodes are languages whose word forms are not observed, and the cognate set, with topology τðcÞ. First, a word is generated for the leaves are modern languages. The output of our system is a pos- root of τðcÞ, using an (initially unknown) root language model terior probability distribution over derivations. Each derivation (i.e., a probability distribution over strings). The words that ap- contains, for each cognate set, a reconstructed transcription of pear at other nodes of the tree are generated incrementally, using ancestral forms, as well as a list of sound changes describing the a branch-specific distribution over changes in strings to generate transformation from parent word to child word. This represen- each word from the word in the language that is its parent in τðcÞ. tation is rich enough to answer a wide range of queries that would Although this distribution differs across branches of the tree, normally be answered by carrying out the comparative method making it possible to estimate the pattern of changes involved in manually, such as which sound changes were most prominent the transition from one language to another, it remains the same along each branch of the tree. for all cognate sets, expressing changes that apply stochastically COMPUTER SCIENCES We model the evolution of discrete sequences of phonemes, to all words. The probabilities of substitution, insertion and de- using a context-dependent probabilistic string transducer (8). letion are also dependent on the context in which the change Probabilistic string transducers efficiently encode a distribution occurs. Further details of the distributions that were used and over possible changes that a string might undergo as it changes their parameterization appear in Materials and Methods. through time. Transducers are sufficient to capture most types of The flexibility of our model comes at the cost of having literally regular sound changes (e.g., , epentheses, and ) and millions of parameters to set, creating challenges not found in can be sensitive to the context in which a change takes place. Most most computational approaches to phylogenetics. Our inference COGNITIVE SCIENCES types of changes not captured by transducers are not regular (1) and algorithm learns these parameters automatically, using estab- PSYCHOLOGICAL AND are therefore less informative (e.g., metatheses, reduplications, and lished principles from machine learning and statistics. Specifi- ). Unlike simple molecular InDel models used in com- cally, we use a variant of the expectation-maximization algorithm putational biology such as the TKF91 model (25), the parameter- (26), which alternates between producing reconstructions on the ization of our model is very expressive: Mutation probabilities are basis of the current parameter estimates and updating the pa- context sensitive, depending on the neighboring characters, and rameter estimates on the basis of those reconstructions. The each branch has its own set of parameters. This context-sensitive reconstructions are inferred using an efficient Monte Carlo in- and branch-specific parameterization plays a central role in our ference algorithm (27). The parameters are estimated by opti- system, allowing explicit modeling of sound changes. mizing a cost function that penalizes complexity, allowing us to Formally, let τ be a phylogenetic tree of languages, where each obtain robust estimates of large numbers of parameters. See SI language is linked to the languages that descended from it. In Appendix, Section 1 for further details of the inference algorithm. such a tree, the modern languages, whose word forms will be If cognate assignments are not available, our system can be observed, are the leaves of τ. The most recent common ancestor applied just to lists of words in different languages. In this case it of these modern languages is the root of τ. Internal nodes of the automatically infers the cognate assignments as well as the tree (including the root) are protolanguages with unobserved reconstructions. This setting requires only two modifications to word forms. Let L denote all languages, modern and otherwise. the model. First, because cognates are not available, we index the All word forms are assumed to be strings in the International words by their semantic meaning (or gloss) g, and there are thus Phonetic Alphabet (IPA). G groups of words. The model is then defined as in the previous We assume that word forms evolve along the branches of the case, with words indexed as wgℓ. Second, the generation process is tree τ. However, it is usually not the case that a word belonging augmented with a notion of innovation, wherein a word wgℓ′ in to each cognate set exists in each modern language—words are some language ℓ′ may instead be generated independently from lost or replaced over time, meaning that words that appear in the its parent word wgℓ. In this instance, the word is generated from root languages may not have cognate descendants in the lan- a language model as though it were a root string. In effect, the guages at the leaves of the tree. For the moment, we assume there tree is “cut” at a language when innovation happens, and so the is a known list of C cognate sets. For each c ∈ f1; ...; Cg let LðcÞ word begins anew. The probability of innovation in any given

Bouchard-Côté et al. PNAS | March 12, 2013 | vol. 110 | no. 11 | 4225 Downloaded by guest on September 23, 2021 Fig. 1. Quantitative validation of Random Automated Effect of tree Effect of tree size A modern reconstruction B C reconstructions and identification quality of some important factors influ- Agreement between two linguists 0.55 encing reconstruction quality. (A) 0.500

V

a K F

P

a i e U F i F S L u U j I T

p i a a v R a il u f utun T i N ua k v a a i o k o i e ka eaE o ra oke b R t u a t e n n m it u T n a n u b a u l T k n p n N a Ma aWes g B e T u n uv g m n Mel l i n n M pa Ti u i a u gi u o a no ah m Bell i n a a a a R a a a A m i r as u S a e n n T l k u n oro k u mba i p u a I ma s fui a u V a H a m a k Ea a n A vu i an nu u m Tahi e i n am ll e is T nu a soo E a o u a a s e K F V e a a u T a ro r aw l i e opi e u n r n u auna n P t ve g a ei u r a qu e a T v r pe ah ma u T u L W o u a i A N us S i U B w e t i M s a F i N e i F S L u t k i j t a s L e y M a U i r T Ruru a w r I l na W l bu i i M oan s p a o aii Ea Si fo u ba a v R a i T t s u f u T Lon p i N u k re L M a i e e o n a v a a i o t u s k o Wes L iY a Da i t e n a a o t u e k b e e a r t o i o R t a ani u a t n i s e n m i a s l a un T n ant r un a an va B B a a a ig u a ga s K Gim M nn a b a k P d a u l T k n p o n h an r N a Mi s um B ar M a g e T u n m M u g a l a n B ba i b E a mo tua n i M M i R e n th p T u e a a u gi uk N e n a m A B iu s a n wa B W a a a i se R a v a h M a u n A m i r n t a a M a S a u e n n o te r T n h a i e T l oo W u M k m a u po u n o ell a i I a m a la u s f a a H a m s N i a k a n or A E v un s nu u i m i a i B s e ingl T e e i t a W n a ll e i e r ki T nu E l a s o Am aw a o t u s e n a r a e V e s a a u u en T hy i r a l i r e a N r r o n e a u n u a n u a n o o a t u v n a g s e v r a hit M m e T p nFi lM q a r o u T a m u D r o u L W m T N u A S u i t il w B i a Mars o a A p i M a w N D Da t s N i e i r a amar h u a s t L k e o y M o a R hi t i a w r r g n mp s Ma n W i l b Ca r hE S o a E o S s f u b a e e an m D s e T i a t p s i tr i r L L M n a Y i r u e a e m o o j I E a t t a u s l ng T W t n e ii a an L a i i a D hu Se u u i e i o e a em n ut r o r i s m ra s E s a B r a a l n I a r ela a g v a K B a en Be P a u s G M d aa e o e r i e n a M s u a r a PingPonape h N Ja a ib na n n a a b h e C bon R e m t a n Big n M S a NA e s i u n A N t e w s B Ki K all w l r se h t i r g s r a w a W Mo n M a t a i T i tub K l ot t M T n h W m u a gae t N e o a w l n a a a a r u e ma A i o i i l s B i a a r k A W M s e r Uj NG i e u r h t n a e M il b u hai n a h u N a o p a m n r s uAm n i D M l Saipa ok apes a a W t e n m y T r t ili P W Ela i a oh t M N D F o i D r a E a a A So ti i ouA C g t n m s r g S u a e H m a l ar e e a i r i m e t h D r s e Ch o il S ul a o j i I E a t t a l n n o l a h n n n u l e n T Ch n o l e A a ner r e n S u s r o C ea m shall n I u E e a r e B a uu s Anna s P u m aa o r A b i o a nC lu ia a P P N J a e e ie b n u M o om u a C a b Sat r i e n r o K l S N a e a uk ol k A M i y i i K a w w a W r s u or esero ar nf o n n i i T i t K m l o Na a r u u a ge t m tlocin le b r M gi a i e a A ai i a e o B boru aa b s esr Uj NG a nA w s i m a p u l u A h W an s Werou hg n S ok l a a W t u e PuP al e S a k a P W a e E i ah o t u o A k B dg S u p t i o l l e M n o Ta C i o a i e H m a a l l o le K e d o p l il e S u e n u s s gaa a C h n o l e n A r ro wA a N n C u a e s m N nn e M o w h s n A s Pa unu a a a i e a M a u nn a l i om u a a P di j a k S u o C i e f n r S N m te n uk ro k r A M i ya i a o e a u a o e o a n r N o a A akaoPorte sen im a n t r l r a o o b r n v a K WB a a e t in s le o e bu e A e n o u uk a lo e B r aa a va S n gg b W w s ia s W ou m h n jo e va P P a e c S a k m v i a N m uu o l A k n B dggo a a S W ra u o a l lo l e K e M an d T Ta A u S d y u e s ga a N Ne s e nei t e sT waA a s N n a O s il o N n e M o i w p Ga a b e a i ea a ha e E or t na a P d j k Bal m el S N m e n o em a N v a F A a e s n i a k u n a N a L od n n k v a e K WB a Ng m a eratu e a A e n o u u g a S t q EnLi b i g j v S n g b o Or un a i n o o av e va ut k k gem eN la m P i a N m Paa o a ir a lu a o a S W ra u o a Mw h n NK a T A r u tS d y T RagaE ak r N N a n t e o s Pete o f P n a Argun n a e e O a s il e m t a A a h s pe i G a b r AmbrymSoue l te lau a e E al o seSap eku rm N B el raraMa SP e a Nv amdF ra n iT i ik N m aa L nio e tu ou i a gu t q EL b i g A Mo ot Ter So O a i e N n ra R n le u rk na k gm e la n kiS Mer t Atonambtu ra I Pa t on ir a lu a u SyeEr ta e ke t M h NK k rg o MT a l lo a w R E Pa r A u ei a o P me o a fa na a u n th emm A e tl g te A k a a Lerom K a mb ter a a e l m Ur L s e p S u er Tan nakean a amahka r ar S P T i ik Kwa mera SLi ng ym a o ti i a er T na a A M Ma u oon b T awarS l di r So R t m le o era ak uo A a tun a Sa u KE r i S i M ta e ke r tI o th luta yeE So e MT a l lo F nt ga Tapu u re o F a g aA A rr th i emma h a g an na ai L o K a a Ba a n e m U L ma uroPaniiRihu Per gu Ta na a ra ak Reconstruction error rates for a Agu u nnKw k n SLi ng T f n a Taw a a el a Ka huawa Iliu S m eadi V Seru ar oue KEr r i Aro n Sa og thra luta si Aros Nila r F nt a Tapu Aros Ta Teua F ag aAna A Sa wat i is a gaa ai ntaiOn Cae t K a B ni er n Ka h om mar au ni Ri P gu 0.5 ib R Da ro Agh FaguaMamial ast ea Paw uu Tu n a Ba E esn a f liu ru ur a ni tined mb Ka aV I S e BauroH LYame ni Aro hu la aunu iTa s A a Ni un Sa oB Ke Ar iT ro Te ar AreareWaaAuaroo laru a S osi aw si is r luV Se iIrirro an O at K a a Ar il wma o Ka ta Cne om am e i ohia hu ataib R tD areMa a KC ng Faa l as ea g Ba gMa E eesn a Do ras a mpuerin ur anmi tind mb io Lom Ba oHa i LeYam ni K nda ur un iTa Sa Sa Su e ja SaaoBa u Ke u aUl a gab Are A ro lar a Or awa ejangRiTa ar ulu o Se iIrrio SaaS oh R l A eW Vil iwma o Sa aaV a Tbo bili B re a re aia Koha aUk ill aganadal Maa C ung iNiM TKoro iB Do mping L on a rangan ri s La er LauNoggu Sa o Kom a LauW Kraoro S und ja a laderth lanSaaan Saa aa S gRe BilBia Ul ja n gab a S Or awa Re liTa KwaFar Atayy aa S oha Tbo li lB a atalekaeSo Ciuli Ata Saa aaV abi da Mba l Squliq UkiN ill Tagrona iB e lelea iq Lo iMa Ko gan Sed n ngg Sa ran Lau Bunu L au L a u u oaro M Wa North SnaKr baengg ao ang la de Blaailaan Lan Kwa uu FavoTh rl Bi galanio Kwa raF a tayay ga kai ataleeS k CiuliqAAta Kwa Ru n M b ola Squli Toamba i Paiwa nabu a e le le q ita Kanaka a Sedi n L a Bunu Ngge Mba u LengoParila Saaroa e nggu o ng Len p Ts ou Kw u FThavaorla goGhaim a L a ngal aio L Puyum anga kai engo Kav alan K Ru Gh Toam wai Paiwa n nabu ariNgini Basai baita Kanaka Ghari CentralAmi TaliNggseKeoro Siraya L Nggel a Pazeh engoPa ripa Saaro TalisePole L engoGha Tsou GhariTanda Saisiat im Puyuma L engo Kav alan Mbirao Yakan Gh GhariNggae Bajo ariNgini Bas ai lango Mapun lAmi Ma GhaTa rliiNseggKeoor SirCaenyatra S Pazeh Talise amalSia si TalisePole Tolo Inabaknon GhariTanda Sa isia t Ghari SubanunSin Su Mbira o Yakan hariNdi banonSio gae Bajo G Weste GhariNg Map Talise Mala rnBuk M a la ngo un i ManM a Talis eMotoul noobbooIliaW es SamalSiasi Bug t Talise In tuDh Manobo Tolo abaknon Mbugho M an Diba Ghari SubanunSin apes e oboAtad Su Y Vitu M ano hariNdi banonSio boAt G W este rn ai au Talis eMala Buk L a kaaiBlil ManoManobo M akan bo Tigw iseMoli M anoa nobbooW N utu M a no Ka la TalBugotu Iliaest Maut k Binu boSa Dh Manob Baro kid ra bughotu M oDiba M es e anoboA mas IraM Yap Vitu M tad kLa nua ran n a a noboA M ada gl Sasa o tau k k akaBlaili LihirSunTigiaang Bali kLanai M a nooboTigw T k M Na u M a noboKala Nali a manwa a utut Bi boS baseline (which consists of picking t Il M arok nukid ara es on B raW g Hili ggo s Ka Tun ga ma IraMnuna r gag a yn da kLa a nao Tun ari Waray on M a ungl S asak M adnon TBu Wa ihirS k Ba autuan ra L TTiigaang Bali kaYs sugon y lik M ka ta SurigaononJolo Na a m Kilo oaG Ak t Ilo a nw Kokng lan Wes nggo a bla Ce onBis ara ung Hili Bla ga Ka bua K gT gay lan as la no nga ra W non lab m ia TMansaga Tu a dani B a ra B uSa aK l aga n M ano Tautu yW gh n Le l B Ys usan a ra L a aba e ogAka aka ugJolon y Z ring o Bikol nt ok ta A Sur o M a r Tagbanw Kil okoaG kla iga 0.375 oPoat Na Kang C no on a eT g nB o ggng Kma Tag a labl a ebu is n Nari e o Bat b a C B ng Kal an M ing r a an Ka bla as a o r keHolilu P kP w la amKia TM a gan Ma he B B alawanBal aAb B S na el ag ns C uro an aw ghuba eL aloa k Fa va SWPgg L a Za ng Bi gAna no a H i a a ri ro T kol t o ru anunooa t M Poat ag M U ra u u P law oeT a ba Nag To Al Sia a gang m Ta n a o oa Da nglaua no Ngri eKolo B gb wa C on ng y h M a ng H r ata a Ka M on Ka a i n ri e lu Pa kPanw LMu a t kB M a ek Bi o Ba l aA a Da y ingaak Ch ur aw la b ng naK o a a a S nggan w u a L a M K n t oF av HW i B L at na n a era ak on ru u anP a at ab ta ta u i dor N M U ra unla B a TMa na gaju o lu SPa o w ba ab aghaAv a unj Mih T oA D in la o a n Ba B V a ala noa ay g u o an gg MKer unnya on ong K a hi an n o el M Mu a kB bat Se iri i inc g n aL Ka a Ba R is Mel ay g a D ga k r n uSari un an Lo ay at Va o u O L at a aa M K a n h gan a b n an e a d k iG aT ua In Lo y a a ta t hu ri o N is n kg do u B a g v TM na r ga r a ag B wM ab abaa A a u aa Mih j a at Hn n a n B B V na nj al u V b a Mel nj esia ta gg MKer unn a isi s p M a lay a n o e yan a B S is e M al re s e Ma b ri i l inc g N p of a n a Se Riis a TeoHa i in yuayB B r n M y i n a P a a o O e u T s C h n Br V h u ga lay S a ha lo e h anRg a iG T In L r o o MokeH ru ka u h s a ua d o n u a u a n a ri n kg B o w Ne S re k nv i a s a ta g n a n M a ruo a Iba na n b V iHan a n e a b ag g g a ba s M j s G n n nC a is pe M a i la M Mn g e Idaana C B S s p f Mi e a re a y a el a B y ha h Ni eoa o lay l se n V h t Timu o TH i n ay non a K u n a P a uB M a q aa e n m a T s C h n B h Ug okou gv Bil du h lo e h a g ru a G L ga a k M H ru n h b no e B n b g D e o uo o a R k n a e o N S re nv a s Ku uH h k K B er tu it us a ro Ib k in a b L g u M en elai B n u a e a n a oa aw lu a Mu u b ag G an n n g sa d bn e r n M Mn gg e a C N a ta r la yah a a n l Id y C h u imi a r E E L t n r V o e ta B h K v a n a h T u aa o a S tp ia a n n L n o a K i m o u g g g h auML o a k q aa e m n n R a nS Nia g an h Ug u gv d P To g o n o ga B la u uD a an t Madur a a n G b L i g u u u s no a g u no e e B n b T Ivata baBa n uk uH h Ke B e t i o u K o tup Ba o n K k u t n ndasS It M L g u o M e r l Ba s u lao I bay B a r n a u M u a ri ir Itb r d naa e l w Mol ae arb a a s t y a r u n K o i k I n N a E L l i a L ro s ay u nB e tak l u imbi a r E a a t n r V e uni Iv al y s K v a n a n h ag o i Y I a at S tp i a n L K a a m an a e o u g g h a L M R r a as m at a a S g N o M il s R n n T g g a u o n uad S m o y P s t i n v ma n Ka o e c a a M o a a a M n ob i a r ay u a u u n a li a K If i od r n T u I b s n u g D i s u a K m o K d o p v ad o n i r I a u p B I a o k 0.45 a Ga K a w Pa ni n n S otu o t M bu i a ll gao a ba i r la I I a b B B d a Ka g a r e i t r t u M o a ll M a a b a a a a a a I b a m K o I b r e Ubi m io nga a ili Lk on r u y n e t n l D I l al h l g r s a a n u f Ifugaoo k p Boto V e u Iv a y B s W ap li d i B ug ha B o Y a y a k an a n an a a i I l o o s i B on i o a K a m a a t a e p i a l Ba d ay M R rd a a a m a w i u ba Ka nk g M n s o si i n ng u il S s t y a M D A u It K a ug Ka a m o a S li is G o v m n Ka o e c eo K n tok o a a G ir ali l o na K n b i r alaw K Is Att n a t a K I i o r n Ma g r u i D a toc a D u a f m y o Il Ag Ba K enI e a ilis K Se wo e n i u r I a u p d o e e u dd A w P n i a a n ak G K a i n a ai n ok gB n G n b i ll g a b r Bu Y aPa g a K d l Ma ma a m d b a g Gum R v M W egDib M i ll W eu S O t t K e i G o a o a a m a ul pi o a g a u a I a b ob au e r V a n a e U i n r o S Ta m a B S I l l a l ano u ga D p o l Auh g M h p b e i a w i i n n u f o a P a k b gi gya al li a T G d I B r T g p S k n a d B l B a t i K i n n h i K an ug o Bo a a W a B i f i u e n g in an n i a i s G a a l u uka r k m a n i o W B o l i u o o Ja g o a a Au a ag o y o s o o u o a un B a Ta ili a o p a a e l d a n o i n M B S li n a i i J g u e g d w one modern word at random), our m t b g u K l a o t y A o g s n p r n M re r N n i g ar p n a n l n w s B k e p a A u a u y o at id t g S K u n g h e M D I K i o o va t a S a i a li i a o e e aa i e r Ga a e ss n a i g i a K t a t o o a Ca t ed n s r a l K M W G n a oB u rn l g i T i a n a k s o n n e mi e il l Is o m a n T a K A a t K Mou i o g r i D n t o a M r g M u i W B m G li A Ka e o e I M m e a B Ka o o eSo i w e l k a r a m p S n l A g M t ab ne S o l u tt k am e dd A ab n a M e i nK u i a a u n at s a ok g a a n a n n U a n c r l i r B Y g g G b a a u a u o b d M m a m Y g a r R v l W a G e M t e k b S O t l uhe W e G I W a K n i B g u t p l a u o a P a e r s a V a a Sengseng o n u o e r a n gga M j r o S T o B S g a i l a l a a g o u m l A g g M a e a i a a pu k a a i w P gi g a b T G a l u r d T a H g S g n k B K a i t n a a K a n n un K a D i o B a b i a um i u e n n ga i a a a n i n a i G y k s o u u r m W o l y a A a o o J g y n r u o a u u o a ilib a o y T t a n e ib P o n o li n M i n B e N u S g a o n J L e g e m d t u l a a n y a A p r n d M i r r N g a o p n n l n k B o B e p a i t g S h un n e o gil i o v tC a i e s e aa i K a ss e d i a a t o e o a n s M W r a r P u i T l e k s m e m a n T a g M m i K b n o M K r g i G W B l as o e M m o B K o e r a m a p S l A g M i t a n n a a n a M a n a a

a a U ly a r S b a a e u o b e g Y a a r t b l W a t o e r s a o n u S gg n M j n o m a m o an a l K H g e n a u u b a a y o

s a y P N u L n a n

ia B o

K P n o

l y

n

e

s

i a

n system, and the amount of dis- 0.250 agreement between two linguist’s 0.4 manual reconstructions. Reconstruc- 0.125 0.35 tion error rates are Levenshtein Reconstruction error rate Reconstruction error rate N/A Reconstruction error rate distances normalized by the mean 0 0.3 word form length so that errors 0 10 20 30 40 50 60 Number of modern languages can be compared across languages. Agreement between linguists was computed on only Proto-Oceanic because the dataset used lacked multiple reconstructions for other protolanguages.(B) The effect of the topology on the quality of the reconstruction. On one hand, the difference between reconstruction error rates obtained from the system that ran on an uninformed topology (first and second) and rates obtained from the system that ran on an informed topology (third and fourth) is statistically significant. On the other hand, the corresponding difference between a flat tree and a random binary tree is not statistically significant, nor is the difference between using the consensus tree of ref. 41 and the Ethnologue tree (29). This suggests that our method has a certain robustness to moderate topology variations. (C)Re- construction error rate as a function of the number of languages used to train our automatic reconstruction system. Note that the error is not expected to down to zero, perfect reconstruction being generally unidentifiable. The results in A and B are directly comparable: In fact, the entry labeled “Ethnologue” in B corresponds to the green Proto-Austronesian entry in A. The results in A and B and those in C are not directly comparable because the evaluation in C is restricted to those cognates with at least one reflex in the smallest evaluation set (to make the curve comparable across the horizontal axis of C).

language is initially unknown and must be learned automatically except from 32 to 64 languages, where the average edit distance along with the other branch-specific model parameters. improvement was 0.05. For , we also evaluated previous automatic re- Results construction methods. These previous methods do not scale to Our results address three questions about the performance of our large datasets so we performed comparisons on smaller subsets system. First, how well does it reconstruct protolanguages? Second, of the Austronesian dataset. We show in SI Appendix, Section 2 how well does it identify cognate sets? Finally, how can this approach that our method outperforms these baselines. be used to address outstanding questions in historical linguistics? We analyze the output of our system in more depth in Fig. 2 A–C, which shows the system learned a variety of realistic sound Protolanguage Reconstructions. To test our system, we applied it to changes across the Austronesian family (30). In Fig. 2D, we show a large-scale database of Austronesian languages, the Austrone- the most frequent substitution errors in the Proto-Austronesian sian Basic Vocabulary Database (ABVD) (28). We used a pre- reconstruction experiments. See SI Appendix,Section5for viously established phylogeny for these languages, the Ethnologue details and similar plots for the most common incorrect inser- tree (29) (we also describe experiments with other trees in Fig. 1). tions and deletions. For this first test of our system we also used the cognate sets provided in the database. The dataset contained 659 languages at Cognate Recovery. Previous reconstruction systems (22) required the time of download (August 7, 2010), including a few languages that cognate sets be provided to the system. However, the crea- outside the Austronesian family and some manually reconstructed tion of these large cognate databases requires considerable an- protolanguages used for evaluation. The total data comprised notation effort on the part of linguists and often requires that at 142,661 word forms and 7,708 cognate sets. The goal was to re- least some reconstruction be done by hand. To demonstrate that construct the word in each protolanguage that corresponded to our model can accurately infer cognate sets automatically, we each cognate set and to infer the patterns of sound changes along used a version of our system that learns which words are cognate, each branch in the phylogeny. See SI Appendix,Section2for starting only from raw word lists and their meanings. This system further details of our simulations. uses a faster but lower-fidelity model of sound change to infer We used the Austronesian dataset to quantitatively evaluate the correspondences. We then ran our reconstruction system on performance of our system by comparing withheld words from cognate sets that our cognate recovery system found. See SI Ap- known languages with automatic reconstructions of those words. pendix The Levenshtein distance between the held-out and reconstructed , Section 1 for details. This version of the system was run on all of the Oceanic lan- forms provides a measure of the number of errors in these reconstructions. We used this measure to show that using more guages in the ABVD, which comprise roughly half of the Aus- tronesian languages. We then evaluated the pairwise precision languages helped reconstruction and also to assess the overall fi performance of our system. Specifically, we compared the system’s (the fraction of cognate pairs identi ed by our system that are also in the set of labeled cognate pairs), pairwise recall (the fraction of error rate on the ancestral reconstructions to a baseline and also to fi the amount of divergence between the reconstructions of two labeled cognate pairs identi ed by our system), and pairwise F1 fi linguists (Fig. 1A). Given enough data, the system can achieve measure (de ned as the harmonic mean of precision and recall) reconstruction error rates close to the level of disagreement be- for the cognates found by our system against the known cognates tween manual reconstructions. In particular, most reconstructions that are encoded in the ABVD. We also report cluster purity, perfectly agree with manual reconstructions, and only a few con- which is the fraction of words that are in a cluster whose known tain big errors. Refer to Table 1 for examples of reconstructions. cognate group matches the cognate group of the cluster. See SI See SI Appendix, Section 3 for the full lists. Appendix, Section 2.3 for a detailed description of the metrics. We also present in Fig. 1B the effect of the tree topology on Using these metrics, we found that our system achieved a pre- reconstruction quality, reiterating the importance of using in- cision of 0.844, recall of 0.621, F1 of 0.715, and cluster purity of formative topologies for reconstruction. In Fig. 1C, we show that 0.918. Thus, over 9 of 10 words are correctly grouped, and our the accuracy of our method increases with the number of ob- system errs on the side of undergrouping words rather than clus- served Oceanic languages, confirming that large-scale inference tering words that are not cognates. Because the null hypothesis in is desirable for automatic protolanguage reconstruction: Recon- historical linguistics is to deem words to be unrelated unless struction improved statistically significantly with each increase proved otherwise, a slight undergrouping is the desired behavior.

4226 | www.pnas.org/cgi/doi/10.1073/pnas.1204678110 Bouchard-Côté et al. Downloaded by guest on September 23, 2021 A i y u B m n SEE COMMENTARY

(Fig. S5) (Fig. S2)

t u

a

ili a

a u o

o

n ta as

b

B t i a b

a h a m l

u

l lu

n T

a n a lu m u a

o g e

n

aE i e m i u

a a e i a Ir

i oo n n n

a e u 12 n

m k u v m n mb j im s u pon f en

e a i

kuu a una i

o

i a n

e i v iu a

u o u u

a r e apingamar p v ir

F v

V T a T a

N T n

U B ei

T

Pukapuka As k Luangiua en

T Ne

Nuk uoro i a

opia r W we Sikaiana S a

K W i

Tuva Lou e uss s Mis

Ta Na i VaeakauTau y ua 10 o

L li r

FutunaEast b

veUv a W e s t S m o a

IfiraMeleM L a

L a FutunaAniw f u s

L i o n iY b

Rennellese M u op a

Anuta D a

Emae K a e s l

Tik G a s

Bellona B ig m r B a

itianMo M

Samoan s ar d r a e RapanuiEas Mi

B

Hawaiian u g Mangareva A B

n th wa s as r Marquesan W

264) N mb i r M

ahtanth hitia Ta Ma we a a Rarotongan in a st

A o i a l M Ba

Tah W a l

Rurutuan 537) o

(333) a )

(483) N M

(592) m ( a

Manihiki ) r p

(694) i

) t

(647) e ) (704) D

Tuamotu al

(181)

(209)

(703) D a an Ar ) r

Penrhyn ) (218) ) s il

317 T (180) t Tahiti m

598

( hE D (13) ( 325 ) g S

)

) 323) Im

)

) (

Maori ) a u

(163) t

r

(675) n e

) E

WesternFij ) er (359) s

(63) r 682

459 ut a

(383) E s

Rotuman ( o

361) l T

708) e 654

666

(644) S e 98 26

656 680 442 ehu De

(534 ( e o (

( ( (

(702 ( b ( ( e

( )

Nengone C on 505) r

(643 NA i

(

(546)

( So r iB

Iaai b na

( e b

374) i

We ( a an la na Ca 330) (689)

) (524) e

435 Taranj a

( s (162) Jawe

( tu

) K m l (642) i

el wa m le Ne ) U

(384) e t se

( 408 ) Ng a A m A

auru Na (182) 68 i

(562) a (

( 25 G a Marshalles A

s n ( l u )

) W t

u aie Kus 378 )

)

u e E ah l ) ( 8 )

Kiribati (645) 742) loh t a

) ( )

581 Hi e o 748

Ponapean ( r 153 m u

(

role 667 (533) So K 729 ) m

Pingilapes ( ) )

329 ) na a p b t k gd g q ( 229 A

o (

( ) un a m Mokile ( 164)

) l fi ( Pa s Woleai ur n i

200 u

) 97 ( 588 )

(563) (266 A n i o r (122) ( n a

PuloAnna

se 115 r Na

244) ) r

) ( 606) M o y o )

SaipanCaro ) 211

683 ) (148 u

) ( ( a

( ) B r o b

( ( 604 keseA

So 460 (

( ) b ) We u

Chuukese ( 135 a (441) 707 m

701 ( 134 191) o na

734) ( B h

arlnan rolinia Ca ( ( (

(512) ngg a

S

) d Mortlockes

(514)

(156 a T

)

) Ma

a Chuu

( dok

) a

) (719

(481)

(543 177 g Satawale M

(411) i

) ) w (

616 (744)

) N

(417 532 on

Woleaian ) d

373

603) rtO

(526)

( je a

) 718) ) P o PuloAnnan

uluwatese ) u ( (425 (555) ) k

120 6

( (5) K

P a

106) ( 474

( ( a 11

(9) 503 ma

( ( Wei an (131

aman a m Na 622 a k

(485 )

(105) ) (513) u

( B b

( eveei e e v Ne

( v Ngg

577 386 m 7

So nu (419)

(515)

vava v a Av ( ) a a

(132) ) r u

) S o

SakaoPo ) ) S a

( 660 u t d

(146)

( ) Wa y

AnejomAnei 697 a e sT

(745) s ) 192

) ( 450 587

) G 1 (

Tape ( 161 ( a ) (527)

283) ) re ) lil

( E ese s Ne (385) (528) 159 o

738 mbol

( (297) 698

( ( Ba

(520) 160 ( ah q a v ha Na

F

(139) ( ) La e a (443 ati Na n

io

(214 (116)

n ) L 2 4 9

tu

608 ) g amakir k a m Na )

(446) ( ge ber

(725) End n

hEfate Ni

Nguna 433)

516) 108 465 m (436 e la

( Na

r o Ork

( ) 486) a u n

eSou a 8

( K l u k

Sout n

(445) ( ) ) a r

( Pa a

429)

Raga a (32)

86 n Arg (

mes ) ( 722 A k u m

Mwotlap (612 ( r

la e

) 06) Se

(434)

Paa ) u T

(432)

a

) i 454) i

1

(4

P t k ( i ) ymSout S m ry PeteraraMa (

h (45 a

426 o ri

(332) )

(

R on e Mot

) 586

) T

607) 326 mb (489 ( ( At e

Amb a l

) 165) )

(531

2) (

( ) M tun I

Merei

427) (357) 428

) e ) era ot

( ) ( l l

( 360449 T r i out kiS Ara

rman a rrom ) a (536 )

351 (

(35 ) 517 ) emak

(522 ( K m Ura

) ) aho (491) 287

119) a

(

557) 152 721 ) 2) L m

( SyeE 33 (

(510) 76) 4

( 5 ) (

enakel ) ( ( ( ) La a ( (420

L ) k ng

(609

(10) ( 29 )

444 578) (12) f v s zx h

724 ( ) Si

Kwamera 716 356 ) (659) ( (599

24) 101) ( ) (

(402)

( ( ( 601) 186 669 Keda

TannaSouth ( ( ) ai

149 ) r r (15) (150) ( )

) 44 ) ( 277 E u i

Tawaroga ( ( l

308 582 ) ta

) (

aniRihu ( Ta u SantaAna 9 114

g 525

( 166) ( p i ) a

) ( ) A ra F

ganiAguf n 636

(69 ) e

259 ) P u

Fa g

18) (

( 731

3 u (726) a 307 (467 496) ) ( )

( ( ) T n

BauroPawaV 11

( ( 306 ) a

) 170

3 ( ) Iliu

6 ru Kahu

(300) ( 155) ( 591 ) e

) 653

o i ros

(658) ( 273 ( ) S (6 A

569 ila

o T at wa iTa ros ( ) N (709) (

A

i

atal (14506

)

(174) ( ) r

Teuna

77) 589)

rsiOneib Aros ( 690 s ntaC ( i

)

(173) a

(112) ( 456) K

S (171 223) ( ma mar )

6 i o

(60 24 ) ( R tDa

(

KahuaMam s

( 479 a ese (657) E Fagan oo 5

) in na

(21) 57) de ba

il ) 4 ) Let m

BauroHaunu m V i

(23) 519 ( 285

u a a

( ) ) n auroBar (740 ) Y

B ( Ta

(22) 70)

( 461 178 ( 6 ( 540 i

) Ke 3 s ru

(570)

SaaAul

eWaia a W re ( la ia

aa e iIr

M (247 S a

r a Are iw rro

(172) 671)

(564) 6

Ko Areare

) amo g

9 ( ( 751) h

(59) C

58) Dorio

(54 ) 274) mpun g ( (

) 286 La rin (19)

Saa 623 ( ) me

(145 Ko

(18) )

SaaUlawa ( 42)

1 (

ha ( 322 unda Reja (111) o S

Or ( 676) ab

a jang

aaSaaVill Re (548)

) S

52

5 oliTag

( 275)

( i

Tb SaaUkiNiM

u 90) )

(4 583 gabil

( (621) ) Ta

0

Longg

(55

orth ronadalBiB

N

u 0) Ko 1) gan La 1

55 3 ra n

( ( 290) 65) Sa oro

LauWalade ( (6 K

46) n

3 aa ra

)

(

4 ( 638) Bil 31

ataleka

Sol

(

F

e

)

617)

aa ( 28 291)

) ( BilaanSa a

(3

Kwar )

(315 ( 288) ( 573 liAtay )

5 Ciu

17 (

Mbaelelea

( 70) liqAtay

5) 301)

34 ) Squ

(

( Lau

) (71

664 )

9 ( diq

8

3 ) Se (

(133

Mbaengguu

)

13 (625) Bunun (3 (127)

Kwaio

) ( 81) o

) 509 390)

73

4 ( ha (

T ( ) ( 580)

Langalanga g

634 rla n

( ) vo (299) Fa

Kwai ( 535 672)

) i (312) aita

18 ) ( a

( 28 uk

(6 ) R

Toamb ) 176

) (

298 (74

(

) )

158 Paiwan (

Nggela

)

( 100 kanabu

(678 ) Kana

LengoParip 737

( ) 3)

45 260 a

h im Gha

( ( Saaro

Lengo ( 545)

(321)

(523)

( 492) ( 553) Tsou

Lengo

(320)

) (687) 189

( 688) Puyuma

h riNgini Gha ( 319)

( (269) Kav alan

aliseKoo

T )

(198) ( 530 ( 472) (54) er Basai h riNgg Gha

(649)

( 147 ) ( 109) Centra lAmi 7) (19

TalisePole ( 596) Siraya 652) (

h ianda riTa Gha Proto-

(504) Pazeh (199)

Mbirao ( 556)

)

(521 ( ( 179)

( 392

) ( 750) Saisiat ( 190) h riN Gha ggae

(196) ( (559) (40)

204) ( Yakan Ma ( lango 632) ( 91)

347) ( (375) Bajo

Talise (230) ( 560)

(648) ( (629) Mapun Tolo

(681) ( (630) SamalSia

h ri Gha s i

(1

9

4) (628) Inabaknon

G ha (195 (620) riNdi

)

SubanunSin u

( Tali

650) (75 s

e (733

3)

M ( 728 )

a )

la ( 362) ( 123 Suba

) no

T ( 651 n

a S

li ) i

( ( o

s 370

e

92 ) M

)

o

l

i W

( e

( 365 s 95

B t ) ( er

ug ) 36 n

( 6

714 ) B

ot uk ) u

a

a Austronesian

( M

393 a M

) ( no

b 43 b

u o

g ) W (741)

h e

o ( 27 s t t

u ) M

D

h ( 614 (363 an Y ( ) oboI

a 3 ) p 7

e 7) li

se a

( ( ( M a V 364

304 79 n i D

tu ) ) o

) boDib

(43 ( M a

( 369 k

0) 3 a

L 67 ) n

a ) o

b

k C oA

a

( l

338 t

a a (

i ( 388 d

) 3

68 Ma au )

N )

a n

k o

a b

n ( oA

a ( 3 ta

i 3 7

B 55 6) M u

i

(

l M (

53 ) 5 a

) 7 no

a ( 6 ( 233

u 405 ) b st

t o u )

ast

T t uT

u ) ( M ig

(33 4

Ba ( 2) a w 9 n

) 118

r

ob se

o

k

( ) M oK ( a

324

316 l

M an a ) (

)

a o

d b

a 80 o

( S k B

674 ) a

L i

L ( nuki r a

) a

i 121

hi

m

r

a ) 673 (

S d

s an

un

)

M a

lau

g

T ar pu

l

ig ( a e u 431

a (507 n k ao

) Ir

Ti a

( 493 ( ( ) nu

a 613

( 253 n 265 n

g ( ( ) (

) 311 ) ) 3 S

N 72 a

717 a

( s aEast (

(49 )

2 69

lik ) ( ) 225 ak

) ( 639 ( ) 3 ( )

495 ( ) B ngamar

K a

( ) 340) (

a ( ) (69 102 210 li

r 52 107)

a ) )

We ) ) Ma T ( uoro

g

un 208 s (635 m

t ) I an

a ) l

M ( o w

g 103 n a T

adar gg g

un )

( Hilig o va

B 662

ng a

a (

1 )

(

n

24

282 a

ka

o

) ( W y n

) 252

K non

i (3 a

k il

(

( )

r a

571 o

289 ayW

( jianB

k

( B

) 71) ) 350

a

K u

k

o t

a

83 u a ( val

k ( r

) ) 641) a a Y 727 T o

(

732

B n y

s a t

a o

l u

i

(

a (

) )

n o

82 640) s ean

eak

b

) u

S

l

(151 u

a g an

B ( u

(

57 J tun n

l

302 r

o san

a ( ig

g

) l n

404 (

b o unaAniw ) ( A ta

eaWe

a ) ( a

187) an L 693)

la 494 k

G ( o pia

755

a

) l n

n ) a

ghu (

(

o aMeleM ) g 47 n

( C u

( n

Z

a o

157) ) e

380 a nB

( b

S

75 547)

b ( (595

u

( Niue F )

a T T

K i a )

a 477) Uv e a E

a s nuiEas

M

m

(

a Toke

na n

452

) l P nth

a

a 136) o

Luangiua

a u

e

) s r

K (

Ma g

( Nu

i

( 349 Ngg n ia a v 381)

Kapi

ge n

(379

( 615)

ns Sikaiana

448 ( ) T

L M

a (

a a T

416 633

e o

) ( g k a

l

) P

( 245 Takuu

(

) a a r B

128 o

i

M 341 lo

n

a

) ik V

ro )

i

) g (

a gAnt

o a

415)

T u e

ri F

C ) ( lNa

a llo

T

n

h 126)

nu aii

( gb a U

g

700

ekeH g t

T

e

B (410) a a

) agb K

Ifir

moan

il ( n C

m 686 (

ur wa

215 B

M

ol

Fut

a

) a

( ( a 414 Ka p

o 403 an

o t

a n u

( ( )

n Pal

Ur

( w Rennell

) 736)

( 554) ( k q

( 337 )

o 267 aAb

413

470 Pa

u

F

(

( ( B a A 610

a ) )

a (

) Torau

327 137 w la

334 a v

(36 ur

n a iti

) ) ( a ) w (129

SW Emae ) ) n

o Mo 464 gg

B

) ( i a ) Ha Tiko

( Pal u

( ) a

n

M t

37 ( 56)

n

243)

o (

) 735 nun

o 396 ( aw A

Pal Be (

( L

34

n (407

l

631 270)

un ( u ) o t ot

anMo

a iti

( S o ) )

705 a

n h B

( i o Sa

g ) 279 n u o

a 125

)

(

a )

D a

35 g )

b (

B

L ( 400 ) a h n )

R a )

a 206

)

(

un ya 538 (

584 ( Ka i ta

b ( )

Ba ( 398

a )

g

)

n k

469

( D t w Ha t

)

a in

a

Va ) Ba b 487

a 167 a

(

K (

a ) 710

n ( ( y ga

440 )

a ) 455 331 ka

ghu K t

Mangare a (

B a

a

( )

)

679 (231 a ) L

711 n ab 333)

) k

n ) M d t o

hy 483)

( )

S

( ( N

a (

)

207)

38 o a

138) 264) Mar ) 48 )

a e e g

Maa r

) ( an ta )

n ( ( r i a

R

597) i

( 78 348 n h ju gg ah

438

i ( T

r

Ta

Va a

( (66 ) 399 647

i

(

a )

un (

55 o (217

) nyaM

a (605 ( 704)

Av (

) K r

Va 278 )

) a am

)

(694

i

572 e j ( 181 Rarotong

(499 si un

511 M r la )

B ( n

r i

(113 ) (

) ) 99 n i

130) e ( 602 ( a

g

s ) )

458 M c

(

( S )

l

T b

) nr

i a 468 )

(568) i

G ) (

439 (

( i

a e )

O te

( ) y

s 447 Ha 224

h

154

t ( l

(668 g u

i (

a a

o

508

u

)

n )

Lo Ruru

(218

611 )

) a S

Ni

y hiti ( n k

n ) (203

( (

gg (423 )

(

( (

( a 646) )

n u (

a

u In 746 ( 180)

) 466

ss 51 46

418 619 w

Ne r

T a ) (592)

a

)

( B d M Manihik (537)

u )

) T

163

397

a )

( o )

ha ) )

) a s

) ( a

) 391

e

(

n

303 ( M n (13) Tai (706 ( ( n

(

(

677) l 675) ) o )

382

110 es a T

n ( j

462) (

M

p

276 a a

( y

S )

359

(

H (

63) )

117 l 730)

of r

(209

o e ia

( ayB a M 480) ) (502 e (

( ) (703) ) Mb

( (

193 ( (241

) l 169 l

p s

(168 65 n Pe

o

) ay

i ( P

M ) (

e n e

696

) 62 )

s

)

( h

) (594 ah

a

294 ( C a M

(

Va ) u ar

( ( 498 )

(281) ) a

)

342 (

644

re ( h ng 579)

B Ta (

(383

H n

(

G 752 as 336 (

ov r

) r

ngunu

539

k 226 ) a u R

) u

h M k

335)

Ug (

575

( 534 ) ) e

in a

( ) n

)

an

o a

) ( )

212 o I 463) Maori

)

) n

K

478 b b (

( a

(213) (

hele k

o

) g

u 296 G a

) 723

on a (

Lu

451 n 643

( (

( e

(

C

501

b

437 255 a n )

343 I C

Lun

n We )

(

( daan h

)

o

y

gga ( 216)

293 h

593 ( 627

q ) 590 (

) B ) Ho

(

471

)

k o

( )

) a

( ( ( a ) 544

( 529) (

( un

Ti

( ) ( ) 754 64

o ( 488

655 u a Kus m

) (

(344 (

141

gg

a ( 94 (

) (

(

) 242 R

t ( 561 K

( )

Nd m 261) 637 497 (

a ( 4)

263 d (546)

(

) va ( 250) ( (

) 476

17) ( ) 237 e )

(

50 ) (

( B

)

201 (

( 84 666

96

) Si u

a 271 691 39

( 739 ) 702

( ug

( ( l

292) ) (

i ) 421 uk )

482 ) )

(712 ) B a D 104 234

R 219 ) n

) (

) ) )

(361) m

ghe ( 238

)

b u ( (713 er B ) )

205) ) o

Pa ( t o

) 90 (409)

)

)

( u )

(

e 235 ) 162)

) (240 i s

) K (

b e

)

v 41 n

183

(

(256 t (680

K ) ) a l

( B u 280 (

en o l M M u 505 i

t 228

Si ) ) w

u

272) a )

an

574 (

p a n

) ) L

( a T

a i u

) e

a

( r

a

305 ) y t

(

720 E ( ah n

an r nua

(

K ( (

a ( l (459) 642 ar ta (682

(

395

r ah

a

) 232

16 E ( ngga Lo

)

)

M

a

(

( 541

) n

( n 384 ) 251 N an ) (626)

ga

295 M

nd

(

353) 227) ) Lo (

a (

ag ( n

401 ( T gg V i 374

) 358)

143 (

685)

684 a u

o 61 ) )

249 ( (88 M o a 182

Lal ili

(

)

a

(

(

s n

( M

) I (689) or t

) ( a

M b no n

) 558 (31

(

(524)

624) v

) s

u 262 (257

a g rup 412) ) )

( I aB n

R ( 254 u

( at i

140 ( t d

ek B a 239

Kuni ) )

(

S ( 184 (220 222 ba o M o k

) )

( ( (484) (

30 I u

( (715) ( a

185) ( D

ou ( 236 ) 221 a a B ) r

r

u 2 258 I a

144 )

695) a r eo ( G

o ) ) ) t b n o ) a

y e t

) I (

K ) b

l

a

(202 ( ) ) r (

abadi u

s (

t ( B ( n

u ) I a s

749

M a

93

)

iliv a

( 354 v k

y

( Im a as (67)

( D r 72) t e

567 l

( ( 566 ) y

a

565 Ya as 562 (

600

( a ( i

475) a

( m

W 248 )

500 ( ( o 661 284) (

743) 518 s )

85 (

424 S 747 a

G (7) ( 89 y n

) c

o ) il

bu K

) 87

i

Ub ) ) o

e a t

)

a ) m a )

) m ) I (

) r e ( Molim

a ) a

) )

) Kall f r

d m o

p y

D K u o n

an p i

a G

i a d a

iod I n

r gao

M a a b

Pa n (645 um

u S (

p a g

Kak

S ll

i m

ah u a I aiw

A 585 ( b

( lo a lB

alib

( If

A

(20

i n

387

a

i I p

(

W 188 uh (

o a a B

(268 s

a Bo fug u h 394) Ki

l

73) g an o

u B n a K

i l a wa

a

in

) i

422) ( K g a

Riwo

o as o 309) ) t

B go d n K o

s airi e yn

K nt n o a S ) a a

g I i Ka

) ga

) a u Ta u

ayupula n a law G (533

n t

Ko a oA K ob Att n i eo

Ma n

I l t o oB t g M n

a A a a K e

Bili D sn lin k o r

M I e k en r ed ed ge da Ge a )

u dd a

v Yo lok g n

M m p O c e a a Bili u e aPam

W g

K G ata

B n

Sengse l

atukar ga k

e M t e g G m ra Ama

i M l n i ma ngen

Lamo

e I

Ar T a B

aulo S uin g au Mouk

egiar

a d u ami ib iS m ba Num Sa u an

S e a

Yabem g g

Wampar B a u

PunanKelai K Bukat

u B G i

a anUmaJu Kay a Wo a P a

B a an

M B

M ak J

B T T

bil W gi n w i a s y G i an a D y

a

o o a e l n

o a n

o g a o o g ngi o n o t a i u No

un v o un i n

l r S er o b

p r gi

n a a p n n

n d n i

aa o K gilS h ng li

e t a

r e T a

a i ss t l 122 e

i gg t s gaiMul i

n n o r e k n C o a

e se a p r o g li T

ra

e

ng n t M A e as K m a a r

a a ar a a

g a a u s te a S n loH bu a

b i

Mo j V r

t o

W g a a layo

o

o

b

a d

n )

u

n i (563)

P o (

lyn

e 683

s

i a

n

)

156) )

(

)

(Fig. S4) (Fig. S3) 734 ( (

543

) ( i

(481

ii

177

( )

3 Fraction of substitution errors

Fig. 2. Analysis of the output of our system in more depth. (A) An Austronesian phylogenetic tree from ref. 29 used in our analyses. Each quadrant is available in a larger format in SI Appendix, Figs. S2–S5, along with a detailed table of sound changes (SI Appendix, Table S5). The numbers in parentheses attached to each branch correspond to rows in SI Appendix, Table S5. The colors and numbers in parentheses encode the most prominent sound change along each branch, as inferred automatically by our system in SI Appendix, Section 4.(B) The most supported sound changes across the phylogeny, with the width of links proportional to the support. Note that the standard organization of the IPA chart into columns and rows according to place, manner, height, and backness is only for visualization purposes: This information was not encoded in the model in this experiment, showing that the model can recover realistic cross-linguistic sound change trends. All of the arcs correspond to sound changes frequently used by historical linguists: sonorizations /p/ > /b/ (1) and /t/ > /d/ (2), voicing changes (3, 4), debuccalizations /f/ > /h/ (5) and /s/ > /h/ (6), spirantizations /b/ > /v/ (7) and /p/ > /f/ (8), changes of (9, 10), and changes in height (11) and backness (12) (1). Whereas this visualization depicts sound changes as undirected arcs, the sound changes are actually represented with directionality in our system. (C) Zooming in a portion of the Oceanic languages, where the Nuclear Polynesian family (i) and Polynesian family (ii) are visible. Several attested sound changes such as to Maori and place of articulation change /t/ > /k/ to Hawaiian (30) are suc- cessfully localized by the system. (D) Most common substitution errors in the PAn reconstructions produced by our system. The first in each pair ðx; yÞ represents the reference phoneme, followed by the incorrectly hypothesized one. Most of these errors could be plausible disagreements among human experts. For example, the most dominant error (p, v) could arise over a disagreement over the phonemic inventory of Proto-Austronesian, whereas are common sources of disagreement.

Because we are ultimately interested in reconstruction, we then investigation was based on just four languages and found little compared our reconstruction system’s ability to reconstruct words evidence to support the hypothesis. This conclusion was criti- COMPUTER SCIENCES given these automatically determined cognates. Specifically, we cized by several authors (31, 32) on the basis of the small number took every cognate group found by our system (run on the Oceanic of languages and sound changes considered, although they pro- subclade) with at least two words in it. Then, we automatically vided no positive counterevidence. reconstructed the Proto-Oceanic ancestor of those words, using Using the output of our system, we collected sound change our system. For evaluation, we then looked at the average Lev- statistics from our reconstruction of 637 Austronesian languages, enshtein distance from our reconstructions to the known recon- including the probability of a particular change as estimated by our structions described in the previous sections. This time, however, system. These statistics provided the information needed to give COGNITIVE SCIENCES we average per modern word rather than per cognate group, to a more comprehensive quantitative evaluation of the FLH, using PSYCHOLOGICAL AND provide a fairer comparison. (Results were not substantially dif- a much larger sample than previous work (details in SI Appendix, ferent when averaging per cognate group.) Compared with re- Section 2.4). We show in Fig. 3 A and B that this analysis provides construction from manually labeled cognate sets, automatically clear quantitative evidence in favor of the FLH. The revealed identified cognates led to an increase in error rate of only 12.8% pattern would not be apparent had we not been able to reconstruct and with a significant reduction in the cost of curating linguistic large numbers of protolanguages and supply probabilities of dif- databases. See SI Appendix,Fig.S1for the fraction of words with ferent kinds of change taking place for each pair of languages. each Levenshtein distance for these reconstructions. Discussion Functional Load. To demonstrate the utility of large-scale recon- We have developed an automated system capable of large-scale struction of protolanguages, we used the output of our system to reconstruction of protolanguage word forms, cognate sets, and investigate an open question in historical linguistics. The func- sound change histories. The analysis of the properties of hun- tional load hypothesis (FLH), introduced 1955 (6), claims that dreds of ancient languages performed by this system goes far the probability that a sound will change over time is related to beyond the capabilities of any previous automated system and the amount of information provided by a sound. Intuitively, if would require significant amounts of manual effort by linguists. two phonemes appear only in words that are differentiated from Furthermore, the system is in no way restricted to applications one another by at least one other sound, then one can argue that like assessing the effects of functional load: It can be used as no information is lost if those phonemes merge together, be- a tool to investigate a wide range of questions about the structure cause no new ambiguous forms can be created by the merger. and dynamics of languages. A first step toward quantitatively testing the FLH was taken in In developing an automated system for reconstructing ancient 1967 (7). By defining a statistic that formalizes the amount of languages, it is by no means our goal to replace the careful information lost when a language undergoes a certain sound reconstructions performed by linguists. It should be emphasized change—on the basis of the proportion of words that are dis- that the reconstruction mechanism used by our system ignores criminated by each pair of phonemes—it became possible to many of the phenomena normally used in manual recon- evaluate the empirical support for the FLH. However, this initial structions. We have mentioned limitations due to the transducer

Bouchard-Côté et al. PNAS | March 12, 2013 | vol. 110 | no. 11 | 4227 Downloaded by guest on September 23, 2021 Fig. 3. Increasing the number of languages we can 1 B 1 106 reconstruct gives new ways to approach questions in A historical linguistics, such as the effect of functional load on the probability of merging two sounds. The 0.8 0.8 104 plots shown are heat maps where the color encodes the log of the number of sound changes that fall 0.6 0.6 into a given two-dimensional bin. Each sound 102 > change x y is encoded as a pair of numbers in the 0.4 0.4 unit square, ðl; mÞ, as explained in Materials and Merger posterior Methods. To convey the amount of noise one could Merger posterior 1 expect from a study with the number of languages 0.2 0.2 that King previously used (7), we first show in A the 0 heat map visualization for four languages. Next, we 0 0 0 2x10-5 4x10-5 6x10-5 8x10-5 1x10-4 0 2x10-5 4x10-5 6x10-5 8x10-5 1x10-4 show the same plot for 637 Austronesian languages Functional load Functional load in B. Only in this latter setup is structure clearly visible: Most of the points with high probability of merging can be seen to have comparatively low functional load, providing evidence in favor of the functional load hypothesis introduced in 1955. See SI Appendix, Section 2.4 for details.

formalism but other limitations include the lack of explicit mod- decisions taken in this process depend on a context to be specified shortly). eling of changes at the level of the phoneme inventories used by If t is not deleted, we choose a single substitution character in the bottom ​ a language and the lack of morphological analysis. Challenges word. We write S = Σ∪ fζg for this set of outcomes, where ζ is the special specific to the cognate inference task, for example difficulties with outcome indicating deletion. Importantly, the probabilities of this multino- polymorphisms, are also discussed in more detail in SI Appendix. mial can depend on both the previous character generated so far (i.e., the Another limitation of the current approach stems from the as- rightmost character p of yi−1) and the current character in the previous sumption that languages form a phylogenetic tree, an assumption generation string (t), providing a way to make changes context sensitive. violated by borrowing, variation, and creole languages. This multinomial decision acts as the initial distribution of the mutation Markov chain. We consider insertions only if a deletion was not selected in However, we believe our system will be useful to linguists in the first step. Here, we draw from a multinomial over S , where this time the several ways, particularly in contexts where there are large num- special outcome ζ corresponds to stopping insertions, and the other ele- bers of languages to be analyzed. Examples might include using ments of S correspond to symbols that are appended to yi . In this case, the the system to propose short lists of potential sound changes and conditioning environment is t = xi and the current rightmost symbol p in yi. correspondences across highly divergent word forms. Insertions continue until ζ is selected. We use θS;t;p;ℓ and θI;t;p;ℓ to denote the An exciting possible application of this work is to use the probabilities over the substitution and insertion decisions in the current model described here to infer the phylogenetic relationships branch ℓ′ → ℓ. A similar process generates the word at the root ℓ of a tree or between languages jointly with reconstructions and cognate sets. when an innovation happens at some language ℓ, treating this word as This will remove a source of circularity present in most previous a single string y1 generated from a dummy ancestor t = x1. In this case, only computational work in historical linguistics. Systems for inferring the insertion probabilities matter, and we separately parameterize these θ phylogenies such as ref. 13 generally assume that cognate sets are probabilities with R;t;p;ℓ. There is no actual dependence on t at the root or given as a fixed input, but cognacy as determined by linguists is in innovative languages, but this formulation allows us to unify the parame- θ ∈ RjΣj+1 ω ∈ ; ; turn motivated by phylogenetic considerations. The phylogenetic terization, with each ω;t;p;ℓ , where fR S Ig. During cognate in- tree hypothesized by the linguist is therefore affecting the tree ference, the decision to innovate is controlled by a simple Bernoulli random built by systems using only these cognates. This problem can be variable ngℓ for each language in the tree. When known cognate groups are avoided by inferring cognates at the same time as a phylogeny, assumed, ncℓ is set to 0 for all nonroot languages and to 1 for the root language. These Bernoulli distributions have parameters νℓ. something that should be possible using an extended version of fi our probabilistic model. Mutation distributions con ned in the family of transducers miss certain phylogenetic phenomena. For example, the process of reduplication (as in Our system is able to reconstruct the words that appear in “ ” ancient languages because it represents words as sequences of bye-bye , for example) is a well-studied mechanism to derive morpholog- ical and lexical forms that is not explicitly captured by transducers. The same sounds and uses a rich probabilistic model of sound change. This situation arises in metatheses (e.g., frist > English first). How- is an important step forward from previous work applying com- ever, these changes are generally not regular and therefore less informative putational ideas to historical linguistics. By leveraging the full (1). Moreover, because we are using a probabilistic framework, these events sequence information available in the word forms in modern can still be handled in our system, even though their costs will simply not be languages, we hope to see in historical linguistics a breakthrough as discounted as they should be. similar to the advances in evolutionary biology prompted by the Note also that the generative process described in this section does not transition from morphological characters to molecular sequences allow explicit dependencies to the next character in ℓ. Relaxing this as- in phylogenetic analysis. sumption can be done in principle by using weighted transducers, but at the cost of a more computationally expensive inference problem (caused by the Materials and Methods transducer normalization computation) (34). A simpler approach is to use This section provides a more detailed specification of our probabilistic the next character in the parent ℓ′ as a surrogate for the next character in ℓ. model. See SI Appendix, Section 1.2 for additional content on the algo- Using the context in the parent word is also more aligned to the standard rithm and simulations. representation of sound change used in historical linguistics, where the context is defined on the parent as well. Distributions. The conditional distributions over pairs of evolving strings are More generally, dependencies limited to a bounded context on the parent specified using a lexicalized stochastic string transducer (33). string can be incorporated in our formalism. By bounded, we mean that it Consider a language ℓ′ evolving to ℓ for cognate set c. Assume we have a should be possible to fix an integer k beforehand such that all of the word form x = wcℓ′. The generative process for producing y = wcℓ works as modeled dependencies are within k characters to the string operation. The follows. First, we consider x to be composed of characters x1x2 ...xn, with caveat is that the computational cost of inference grows exponentially in k. the first and last ones being a special boundary symbol x1 = # ∈ Σ, which is We leave open the question of handling computation in the face of un- never deleted, mutated, or created. The process generates y = y1y2 ...yn in n bounded dependencies such as those induced by harmony (35). * chunks yi ∈ Σ ; i ∈ f1; ...; ng, one for each xi . The yi s may be a single char- acter, multiple characters, or even empty. To generate yi ,wedefine a mu- Parameterization. Instead of directly estimating the transition probabilities of tation Markov chain that incrementally adds zero or more characters to an the mutation Markov chain (which could be done, in principle, by taking them initially empty yi . First, we decide whether the current phoneme in the top to be the parameters of a collection of multinomial distributions) we express word t = xi will be deleted, in which case yi = e (the probabilities of the them as the output of a multinomial logistic regression model (36). This

4228 | www.pnas.org/cgi/doi/10.1073/pnas.1204678110 Bouchard-Côté et al. Downloaded by guest on September 23, 2021 model specifies a distribution over transition probabilities by assigning Using these features and parameter sharing, the logistic regression model

weights to a set of features that describe properties of the sound changes defines the transition probabilities of the mutation process and the root SEE COMMENTARY involved. These features provide a more coherent representation of the language model to be transition probabilities, capturing regularities in sound changes that reflect Æλ; ω; ; ; ℓ; ξ æ the underlying linguistic structure. θ = θ ξ; λ = expf fð t p Þ g × μ ω; ; ξ ; ω;t;p;ℓ ω;t;p;ℓð Þ ω; ; ; ℓ; λ ð t Þ [1] We used the following feature templates: OPERATION, which identifies Zð t p Þ whether an operation in the mutation Markov chain is an insertion, a de- letion, a substitution, a self-substitution (i.e., of the form x > y, x = y), or the where ξ ∈ S , f : fS; I; Rg × Σ × Σ × L × S → Rk is the feature function (which end of an insertion event; MARKEDNESS, which consists of language-specific indicates which features apply for each event), Æ · ; · æ denotes inner product, n-gram indicator functions for all symbols in Σ (during reconstruction, only and λ ∈ Rk is a weight vector. Here, k is the dimensionality of the feature space unigram and bigram features are used for computational reasons; for cognate of the logistic regression model. In the terminology of exponential families, Z inference, only unigram features are used); FAITHFULNESS, which consists of and μ are the normalization function and the reference measure, respectively: indicators for mutation events of the form 1 [ x > y ], where x ∈ Σ, y ∈ S . X Feature templates similar to these can be found, for instance, in the work of Zðω; t; p; ℓ; λÞ = exp Æλ; f ω; t; p; ℓ; ξ′ æ refs. 37 and 38, in the context of string-to-string transduction models used in ξ′∈S computational linguistics. This approach to specifying the transition proba- 8 bilities produces an interesting connection to stochastic optimality theory (39, > ω = ; = #; ξ ≠ # <> 0 if S t ω = ; ξ = ζ 40), where a logistic regression model mediates markedness and faithfulness μ ω; ; ξ = 0 if R ð t Þ > ω ≠ ; ξ = # of the production of an output form from an underlying input form. :> 0 if R : : Data sparsity is a significant challenge in protolanguage reconstruction. 1 o w Although the experiments we present here use an order of magnitude more languages than previous computational approaches, the increase in observed Here, μ is used to handle boundary conditions, ensuring that the resulting data also brings with it additional unknowns in the form of intermediate probability distribution is well defined. protolanguages. Because there is one set of parameters for each language, During cognate inference, the innovation Bernoulli random variables νgℓ adding more data is not sufficient to increase the quality of the recon- are similarly parameterized, using a logistic regression model with two kinds struction; it is important to share parameters across different branches in the of features: a global innovation feature κglobal ∈ R and a language-specific tree to benefit from having observations from more languages. We used the feature κℓ ∈ R. The likelihood function for each νgℓ then takes the form following technique to address this problem: We augment the parameteri- zation to include the current language (or language at the bottom of the 1 νgℓ = : [2] current branch) and use a single, global weight vector instead of a set of 1 + exp −κglobal − κℓ branch-specific weights. Generalization across branches is then achieved by using features that ignore ℓ, whereas branch-specific features depend on ℓ. Similarly, all of the features in OPERATION, MARKEDNESS, and ACKNOWLEDGMENTS. This work was supported by Grant IIS-1018733 from FAITHFULNESS have universal and branch-specific versions. the National Science Foundation.

1. Hock HH (1991) Principles of Historical Linguistics (Mouton de Gruyter, The Hague, 21. Lynch J, ed (2003) Issues in Austronesian (Pacific Linguistics, Canberra, Australia). Netherlands). 22. Oakes M (2000) Computer estimation of vocabulary in a protolanguage from word 2. Ross M, Pawley A, Osmond M (1998) The Lexicon of Proto Oceanic: The Culture and lists in four daughter languages. J Quant Linguist 7:233–244. Environment of Ancestral Oceanic Society (Pacific Linguistics, Canberra, Australia). 23. Kondrak G (2002) Algorithms for Language Reconstruction. PhD thesis (Univ of Tor-

3. Diamond J (1999) Guns, Germs, and Steel: The Fates of Human Societies (WW Norton, onto, Toronto). COMPUTER SCIENCES New York). 24. Ellison TM (2007) Bayesian identification of cognates and correspondences. Assoc 4. Nichols J (1999) Archaeology and Language: Correlating Archaeological and Linguistic Comput Linguist 45:15–22. Hypotheses, eds Blench R, Spriggs M (Routledge, London). 25. Thorne JL, Kishino H, Felsenstein J (1991) An evolutionary model for maximum like- 5. Ventris M, Chadwick J (1973) Documents in Mycenaean Greek (Cambridge Univ Press, lihood alignment of DNA sequences. J Mol Evol 33(2):114–124. Cambridge, UK). 26. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data 6. Martinet A (1955) Économie des Changements Phonétiques [Economy of phonetic via the EM algorithm. J R Stat Soc B 39:1–38. sound changes] (Maisonneuve & Larose, Paris). 27. Bouchard-Côté A, Jordan MI, Klein D (2009) Efficient inference in phylogenetic InDel 7. King R (1967) Functional load and sound change. Language 43:831–852. trees. Adv Neural Inf Process Syst 21:177–184. 8. Holmes I, Bruno WJ (2001) Evolutionary HMMs: A Bayesian approach to multiple 28. Greenhill SJ, Blust R, Gray RD (2008) The Austronesian basic vocabulary database: COGNITIVE SCIENCES alignment. Bioinformatics 17(9):803–820. From bioinformatics to lexomics. Evol Bioinform Online 4:271–283. PSYCHOLOGICAL AND 9. Miklós I, Lunter GA, Holmes I (2004) A “Long Indel” model for evolutionary sequence 29. Lewis MP, ed (2009) Ethnologue: Languages of the World (SIL International, Dallas, alignment. Mol Biol Evol 21(3):529–540. TX, 16th Ed). 10. Suchard MA, Redelings BD (2006) BAli-Phy: Simultaneous Bayesian inference of 30. Lyovin A (1997) An Introduction to the Languages of the World (Oxford Univ Press, alignment and phylogeny. Bioinformatics 22(16):2047–2048. Oxford, UK). 11. Liberles DA, ed (2007) Ancestral Sequence Reconstruction (Oxford Univ Press, Ox- 31. Hockett CF (1967) The quantification of functional load. Word 23:320–339. , UK). 32. Surendran D, Niyogi P (2006) Competing Models of Linguistic Change. Evolution and 12. Paten B, et al. (2008) Genome-wide nucleotide-level mammalian ancestor recon- Beyond (Benjamins, Amsterdam). struction. Genome Res 18(11):1829–1843. 33. Varadarajan A, Bradley RK, Holmes IH (2008) Tools for simulating evolution of aligned 13. Gray RD, Jordan FM (2000) Language trees support the express-train sequence of genomic regions with integrated parameter estimation. Genome Biol 9(10):R147. Austronesian expansion. Nature 405(6790):1052–1055. 34. Mohri M (2009) Handbook of Weighted Automata, Monographs in Theoretical 14. Ringe D, Warnow T, Taylor A (2002) Indo-European and computational cladistics. Computer Science, eds Droste M, Kuich W, Vogler H (Springer, Berlin). Trans Philol Soc 100:59–129. 35. Hansson GO (2007) On the evolution of harmony: The case of secondary 15. Evans SN, Ringe D, Warnow T (2004) Inference of Divergence Times as a Statistical In- articulation agreement. Phonology 24:77–120. verse Problem, McDonald Institute Monographs, eds Forster P, Renfrew C (McDonald 36. McCullagh P, Nelder JA (1989) Generalized Linear Models (Chapman & Hall, London). Institute, Cambridge, UK). 37. Dreyer M, Smith JR, Eisner J (2008) Latent-variable modeling of string transductions 16. Gray RD, Atkinson QD (2003) Language-tree divergence times support the Anatolian with finite-state methods. Empirical Methods on Natural Language Processing 13: theory of Indo-European origin. Nature 426(6965):435–439. 1080–1089. 17. Nakhleh L, Ringe D, Warnow T (2005) Perfect phylogenetic networks: A new meth- 38. Chen SF (2003) Conditional and joint models for -to-phoneme conversion. odology for reconstructing the evolutionary history of natural languages. Language Eurospeech 8:2033–2036. 81:382–420. 39. Goldwater S, Johnson M (2003) Learning OT constraint rankings using a maximum 18. Bryant D (2006) Phylogenetic Methods and the Prehistory of Languages, eds Forster P, entropy model. Proceedings of the Workshop on Variation Within Optimality Theory Renfrew C (McDonald Institute for Archaeological Research, Cambridge, UK), pp eds Spenader J, Eriksson A, Dahl Ö (Stockholm University, Stockholm) pp 113–122. 111–118. 40. Wilson C (2006) Learning phonology with substantive bias: An experimental and 19. Daumé H III, Campbell L (2007) A Bayesian model for discovering typological im- computational study of velar palatalization. Cogn Sci 30(5):945–982. plications. Assoc Comput Linguist 45:65–72. 41. Gray RD, Drummond AJ, Greenhill SJ (2009) Language phylogenies reveal expansion 20. Dunn M, Levinson S, Lindstrom E, Reesink G, Terrill A (2008) Structural phylogeny pulses and pauses in Pacific settlement. Science 323(5913):479–483. in historical linguistics: Methodological explorations applied in Island Melanesia. 42. Blust R (1999) Subgrouping, circularity and extinction: Some issues in Austronesian Language 84:710–759. . Inst Linguist Acad Sinica 1:31–94.

Bouchard-Côté et al. PNAS | March 12, 2013 | vol. 110 | no. 11 | 4229 Downloaded by guest on September 23, 2021