Automated reconstruction of ancient languages using probabilistic models of sound change

Alexandre Bouchard-Côtéa,1, David Hallb, Thomas L. Griffithsc, and Dan Kleinb aDepartment of Statistics, University of British Columbia, Vancouver, BC V6T 1Z4, Canada; bComputer Science Division and cDepartment of Psychology, University of California, Berkeley, CA 94720

Edited by Nick Chater, University of Warwick, Coventry, United Kingdom, and accepted by the Editorial Board December 22, 2012 (received for review March 19, 2012)

One of the oldest problems in linguistics is reconstructing the building on work on the reconstruction of ancestral sequences words that appeared in the protolanguages from which modern and alignment in computational biology (8–12). Several groups languages evolved. Identifying the forms of these ancient lan- have recently explored how methods from computational biology guages makes it possible to evaluate proposals about the nature can be applied to problems in historical linguistics, but such work of language change and to draw inferences about human history. has focused on identifying the relationships between languages Protolanguages are typically reconstructed using a painstaking (as might be expressed in a phylogeny) rather than reconstructing – manual process known as the comparative method. We present a the languages themselves (13 18). Much of this type of work has family of probabilistic models of sound change as well as algo- been based on binary cognate or structural matrices (19, 20), which rithms for performing inference in these models. The resulting discard all information about the form that words take, simply in- system automatically and accurately reconstructs protolanguages dicating whether they are cognate. Such models did not have the from modern languages. We apply this system to 637 Austronesian goal of reconstructing protolanguages and consequently use a rep- resentation that lacks the resolution required to infer ancestral languages, providing an accurate, large-scale automatic reconstruc- ’ phonetic sequences. Using phonological representations allows us tion of a set of protolanguages. Over 85% of the system s recon- to perform reconstruction and does not require us to assume that structions are within one character of the manual reconstruction cognate sets have been fully resolved as a preprocessing step. Rep- provided by a linguist specializing in . Be- resenting the words at each point in a phylogeny and having a model ing able to automatically reconstruct large numbers of languages of how they change give a way of comparing different hypothesized provides a useful way to quantitatively explore hypotheses about the cognate sets and hence inferring cognate sets automatically. factors determining which sounds in a language are likely to change The focus on problems other than reconstruction in previous over time. We demonstrate this by showing that the reconstructed computational approaches has meant that almost all existing Austronesian protolanguages provide compelling support for a hy- protolanguage reconstructions have been done manually. How- pothesis about the relationship between the function of a sound ever, to obtain more accurate reconstructions for older languages, and its probability of changing that was first proposed in 1955. large numbers of modern languages need to be analyzed. The Proto-Austronesian language, for instance, has over 1,200 de- ancestral | computational | diachronic scendant languages (21). All of these languages could potentially increase the quality of the reconstructions, but the number of possibilities increases considerably with each language, making it econstruction of the protolanguages from which modern lan- fi guages are descended is a difficult problem, occupying histor- dif cult to analyze a large number of languages simultaneously. R The few previous systems for automated reconstruction of pro- ical linguists since the late 18th century. To solve this problem – linguists have developed a labor-intensive manual procedure called tolanguages or cognate inference (22 24) were unable to handle the comparative method (1), drawing on information about the this increase in computational complexity, as they relied on de- sounds and words that appear in many modern languages to hy- terministic models of sound change and exact but intractable pothesize protolanguage reconstructions even when no written algorithms for reconstruction. records are available, opening one of the few possible windows Being able to reconstruct large numbers of languages also to prehistoric societies (2, 3). Reconstructions can help in un- makes it possible to provide quantitative answers to questions derstanding many aspects of our past, such as the technological about the factors that are involved in language change. We dem- onstrate the potential for automated reconstruction to lead to level (2), migration patterns (4), and scripts (2, 5) of early societies. fi Comparing reconstructions across many languages can help reveal novel results in historical linguistics by investigating a speci c the nature of language change itself, identifying which aspects hypothesized regularity in sound changes called functional load. of language are most likely to change over time, a long-standing The functional load hypothesis, introduced in 1955, asserts that question in historical linguistics (6, 7). sounds that play a more important role in distinguishing words are In many cases, direct evidence of the form of protolanguages is less likely to change over time (6). Our probabilistic reconstruction not available. Fortunately, owing to the world’s considerable lin- of hundreds of protolanguages in the Austronesian phylogeny guistic diversity, it is still possible to propose reconstructions by provides a way to explore this question quantitatively, producing leveraging a large collection of extant languages descended from a compelling evidence in favor of the functional load hypothesis. single protolanguage. Words that appear in these modern lan- guages can be organized into cognate sets that contain words sus- pected to have a shared ancestral form (Table 1). The key Author contributions: A.B.-C., D.H., T.L.G., and D.K. designed research; A.B.-C. and D.H. observation that makes reconstruction from these data possible is performed research; A.B.-C. and D.H. contributed new reagents/analytic tools; A.B.-C., that languages seem to undergo a relatively limited set of regular D.H., T.L.G., and D.K. analyzed data; and A.B.-C., D.H., T.L.G., and D.K. wrote the paper. sound changes, each applied to the entire vocabulary of a language The authors declare no conflict of interest. at specific stages of its history (1). Still, several factors make re- This article is a PNAS Direct Submission. N.C. is a guest editor invited by the Editorial construction a hard problem. For example, sound changes are often Board. context sensitive, and many are string insertions and deletions. Freely available online through the PNAS open access option. In this paper, we present an automated system capable of See Commentary on page 4159. large-scale reconstruction of protolanguages directly from words 1To whom correspondence should be addressed. E-mail: [email protected]. that appear in modern languages. This system is based on a This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. probabilistic model of sound change at the level of phonemes, 1073/pnas.1204678110/-/DCSupplemental.

4224–4229 | PNAS | March 12, 2013 | vol. 110 | no. 11 www.pnas.org/cgi/doi/10.1073/pnas.1204678110 Table 1. Sample of reconstructions produced by the system SEE COMMENTARY

*Complete sets of reconstructions can be found in SI Appendix. † Randomly selected by stratified sampling according to the Levenshtein edit distance Δ. ‡ Levenshtein distance to a reference manual reconstruction, in this case the reconstruction of Blust (42). §The colors encode cognate sets. {We use this symbol for encoding missing data.

Model denote the subset of modern languages that have a word form in We use a probabilistic model of sound change and a Monte Carlo the cth cognate set. For each set c ∈ f1; ...; Cg and each language ℓ ∈ inference algorithm to reconstruct the lexicon and phonology of LðcÞ, we denote the modern word form by wcℓ. For cognate set protolanguages given a collection of cognate sets from modern c, only the minimal subtree τðcÞ containing LðcÞ and the root is languages. As in other recent work in computational historical relevant to the reconstruction inference problem for that set. linguistics (13–18), we make the simplifying assumption that each Our model of sound change is based on a generative process word evolves along the branches of a tree of languages, reflecting defined on this tree. From a high-level perspective, the genera- the languages’ phylogenetic relationships. The tree’s internal tive process is quite simple. Let c be the index of the current nodes are languages whose word forms are not observed, and the cognate set, with topology τðcÞ. First, a word is generated for the leaves are modern languages. The output of our system is a pos- root of τðcÞ, using an (initially unknown) root language model terior probability distribution over derivations. Each derivation (i.e., a probability distribution over strings). The words that ap- contains, for each cognate set, a reconstructed transcription of pear at other nodes of the tree are generated incrementally, using ancestral forms, as well as a list of sound changes describing the a branch-specific distribution over changes in strings to generate transformation from parent word to child word. This represen- each word from the word in the language that is its parent in τðcÞ. tation is rich enough to answer a wide range of queries that would Although this distribution differs across branches of the tree, normally be answered by carrying out the comparative method making it possible to estimate the pattern of changes involved in manually, such as which sound changes were most prominent the transition from one language to another, it remains the same along each branch of the tree. for all cognate sets, expressing changes that apply stochastically COMPUTER SCIENCES We model the evolution of discrete sequences of phonemes, to all words. The probabilities of substitution, insertion and de- using a context-dependent probabilistic string transducer (8). letion are also dependent on the context in which the change Probabilistic string transducers efficiently encode a distribution occurs. Further details of the distributions that were used and over possible changes that a string might undergo as it changes their parameterization appear in Materials and Methods. through time. Transducers are sufficient to capture most types of The flexibility of our model comes at the cost of having literally regular sound changes (e.g., lenitions, epentheses, and elisions) and millions of parameters to set, creating challenges not found in can be sensitive to the context in which a change takes place. Most most computational approaches to phylogenetics. Our inference COGNITIVE SCIENCES types of changes not captured by transducers are not regular (1) and algorithm learns these parameters automatically, using estab- PSYCHOLOGICAL AND are therefore less informative (e.g., metatheses, reduplications, and lished principles from machine learning and statistics. Specifi- haplologies). Unlike simple molecular InDel models used in com- cally, we use a variant of the expectation-maximization algorithm putational biology such as the TKF91 model (25), the parameter- (26), which alternates between producing reconstructions on the ization of our model is very expressive: Mutation probabilities are basis of the current parameter estimates and updating the pa- context sensitive, depending on the neighboring characters, and rameter estimates on the basis of those reconstructions. The each branch has its own set of parameters. This context-sensitive reconstructions are inferred using an efficient Monte Carlo in- and branch-specific parameterization plays a central role in our ference algorithm (27). The parameters are estimated by opti- system, allowing explicit modeling of sound changes. mizing a cost function that penalizes complexity, allowing us to Formally, let τ be a phylogenetic tree of languages, where each obtain robust estimates of large numbers of parameters. See SI language is linked to the languages that descended from it. In Appendix, Section 1 for further details of the inference algorithm. such a tree, the modern languages, whose word forms will be If cognate assignments are not available, our system can be observed, are the leaves of τ. The most recent common ancestor applied just to lists of words in different languages. In this case it of these modern languages is the root of τ. Internal nodes of the automatically infers the cognate assignments as well as the tree (including the root) are protolanguages with unobserved reconstructions. This setting requires only two modifications to word forms. Let L denote all languages, modern and otherwise. the model. First, because cognates are not available, we index the All word forms are assumed to be strings in the International words by their semantic meaning (or gloss) g, and there are thus Phonetic Alphabet (IPA). G groups of words. The model is then defined as in the previous We assume that word forms evolve along the branches of the case, with words indexed as wgℓ. Second, the generation process is tree τ. However, it is usually not the case that a word belonging augmented with a notion of innovation, wherein a word wgℓ′ in to each cognate set exists in each modern language—words are some language ℓ′ may instead be generated independently from lost or replaced over time, meaning that words that appear in the its parent word wgℓ. In this instance, the word is generated from root languages may not have cognate descendants in the lan- a language model as though it were a root string. In effect, the guages at the leaves of the tree. For the moment, we assume there tree is “cut” at a language when innovation happens, and so the is a known list of C cognate sets. For each c ∈ f1; ...; Cg let LðcÞ word begins anew. The probability of innovation in any given

Bouchard-Côté et al. PNAS | March 12, 2013 | vol. 110 | no. 11 | 4225 Fig. 1. Quantitative validation of Random Automated Effect of tree Effect of tree size A modern reconstruction B C reconstructions and identification quality of some important factors influ- Agreement between two linguists 0.55 encing reconstruction quality. (A) 0.500

V

a K F

P

a i e U F i F S L u U j I T

p i a a v R a il u f utun T i N ua k v a a i o k o i e ka eaE o ra oke b R t u a t e n n m it u T n a n u b a u l T k n p n N a Ma aWes g B e T u n uv g m n Mel l i n n M pa Ti u i a u gi u o a no ah m Bell i n a a a a R a a a A m i r as u S a e n n T l k u n oro k u mba i p u a I ma s fui a u V a H a m a k Ea a n A vu i an nu u m Tahi e i n am ll e is T nu a soo E a o u a a s e K F V e a a u T a ro r aw l i e opi e u n r n u auna n P t ve g a ei u r a qu e a T v r pe ah ma u T u L W o u a i A N us S i U B w e t i M s a F i N e i F S L u t k i j t a s L e y M a U i r T Ruru a w r I l na W l bu i i M oan s p a o aii Ea Si fo u ba a v R a i T t s u f u T Lon p i N u k re L M a i e e o n a v a a i o t u s k o Wes L iY a Da i t e n a a o t u e k b e e a r t o i o R t a ani u a t n i s e n m i a s l a un T n ant r un a an va B B a a a ig u a ga s K Gim M nn a b a k P d a u l T k n p o n h an r N a Mi s um B ar M a g e T u n m M u g a l a n B ba i b E a mo tua n i M M i R e n th p T u e a a u gi uk N e n a m A B iu s a n wa B W a a a i se R a v a h M a u n A m i r n t a a M a S a u e n n o te r T n h a i e T l oo W u M k m a u po u n o ell a i I a m a la u s f a a H a m s N i a k a n or A E v un s nu u i m i a i B s e ingl T e e i t a W n a ll e i e r ki T nu E l a s o Am aw a o t u s e n a r a e V e s a a u u en T hy i r a l i r e a N r r o n e a u n u a n u a n o o a t u v n a g s e v r a hit M m e T p nFi lM q a r o u T a m u D r o u L W m T N u A S u i t il w B i a Mars o a A p i M a w N D Da t s N i e i r a amar h u a s t L k e o y M o a R hi t i a w r r g n mp s Ma n W i l b Ca r hE S o a E o S s f u b a e e an m D s e T i a t p s i tr i r L L M n a Y i r u e a e m o o j I E a t t a u s l ng T W t n e ii a an L a i i a D hu Se u u i e i o e a em n ut r o r i s m ra s E s a B r a a l n I a r ela a g v a K B a en Be P a u s G M d aa e o e r i e n a M s u a r a PingPonape h N Ja a ib na n n a a b h e C bon R e m t a n Big n M S a NA e s i u n A N t e w s B Ki K all w l r se h t i r g s r a w a W Mo n M a t a i T i tub K l ot t M T n h W m u a gae t N e o a w l n a a a a r u e ma A i o i i l s B i a a r k A W M s e r Uj NG i e u r h t n a e M il b u hai n a h u N a o p a m n r s uAm n i D M l Saipa ok apes a a W t e n m y T r t ili P W Ela i a oh t M N D F o i D r a E a a A So ti i ouA C g t n m s r g S u a e H m a l ar e e a i r i m e t h D r s e Ch o il S ul a o j i I E a t t a l n n o l a h n n n u l e n T Ch n o l e A a ner r e n S u s r o C ea m shall n I u E e a r e B a uu s Anna s P u m aa o r A b i o a nC lu ia a P P N J a e e ie b n u M o om u a C a b Sat r i e n r o K l S N a e a uk ol k A M i y i i K a w w a W r s u or esero ar nf o n n i i T i t K m l o Na a r u u a ge t m tlocin le b r M gi a i e a A ai i a e o B boru aa b s esr Uj NG a nA w s i m a p u l u A h W an s Werou hg n S ok l a a W t u e PuP al e S a k a P W a e E i ah o t u o A k B dg S u p t i o l l e M n o Ta C i o a i e H m a a l l o le K e d o p l il e S u e n u s s gaa a C h n o l e n A r ro wA a N n C u a e s m N nn e M o w h s n A s Pa unu a a a i e a M a u nn a l i om u a a P di j a k S u o C i e f n r S N m te n uk ro k r A M i ya i a o e a u a o e o a n r N o a A akaoPorte sen im a n t r l r a o o b r n v a K WB a a e t in s le o e bu e A e n o u uk a lo e B r aa a va S n gg b W w s ia s W ou m h n jo e va P P a e c S a k m v i a N m uu o l A k n B dggo a a S W ra u o a l lo l e K e M an d T Ta A u S d y u e s ga a N Ne s e nei t e sT waA a s N n a O s il o N n e M o i w p Ga a b e a i ea a ha e E or t na a P d j k Bal m el S N m e n o em a N v a F A a e s n i a k u n a N a L od n n k v a e K WB a Ng m a eratu e a A e n o u u g a S t q EnLi b i g j v S n g b o Or un a i n o o av e va ut k k gem eN la m P i a N m Paa o a ir a lu a o a S W ra u o a Mw h n NK a T A r u tS d y T RagaE ak r N N a n t e o s Pete o f P n a Argun n a e e O a s il e m t a A a h s pe i G a b r AmbrymSoue l te lau a e E al o seSap eku rm N B el raraMa SP e a Nv amdF ra n iT i ik N m aa L nio e tu ou i a gu t q EL b i g A Mo ot Ter So O a i e N n ra R n le u rk na k gm e la n kiS Mer t Atonambtu ra I Pa t on ir a lu a u SyeEr ta e ke t M h NK k rg o MT a l lo a w R E Pa r A u ei a o P me o a fa na a u n th emm A e tl g te A k a a Lerom K a mb ter a a e l m Ur L s e p S u er Tan nakean a amahka r ar S P T i ik Kwa mera SLi ng ym a o ti i a er T na a A M Ma u oon b T awarS l di r So R t m le o era ak uo A a tun a Sa u KE r i S i M ta e ke r tI o th luta yeE So e MT a l lo F nt ga Tapu u re o F a g aA A rr th i emma h a g an na ai L o K a a Ba a n e m U L ma uroPaniiRihu Per gu Ta na a ra ak Reconstruction error rates for a Agu u nnKw k n SLi ng T f n a Taw a a el a Ka huawa Iliu S m eadi V Seru ar oue KEr r i Aro n Sa og thra luta si Aros Nila r F nt a Tapu Aros Ta Teua F ag aAna A Sa wat i is a gaa ai ntaiOn Cae t K a B ni er n Ka h om mar au ni Ri P gu 0.5 ib R Da ro Agh FaguaMamial ast ea Paw uu Tu n a Ba E esn a f liu ru ur a ni tined mb Ka aV I S e BauroH LYame ni Aro hu la aunu iTa s A a Ni un Sa oB Ke Ar iT ro Te ar AreareWaaAuaroo laru a S osi aw si is r luV Se iIrirro an O at K a a Ar il wma o Ka ta Cne om am e i ohia hu ataib R tD areMa a KC ng Faa l as ea g Ba gMa E eesn a Do ras a mpuerin ur anmi tind mb io Lom Ba oHa i LeYam ni K nda ur un iTa Sa Sa Su e ja SaaoBa u Ke u aUl a gab Are A ro lar a Or awa ejangRiTa ar ulu o Se iIrrio SaaS oh R l A eW Vil iwma o Sa aaV a Tbo bili B re a re aia Koha aUk ill aganadal Maa C ung iNiM TKoro iB Do mping L on a rangan ri s La er LauNoggu Sa o Kom a LauW Kraoro S und ja a laderth lanSaaan Saa aa S gRe BilBia Ul ja n gab a S Or awa Re liTa KwaFar Atayy aa S oha Tbo li lB a atalekaeSo Ciuli Ata Saa aaV abi da Mba l Squliq UkiN ill Tagrona iB e lelea iq Lo iMa Ko gan Sed n ngg Sa ran Lau Bunu L au L a u u oaro M Wa North SnaKr baengg ao ang la de Blaailaan Lan Kwa uu FavoTh rl Bi galanio Kwa raF a tayay ga kai ataleeS k CiuliqAAta Kwa Ru n M b ola Squli Toamba i Paiwa nabu a e le le q ita Kanaka a Sedi n L a Bunu Ngge Mba u LengoParila Saaroa e nggu o ng Len p Ts ou Kw u FThavaorla goGhaim a L a ngal aio L Puyum anga kai engo Kav alan K Ru Gh Toam wai Paiwa n nabu ariNgini Basai baita Kanaka Ghari CentralAmi TaliNggseKeoro Siraya L Nggel a Pazeh engoPa ripa Saaro TalisePole L engoGha Tsou GhariTanda Saisiat im Puyuma L engo Kav alan Mbirao Yakan Gh GhariNggae Bajo ariNgini Bas ai lango Mapun lAmi Ma GhaTa rliiNseggKeoor SirCaenyatra S Pazeh Talise amalSia si TalisePole Tolo Inabaknon GhariTanda Sa isia t Ghari SubanunSin Su Mbira o Yakan hariNdi banonSio gae Bajo G Weste GhariNg Map Talise Mala rnBuk M a la ngo un i ManM a Talis eMotoul noobbooIliaW es SamalSiasi Bug t Talise In tuDh Manobo Tolo abaknon Mbugho M an Diba Ghari SubanunSin apes e oboAtad Su Y Vitu M ano hariNdi banonSio boAt G W este rn ai au Talis eMala Buk L a kaaiBlil ManoManobo M akan bo Tigw iseMoli M anoa nobbooW N utu M a no Ka la TalBugotu Iliaest Maut k Binu boSa Dh Manob Baro kid ra bughotu M oDiba M es e anoboA mas IraM Yap Vitu M tad kLa nua ran n a a noboA M ada gl Sasa o tau k k akaBlaili Man LihirSunTigiaang Bali kLanai M a nooboTigw T k M Na u M a noboKala Nali a manwa a utut Bi boS baseline (which consists of picking t Il M arok nukid ara es on B raW g Hili ggo s Ka Tun ga ma IraMnuna r gag a yn da kLa a nao Tun ari Waray on M a ungl S asak M adnon TBu Wa ihirS k Ba autuan ra L TTiigaang Bali kaYs sugon y lik M ka ta SurigaononJolo Na a m Kilo oaG Ak t Ilo a nw Kokng lan Wes nggo a bla Ce onBis ara ung Hili Bla ga Ka bua K gT gay lan as la no nga ra W non lab m ia TMansaga Tu a dani B a ra B uSa aK l aga n M ano Tautu yW gh n Le l B Ys usan a ra L a aba e ogAka aka ugJolon y Z ring o Bikol nt ok ta A Sur o M a r Tagbanw Kil okoaG kla iga 0.375 oPoat Na Kang C no on a eT g nB o ggng Kma Tag a labl a ebu is n Nari e o Bat b a C B ng Kal an M ing r a an Ka bla as a o r keHolilu P kP w la amKia TM a gan Ma he B B alawanBal aAb B S na el ag ns C uro an aw ghuba eL aloa k Fa va SWPgg L a Za ng Bi gAna no a H i a a ri ro T kol t o ru anunooa t M Poat ag M U ra u u P law oeT a ba Nag To Al Sia a gang m Ta n a o oa Da nglaua no Ngri eKolo B gb wa C on ng y h M a ng H r ata a Ka M on Ka a i n ri e lu Pa kPanw LMu a t kB M a ek Bi o Ba l aA a Da y ingaak Ch ur aw la b ng naK o a a a S nggan w u a L a M K n t oF av HW i B L at na n a era ak on ru u anP a at ab ta ta u i dor N M U ra unla B a TMa na gaju o lu SPa o w ba ab aghaAv a unj Mih T oA D in la o a n Ba B V a ala noa ay g u o an gg MKer unnya on ong K a hi an n o el M Mu a kB bat Se iri i inc g n aL Ka tin a Ba R is Mel ay g a D ga k r n uSari un an Lo ay at Va o u O L at a aa M K a n h gan a b n an e a d k iG aT ua In Lo y a a ta t hu ri o N is n kg do u B a g v TM na r ga r a ag B wM ab abaa A a u aa Mih j a at Hn n a n B B V na nj al u V b a Mel nj esia ta gg MKer unn a isi s p M a lay a n o e yan a B S is e M al re s e Ma b ri i l inc g N p of a n a Se Riis a TeoHa i in yuayB B r n M y i n a P a a o O e u T s C h n Br V h u ga lay S a ha lo e h anRg a iG T In L r o o MokeH ru ka u h s a ua d o n u a u a n a ri n kg B o w Ne S re k nv i a s a ta g n a n M a ruo a Iba na n b V iHan a n e a b ag g g a ba s M j s G n n nC a is pe M a i la M Mn g e Idaana C B S s p f Mi e a re a y a el a B y ha h Ni eoa o lay l se n V h t Timu o TH i n ay non a K u n a P a uB M a q aa e n m a T s C h n B h Ug okou gv Bil du h lo e h a g ru a G L ga a k M H ru n h b no e B n b g D e o uo o a R k n a e o N S re nv a s Ku uH h k K B er tu it us a ro Ib k in a b L g u M en elai B n u a e a n a oa aw lu a Mu u b ag G an n n g sa d bn e r n M Mn gg e a C N a ta r la yah a a n l Id y C h u imi a r E E L t n r V o e ta B h K v a n a h T u aa o a S tp ia a n n L n o a K i m o u g g g h auML o a k q aa e m n n R a nS Nia g an h Ug u gv d P To g o n o ga B la u uD a an t Madur a a n G b L i g u u u s no a g u no e e B n b T Ivata baBa n uk uH h Ke B e t i o u K o tup Ba o n K k u t n ndasS It M L g u o M e r l Ba s u lao I bay B a r n a u M u a ri ir Itb r d naa e l w Mol ae arb a a s t y a r u n K o i k I n N a E L l i a L ro s ay u nB e tak l u imbi a r E a a t n r V e uni Iv al y s K v a n a n h ag o i Y I a at S tp i a n L K a a m an a e o u g g h a L M R r a as m at a a S g N o M il s R n n T g g a u o n uad S m o y P s t i n v ma n Ka o e c a a M o a a a M n ob i a r ay u a u u n a li a K If i od r n T u I b s n u g D i s u a K m o K d o p v ad o n i r I a u p B I a o k 0.45 a Ga K a w Pa ni n n S otu o t M bu i a ll gao a ba i r la I I a b B B d a Ka g a r e i t r t u M o a ll M a a b a a a a a a I b a m K o I b r e Ubi m io nga a ili Lk on r u y n e t n l D I l al h l g r s a a n u f Ifugaoo k p Boto V e u Iv a y B s W ap li d i B ug ha B o Y a y a k an a n an a a i I l o o s i B on i o a K a m a a t a e p i a l Ba d ay M R rd a a a m a w i u ba Ka nk g M n s o si i n ng u il S s t y a M D A u It K a ug Ka a m o a S li is G o v m n Ka o e c eo K n tok o a a G ir ali l o na K n b i r alaw K Is Att n a t a K I i o r n Ma g r u i D a toc a D u a f m y o Il Ag Ba K enI e a ilis K Se wo e n i u r I a u p d o e e u dd A w P n i a a n ak G K a i n a ai n ok gB n G n b i ll g a b r Bu Y aPa g a K d l Ma ma a m d b a g Gum R v M W egDib M i ll W eu S O t t K e i G o a o a a m a ul pi o a g a u a I a b ob au e r V a n a e U i n r o S Ta m a B S I l l a l ano u ga D p o l Auh g M h p b e i a w i i n n u f o a P a k b gi gya al li a T G d I B r T g p S k n a d B l B a t i K i n n h i K an ug o Bo a a W a B i f i u e n g in an n i a i s G a a l u uka r k m a n i o W B o l i u o o Ja g o a a Au a ag o y o s o o u o a un B a Ta ili a o p a a e l d a n o i n M B S li n a i i J g u e g d w one modern word at random), our m t b g u K l a o t y A o g s n p r n M re r N n i g ar p n a n l n w s B k e p a A u a u y o at id t g S K u n g h e M D I K i o o va t a S a i a li i a o e e aa i e r Ga a e ss n a i g i a K t a t o o a Ca t ed n s r a l K M W G n a oB u rn l g i T i a n a k s o n n e mi e il l Is o m a n T a K A a t K Mou i o g r i D n t o a M r g M u i W B m G li A Ka e o e I M m e a B Ka o o eSo i w e l k a r a m p S n l A g M t ab ne S o l u tt k am e dd A ab n a M e i nK u i a a u n at s a ok g a a n a n n U a n c r l i r B Y g g G b a a u a u o b d M m a m Y g a r R v l W a G e M t e k b S O t l uhe W e G I W a K n i B g u t p l a u o a P a e r s a V a a Sengseng o n u o e r a n gga M j r o S T o B S g a i l a l a a g o u m l A g g M a e a i a a pu k a a i w P gi g a b T G a l u r d T a H g S g n k B K a i t n a a K a n n un K a D i o B a b i a um i u e n n ga i a a a n i n a i G y k s o u u r m W o l y a A a o o J g y n r u o a u u o a ilib a o y T t a n e ib P o n o li n M i n B e N u S g a o n J L e g e m d t u l a a n y a A p r n d M i r r N g a o p n n l n k B o B e p a i t g S h un n e o gil i o v tC a i e s e aa i K a ss e d i a a t o e o a n s M W r a r P u i T l e k s m e m a n T a g M m i K b n o M K r g i G W B l as o e M m o B K o e r a m a p S l A g M i t a n n a a n a M a n a a

a a U ly a r S b a a e u o b e g Y a a r t b l W a t o e r s a o n u S gg n M j n o m a m o an a l K H g e n a u u b a a y o

s a y P N u L n a n

ia B o

K P n o

l y

n

e

s

i a

n system, and the amount of dis- 0.250 agreement between two linguist’s 0.4 manual reconstructions. Reconstruc- 0.125 0.35 tion error rates are Levenshtein Reconstruction error rate Reconstruction error rate N/A Reconstruction error rate distances normalized by the mean 0 0.3 word form length so that errors 0 10 20 30 40 50 60 Number of modern languages can be compared across languages. Agreement between linguists was computed on only Proto-Oceanic because the dataset used lacked multiple reconstructions for other protolanguages.(B) The effect of the topology on the quality of the reconstruction. On one hand, the difference between reconstruction error rates obtained from the system that ran on an uninformed topology (first and second) and rates obtained from the system that ran on an informed topology (third and fourth) is statistically significant. On the other hand, the corresponding difference between a flat tree and a random binary tree is not statistically significant, nor is the difference between using the consensus tree of ref. 41 and the Ethnologue tree (29). This suggests that our method has a certain robustness to moderate topology variations. (C)Re- construction error rate as a function of the number of languages used to train our automatic reconstruction system. Note that the error is not expected to go down to zero, perfect reconstruction being generally unidentifiable. The results in A and B are directly comparable: In fact, the entry labeled “Ethnologue” in B corresponds to the green Proto-Austronesian entry in A. The results in A and B and those in C are not directly comparable because the evaluation in C is restricted to those cognates with at least one reflex in the smallest evaluation set (to make the curve comparable across the horizontal axis of C). language is initially unknown and must be learned automatically except from 32 to 64 languages, where the average edit distance along with the other branch-specific model parameters. improvement was 0.05. For comparison, we also evaluated previous automatic re- Results construction methods. These previous methods do not scale to Our results address three questions about the performance of our large datasets so we performed comparisons on smaller subsets system. First, how well does it reconstruct protolanguages? Second, of the Austronesian dataset. We show in SI Appendix, Section 2 how well does it identify cognate sets? Finally, how can this approach that our method outperforms these baselines. be used to address outstanding questions in historical linguistics? We analyze the output of our system in more depth in Fig. 2 A–C, which shows the system learned a variety of realistic sound Protolanguage Reconstructions. To test our system, we applied it to changes across the Austronesian family (30). In Fig. 2D, we show a large-scale database of Austronesian languages, the Austrone- the most frequent substitution errors in the Proto-Austronesian sian Basic Vocabulary Database (ABVD) (28). We used a pre- reconstruction experiments. See SI Appendix,Section5for viously established phylogeny for these languages, the Ethnologue details and similar plots for the most common incorrect inser- tree (29) (we also describe experiments with other trees in Fig. 1). tions and deletions. For this first test of our system we also used the cognate sets provided in the database. The dataset contained 659 languages at Cognate Recovery. Previous reconstruction systems (22) required the time of download (August 7, 2010), including a few languages that cognate sets be provided to the system. However, the crea- outside the Austronesian family and some manually reconstructed tion of these large cognate databases requires considerable an- protolanguages used for evaluation. The total data comprised notation effort on the part of linguists and often requires that at 142,661 word forms and 7,708 cognate sets. The goal was to re- least some reconstruction be done by hand. To demonstrate that construct the word in each protolanguage that corresponded to our model can accurately infer cognate sets automatically, we each cognate set and to infer the patterns of sound changes along used a version of our system that learns which words are cognate, each branch in the phylogeny. See SI Appendix,Section2for starting only from raw word lists and their meanings. This system further details of our simulations. uses a faster but lower-fidelity model of sound change to infer We used the Austronesian dataset to quantitatively evaluate the correspondences. We then ran our reconstruction system on performance of our system by comparing withheld words from cognate sets that our cognate recovery system found. See SI Ap- known languages with automatic reconstructions of those words. pendix The Levenshtein distance between the held-out and reconstructed , Section 1 for details. This version of the system was run on all of the Oceanic lan- forms provides a measure of the number of errors in these reconstructions. We used this measure to show that using more guages in the ABVD, which comprise roughly half of the Aus- tronesian languages. We then evaluated the pairwise precision languages helped reconstruction and also to assess the overall fi performance of our system. Specifically, we compared the system’s (the fraction of cognate pairs identi ed by our system that are also in the set of labeled cognate pairs), pairwise recall (the fraction of error rate on the ancestral reconstructions to a baseline and also to fi the amount of divergence between the reconstructions of two labeled cognate pairs identi ed by our system), and pairwise F1 fi linguists (Fig. 1A). Given enough data, the system can achieve measure (de ned as the harmonic mean of precision and recall) reconstruction error rates close to the level of disagreement be- for the cognates found by our system against the known cognates tween manual reconstructions. In particular, most reconstructions that are encoded in the ABVD. We also report cluster purity, perfectly agree with manual reconstructions, and only a few con- which is the fraction of words that are in a cluster whose known tain big errors. Refer to Table 1 for examples of reconstructions. cognate group matches the cognate group of the cluster. See SI See SI Appendix, Section 3 for the full lists. Appendix, Section 2.3 for a detailed description of the metrics. We also present in Fig. 1B the effect of the tree topology on Using these metrics, we found that our system achieved a pre- reconstruction quality, reiterating the importance of using in- cision of 0.844, recall of 0.621, F1 of 0.715, and cluster purity of formative topologies for reconstruction. In Fig. 1C, we show that 0.918. Thus, over 9 of 10 words are correctly grouped, and our the accuracy of our method increases with the number of ob- system errs on the side of undergrouping words rather than clus- served , confirming that large-scale inference tering words that are not cognates. Because the null hypothesis in is desirable for automatic protolanguage reconstruction: Recon- historical linguistics is to deem words to be unrelated unless struction improved statistically significantly with each increase proved otherwise, a slight undergrouping is the desired behavior.

4226 | www.pnas.org/cgi/doi/10.1073/pnas.1204678110 Bouchard-Côté et al. A i y u B m n SEE COMMENTARY

(Fig. S5) (Fig. S2)

t u

a

ili a

a u o

o

n ta as

b

B t i a b

a h a m l

u

l lu

n T

a n a lu m u a

o g e

n

aE i e m i u

a a e i a Ir

i oo n n n

a e u 12 n

m k u v m n mb j im s u pon f en

e a i

kuu a una i

o

i a n

e i v iu a

u o u u

a r e apingamar p v ir

F v

V T a T a

N T n

U B ei

T

Pukapuka As k Luangiua en

T Ne

Nuk uoro i a

opia r W we Sikaiana S a

K W i

Tuva Lou e uss s Mis

Ta Na i VaeakauTau y ua 10 o

L li r

FutunaEast b

veUv a W e s t S m o a

IfiraMeleM L a

L a FutunaAniw f u s

L i o n iY b

Rennellese M u op a

Anuta D a

Emae K a e s l

Tik G a s

Bellona B ig m r B a

itianMo M

Samoan s ar d r a e RapanuiEas Mi

B

Hawaiian u g Mangareva A B

n th wa s as r Marquesan W

264) N mb i r M

ahtanth hitia Ta Ma we a a Rarotongan in a st

A o i a l M Ba

Tah W a l

Rurutuan 537) o

(333) a )

(483) N M

(592) m ( a

Manihiki ) r p

(694) i

) t

(647) e ) (704) D

Tuamotu al

(181)

(209)

(703) D a an Ar ) r

Penrhyn ) (218) ) s il

317 T (180) t Tahiti m

598

( hE D (13) ( 325 ) g S

)

) 323) Im

)

) (

Maori ) a u

(163) t

r

(675) n e

) E

WesternFij ) er (359) s

(63) r 682

459 ut a

(383) E s

Rotuman ( o

361) l T

708) e 654

666

(644) S e 98 26

656 680 442 ehu De

(534 ( e o (

( ( (

(702 ( b ( ( e

( )

Nengone C on 505) r

(643 NA i

(

(546)

( So r iB

Iaai b na

( e b

374) i

We ( a an la na Ca 330) (689)

) (524) e

435 Taranj a

( s (162) Jawe

( tu

) K m l (642) i

el wa m le Ne ) U

(384) e t se

( 408 ) Ng a A m A

auru Na (182) 68 i

(562) a (

( 25 G a Marshalles A

s n ( l u )

) W t

u aie Kus 378 )

)

u e E ah l ) ( 8 )

Kiribati (645) 742) loh t a

) ( )

581 Hi e o 748

Ponapean ( r 153 m u

(

role 667 (533) So K 729 ) m

Pingilapes ( ) )

329 ) na a p b t k gd g q ( 229 A

o (

( ) un a m Mokile ( 164)

) l fi ( Pa s Woleai ur n i

200 u

) 97 ( 588 )

(563) (266 A n i o r (122) ( n a

PuloAnna

se 115 r Na

244) ) r

) ( 606) M o y o )

SaipanCaro ) 211

683 ) (148 u

) ( ( a

( ) B r o b

( ( 604 keseA

So 460 (

( ) b ) We u

Chuukese ( 135 a (441) 707 m

701 ( 134 191) o na

734) ( B h

arlnan rolinia Ca ( ( (

(512) ngg a

S

) d Mortlockes

(514)

(156 a T

)

) Ma

a Chuu

( dok

) a

) (719

(481)

(543 177 g Satawale M

(411) i

) ) w (

616 (744)

) N

(417 532 on

Woleaian ) d

373

603) rtO

(526)

( je a

) 718) ) P o PuloAnnan

uluwatese ) u ( (425 (555) ) k

120 6

( (5) K

P a

106) ( 474

( ( a 11

(9) 503 ma

( ( Wei an (131

aman a m Na 622 a k

(485 )

(105) ) (513) u

( B b

( eveei e e v Ne

( v Ngg

577 386 m 7

So nu (419)

(515)

vava v a Av ( ) a a

(132) ) r u

) S o

SakaoPo ) ) S a

( 660 u t d

(146)

( ) Wa y

AnejomAnei 697 a e sT

(745) s ) 192

) ( 450 587

) G 1 (

Tape ( 161 ( a ) (527)

283) ) re ) lil

( E ese s Ne (385) (528) 159 o

738 mbol

( (297) 698

( ( Ba

(520) 160 ( ah q a v ha Na

F

(139) ( ) La e a (443 ati Na n

io

(214 (116)

n ) L 2 4 9

tu

608 ) g amakir k a m Na )

(446) ( ge ber

(725) End n

hEfate Ni

Nguna 433)

516) 108 465 m (436 e la

( Na

r o Ork

( ) 486) a u n

eSou a 8

( K l u k

Sout n

(445) ( ) ) a r

( Pa a

429)

Raga a (32)

86 n Arg (

mes ) ( 722 A k u m

Mwotlap (612 ( r

la e

) 06) Se

(434)

Paa ) u T

(432)

a

) i 454) i

1

(4

P t k ( i ) ymSout S m ry PeteraraMa (

h (45 a

426 o ri

(332) )

(

R on e Mot

) 586

) T

607) 326 mb (489 ( ( At e

Amb a l

) 165) )

(531

2) (

( ) M tun I

Merei

427) (357) 428

) e ) era ot

( ) ( l l

( 360449 T r i out kiS Ara

rman a rrom ) a (536 )

351 (

(35 ) 517 ) emak

(522 ( K m Ura

) ) aho (491) 287

119) a

(

557) 152 721 ) 2) L m

( SyeE 33 (

(510) 76) 4

( 5 ) (

enakel ) ( ( ( ) La a ( (420

L ) k ng

(609

(10) ( 29 )

444 578) (12) f v s zx h

724 ( ) Si

Kwamera 716 356 ) (659) ( (599

24) 101) ( ) (

(402)

( ( ( 601) 186 669 Keda

TannaSouth ( ( ) ai

149 ) r r (15) (150) ( )

) 44 ) ( 277 E u i

Tawaroga ( ( l

308 582 ) ta

) (

aniRihu ( Ta u SantaAna 9 114

g 525

( 166) ( p i ) a

) ( ) A ra F

ganiAguf n 636

(69 ) e

259 ) P u

Fa g

18) (

( 731

3 u (726) a 307 (467 496) ) ( )

( ( ) T n

BauroPawaV 11

( ( 306 ) a

) 170

3 ( ) Iliu

6 ru Kahu

(300) ( 155) ( 591 ) e

) 653

o i ros

(658) ( 273 ( ) S (6 A

569 ila

o T at wa iTa ros ( ) N (709) (

A

i

atal (14506

)

(174) ( ) r

Teuna

77) 589)

rsiOneib Aros ( 690 s ntaC ( i

)

(173) a

(112) ( 456) K

S (171 223) ( ma mar )

6 i o

(60 24 ) ( R tDa

(

KahuaMam s

( 479 a ese (657) E Fagan oo 5

) in na

(21) 57) de ba

il ) 4 ) Let m

BauroHaunu m V i

(23) 519 ( 285

u a a

( ) ) n auroBar (740 ) Y

B ( Ta

(22) 70)

( 461 178 ( 6 ( 540 i

) Ke 3 s ru

(570)

SaaAul

eWaia a W re ( la ia

aa e iIr

M (247 S a

r a Are iw rro

(172) 671)

(564) 6

Ko Areare

) amo g

9 ( ( 751) h

(59) C

58) Dorio

(54 ) 274) mpun g ( (

) 286 La rin (19)

Saa 623 ( ) me

(145 Ko

(18) )

SaaUlawa ( 42)

1 (

ha ( 322 unda Reja (111) o S

Or ( 676) ab

a jang

aaSaaVill Re (548)

) S

52

5 oliTag

( 275)

( i

Tb SaaUkiNiM

u 90) )

(4 583 gabil

( (621) ) Ta

0

Longg

(55

orth ronadalBiB

N

u 0) Ko 1) gan La 1

55 3 ra n

( ( 290) 65) Sa oro

LauWalade ( (6 K

46) n

3 aa ra

)

(

4 ( 638) Bil 31

ataleka

Sol

(

F

e

)

617)

aa ( 28 291)

) ( BilaanSa a

(3

Kwar )

(315 ( 288) ( 573 liAtay )

5 Ciu

17 (

Mbaelelea

( 70) liqAtay

5) 301)

34 ) Squ

(

( Lau

) (71

664 )

9 ( diq

8

3 ) Se (

(133

Mbaengguu

)

13 (625) Bunun (3 (127)

Kwaio

) ( 81) o

) 509 390)

73

4 ( ha (

T ( ) ( 580)

Langalanga g

634 rla n

( ) vo (299) Fa

Kwai ( 535 672)

) i (312) aita

18 ) ( a

( 28 uk

(6 ) R

Toamb ) 176

) (

298 (74

(

) )

158 Paiwan (

Nggela

)

( 100 kanabu

(678 ) Kana

LengoParip 737

( ) 3)

45 260 a

h im Gha

( ( Saaro

Lengo ( 545)

(321)

(523)

( 492) ( 553) Tsou

Lengo

(320)

) (687) 189

( 688) Puyuma

h riNgini Gha ( 319)

( (269) Kav alan

aliseKoo

T )

(198) ( 530 ( 472) (54) er Basai h riNgg Gha

(649)

( 147 ) ( 109) Centra lAmi 7) (19

TalisePole ( 596) Siraya 652) (

h ianda riTa Gha Proto-

(504) Pazeh (199)

Mbirao ( 556)

)

(521 ( ( 179)

( 392

) ( 750) Saisiat ( 190) h riN Gha ggae

(196) ( (559) (40)

204) ( Yakan Ma ( lango 632) ( 91)

347) ( (375) Bajo

Talise (230) ( 560)

(648) ( (629) Mapun Tolo

(681) ( (630) SamalSia

h ri Gha s i

(1

9

4) (628) Inabaknon

G ha (195 (620) riNdi

)

SubanunSin u

( Tali

650) (75 s

e (733

3)

M ( 728 )

a )

la ( 362) ( 123 Suba

) no

T ( 651 n

a S

li ) i

( ( o

s 370

e

92 ) M

)

o

l

i W

( e

( 365 s 95

B t ) ( er

ug ) 36 n

( 6

714 ) B

ot uk ) u

a

a Austronesian

( M

393 a M

) ( no

b 43 b

u o

g ) W (741)

h e

o ( 27 s t t

u ) M

D

h ( 614 (363 an Y ( ) oboI

a 3 ) p 7

e 7) li

se a

( ( ( M a V 364

304 79 n i D

tu ) ) o

) boDib

(43 ( M a

( 369 k

0) 3 a

L 67 ) n

a ) o

b

k C oA

a

( l

338 t

a a (

i ( 388 d

) 3

68 Ma au )

N )

a n

k o

a b

n ( oA

a ( 3 ta

i 3 7

B 55 6) M u

i

(

l M (

53 ) 5 a

) 7 no

a ( 6 ( 233

u 405 ) b st

t o u )

ast

T t uT

u ) ( M ig

(33 4

Ba ( 2) a w 9 n

) 118

r

ob se

o

k

( ) M oK ( a

324

316 l

M an a ) (

)

a o

d b

a 80 o

( S k B

674 ) a

L i

L ( nuki r a

) a

i 121

hi

m

r

a ) 673 (

S d

s an

un

)

M a

lau

g

T ar pu

l

ig ( a e u 431

a (507 n k ao

) Ir

Ti a

( 493 ( ( ) nu

a 613

( 253 n 265 n

g ( ( ) (

) 311 ) ) 3 S

N 72 a

717 a

( s aEast (

(49 )

2 69

lik ) ( ) 225 ak

) ( 639 ( ) 3 ( )

495 ( ) B ngamar

K a

( ) 340) (

a ( ) (69 102 210 li

r 52 107)

a ) )

We ) ) Ma T ( uoro

g

un 208 s (635 m

t ) I an

a ) l

M ( o w

g 103 n a T

adar gg g

un )

( Hilig o va

B 662

ng a

a (

1 )

(

n

24

282 a

ka

o

) ( W y n

) 252

K non

i (3 a

k il

(

( )

r a

571 o

289 ayW

( jianB

k

( B

) 71) ) 350

a

K u

k

o t

a

83 u a ( val

k ( r

) ) 641) a a Y 727 T o

(

732

B n y

s a t

a o

l u

i

(

a (

) )

n o

82 640) s ean

eak

b

) u

S

l

(151 u

a g an

B ( u

(

57 J tun n

l

302 r

o san

a ( ig

g

) l n

404 (

b o unaAniw ) ( A ta

eaWe

a ) ( a

187) an L 693)

la 494 k

G ( o pia

755

a

) l n

n ) a

ghu (

(

o aMeleM ) g 47 n

( C u

( n

Z

a o

157) ) e

380 a nB

( b

S

75 547)

b ( (595

u

( Niue F )

a T T

K i a )

a 477) Uv e a E

a s nuiEas

M

m

(

a Toke

na n

452

) l P nth

a

a 136) o

Luangiua

a u

e

) s r

K (

Ma g

( Nu

i

( 349 Ngg n ia a v 381)

Kapi

ge n

(379

( 615)

ns Sikaiana

448 ( ) T

L M

a (

a a T

416 633

e o

) ( g k a

l

) P

( 245 Takuu

(

) a a r B

128 o

i

M 341 lo

n

a

) ik V

ro )

i

) g (

a gAnt

o a

415)

T u e

ri F

C ) ( lNa

a llo

T

n

h 126)

nu aii

( gb a U

g

700

ekeH g t

T

e

B (410) a a

) agb K

Ifir

moan

il ( n C

m 686 (

ur wa

215 B

M

ol

Fut

a

) a

( ( a 414 Ka p

o 403 an

o t

a n u

( ( )

n Pal

Ur

( w Rennell

) 736)

( 554) ( k q

( 337 )

o 267 aAb

413

470 Pa

u

F

(

( ( B a A 610

a ) )

a (

) Torau

327 137 w la

334 a v

(36 ur

n a iti

) ) ( a ) w (129

SW Emae ) ) n

o Mo 464 gg

B

) ( i a ) Ha Tiko

( Pal u

( ) a

n

M t

37 ( 56)

n

243)

o (

) 735 nun

o 396 ( aw A

Pal Be (

( L

34

n (407

l

631 270)

un ( u ) o t ot

anMo

a iti

( S o ) )

705 a

n h B

( i o Sa

g ) 279 n u o

a 125

)

(

a )

D a

35 g )

b (

B

L ( 400 ) a h n )

R a )

a 206

)

(

un ya 538 (

584 ( Ka i ta

b ( )

Ba ( 398

a )

g

)

n k

469

( D t w Ha t

)

a in

a

Va ) Ba b 487

a 167 a

(

K (

a ) 710

n ( ( y ga

440 )

a ) 455 331 ka

ghu K t

Mangare a (

B a

a

( )

)

679 (231 a ) L

711 n ab 333)

) k

n ) M d t o

hy 483)

( )

S

( ( N

a (

)

207)

38 o a

138) 264) Mar ) 48 )

a e e g

Maa r

) ( an ta )

n ( ( r i a

R

597) i

( 78 348 n h ju gg ah

438

i ( T

r

Ta

Va a

( (66 ) 399 647

i

(

a )

un (

55 o (217

) nyaM

a (605 ( 704)

Av (

) K r

Va 278 )

) a am

)

(694

i

572 e j ( 181 Rarotong

(499 si un

511 M r la )

B ( n

r i

(113 ) (

) ) 99 n i

130) e ( 602 ( a

g

s ) )

458 M c

(

( S )

l

T b

) nr

i a 468 )

(568) i

G ) (

439 (

( i

a e )

O te

( ) y

s 447 Ha 224

h

154

t ( l

(668 g u

i (

a a

o

508

u

)

n )

Lo Ruru

(218

611 )

) a S

Ni

y hiti ( n k

n ) (203

( (

gg (423 )

(

( (

( a 646) )

n u (

a

u In 746 ( 180)

) 466

ss 51 46

418 619 w

Ne r

T a ) (592)

a

)

( B d M Manihik (537)

u )

) T

163

397

a )

( o )

ha ) )

) a s

) ( a

) 391

e

(

n

303 ( M n (13) Tai (706 ( ( n

(

(

677) l 675) ) o )

382

110 es a T

n ( j

462) (

M

p

276 a a

( y

S )

359

(

H (

63) )

117 l 730)

of r

(209

o e ia

( ayB a M 480) ) (502 e (

( ) (703) ) Mb

( (

193 ( (241

) l 169 l

p s

(168 65 n Pe

o

) ay

i ( P

M ) (

e n e

696

) 62 )

s

)

( h

) (594 ah

a

294 ( C a M

(

Va ) u ar

( ( 498 )

(281) ) a

)

342 (

644

re ( h ng 579)

B Ta (

(383

H n

(

G 752 as 336 (

ov r

) r

ngunu

539

k 226 ) a u R

) u

h M k

335)

Ug (

575

( 534 ) ) e

in a

( ) n

)

an

o a

) ( )

212 o I 463) Maori

)

) n

K

478 b b (

( a

(213) (

hele k

o

) g

u 296 G a

) 723

on a (

Lu

451 n 643

( (

( e

(

C

501

b

437 255 a n )

343 I C

Lun

n We )

(

( daan h

)

o

y

gga ( 216)

293 h

593 ( 627

q ) 590 (

) B ) Ho

(

471

)

k o

( )

) a

( ( ( a ) 544

( 529) (

( un

Ti

( ) ( ) 754 64

o ( 488

655 u a Kus m

) (

(344 (

141

gg

a ( 94 (

) (

(

) 242 R

t ( 561 K

( )

Nd m 261) 637 497 (

a ( 4)

263 d (546)

(

) va ( 250) ( (

) 476

17) ( ) 237 e )

(

50 ) (

( B

)

201 (

( 84 666

96

) Si u

a 271 691 39

( 739 ) 702

( ug

( ( l

292) ) (

i ) 421 uk )

482 ) )

(712 ) B a D 104 234

R 219 ) n

) (

) ) )

(361) m

ghe ( 238

)

b u ( (713 er B ) )

205) ) o

Pa ( t o

) 90 (409)

)

)

( u )

(

e 235 ) 162)

) (240 i s

) K (

b e

)

v 41 n

183

(

(256 t (680

K ) ) a l

( B u 280 (

en o l M M u 505 i

t 228

Si ) ) w

u

272) a )

an

574 (

p a n

) ) L

( a T

a i u

) e

a

( r

a

305 ) y t

(

720 E ( ah n

an r nua

(

K ( (

a ( l (459) 642 ar ta (682

(

395

r ah

a

) 232

16 E ( ngga Lo

)

)

M

a

(

( 541

) n

( n 384 ) 251 N an ) (626)

ga

295 M

nd

(

353) 227) ) Lo (

a (

ag ( n

401 ( T gg V i 374

) 358)

143 (

685)

684 a u

o 61 ) )

249 ( (88 M o a 182

Lal ili

(

)

a

(

(

s n

( M

) I (689) or t

) ( a

M b no n

) 558 (31

(

(524)

624) v

) s

u 262 (257

a g rup 412) ) )

( I aB n

R ( 254 u

( at i

140 ( t d

ek B a 239

Kuni ) )

(

S ( 184 (220 222 ba o M o k

) )

( ( (484) (

30 I u

( (715) ( a

185) ( D

ou ( 236 ) 221 a a B ) r

r

u 2 258 I a

144 )

695) a r eo ( G

o ) ) ) t b n o ) a

y e t

) I (

K ) b

l

a

(202 ( ) ) r (

abadi u

s (

t ( B ( n

u ) I a s

749

M a

93

)

iliv a

( 354 v k

y

( Im a as (67)

( D r 72) t e

567 l

( ( 566 ) y

a

565 Ya as 562 (

600

( a ( i

475) a

( m

W 248 )

500 ( ( o 661 284) (

743) 518 s )

85 (

424 S 747 a

G (7) ( 89 y n

) c

o ) il

bu K

) 87

i

Ub ) ) o

e a t

)

a ) m a )

) m ) I (

) r e ( Molim

a ) a

) )

) Kall f r

d m o

p y

D K u o n

an p i

a G

i a d a

iod I n

r gao

M a a b

Pa n (645 um

u S (

p a g

Kak

S ll

i m

ah u a I aiw

A 585 ( b

( lo a lB

alib

( If

A

(20

i n

387

a

i I p

(

W 188 uh (

o a a B

(268 s

a Bo fug u h 394) Ki

l

73) g an o

u B n a K

i l a wa

a

in

) i

422) ( K g a

Riwo

o as o 309) ) t

B go d n K o

s airi e yn

K nt n o a S ) a a

g I i Ka

) ga

) a u Ta u

ayupula n a law G (533

n t

Ko a oA K ob Att n i eo

Ma n

I l t o oB t g M n

a A a a K e

Bili D sn lin k o r

M I e k en r ed ed ge da Ge a )

u dd a

v Yo lok g n

M m p O c e a a Bili u e aPam

W g

K G ata

B n

Sengse l

atukar ga k

e M t e g G m ra Ama

i M l n i ma ngen

Lamo

e I

Ar T a B

aulo S uin g au Mouk

egiar

a d u ami ib iS m ba Num Sa u an

S e a

Yabem g g

Wampar B a u

PunanKelai K Bukat

u B G i

a anUmaJu Kay a Wo a P a

B a an

M B

M ak J

B T T

bil W gi n w i a s y G i an a D y

a

o o a e l n

o a n

o g a o o g ngi o n o t a i u No

un v o un i n

l r S er o b

p r gi

n a a p n n

n d n i

aa o K gilS h ng li

e t a

r e T a

a i ss t l 122 e

i gg t s gaiMul i

n n o r e k n C o a

e se a p r o g li T

ra

e

ng n t M A e as K m a a r

a a ar a a

g a a u s te a S n loH bu a

b i

Mo j V r

t o

W g a a layo

o

o

b

a d

n )

u

n i (563)

P o (

lyn

e 683

s

i a

n

)

156) )

(

)

(Fig. S4) (Fig. S3) 734 ( (

543

) ( i

(481

ii

177

( )

3 Fraction of substitution errors

Fig. 2. Analysis of the output of our system in more depth. (A) An Austronesian phylogenetic tree from ref. 29 used in our analyses. Each quadrant is available in a larger format in SI Appendix, Figs. S2–S5, along with a detailed table of sound changes (SI Appendix, Table S5). The numbers in parentheses attached to each branch correspond to rows in SI Appendix, Table S5. The colors and numbers in parentheses encode the most prominent sound change along each branch, as inferred automatically by our system in SI Appendix, Section 4.(B) The most supported sound changes across the phylogeny, with the width of links proportional to the support. Note that the standard organization of the IPA chart into columns and rows according to place, manner, height, and backness is only for visualization purposes: This information was not encoded in the model in this experiment, showing that the model can recover realistic cross-linguistic sound change trends. All of the arcs correspond to sound changes frequently used by historical linguists: sonorizations /p/ > /b/ (1) and /t/ > /d/ (2), voicing changes (3, 4), debuccalizations /f/ > /h/ (5) and /s/ > /h/ (6), spirantizations /b/ > /v/ (7) and /p/ > /f/ (8), changes of place of articulation (9, 10), and vowel changes in height (11) and backness (12) (1). Whereas this visualization depicts sound changes as undirected arcs, the sound changes are actually represented with directionality in our system. (C) Zooming in a portion of the Oceanic languages, where the Nuclear Polynesian family (i) and Polynesian family (ii) are visible. Several attested sound changes such as debuccalization to Maori and place of articulation change /t/ > /k/ to Hawaiian (30) are suc- cessfully localized by the system. (D) Most common substitution errors in the PAn reconstructions produced by our system. The first phoneme in each pair ðx; yÞ represents the reference phoneme, followed by the incorrectly hypothesized one. Most of these errors could be plausible disagreements among human experts. For example, the most dominant error (p, v) could arise over a disagreement over the phonemic inventory of Proto-Austronesian, whereas vowels are common sources of disagreement.

Because we are ultimately interested in reconstruction, we then investigation was based on just four languages and found little compared our reconstruction system’s ability to reconstruct words evidence to support the hypothesis. This conclusion was criti- COMPUTER SCIENCES given these automatically determined cognates. Specifically, we cized by several authors (31, 32) on the basis of the small number took every cognate group found by our system (run on the Oceanic of languages and sound changes considered, although they pro- subclade) with at least two words in it. Then, we automatically vided no positive counterevidence. reconstructed the Proto-Oceanic ancestor of those words, using Using the output of our system, we collected sound change our system. For evaluation, we then looked at the average Lev- statistics from our reconstruction of 637 Austronesian languages, enshtein distance from our reconstructions to the known recon- including the probability of a particular change as estimated by our structions described in the previous sections. This time, however, system. These statistics provided the information needed to give COGNITIVE SCIENCES we average per modern word rather than per cognate group, to a more comprehensive quantitative evaluation of the FLH, using PSYCHOLOGICAL AND provide a fairer comparison. (Results were not substantially dif- a much larger sample than previous work (details in SI Appendix, ferent when averaging per cognate group.) Compared with re- Section 2.4). We show in Fig. 3 A and B that this analysis provides construction from manually labeled cognate sets, automatically clear quantitative evidence in favor of the FLH. The revealed identified cognates led to an increase in error rate of only 12.8% pattern would not be apparent had we not been able to reconstruct and with a significant reduction in the cost of curating linguistic large numbers of protolanguages and supply probabilities of dif- databases. See SI Appendix,Fig.S1for the fraction of words with ferent kinds of change taking place for each pair of languages. each Levenshtein distance for these reconstructions. Discussion Functional Load. To demonstrate the utility of large-scale recon- We have developed an automated system capable of large-scale struction of protolanguages, we used the output of our system to reconstruction of protolanguage word forms, cognate sets, and investigate an open question in historical linguistics. The func- sound change histories. The analysis of the properties of hun- tional load hypothesis (FLH), introduced 1955 (6), claims that dreds of ancient languages performed by this system goes far the probability that a sound will change over time is related to beyond the capabilities of any previous automated system and the amount of information provided by a sound. Intuitively, if would require significant amounts of manual effort by linguists. two phonemes appear only in words that are differentiated from Furthermore, the system is in no way restricted to applications one another by at least one other sound, then one can argue that like assessing the effects of functional load: It can be used as no information is lost if those phonemes merge together, be- a tool to investigate a wide range of questions about the structure cause no new ambiguous forms can be created by the merger. and dynamics of languages. A first step toward quantitatively testing the FLH was taken in In developing an automated system for reconstructing ancient 1967 (7). By defining a statistic that formalizes the amount of languages, it is by no means our goal to replace the careful information lost when a language undergoes a certain sound reconstructions performed by linguists. It should be emphasized change—on the basis of the proportion of words that are dis- that the reconstruction mechanism used by our system ignores criminated by each pair of phonemes—it became possible to many of the phenomena normally used in manual recon- evaluate the empirical support for the FLH. However, this initial structions. We have mentioned limitations due to the transducer

Bouchard-Côté et al. PNAS | March 12, 2013 | vol. 110 | no. 11 | 4227 Fig. 3. Increasing the number of languages we can 1 B 1 106 reconstruct gives new ways to approach questions in A historical linguistics, such as the effect of functional load on the probability of merging two sounds. The 0.8 0.8 104 plots shown are heat maps where the color encodes the log of the number of sound changes that fall 0.6 0.6 into a given two-dimensional bin. Each sound 102 > change x y is encoded as a pair of numbers in the 0.4 0.4 unit square, ðl; mÞ, as explained in Materials and Merger posterior Methods. To convey the amount of noise one could Merger posterior 1 expect from a study with the number of languages 0.2 0.2 that King previously used (7), we first show in A the 0 heat map visualization for four languages. Next, we 0 0 0 2x10-5 4x10-5 6x10-5 8x10-5 1x10-4 0 2x10-5 4x10-5 6x10-5 8x10-5 1x10-4 show the same plot for 637 Austronesian languages Functional load Functional load in B. Only in this latter setup is structure clearly visible: Most of the points with high probability of merging can be seen to have comparatively low functional load, providing evidence in favor of the functional load hypothesis introduced in 1955. See SI Appendix, Section 2.4 for details. formalism but other limitations include the lack of explicit mod- decisions taken in this process depend on a context to be specified shortly). eling of changes at the level of the phoneme inventories used by If t is not deleted, we choose a single substitution character in the bottom ​ a language and the lack of morphological analysis. Challenges word. We write S = Σ∪ fζg for this set of outcomes, where ζ is the special specific to the cognate inference task, for example difficulties with outcome indicating deletion. Importantly, the probabilities of this multino- polymorphisms, are also discussed in more detail in SI Appendix. mial can depend on both the previous character generated so far (i.e., the Another limitation of the current approach stems from the as- rightmost character p of yi−1) and the current character in the previous sumption that languages form a phylogenetic tree, an assumption generation string (t), providing a way to make changes context sensitive. violated by borrowing, variation, and creole languages. This multinomial decision acts as the initial distribution of the mutation Markov chain. We consider insertions only if a deletion was not selected in However, we believe our system will be useful to linguists in the first step. Here, we draw from a multinomial over S , where this time the several ways, particularly in contexts where there are large num- special outcome ζ corresponds to stopping insertions, and the other ele- bers of languages to be analyzed. Examples might include using ments of S correspond to symbols that are appended to yi . In this case, the the system to propose short lists of potential sound changes and conditioning environment is t = xi and the current rightmost symbol p in yi. correspondences across highly divergent word forms. Insertions continue until ζ is selected. We use θS;t;p;ℓ and θI;t;p;ℓ to denote the An exciting possible application of this work is to use the probabilities over the substitution and insertion decisions in the current model described here to infer the phylogenetic relationships branch ℓ′ → ℓ. A similar process generates the word at the root ℓ of a tree or between languages jointly with reconstructions and cognate sets. when an innovation happens at some language ℓ, treating this word as This will remove a source of circularity present in most previous a single string y1 generated from a dummy ancestor t = x1. In this case, only computational work in historical linguistics. Systems for inferring the insertion probabilities matter, and we separately parameterize these θ phylogenies such as ref. 13 generally assume that cognate sets are probabilities with R;t;p;ℓ. There is no actual dependence on t at the root or given as a fixed input, but cognacy as determined by linguists is in innovative languages, but this formulation allows us to unify the parame- θ ∈ RjΣj+1 ω ∈ ; ; turn motivated by phylogenetic considerations. The phylogenetic terization, with each ω;t;p;ℓ , where fR S Ig. During cognate in- tree hypothesized by the linguist is therefore affecting the tree ference, the decision to innovate is controlled by a simple Bernoulli random built by systems using only these cognates. This problem can be variable ngℓ for each language in the tree. When known cognate groups are avoided by inferring cognates at the same time as a phylogeny, assumed, ncℓ is set to 0 for all nonroot languages and to 1 for the root language. These Bernoulli distributions have parameters νℓ. something that should be possible using an extended version of fi our probabilistic model. Mutation distributions con ned in the family of transducers miss certain phylogenetic phenomena. For example, the process of reduplication (as in Our system is able to reconstruct the words that appear in “ ” ancient languages because it represents words as sequences of bye-bye , for example) is a well-studied mechanism to derive morpholog- ical and lexical forms that is not explicitly captured by transducers. The same sounds and uses a rich probabilistic model of sound change. This situation arises in metatheses (e.g., Old English frist > English first). How- is an important step forward from previous work applying com- ever, these changes are generally not regular and therefore less informative putational ideas to historical linguistics. By leveraging the full (1). Moreover, because we are using a probabilistic framework, these events sequence information available in the word forms in modern can still be handled in our system, even though their costs will simply not be languages, we hope to see in historical linguistics a breakthrough as discounted as they should be. similar to the advances in evolutionary biology prompted by the Note also that the generative process described in this section does not transition from morphological characters to molecular sequences allow explicit dependencies to the next character in ℓ. Relaxing this as- in phylogenetic analysis. sumption can be done in principle by using weighted transducers, but at the cost of a more computationally expensive inference problem (caused by the Materials and Methods transducer normalization computation) (34). A simpler approach is to use This section provides a more detailed specification of our probabilistic the next character in the parent ℓ′ as a surrogate for the next character in ℓ. model. See SI Appendix, Section 1.2 for additional content on the algo- Using the context in the parent word is also more aligned to the standard rithm and simulations. representation of sound change used in historical linguistics, where the context is defined on the parent as well. Distributions. The conditional distributions over pairs of evolving strings are More generally, dependencies limited to a bounded context on the parent specified using a lexicalized stochastic string transducer (33). string can be incorporated in our formalism. By bounded, we mean that it Consider a language ℓ′ evolving to ℓ for cognate set c. Assume we have a should be possible to fix an integer k beforehand such that all of the word form x = wcℓ′. The generative process for producing y = wcℓ works as modeled dependencies are within k characters to the string operation. The follows. First, we consider x to be composed of characters x1x2 ...xn, with caveat is that the computational cost of inference grows exponentially in k. the first and last ones being a special boundary symbol x1 = # ∈ Σ, which is We leave open the question of handling computation in the face of un- never deleted, mutated, or created. The process generates y = y1y2 ...yn in n bounded dependencies such as those induced by harmony (35). * chunks yi ∈ Σ ; i ∈ f1; ...; ng, one for each xi . The yi s may be a single char- acter, multiple characters, or even empty. To generate yi ,wedefine a mu- Parameterization. Instead of directly estimating the transition probabilities of tation Markov chain that incrementally adds zero or more characters to an the mutation Markov chain (which could be done, in principle, by taking them initially empty yi . First, we decide whether the current phoneme in the top to be the parameters of a collection of multinomial distributions) we express word t = xi will be deleted, in which case yi = e (the probabilities of the them as the output of a multinomial logistic regression model (36). This

4228 | www.pnas.org/cgi/doi/10.1073/pnas.1204678110 Bouchard-Côté et al. model specifies a distribution over transition probabilities by assigning Using these features and parameter sharing, the logistic regression model weights to a set of features that describe properties of the sound changes defines the transition probabilities of the mutation process and the root SEE COMMENTARY involved. These features provide a more coherent representation of the language model to be transition probabilities, capturing regularities in sound changes that reflect Æλ; ω; ; ; ℓ; ξ æ the underlying linguistic structure. θ = θ ξ; λ = expf fð t p Þ g × μ ω; ; ξ ; ω;t;p;ℓ ω;t;p;ℓð Þ ω; ; ; ℓ; λ ð t Þ [1] We used the following feature templates: OPERATION, which identifies Zð t p Þ whether an operation in the mutation Markov chain is an insertion, a de- letion, a substitution, a self-substitution (i.e., of the form x > y, x = y), or the where ξ ∈ S , f : fS; I; Rg × Σ × Σ × L × S → Rk is the feature function (which end of an insertion event; MARKEDNESS, which consists of language-specific indicates which features apply for each event), Æ · ; · æ denotes inner product, n-gram indicator functions for all symbols in Σ (during reconstruction, only and λ ∈ Rk is a weight vector. Here, k is the dimensionality of the feature space unigram and bigram features are used for computational reasons; for cognate of the logistic regression model. In the terminology of exponential families, Z inference, only unigram features are used); FAITHFULNESS, which consists of and μ are the normalization function and the reference measure, respectively: indicators for mutation events of the form 1 [ x > y ], where x ∈ Σ, y ∈ S . X Feature templates similar to these can be found, for instance, in the work of Zðω; t; p; ℓ; λÞ = exp Æλ; f ω; t; p; ℓ; ξ′ æ refs. 37 and 38, in the context of string-to-string transduction models used in ξ′∈S computational linguistics. This approach to specifying the transition proba- 8 bilities produces an interesting connection to stochastic optimality theory (39, > ω = ; = #; ξ ≠ # <> 0 if S t ω = ; ξ = ζ 40), where a logistic regression model mediates markedness and faithfulness μ ω; ; ξ = 0 if R ð t Þ > ω ≠ ; ξ = # of the production of an output form from an underlying input form. :> 0 if R : : Data sparsity is a significant challenge in protolanguage reconstruction. 1 o w Although the experiments we present here use an order of magnitude more languages than previous computational approaches, the increase in observed Here, μ is used to handle boundary conditions, ensuring that the resulting data also brings with it additional unknowns in the form of intermediate probability distribution is well defined. protolanguages. Because there is one set of parameters for each language, During cognate inference, the innovation Bernoulli random variables νgℓ adding more data is not sufficient to increase the quality of the recon- are similarly parameterized, using a logistic regression model with two kinds struction; it is important to share parameters across different branches in the of features: a global innovation feature κglobal ∈ R and a language-specific tree to benefit from having observations from more languages. We used the feature κℓ ∈ R. The likelihood function for each νgℓ then takes the form following technique to address this problem: We augment the parameteri- zation to include the current language (or language at the bottom of the 1 νgℓ = : [2] current branch) and use a single, global weight vector instead of a set of 1 + exp −κglobal − κℓ branch-specific weights. Generalization across branches is then achieved by using features that ignore ℓ, whereas branch-specific features depend on ℓ. Similarly, all of the features in OPERATION, MARKEDNESS, and ACKNOWLEDGMENTS. This work was supported by Grant IIS-1018733 from FAITHFULNESS have universal and branch-specific versions. the National Science Foundation.

1. Hock HH (1991) Principles of Historical Linguistics (Mouton de Gruyter, The Hague, 21. Lynch J, ed (2003) Issues in Austronesian (Pacific Linguistics, Canberra, Australia). Netherlands). 22. Oakes M (2000) Computer estimation of vocabulary in a protolanguage from word 2. Ross M, Pawley A, Osmond M (1998) The Lexicon of Proto Oceanic: The Culture and lists in four daughter languages. J Quant Linguist 7:233–244. Environment of Ancestral Oceanic Society (Pacific Linguistics, Canberra, Australia). 23. Kondrak G (2002) Algorithms for Language Reconstruction. PhD thesis (Univ of Tor-

3. Diamond J (1999) Guns, Germs, and Steel: The Fates of Human Societies (WW Norton, onto, Toronto). COMPUTER SCIENCES New York). 24. Ellison TM (2007) Bayesian identification of cognates and correspondences. Assoc 4. Nichols J (1999) Archaeology and Language: Correlating Archaeological and Linguistic Comput Linguist 45:15–22. Hypotheses, eds Blench R, Spriggs M (Routledge, London). 25. Thorne JL, Kishino H, Felsenstein J (1991) An evolutionary model for maximum like- 5. Ventris M, Chadwick J (1973) Documents in Mycenaean Greek (Cambridge Univ Press, lihood alignment of DNA sequences. J Mol Evol 33(2):114–124. Cambridge, UK). 26. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data 6. Martinet A (1955) Économie des Changements Phonétiques [Economy of phonetic via the EM algorithm. J R Stat Soc B 39:1–38. sound changes] (Maisonneuve & Larose, Paris). 27. Bouchard-Côté A, Jordan MI, Klein D (2009) Efficient inference in phylogenetic InDel 7. King R (1967) Functional load and sound change. Language 43:831–852. trees. Adv Neural Inf Process Syst 21:177–184. 8. Holmes I, Bruno WJ (2001) Evolutionary HMMs: A Bayesian approach to multiple 28. Greenhill SJ, Blust R, Gray RD (2008) The Austronesian basic vocabulary database: COGNITIVE SCIENCES alignment. Bioinformatics 17(9):803–820. From bioinformatics to lexomics. Evol Bioinform Online 4:271–283. PSYCHOLOGICAL AND 9. Miklós I, Lunter GA, Holmes I (2004) A “Long Indel” model for evolutionary sequence 29. Lewis MP, ed (2009) Ethnologue: Languages of the World (SIL International, Dallas, alignment. Mol Biol Evol 21(3):529–540. TX, 16th Ed). 10. Suchard MA, Redelings BD (2006) BAli-Phy: Simultaneous Bayesian inference of 30. Lyovin A (1997) An Introduction to the Languages of the World (Oxford Univ Press, alignment and phylogeny. Bioinformatics 22(16):2047–2048. Oxford, UK). 11. Liberles DA, ed (2007) Ancestral Sequence Reconstruction (Oxford Univ Press, Ox- 31. Hockett CF (1967) The quantification of functional load. Word 23:320–339. ford, UK). 32. Surendran D, Niyogi P (2006) Competing Models of Linguistic Change. Evolution and 12. Paten B, et al. (2008) Genome-wide nucleotide-level mammalian ancestor recon- Beyond (Benjamins, Amsterdam). struction. Genome Res 18(11):1829–1843. 33. Varadarajan A, Bradley RK, Holmes IH (2008) Tools for simulating evolution of aligned 13. Gray RD, Jordan FM (2000) Language trees support the express-train sequence of genomic regions with integrated parameter estimation. Genome Biol 9(10):R147. Austronesian expansion. Nature 405(6790):1052–1055. 34. Mohri M (2009) Handbook of Weighted Automata, Monographs in Theoretical 14. Ringe D, Warnow T, Taylor A (2002) Indo-European and computational cladistics. Computer Science, eds Droste M, Kuich W, Vogler H (Springer, Berlin). Trans Philol Soc 100:59–129. 35. Hansson GO (2007) On the evolution of consonant harmony: The case of secondary 15. Evans SN, Ringe D, Warnow T (2004) Inference of Divergence Times as a Statistical In- articulation agreement. Phonology 24:77–120. verse Problem, McDonald Institute Monographs, eds Forster P, Renfrew C (McDonald 36. McCullagh P, Nelder JA (1989) Generalized Linear Models (Chapman & Hall, London). Institute, Cambridge, UK). 37. Dreyer M, Smith JR, Eisner J (2008) Latent-variable modeling of string transductions 16. Gray RD, Atkinson QD (2003) Language-tree divergence times support the Anatolian with finite-state methods. Empirical Methods on Natural Language Processing 13: theory of Indo-European origin. Nature 426(6965):435–439. 1080–1089. 17. Nakhleh L, Ringe D, Warnow T (2005) Perfect phylogenetic networks: A new meth- 38. Chen SF (2003) Conditional and joint models for grapheme-to-phoneme conversion. odology for reconstructing the evolutionary history of natural languages. Language Eurospeech 8:2033–2036. 81:382–420. 39. Goldwater S, Johnson M (2003) Learning OT constraint rankings using a maximum 18. Bryant D (2006) Phylogenetic Methods and the Prehistory of Languages, eds Forster P, entropy model. Proceedings of the Workshop on Variation Within Optimality Theory Renfrew C (McDonald Institute for Archaeological Research, Cambridge, UK), pp eds Spenader J, Eriksson A, Dahl Ö (Stockholm University, Stockholm) pp 113–122. 111–118. 40. Wilson C (2006) Learning phonology with substantive bias: An experimental and 19. Daumé H III, Campbell L (2007) A Bayesian model for discovering typological im- computational study of velar palatalization. Cogn Sci 30(5):945–982. plications. Assoc Comput Linguist 45:65–72. 41. Gray RD, Drummond AJ, Greenhill SJ (2009) Language phylogenies reveal expansion 20. Dunn M, Levinson S, Lindstrom E, Reesink G, Terrill A (2008) Structural phylogeny pulses and pauses in Pacific settlement. Science 323(5913):479–483. in historical linguistics: Methodological explorations applied in Island Melanesia. 42. Blust R (1999) Subgrouping, circularity and extinction: Some issues in Austronesian Language 84:710–759. comparative linguistics. Inst Linguist Acad Sinica 1:31–94.

Bouchard-Côté et al. PNAS | March 12, 2013 | vol. 110 | no. 11 | 4229 Supporting Information: Automated reconstruction of ancient languages using probabilistic models of sound change

Alexandre Bouchard-Cotˆ e´ David Hall Thomas L. Griffiths Dan Klein

This Supporting Information describes the learning and inference algorithms used by our system (Section 1), details of simulations supporting the accuracy of our system and analyzing the effects of functional load (Section 2), lists of reconstructions (Section 3), lists of sound changes (Section 4), and analysis of frequent errors (Section 5).

1 Learning and inference

The generative model introduced in the Materials and Methods section of the main paper sets us up with two problems to solve: estimating the values of the parameters characterizing the distribution on sound changes on each branch of the tree, and inferring the optimal values of the strings representing words in the unobserved protolanguages. Section 1.1 introduces the full objective function that we need to optimize in order to estimate these quantities. Section 1.2 describes the Monte Carlo Expectation-Maximization algorithm we used for solving the learning problem. We present the algorithm for inferring ancestral word forms in Section 1.4.

1.1 Full objective function The generative model specified in the Materials and Methods section of the main paper defines an objective func- tion that we can optimize in order to find good protolanguage reconstructions. This objective function takes the form of a regularized log-likelihood, combining the probability of the observed languages with additional con- straints intended to deal with data sparsity. This objective function can be written concisely if we let Pλ(·), Pλ(·|·) denote the root and branch probability models described in the Materials and Methods section of the paper (with transition probabilities given by the above logistic regression model), I(c), the set of internal (non-leaf) nodes in τ(c), pa(`), the parent of language `, r(c), the root of τ(c) and W (c) = (Σ∗)|I(c)|. The full objective function is then

C X X Y ||λ||2 + ||κ||2 Li(λ, κ) = log (w ) (w |w , n ) (n |ν ) − 2 2 (1) Pλ c,r(c) Pλ c,` c,pa(`) c` Pκ c` c` 2σ2 c=1 ~w∈W (c) `∈I(c) where the second term is a standard L2 regularization penalty intended to reduce over-fitting due to data sparsity (we used σ2 = 1) [1]. The goal of learning is to find the value of λ, the parameters of the logistic regression model for the transition probabilities, that maximizes this function.

1 1.2 A Monte Carlo Expectation-Maximization algorithm for reconstruction Optimization of the objective function given in Equation 1 is done using a Monte Carlo variant of the Expectation- Maximization (EM) algorithm [2]. This algorithm breaks down into two steps, an E step in which the objective function is approximated and an M step in which this approximate objective function is optimized. The M step is convex and computed using L-BFGS [3] but the E step is intractable [4], in part because it requires solving the problem of inferring the words in the protolanguages. We approximate the solution to this inference problem using a Markov chain Monte Carlo (MCMC) algorithm [5]. This algorithm repeatedly samples words from the protolanguages until it converges on the distribution implied by our generative model. Since this procedure is guaranteed to find a local maximum of the objective, we ran it with several different random initializations of the model parameters. The next two subsections provide the details of these two parts of our system.

1.2.1 E step: Inferring the posterior over string reconstructions In the E step, the inference problem is to compute an expectation under the posterior over strings in a protolanguage given observed word forms at the leaves of the tree.1 The typical approach in biological InDel models [6] is to use Gibbs sampling, where the entire string at each node in the tree is repeatedly resampled, conditioned on its parent and children. We will call this method Single Sequence Resampling (SSR). While conceptually simple, this approach suffers from mixing problems in large trees, since it can take a long time for information to propagate from one region of the tree to another [6]. Consequently, we use a different MCMC procedure, called Ancestry Resampling (AR) that alleviates these mixing problems. This method was originally introduced for biological applications [7], but commonalities between the biological and linguistic cases make it possible to use it in our model. Concretely, the problem with SSR arises when the tree under consideration is large or unbalanced. In this case, it can take a long time for information from the observed languages to propagate to the root of the tree. Indeed, samples at the root will initially be independent of the observations. AR addresses this problem by resampling one thin vertical slice of all sequences at a time, called an ancestry (for the precise definition of the algorithm, see [7]). Slices condition on observed data, avoiding mixing problems, and can propagate information rapidly across the tree. We ran the ancestry resampling algorithm for a number of iterations that increased linearly with the number of iterations of the EM algorithm that had been completed, resulting in an approximation regime that could allow the EM algorithm to converge to a solution [8]. To speed-up the large experiments, we also used an approximation in AR. This approximation is based on fixing the value of a set-valued auxiliary variables zg, where zg = {wg,`}, and ` ranges over the set of all languages (both internal and at the leaves). Conditioning on these variables, sampling wg,`|zg can be done exactly using dynamic programming and rejection sampling.

1.2.2 M step: Convex optimization of the approximate objective In the M step, we individually update the parameters λ and κ as specified in Equation 1. We show how λ is updated in this section, κ can be optimized similarly. Let C = (ω, t, p, `) denote local transducer contexts from the space C = {S, I, R} × Σ × Σ × L of all such contexts. Let N(C, ξ) be the expected number of times the transition ξ was used in context C in the preceding E-step. Given these sufficient statistics, the estimate of λ is given by optimizing the expected complete (regularized) log-likelihood O(λ) derived from the original objective function given Equation [1] in the Materials and Methods section of the main paper (ignoring terms that do not involve λ),

X X h X i ||λ||2 O(λ) = N(C, ξ) hλ, f(C, ξ)i − log exp{hλ, f(C, ξ0)i} − . 2σ2 C∈C ξ∈S ξ0

1To be precise: the posterior is over both protolanguage strings and the derivations between these strings and the modern words.

2 We use L-BFGS [3] to optimize this convex objective function. L-BFGS requires the partial derivatives

∂O(λ) X X h X 0 0 i λj = N(C, ξ) fj(C, ξ) − θC(ξ ; λ)fj(C, ξ ) − 2 ∂λj σ C∈C ξ∈S ξ0 X X λj = Fˆ − N(C, ·)θ (ξ0; λ)f (C, ξ0) − , j C j σ2 C∈C ξ∈S ˆ P P P where Fj = C∈C ξ∈S N(C, ξ)fj(C, ξ) is the empirical feature vector and N(C, ·) = ξ N(C, ξ) is the number of times context C was used. Fˆj and N(C, ·) do not depend on λ and thus can be precomputed at the beginning of the M-step, thereby speeding up each L-BFGS iteration.

1.3 Approximate Expectation-Maximization for cognate inference The procedure for cognate inference is similar: we again operate within the Expectation-Maximization framework, and the M-steps are identical. However, because of different characteristics of the cognate inference problem, we make a few different choices for approximations for the E-step. The Monte Carlo inference algorithm described in the preceding section works exceedingly well for reconstructing words for known cognates. However, a Monte Carlo approach to determining which words are cognate requires resampling the innovation variables ng`, which is likely to lead to slow mixing of the Markov chain. We therefore take a different approach when doing cognate inference. Here, we restrict our system to not perform inference over all possible reconstructions for all words, but to only use words that correspond to some observed modern word with the same meaning. The simplification is of course false, but it works well in practice. Specifically, we perform inference on the tree for each gloss using message passing [9], also known as pruning in the computational biology literature [10], where each message µ(wg`) has a non-zero score only when wg` is one of the observed modern word forms in gloss g. The result of inference yields expected alignments counts for each character in each language along with the expected number of innovations that occur at each language. These expectations can then be used in the convex M-step.

1.4 Ancestral word form reconstruction In the E step described in the preceding section, a posterior distribution π over ancestral sequences given observed forms is approximated by a collection of samples X1,X2,...XS. In this section, we describe how this distribution is summarized to produce a single output string for each cognate set. This algorithm is based on a fundamental Bayesian decision theoretic concept: Bayes estimators. Given a loss function over strings Loss : Σ∗ × Σ∗ → [0, ∞), an estimator is a Bayes estimator if it belongs to the random set:

π X argmin E Loss(x, X) = argmin Loss(x, y)π(y). x∈Σ∗ x∈Σ∗ y∈Σ∗ Bayes estimators are not only optimal within the Bayesian decision framework, but also satisfy frequentist opti- mality criteria such as admissibility [11]. In our case, the loss we used is the Levenshtein [12] distance, denoted Loss(x, y) = Lev(x, y) (we discuss this choice in more detail in Section 2). Since we do not have access to π, but rather to an approximation based on S samples, the objective function we use for reconstruction rewrites as follows: S X 1 X argmin Lev(x, y)π(y) ≈ argmin Lev(x, Xs) x∈Σ∗ x∈Σ∗ S y∈Σ∗ s=1 S X = argmin Lev(x, Xs). ∗ x∈Σ s=1

3 The raw samples contain both derivations and strings for all protolanguages, whereas we are only interested in reconstructing words in a single protolanguage. This is addressed by marginalization, which is done in sampling representations by simply discarding the irrelevant information. Hence, the random variables Xs in the above equation can be viewed as being string-valued random variables. Note that the optimum is not changed if we restrict the minimization to be taken on x ∈ Σ∗ such that m ≤ |x| ≤ M where m = mins |Xs|,M = maxs |Xs|. However, even with this simplification, optimization is intractable. As an approximation, we considered only strings built by at most k contiguous substrings taken from the word forms in X1,X2,...,XS. If k = 1, then it is equivalent to taking the min over {Xs : 1 ≤ s ≤ S}. At the other end of the spectrum, if k = S, it is exact. This scheme is exponential in k, but since words are relatively short, we found that k = 2 often finds the same solution as higher values of k.

1.5 Finding Cognate Groups Our cognate model finds cuts to the phylogeny in order to determine which words are cognate with one another. However, this approach cannot straightforwardly handle instances where the evolution of the words did not follow strictly treelike behavior. This limitation applies to polymorphisms—where multiple words for a given meaning are available in a language—and also for borrowing—where the words did not evolve according to the phylogeny. However, we can modify the inference procedure to capture these kinds of behaviors using a post hoc agglom- erative “merge” procedure. Specifically, we can run our procedure to find an initial set of cognate groups, and then merge those cognate groups that produce an increase in model score. That is, we create several initial small subtrees containing some cognates, and then stitch them together into one or more larger trees. Thus, non-treelike behaviors like borrowing will be represented as multiple trees that “overlap.” For instance, if two languages each have two words for one meaning (say A and B), then the initial stage might find that the two A’s are cognate, leaving the two B’s as singleton cognate groups. However, merging these two words into a single group will likely produce a gain in likelihood, and so we can merge them. Note this procedure is unlikely to work well for long distance borrowings, as the sound changes involved in long distance borrowing are likely to be very different from those according to the phylogeny. Nevertheless, we found this procedure to be effective in practice.

2 Experiments

In this section, we give more details on the results in the main paper, namely those concerning validation using reconstruction error rate, and those measuring the effect of the tree topology and the number of languages. We also include additional comparisons to other reconstruction methods, as well as cognate inference results. In Section 2.1, we analyze in isolation the effects of varying the set of features, the number of observed languages, the topology, and the number of iterations of EM. In Section 2.2 we compare performance to an oracle and to two other systems. Evaluation of all methods was done by computing the Levenshtein distance [12] (uniform-cost edit distance) between the reconstruction produced by each method and the reconstruction produced by linguists. The Leven- shtein distance is the minimum number of substitutions, insertions, or deletions of a phoneme required to transform one word to another. While the Levenshtein distance misses important aspects of phonology (all phoneme substi- tutions are not equal, for instance), it is parameter-free and still correlates to a large extent with linguistic quality of reconstruction. It is also superior to held-out log-likelihood, which fails to penalize errors in the modeling assumptions, and to measuring the percentage of perfect reconstructions, which ignores the degree of correctness of each reconstructed word. We averaged this distance across reconstructed words to report a single number for each method. The statistical significance of all performance differences are assessed using a paired t-test with significance level of 0.05.

4 2.1 Evaluating system performance We used the Austronesian Basic Vocabulary Database (ABVD) [13] as the basis for a series of experiments used to evaluate the performance of our system and the factors relevant to its success. The database, downloaded from http://language.psy.auckland.ac.nz/austronesian/ on August 7, 2010, includes partial cognacy judgments and IPA transcriptions,2 as well as a several reconstructed protolanguages. In our main experiments, we used the tree topology induced by the Ethnologue classification [14] in order to facilitate interpretability of the results (i.e. so that we can give well-known names to clades in the tree in our analyses). Our method does not require specifying branch lengths since our unsupervised learning procedure provides a more flexible way of estimating the amount of change between points of the phylogenetic tree. The first claim we verified experimentally is that having more observed languages aids reconstruction of pro- tolanguages. In the results in this section, we used the subset of the languages under Proto-Oceanic (POc) to speed-up the computations. To test this hypothesis we added observed modern languages in increasing order of distance dc to the target reconstruction of POc so that the languages that are most useful for POc reconstruction are added first. This prevents the effects of adding a close language after several distant ones being confused with an improvement produced by increasing the number of languages. The results are reported in Figure 1(c) of the main paper. They confirm that large-scale inference is desirable for automatic protolanguage reconstruction: reconstruction improved statistically significantly with each increase except from 32 to 64 languages, where the average edit distance improvement was 0.05. We then conducted a number of experiments intended to identify the contribution made by different factors it incorporates. We found that all of the following ablations significantly hurt reconstruction: using a flat tree (in which all languages are equidistant from the reconstructed root and from each other) instead of the consensus tree, dropping the markedness features, dropping the faithfulness features, and disabling sharing across branches. The results of these experiments are shown in Table S.1. For comparison, we also included in the same table the performance of a semi-supervised system trained by K-fold validation. The system was run K = 5 times, with 1 − K−1 of the POc words given to the system as observations in the graphical model for each run. It is semi-supervised in the sense that target reconstructions for many internal nodes are not available in the dataset, so they are still not filled.3

2.2 Comparisons against other methods

The first competing method, PRAGUE was introduced in [15]. In this method, the word forms in a given protolan- guage are reconstructed using a Viterbi multi-alignment between a small number of its descendant languages. The alignment is computed using hand-set parameters. Deterministic rules characterizing changes between pairs of observed languages are extracted from the alignment when their frequency is higher than a threshold, and a proto- phoneme inventory is built using linguistically motivated rules and parsimony. A reconstruction of each observed word is first proposed independently for each language. If at least two reconstructions agree, a majority vote is taken, otherwise no reconstruction is proposed. This approach has several limitations. First, it is not tractable for larger trees, since the time complexity of their multi-alignment algorithm grows exponentially in the number of languages. Second, deterministic rules, while elegant in theory, are not robust to noise: even in experiments with only four daughter languages, a large fraction of the words could not be reconstructed. Since PRAGUE does not scale to large datasets, we also built a second, more tractable baseline. This new baseline system, CENTROID, computes the centroid of the observed word forms in Levenshtein distance. Let

2While most word forms in ABVD are encoded using IPA, there are a few exceptions, for example specific conventions such as the usage of the okina symbol (’) for glottal stops (P) in some or ad hoc conventions such as using the bigram ‘ng’ for the agma symbol (N). We have preprocessed as many of these exceptions as possible, and probabilist methods are generally robust to reasonable amounts of encoding glitches. 3We also tried a fully supervised system where a flat topology is used so that all of these latent internal nodes are avoided; but it did not perform as well—this is consistent with the -Topology experiment of Table S.1.

5 Lev(x, y) denote the Levenshtein distance between word forms x and y. Ideally, we would like the baseline system to return: X argmin Lev(x, y), x∈Σ∗ y∈O where O = {y1, . . . , y|O|} is the set of observed word forms. This objective function is motivated by Bayesian decision theory [11], and shares similarity to the more sophisticated Bayes estimator described in Section 2. How- ever it replaces the samples obtained by MCMC sampling by the set of observed words. Similarly to the algorithm of Section 2, we also restrict the minimization to be taken on x ∈ Σ(O)∗ such that m ≤ |x| ≤ M and to strings built by at most k contiguous substrings taken from the word forms in O, where m = mini |yi|,M = maxi |yi| and Σ(O) is the set of characters occurring in O. Again, we found that k = 2 often finds the same solution as higher values of k. The difference was in all the cases not statistically significant, so we report the approximation k = 2 in what follows. We also compared against an oracle, denoted ORACLE, which returns

argmin Lev(y, x∗), y∈O where x∗ is the target reconstruction. We will denote it by ORACLE. This is superior to picking a single closest language to be used for all word forms, but it is possible for systems to perform better than the oracle since it has to return one of the observed word forms. Of course, this scheme is only available to assess system performance on held-out data: It cannot make new predictions. We performed the comparison against a previous system proposed in [15] on the same dataset and experimental conditions as used in [15]. The PMJ dataset was compiled by [16], who also reconstructed the corresponding protolanguage. Since PRAGUE is not guaranteed to return a reconstruction for each cognate set, only 55 word forms could be directly compared to our system. We restricted comparison to this subset of the data. This favors PRAGUE since the system only proposes a reconstruction when it is certain. Still, our system outperformed PRAGUE, with an average distance of 1.60 compared to 2.02 for PRAGUE. The difference is marginally significant, p = 0.06, partly due to the small number of word forms involved. To get a more extensive comparison, we considered the hybrid system that returns PRAGUE’s reconstruction when possible and otherwise back off to the Sundanese (Snd.) modern form, then Madurese (Mad.), Malay (Mal.) and finally Javanic (Jv.) (the optimal back-off order). In this case, we obtained an edit distance of 1.86 using our system against 2.33 for PRAGUE, a statistically significant difference. We also compared against ORACLE and CENTROID in a large-scale setting. Specifically, we compare to the experimental setup on 64 modern languages used to reconstruct POc described before. Encouragingly, while the system’s average distance (1.49) does not attain that of the ORACLE (1.13), we significantly outperform the CENTROID baseline (1.79).

2.3 Cognate recovery To test the effectiveness of our cognate recovery system, we ran our system on all of the Oceanic languages in the ABVD, which comprises roughly half of the Austronesian languages. We then evaluated the pairwise precision, recall, F1, and purity scores, defined as follows. Let G = {G1,G2,...,Gg} denote the known partitions of the forms into cognates, and let F = {F1,F2,...,Ff } denote the inferred partitions. Let pairs(F ) denote the set of unordered pairs of indices in the same partition in F : pairs(F ) = {{i, j} : ∃k s.t. i, j ∈ Fk, i 6= j}, and similarly

6 for pairs(G). The metrics are defined as follows:

|pairs(G) ∩ pairs(F )| precision = , |pairs(F )| |pairs(G) ∩ pairs(F )| recall = , |pairs(G)| precision · recall F1 = 2 , precision + recall 1 X purity = max |Gg ∩ Ff |. N g f

Using these metrics, we found that our system achieved a precision of 84.4, recall of 62.1, F1 of 71.5, and cluster purity of 91.8. Thus, over 9 out of 10 words are correctly grouped, and our system errs on the side of under- grouping words rather than clustering words that are not cognates. Since the null hypothesis in historical linguistics is to deem words to be unrelated unless proven otherwise, a slight under-grouping is the desired behavior. Since we are ultimately interested in reconstruction, we then compared our reconstruction system’s ability to reconstruct words given these automatically determined cognates. Specifically, we took every cognate group found by our system (run on the Oceanic subclade) with at least two words in it. That is, we excluded words that our system found to be isolates. Then, we automatically reconstructed the Proto-Oceanic ancestor of those words using our system (using the auxiliary variables z described in Section 1.2.1). For evaluation, we then looked at the average edit distance from our reconstructions to the known reconstruc- tions described in the previous sections. This time, however, we average per modern word rather than per cognate group, to provide a fairer comparison. (Results were not substantially different averaging per cognate group.) Using known cognates from the ABVD, there was an average reconstruction error of 2.19, versus 2.47 for the automatically reconstructed cognates, or an increase in error rate of 12.8%. The fraction of words with each Leven- shtein distance for these reconstructions is shown in Figure S.1. While the plots are similar, the automatic cognates exhibit a slightly longer tail. Thus, even with automatic cognates, the reconstruction system can reconstruct words faithfully in many cases, only failing in a few instances.

2.4 Computation of functional loads To measure functional load quantitatively, we used the same estimator as the one used in [17]. This definition is based on associating a context vector cx,` to each phoneme and language. For a given language with N(`) phoneme tokens, these context vectors are defined as follows: first, fix an enumeration order of all the contexts found in the corpus, where the context is defined as a pair of phonemes, one at the left and one at the right of a position. Element i in this enumeration will correspond to component i of the context vectors. Then, the value of component i in context vector cx,` is set to be the number of time phoneme x occurred in context i and language l. Finally, King’s definition of functional load FL`(x, y) is the dot product of the two induced context vectors:

1 1 X FL (x, y) = hc , c i = c (i) × c (i), ` N(`)2 x,` y,` N(`)2 x,` y,` i where the denominator is simply a normalization that insures FL`(x, y) ≤ 1. Note that if x and y are in comple- mentary distribution in language `, then the two vectors cx,` and cy,` are orthogonal. The functional load is indeed zero in this case. In Figure 3 of the main paper, we show heat maps where the color encodes the log of the number of sound changes that fall into a given 2-dimensional bin. Each sound change x > y is encoded as pair of numbers in the unit interval, (ˆl, mˆ ), where ˆl is an estimate of the functional load of the pair and mˆ is the posterior fraction of the instances of the phoneme x that undergo a change to y. We now describe how ˆl, mˆ were estimated. The posterior

7 fraction mˆ for the merger x > y between languages pa(`) → ` is easily computed from the same expected sufficient statistics used for parameter estimation: P p∈Σ N(S, x, p, `, y) mˆ `(x > y) = P P 0 0 . p0∈Σ y0∈Σ N(S, x, p , `, y )

The estimate of the functional load requires additional statistics, i.e. the expected context vectors cˆx,` and expected phoneme token counts Nˆ(`), but these can be readily extracted from the output of the MCMC sampler. The estimate is then:

ˆ 1 l`(x, y) = hcˆx,`, cˆy,`i. Nˆ(`)2

Finally, the set of points used to construct the heat map is:

nˆ  o lpa(`)(x, y), mˆ `(x > y) : ` ∈ L − {root}, x ∈ Σ, y ∈ Σ, x 6= y .

3 Reconstruction lists

In Table S.2–S.4, we show the lists of consensus reconstructions produced by our system (the ‘Automatic’ column). For comparison, we also include a baseline (randomly picking one modern word), and the edit distances (when it is greater than zero). In Proto-Oceanic, since two manual reconstructions are available, we include the distances of the automatic reconstruction to both manual reconstructions (‘P-A’ and ‘B-A’) and the distance between the two manual reconstructions (‘B-P’).

4 Sound changes

In Figures S.2–S.5, we zoom and rotate each quadrant of the tree shown in Figure 2 of the main paper. For information on the most frequent change of each branch, refer to the row in Table S.5 corresponding to the code in parenthesis attached to each branch. By most frequent, we mean the change with the highest expected count in the last EM iteration, collapsing contexts for simplicity. In cases where no change is observed with an expected count of more than 1.0, we skip the corresponding entry—this can be caused for example by languages where cognacy information was too sparse in the dataset. The functional load, normalized to be in the interval [0, 1] is also shown for reference.

5 Frequent errors

In Figure S.6, we analyze the frequent discrepancy between the PAn reconstruction from our system with those from [18]. The most frequent problematic substitutions, insertions, and deletions are shown. These frequencies were obtained by aligning, after running the system, the reconstructions with the references. The aligner is a pair- HMM trained via EM, using only phoneme identity and gap information [19]. The frequencies were extracted from the posterior alignments.

References and Notes

[1] Hastie T, Tibshiranim R, Friedman J (2009) The Elements of Statistical Learning (Springer). [2] Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39:1–38.

8 [3] Liu DC, Nocedal J, Dong C (1989) On the limited memory BFGS method for large scale optimization. Mathematical Programming 45:503–528. [4] Lunter GA, Miklos´ I, Song YS, Hein J (2003) An efficient algorithm for statistical multiple alignment on arbitrary phylogenetic trees. J. Comp. Biol. 10:869–889. [5] Tierney L (1994) Markov chains for exploring posterior distributions. The Annals of Statistics 22:1701–1728. [6] Holmes I, Bruno WJ (2001) Evolutionary HMM: a Bayesian approach to multiple alignment. Bioinformatics 17:803–820. [7] Bouchard-Cotˆ e´ A, Jordan MI, Klein D (2009) Efficient inference in phylogenetic InDel trees. Advances in Neural Information Processing Systems 21:177–184. [8] Caffo B, Jank W, Jones G (2005) Ascent-based Monte Carlo EM. Journal of the Royal Statistical Society - Series B 67:235–252. [9] Judea P (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. (Morgan Kaufmann). [10] Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution 17:368– 376. [11] Robert CP (2001) The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation (Springer). [12] Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10. [13] Greenhill S, Blust R, Gray R (2008) The Austronesian basic vocabulary database: From bioinformatics to lexomics. Evolutionary Bioinformatics 4:271–283. [14] Lewis MP, ed (2009) Ethnologue: Languages of the World, Sixteenth edition. (SIL International). [15] Oakes M (2000) Computer estimation of vocabulary in a protolanguage from word lists in four daughter languages. Journal of Quanti- tative Linguistics 7:233–244. [16] Nothofer B (1975) The reconstruction of Proto-Malayo-Javanic (M. Nijhoff). [17] King R (1967) Functional load and sound change. Language 43:831–852. [18] Blust R (1999) Subgrouping, circularity and extinction: Some issues in Austronesian comparative linguistics. Inst. Linguist. Acad. Sinica 1:31–94. [19] Berg-Kirkpatrick T, Bouchard-Cotˆ e´ A, DeNero J, Klein D (2010) Painless unsupervised learning with features. Proceedings of the North American Conference on Computational Linguistics pp 582–590.

Condition Edit dist. Unsupervised full system 1.87 -FAITHFULNESS 2.02 -MARKEDNESS 2.18 -Sharing 1.99 -Topology 2.06 Semi-supervised system 1.75

Table S.1: Effects of ablation of various aspects of our unsupervised system on mean edit distance to POc. -Sharing corresponds to the restriction to the subset of the features in OPERATION, FAITHFULNESS and MARKEDNESS that are branch-specific, -Topology corresponds to using a flat topology where the only edges in the tree connect modern languages to POc. The semi-supervised system is described in the text. All differences (compared to the unsupervised full system) are statistically significant.

9 Table S.2. Proto-Austronesian reconstructions Gloss Reference Baseline Distance Automatic Distance tohold(1) *gemgem *higem 3 *gemgem smoke(1) *qebel *q1v1l 3 *qebel toscratch(1) *ka Raw *kagaw 1 *ka Raw and(2) *mah *ma 1 *ma 1 leg/foot(1) *qaqay *ai 4 *qaqay shoulder(1) *qaba Ra *vala 4 *qaba Ra woman/female(1) *bahi *babinay 4 *vavaian 5 left(1) *kawi Ri *kayli 3 *kawi Ri day(1) *qalejaw *andew 5 *qalejaw mother(1) *tina *qina 1 *tina tosuck(1) *sepsep *sipsip 2 *sepsep small(2) *kedi *kedhi 1 *kedi night(1) *be RNi *beyi 2 *be RNi tosqueeze(1) *pe Req *perah 3 *pereq 1 tohit(1) *palu *mipalo 3 *palu rat(1) *labaw *lapo 3 *kulavaw 3 you(1) *ikamu *kamo 2 *kamu 1 toplant(1) *mula *h1mula 2 *mula toswell(1) *ba Req *ba Req *ba Req tosee(1) *kita *kita *kita one(1) *isa *sa 1 *isa tosleep(1) *tudu R *maturug 4 *tudu R dog(1) *wasu *kahu 2 *vatu 2 topound,beat(20) *tutuh *nutu 2 *tutu 1 stone(1) *batu *batu *batu green(1) *mataq *mataq *mataq father(1) *tama *qama 1 *tama this(1) *ini *eni 1 *ani 1 tooth(1) *nipen *lipon 2 *nipen tochoose(1) *piliq *piliP 1 *piliq star(1) *bituqen *bitun 2 *bituqen tobuy(23) *baliw *taiw 2 *taiw 2 tovomit(1) *utaq *mutjaq 2 *utaq towork(1) *qumah *quma 1 *quma 1 wide(1) *malawas *Gawig 5 *malabe R 3 tocut,hack(1) *ta Raq *ta Raq *ta Raq tofear(1) *matakut *takuP 3 *matakut tolive,bealive(1) *maqudip *maPuNi 3 *maqudip thunder(3) *de RuN *zuN 3 *deruN 1 tofly(2) *layap *layap *layap toshoot(1) *panaq *fanak 2 *panaq name(1) *Najan *ngaza 4 *Najan tobuy(1) *beli *poli 2 *beli and(1) *ka *kae 1 *ka when?(1) *ijan *pirang 3 *pijan 1 todig(1) *kalih *kali 1 *kali 1 ash(1) *qabu *abu 1 *qabu big(1) *ma Raya *ma Raya *ma Raya tocut,hack(3) *tektek *tutek 2 *tektek road/path(1) *zalan *dalan 1 *zalan tostand(1) *di Ri *diri 1 *di Ri sand(1) *qenay *one 4 *qenay continued on next page

10 continued from previous page Gloss Reference Baseline Distance Automatic Distance meat/flesh(31) *isi *ici 1 *isi what?(2) *nanu *anu 1 *anu 1 toswell(26) * Ribawa *malifawa 4 *abeh 5 bird(2) *qayam *ayam 1 *qayam hand(1) *lima *lime 1 *lima in,inside(1) *idalem *dalale 4 *idalem if(2) *nu *no 1 *nu toknow,beknowledgeable(2) *bajaq *mafanaP 5 *mafanaP 5 at(20) *di *ri 1 *di wind(2) *bali *feli 2 *beliu 2 new(1) *mabaqe Ru *baPru 5 *vaquan 6 blood(1) *da Raq *da Raq *da Raq breast(1) *susu *soso 2 *susu i(1) *iaku *ako 2 *iaku salt(1) *qasi Ra *sie 4 *qasi Ra toflow(1) *qalu R *ilir 4 *qalu R five(1) *lima *lima *lima at(1) *i *i *i other(1) *duma *duma *duma leaf(2) *bi Raq *bela 3 *bela 3 all(1) *amin *kemon 3 *amin rotten(1) *mabu Raq *mavuk 4 *mabu Ruk 2 tocook(1) *tanek *tan@k 1 *tanek head(1) *qulu *PuGuh 3 *qulu mouth(2) *Nusu *Nutu 1 *Nuju 1 house(1) * Rumaq *uma 2 * Rumaq if(1) *ka *ke 1 *ka neck(1) *liqe R *liq1g 2 *liqe R needle(1) *za Rum *dag1m 3 *za Rum he/she(1) *siia *ia 2 *siia fruit(1) *buaq *buaq *buaq back(1) *likud *likudeP 2 *likud tochew(2) *qelqel *qmelqel 1 *qmelqel 1 salt(2) *timus *timus *timus we(2) *kami *sikami 2 *kami long(1) *inaduq *nandu 3 *anaduq 1 we(1) *ikita *itam 3 *kita 1 three(1) *telu *t1lu 1 *telu lake(1) *danaw *ranu 3 *danaw toeat(1) *kaen *kman 2 *kman 2 no,not(3) *ini *ini *ini where?(1) *inu *sadinno 5 *ainu 1 how?(1) *kuja *gagua 4 *kua 1 tothink(34) *nemnem *n1mn1m 2 *kinemnem 2 who?(2) *siima *cima 2 *tima 2 tobite(1) *ka Rat *kagat 1 *ka Rat tail(1) *iku R *ikog 2 *iku R

11 Table S.3. Proto-Oceanic reconstructions Reconstructions Pairwise distances Gloss Blust (B) Pawley (P) Automatic (A) B-P P-A B-A fish(1) *ikan *ikan *ikan five(1) *lima *lima *lima what?(1) *sapa *saa *sava 1 1 1 meat/flesh(1) *pisiko *pisako *kiko 1 3 3 star(1) *pituqun *pituqon *vetuqu 1 4 3 fog(1) *kaput *kaput *kabu 2 2 toscratch(44) *karu *kadru *kadru 1 1 shoulder(1) *pa Ra *qapa Ra *vara 2 4 2 where?(3) *pai *pea *vea 2 1 3 toclimb(2) *sake *sake *cake 1 1 toeat(1) *kani *kani *kani two(1) *rua *rua *rua dry(11) *maca *masa *mamasa 1 2 3 narrow(1) *kopit *kopit *kapi 2 2 todig(1) *keli *keli *keli bone(2) *su Ri *su Ri *sui 1 1 stone(1) *patu *patu *patu left(1) *mawi Ri *mawi Ri *mawii 1 1 they(1) *ira *ira *sira 1 1 toliedown(1) *qinop *qeno *eno 2 1 3 tohide(1) *puni *puni *vuni 1 1 rope(1) *tali *tali *tali smoke(2) *qasu *qasu *qasu when?(1) *Naican *Naijan *Nisa 1 3 3 we(2) *kamami *kami *kami 2 2 this(1) *ne *ani *eni 2 1 2 egg(1) *qatolu R *katolu R *tolu 1 3 3 stick/wood(1) *kayu *kayu *kai 2 2 tosit(16) *nopo *nopo *nofo 1 1 toshoot(1) *panaq *pana *pana 1 1 liver(1) *qate *qate *qate needle(1) *sa Rum *sa Rum *sau 2 2 feather(1) *pulu *pulu *vulu 1 1 topound,beat(2) *tutuk *tuki *tutuk 3 3 near(9) *tata *tata *tata heavy(1) *mamat *mapat *mamava 1 3 2 year(1) *taqun *taqun *taqu 1 1 old(1) *matuqa *matuqa *matuqa fire(1) *api *api *avi 1 1 tochoose(1) *piliq *piliq *vili 2 2 rain(1) *qusan *qusan *usa 2 2 togrow(1) *tubuq *tubuq *tubu 1 1 tosee(1) *kita *kita *kita tohear(1) *roNo R *roNo R *roNo 1 1 tochew(1) *mamaq *mamaq *mama 1 1 louse(1) *kutu *kutu *kutu wind(1) *aNin *mataNi *mataNi 4 4 bad,evil(1) *saqat *saqat *saqa 1 1 hand(1) *lima *lima *lima toflow(1) *tape *tape *tave 1 1 night(1) *boNi *boNi *boNi continued on next page

12 continued from previous page Reconstructions Pairwise distances Gloss Blust (B) Pawley (P) Automatic (A) B-P P-A B-A day(5) *qaco *qaco *qaso 1 1 tospit(14) *qanusi *qanusi *aNusu 3 3 person/humanbeing(1) *taumataq *tamwata *tamata 3 1 2 tovomit(1) *mumutaq *mumuta *muta 1 2 3 name(1) *Najan *qajan *qasa 1 2 3 snake(12) *mwata *mwata *mwata man/male(1) *mwa Ruqane *taumwaqane *mwane 5 5 4 tobreathe(1) *manawa *manawa *manawa far(1) *sauq *sauq *sau 1 1 tobuy(1) *poli *poli *voli 1 1 tovomit(8) *luaq *luaq *lua 1 1 tocook(9) *tunu *tunu *tunu thick(3) *matolu *matolu *matolu leg/foot(1) *waqe *waqe *waqe tobite(1) *ka Rat *ka Rati *karat 1 2 1 leaf(1) *raun *rau *dau 1 1 2 sky(1) *laNit *laNit *laNi 1 1 todrink(1) *inum *inum *inum tostand(2) *tuqur *taqur *tuqu 1 2 1 i(1) *au *au *yau 1 1 warm(1) *mapanas *mapanas *mavana 2 2 moon(1) *pulan *pulan *vula 2 2 how?(1) *kua *kuya *kua 1 1 three(1) *tolu *tolu *tolu toplant(2) *tanum *tanom *tan@m 1 1 1 mosquito(1) *namuk *namuk *namu 1 1 bird(1) *manuk *manuk *manu 1 1 four(1) *pani *pat *vati 2 2 2 water(2) *wai R *wai R *wai 1 1 one(1) *sakai *tasa *sa 3 2 3 skin(1) *kulit *kulit *kulit toyawn(1) *mawap *mawap *mawa 1 1 he/she(1) *ia *ia *ia nose(1) *isuN *ijuN *isu 1 2 1 thatch/roof(1) *qatop *qatop *qato 1 1 towalk(2) *pano *pano *vano 1 1 flower(1) *puNa *puNa *vuNa 1 1 dust(1) *qapuk *qapuk *avu 3 3 neck(18) * Ruqa * Ruqa *ua 2 2 eye(1) *mata *mata *mata father(1) *tama *tamana *tama 2 2 tofear(1) *matakut *matakut *matakut root(2) *waka Ra *waka R *waka 1 1 2 tostab,pierce(8) *soka *soka *soka breast(1) *susu *susu *susu tolive,bealive(1) *maqurip *maqurip *maquri 1 1 head(1) *qulu *qulu *qulu thou(1) *ko *iko *kou 1 2 1 fruit(1) *puaq *puaq *vua 2 2

13 Table S.4. Proto-Polynesian reconstructions Gloss Reference Baseline Distance Automatic Distance rope(9) *taula *taura 1 *taula short(11) *pukupuku *puPupuPu 2 *pukupuku strikewithfist *moto *moko 1 *moto coral *puNa *puNa *puNa cook *tafu *tahu 1 *tafu painful,sick(1) *masaki *mahaki 1 *masaki downwards *hifo *ifo 1 *hifo dive *ruku *uku 1 *ruku toface *haNa *haNa *haNa grasp *kapo *Papo 1 *kapo bay *faNa *hana 2 *faNa tail *siku *siPu 1 *siku toblow(6) *pupusi *pupuhi 1 *pupusi channel *awa *ava 1 *awa pandanus *fara *fala 1 *fara urinate *mimi *mimi *mimi navel *pito *pito *pito gall *Pahu *au 2 *Pahu wave *Nalu *Nalu *Nalu sleep *mohe *moe 1 *mohe warm(1) *mafanafana *mafanafana *mafanafana beak *Nutu *Nutu *Nutu tooth *nifo *niho 1 *nifo dance *saka *haka 1 *saka leg *waPe *wae 1 *waPe drown *lemo *lemo *lemo water *wai *vai 1 *wai nose *isu *isu *isu taro *talo *kalo 1 *talo tohide(1) *funi *huna 2 *suna 2 upwards *hake *aPe 2 *hake overripe *pePe *pee 1 *pePe tosit(16) *nofo *noho 1 *nofo red *kula *Pula 1 *kula tochew(1) *mama *mama *mama three(1) *tolu *tolu *tolu tosleep(10) *mohe *moe 1 *mohe nine *hiwa *iwa 1 *hiwa dawn *ata *ata *ata feather(1) *fulu *fulu *fulu canoe *waka *vaPa 2 *waka left(11) *sema *hema 1 *sema ashes *refu *lehu 2 *refu dew *sau *sau *sau small *riki *riki *riki house *fale *hale 1 *fale voice *lePo *leo 1 *lePo octopus *feke *hePe 2 *feke toclimb(2) *kake *aPe 2 *ake 1 sea *tahi *kai 2 *tahi day *Paho *ao 2 *Paho branch *maNa *maNa *maNa continued on next page

14 continued from previous page Gloss Reference Baseline Distance Automatic Distance lovepity *PaloPofa *aroha 5 *PaloPofa flyrun *lele *rere 2 *lele thick(3) *matolutolu *matolutolu *matolutoru 1 tostrippeel *hisi *hihi 1 *hisi sitdwell *nofo *noho 1 *nofo black *kele *Pele 1 *kele torch *rama *ama 1 *rama

Table S.5. Sound changes

Code Parent Child Functional load Number of occurrences

(1) v (ProtoOcean) > p (AdmiraltyIslands) 0.09884 5.67800 (2) a (Northern Dumagat) > 2 (Agta) 5.132e-05 110.26986 (3) q (Bisayan) > P (AklanonBis) 0.03289 46.31015 (4) a (Schouten) > @ (Ali) 7.158e-05 5.45255 (5) e (UlatInai) > a (Alune) 1.00000 1.31832 (6) e (SeramStraits) > o (Amahai) 0.05802 6.91108 (7) a (SouthwestNewBritain) > e (Amara) 0.19090 15.52212 (8) a (CentralWestern) > e (AmbaiYapen) 0.21003 5.74372 (9) i (SeramStraits) > a (Ambon) 0.27139 3.04082 (10) a (EastVanuatu) > e (AmbrymSout) 0.09085 14.42495 (11) s (BimaSumba) > h (Anakalang) 0.03743 10.01598 (12) a (Vanuatu) > e (AnejomAnei) 0.08559 14.43646 (13) f (Futunic) > p (Anuta) 0.09282 19.96156 (14) n (Wetar) > N (Aputai) 0.04788 7.10435 (15) t (WestSanto) > r (ArakiSouth) 0.21996 20.80080 (16) Not enough data available for reliable sound change estimates (17) d (NorthPapuanMainlandDEntrecasteaux) > t (AreTaupota) 0.06738 2.92590 (18) O (Southern Malaita) > o (AreareMaas) 0.01151 1.99913 (19) O (Southern Malaita) > o (AreareWaia) 0.01151 1.99971 (20) a (Bibling) > e (Aria) 0.29446 4.99706 (21) Not enough data available for reliable sound change estimates (22) O (SanCristobal) > o (ArosiOneib) 0.00562 1.99971 (23) Not enough data available for reliable sound change estimates (24) b (ProtoCentr) > p (Aru) 0.13711 9.35892 (25) a (RajaAmpat) > E (As) 0.05156 4.74243 (26) o (Utupua) > u (Asumboa) 0.07562 3.70741 (27) a (Central Manobo) > o (AtaTigwa) 0.00291 7.29063 (28) s (Formosan) > h (Atayalic) 0.04098 4.31225 (29) l (West NuclearTimor) > n (Atoni) 0.23175 7.02772 (30) t (Ibanagic) > q (AttaPamplo) 0.47297 7.87084 (31) a (Suauic) > e (Auhelawa) 0.18692 7.98362 (32) N (MalekulaCentral) > n (Avava) 0.10747 15.49566 (33) a (ProtoCentr) > e (Babar) 0.04891 5.61192 (34) O (Choiseul) > o (Babatana) 0.01953 8.94227 (35) O (Choiseul) > o (BabatanaAv) 0.01953 1.99952 (36) O (Choiseul) > o (BabatanaKa) 0.01953 2.99952 (37) Not enough data available for reliable sound change estimates (38) O (Choiseul) > o (BabatanaTu) 0.01953 2.99961 (39) v (Ivatan) > b (Babuyan) 0.01574 16.97554 (40) a (BorneoCoastBajaw) > e (Bajo) 0.04774 21.59109 (41) a (NuclearCordilleran) > 2 (Balangaw) 0.00387 37.89097 (42) N (BaliSasak) > n (Bali) 0.19166 13.91349 (43) e (ProtoMalay) > @ (BaliSasak) 6.064e-04 26.22332 (44) h (BimaSumba) > s (Baliledo) 0.03743 7.64100 (45) p (East CentralMaluku) > f (BandaGeser) 0.05559 3.52479 (46) a (Sulawesi) > o (BanggaiWdi) 0.06991 8.61615 (47) a (Palawano) > o (Banggi) 6.735e-04 7.69053 (48) e (LocalMalay) > a (BanjareseM) 0.10667 36.24790 (49) a (SouthNewIrelandNorthwestSolomonic) > o (Banoni) 0.16106 5.69268 (50) r (Sangiric) > d (Bantik) 0.01606 6.99267 (51) b (Sulawesi) > w (Baree) 0.05030 10.71565 (52) q (ProtoMalay) > P (Barito) 0.04087 15.54090 (53) e (Madak) > o (Barok) 0.04576 1.91240 (54) u (Northern EastFormosan) > o (Basai) 0.00130 5.88492 (55) u (BashiicCentralLuzonNorthernMindoro) > o (Bashiic) 0.02423 63.00884 (56) r (NorthernPhilippine) > y (BashiicCentralLuzonNorthernMindoro) 0.01665 3.54936 (57) N (Palawano) > n (BatakPalaw) 0.04752 4.94491 (58) O (SanCristobal) > o (BauroBaroo) 0.00562 1.99981 (59) Not enough data available for reliable sound change estimates (60) Not enough data available for reliable sound change estimates (61) p (Vitiaz) > f (Bel) 0.01771 1.06548 (62) u (BerawanLowerBaram) > o (Belait) 0.03125 12.34195 (63) f (Futunic) > h (Bellona) 0.01009 15.97130 (64) t (Pangasinic) > s (Benguet) 0.09759 1.35377 continued on next page

15 continued from previous page

Code Parent Child Functional load Number of occurrences

(65) u (BerawanLowerBaram) > o (BerawanLon) 0.03125 13.55252 (66) k (NorthSarawakan) > P (BerawanLowerBaram) 0.15797 10.19426 (67) e (SouthwestNewBritain) > i (Bibling) 0.03719 2.29657 (68) O (RajaAmpat) > o (BigaMisool) 0.03843 5.00642 (69) k (CentralPhilippine) > c (BikolNagaC) 9.303e-05 12.87548 (70) a (Blaan) > O (BilaanKoro) 0.01132 23.88863 (71) N (Blaan) > n (BilaanSara) 0.08320 18.08416 (72) Not enough data available for reliable sound change estimates (73) u (Northern NuclearBel) > o (Bilibil) 0.03687 2.95970 (74) a (ProtoMalay) > 1 (Bilic) 0.00777 19.43024 (75) O (SouthNewIrelandNorthwestSolomonic) > o (Bilur) 0.00242 1.97890 (76) t (BimaSumba) > d (Bima) 0.08028 9.97797 (77) b (ProtoCentr) > w (BimaSumba) 0.08312 11.67021 (78) a (NorthSarawakan) > e (Bintulu) 0.07029 9.74568 (79) 1 (Manobo) > u (Binukid) 0.03862 2.90829 (80) 1 (CentralPhilippine) > u (Bisayan) 0.01651 18.53014 (81) 1 (Bilic) > a (Blaan) 0.08598 17.06499 (82) Not enough data available for reliable sound change estimates (83) Not enough data available for reliable sound change estimates (84) t (GorontaloMongondow) > s (BolaangMon) 0.03159 4.79938 (85) o (TukangbesiBonerate) > O (Bonerate) 0.02521 18.76480 (86) u (Seram) > i (Bonfia) 0.14937 11.37310 (87) u (Bontok) > o (BontocGuin) 0.04335 58.88829 (88) a (BontokKankanay) > 1 (Bontok) 0.10466 3.41602 (89) q (Bontok) > P (BontokGuin) 8.294e-04 49.64130 (90) u (NuclearCordilleran) > o (BontokKankanay) 0.03806 9.39059 (91) u (SuluBorneo) > o (BorneoCoastBajaw) 0.05043 3.53556 (92) v (GelaGuadalcanal) > p (Bughotu) 0.09530 1.99015 (93) a (Bugis) > @ (BugineseSo) 0.00654 17.32980 (94) N (SouthSulawesi) > k (Bugis) 0.10991 2.98167 (95) s (Bughotu) > h (Bugotu) 0.03395 8.55895 (96) e (KayanMurik) > a (Bukat) 0.06056 4.02644 (97) a (SouthHalmahera) > e (Buli) 0.05583 3.50307 (98) o (Vanikoro) > O (Buma) 0.13019 23.27248 (99) a (Sabahan) > o (BunduDusun) 0.05640 1.58162 (100) u (Formosan) > o (Bunun) 2.254e-04 11.98634 (101) u (CentralMaluku) > o (BuruNamrol) 0.01570 19.92112 (102) o (South Bisayan) > u (ButuanTausug) 0.03947 3.20776 (103) u (ButuanTausug) > o (Butuanon) 0.03983 30.32254 (104) r (NorthPapuanMainlandDEntrecasteaux) > l (Bwaidoga) 0.10631 8.92246 (105) a (NewCaledonian) > E (Canala) 0.05974 4.64394 (106) N (ProtoChuuk) > n (Carolinian) 0.04042 23.82083 (107) u (Bisayan) > o (Cebuano) 0.03656 8.68282 (108) l (SouthHalmaheraWestNewGuinea) > r (CenderawasihBay) 0.11907 11.14761 (109) u (EastFormosan) > o (CentralAmi) 4.190e-04 12.82795 (110) u (SouthCentralCordilleran) > o (CentralCordilleran) 0.02941 3.43974 (111) e (ProtoMalay) > @ (CentralEastern) 6.064e-04 32.49321 (112) j (ProtoOcean) > s (CentralEasternOceanic) 0.00410 2.00058 (113) e (BashiicCentralLuzonNorthernMindoro) > i (CentralLuzon) 0.01063 2.14101 (114) R (ProtoCentr) > r (CentralMaluku) 0.02692 12.80932 (115) u (MaselaSouthBabar) > E (CentralMas) 0.04503 4.21108 (116) N (RemoteOceanic) > g (CentralPacific) 0.00869 11.59225 (117) k (Peripheral PapuanTip) > g (CentralPapuan) 0.11515 5.29705 (118) u (MesoPhilippine) > o (CentralPhilippine) 0.01374 38.98219 (119) x (NortheastVanuatuBanksIslands) > k (CentralVanuatu) 0.02455 3.46857 (120) i (CenderawasihBay) > u (CentralWestern) 0.12684 1.99006 (121) u (Bisayan) > o (Central Bisayan) 0.03656 7.64073 (122) l (East Nuclear Polynesian) > r (Central East Nuclear Polynesian) 0.02709 2.35911 (123) a (Manobo) > 1 (Central Manobo) 0.09378 18.50241 (124) a (SantaIsabel) > u (Central SantaIsabel) 0.27278 1.22175 (125) t (Chamic) > k (ChamChru) 0.34269 7.73602 (126) y (Malayic) > i (Chamic) 0.00237 1.73963 (127) b (ProtoMalay) > p (Chamorro) 0.15451 10.03113 (128) 7 (East SantaIsabel) > g (ChekeHolo) 0.03879 10.10942 (129) o (SouthNewIrelandNorthwestSolomonic) > O (Choiseul) 0.00242 8.56855 (130) a (ChamChru) > @ (Chru) 0.02139 31.97181 (131) a (ProtoChuuk) > e (Chuukese) 0.11108 36.12356 (132) a (ProtoChuuk) > e (ChuukeseAK) 0.11108 59.55135 (133) @ (Atayalic) > a (CiuliAtaya) 0.02870 6.00504 (134) e (North Babar) > o (Dai) 0.02409 5.70633 (135) n (North Babar) > l (DaweraDawe) 0.16784 23.27929 (136) N (LandDayak) > n (DayakBakat) 0.26785 7.13031 (137) N (South West Barito) > n (DayakNgaju) 0.16918 29.67016 (138) a (NorthSarawakan) > e (Dayic) 0.07029 5.15706 (139) a (LoyaltyIslands) > e (Dehu) 0.20861 4.46889 (140) g (Bwaidoga) > y (Diodio) 0.08008 12.32717 (141) l (NorthPapuanMainlandDEntrecasteaux) > r (Dobuan) 0.10631 12.30865 (142) p (Southern Malaita) > b (Dorio) 8.555e-04 1.99973 (143) l (Nuclear WestCentralPapuan) > r (Doura) 0.13489 5.63596 (144) u (Northern Dumagat) > o (DumagatCas) 0.01544 18.63984 (145) s (SouthwestMaluku) > h (EastDamar) 0.02479 6.75499 (146) a (CentralPacific) > o (EastFijianPolynesian) 0.23158 1.54042 continued on next page

16 continued from previous page

Code Parent Child Functional load Number of occurrences

(147) c (Formosan) > t (EastFormosan) 0.04224 6.87583 (148) b (MaselaSouthBabar) > v (EastMasela) 0.00417 5.01447 (149) i (BimaSumba) > u (EastSumban) 0.17438 28.36485 (150) b (NortheastVanuatuBanksIslands) > v (EastVanuatu) 0.16115 2.78838 (151) e (Barito) > i (East Barito) 0.02495 4.21938 (152) p (CentralMaluku) > f (East CentralMaluku) 0.09093 7.47879 (153) u (Manus) > o (East Manus) 0.01231 4.11061 (154) g (NewGeorgia) > h (East NewGeorgia) 0.02220 5.62094 (155) a (NuclearTimor) > e (East NuclearTimor) 0.14170 3.51835 (156) l (Nuclear Polynesian) > r (East Nuclear Polynesian) 0.04761 79.14382 (157) O (SantaIsabel) > o (East SantaIsabel) 0.02383 4.65059 (158) @ (CentralEastern) > o (EasternMalayoPolynesian) 0.00303 18.76994 (159) a (RemoteOceanic) > e (EasternOuterIslands) 0.14981 13.43554 (160) a (AdmiraltyIslands) > e (Eastern AdmiraltyIslands) 0.16811 7.58136 (161) a (BandaGeser) > o (ElatKeiBes) 0.05446 15.50349 (162) r (SamoicOutlier) > l (Ellicean) 0.03331 5.08687 (163) a (Futunic) > e (Emae) 0.20786 3.97652 (164) e (SouthwestBabar) > E (Emplawas) 0.18331 11.05576 (165) Not enough data available for reliable sound change estimates (166) i (BimaSumba) > e (EndeLio) 0.04706 5.22061 (167) a (Sumatra) > @ (Enggano) 0.00149 1.61296 (168) Not enough data available for reliable sound change estimates (169) Not enough data available for reliable sound change estimates (170) f (Wetar) > h (Erai) 0.03346 7.93007 (171) a (Vanuatu) > e (Erromanga) 0.08559 16.54778 (172) O (SanCristobal) > o (Fagani) 0.00562 1.99961 (173) O (SanCristobal) > o (FaganiAguf) 0.00562 1.99952 (174) O (SanCristobal) > o (FaganiRihu) 0.00562 1.99690 (175) Not enough data available for reliable sound change estimates (176) u (WesternPlains) > o (Favorlang) 0.01234 20.09559 (177) s (EastFijianPolynesian) > c (FijianBau) 0.04333 9.93879 (178) f (Timor) > w (FloresLembata) 0.01025 7.38989 (179) l (ProtoAustr) > r (Formosan) 0.01978 16.25315 (180) r (Futunic) > l (FutunaAniw) 0.05835 2.94909 (181) r (Futunic) > l (FutunaEast) 0.05835 48.57376 (182) l (SamoicOutlier) > r (Futunic) 0.03331 74.90293 (183) l (WestCentralPapuan) > r (Gabadi) 0.13055 5.24097 (184) u (Ibanagic) > o (Gaddang) 0.01362 18.75014 (185) e (Are) > i (Gapapaiwa) 0.06186 4.23744 (186) a (BimaSumba) > o (GauraNggau) 0.13543 29.26141 (187) a (ProtoMalay) > @ (Gayo) 0.00259 24.49089 (188) r (Northern NuclearBel) > z (Gedaged) 0.02021 7.64916 (189) c (GelaGuadalcanal) > s (Gela) 0.01050 2.08324 (190) k (SoutheastSolomonic) > g (GelaGuadalcanal) 0.01749 19.76693 (191) b (GeserGorom) > w (Geser) 0.06536 3.91631 (192) f (BandaGeser) > w (GeserGorom) 0.05946 4.12917 (193) Not enough data available for reliable sound change estimates (194) g (Guadalcanal) > G (Ghari) 4.995e-04 9.42542 (195) Not enough data available for reliable sound change estimates (196) h (Guadalcanal) > G (GhariNggae) 5.670e-05 2.00000 (197) O (Guadalcanal) > o (GhariNgger) 0.00601 2.99863 (198) Not enough data available for reliable sound change estimates (199) Not enough data available for reliable sound change estimates (200) a (SouthHalmahera) > o (Giman) 0.11966 11.83686 (201) a (GorontaloMongondow) > o (Gorontalic) 0.12679 8.78982 (202) k (Gorontalic) > P (GorontaloH) 0.00973 18.65349 (203) a (Sulawesi) > o (GorontaloMongondow) 0.06991 26.41042 (204) s (GelaGuadalcanal) > c (Guadalcanal) 0.01050 3.50693 (205) Not enough data available for reliable sound change estimates (206) Not enough data available for reliable sound change estimates (207) N (NehanNorthBougainville) > n (Haku) 0.10208 6.26107 (208) 1 (MesoPhilippine) > u (Hanunoo) 0.02599 24.61094 (209) t (Marquesic) > k (Hawaiian) 0.31942 56.48048 (210) u (Peripheral Central Bisayan) > o (Hiligaynon) 0.02028 1.95625 (211) r (Ambon) > l (HituAmbon) 0.03275 16.21990 (212) g (West NewGeorgia) > G (Hoava) 0.00152 7.69893 (213) a (NorthNewGuinea) > e (HuonGulf) 0.13866 4.84173 (214) a (LoyaltyIslands) > e (Iaai) 0.20861 8.25767 (215) u (Malayic) > o (Iban) 0.00621 12.93856 (216) k (NorthernCordilleran) > q (Ibanagic) 0.40890 11.09862 (217) N (Sabahan) > n (Idaan) 0.13834 5.84917 (218) e (Futunic) > a (IfiraMeleM) 0.20786 4.95522 (219) 1 (NuclearCordilleran) > o (Ifugao) 0.01254 19.55025 (220) k (Ifugao) > q (IfugaoAmga) 0.34057 4.76075 (221) k (Ifugao) > q (IfugaoBata) 0.34057 17.99754 (222) 1 (Kallahan) > o (IfugaoBayn) 0.01101 17.75466 (223) n (Wetar) > N (Iliun) 0.04788 10.14036 (224) u (NorthernLuzon) > o (Ilokano) 0.02660 29.66945 (225) a (Peripheral Central Bisayan) > u (Ilonggo) 0.12642 4.07172 (226) a (SouthernCordilleran) > 1 (Ilongot) 0.07296 10.35510 (227) N (Ilongot) > n (IlongotKak) 0.06631 9.13131 (228) a (Ivatan) > e (Imorod) 0.08734 6.03794 continued on next page

17 continued from previous page

Code Parent Child Functional load Number of occurrences

(229) n (SouthwestBabar) > m (Imroing) 0.23281 10.35889 (230) u (SamaBajaw) > o (Inabaknon) 0.02973 19.36631 (231) a (LocalMalay) > e (Indonesian) 0.10667 3.03924 (232) u (Benguet) > o (Inibaloi) 0.03849 64.30170 (233) y (MaranaoIranon) > i (Iranun) 0.00959 4.31410 (234) a (Ivatan) > e (Iraralay) 0.08734 11.08683 (235) l (Ivatan) > d (Isamorong) 0.01340 14.95905 (236) t (Ibanagic) > s (IsnegDibag) 0.05600 2.91874 (237) h (Ivatan) > x (Itbayat) 0.00445 31.26009 (238) o (Ivatan) > u (Itbayaten) 5.820e-04 67.10087 (239) u (KalingaItneg) > o (ItnegBinon) 0.02352 60.52923 (240) l (Ivatan) > d (Ivasay) 0.01340 14.96335 (241) g (Bashiic) > y (Ivatan) 0.02574 5.02873 (242) e (Ivatan) > 1 (IvatanBasc) 0.00723 32.77267 (243) q (ProtoMalay) > h (Javanese) 0.07129 15.44980 (244) a (Northern NewCaledonian) > e (Jawe) 0.13215 8.47484 (245) e (West Barito) > o (Kadorih) 7.673e-05 3.80522 (246) O (SanCristobal) > o (Kahua) 0.00562 1.99971 (247) O (SanCristobal) > o (KahuaMami) 0.00562 1.99952 (248) l (Gorontalic) > r (Kaidipang) 0.01053 21.87631 (249) k (KairiruManam) > q (Kairiru) 0.00897 10.55125 (250) u (Schouten) > i (KairiruManam) 0.17315 1.37804 (251) n (Ilongot) > N (KakidugenI) 0.06631 9.38577 (252) r (Mansakan) > l (Kalagan) 0.04611 3.61546 (253) i (MesoPhilippine) > 1 (Kalamian) 0.02195 3.10459 (254) 1 (KalingaItneg) > o (KalingaGui) 0.01076 22.89422 (255) a (CentralCordilleran) > i (KalingaItneg) 0.13957 2.25237 (256) s (Benguet) > h (Kallahan) 0.02667 15.50239 (257) s (Kallahan) > h (KallahanKa) 0.00566 1.92678 (258) 1 (Kallahan) > e (KallahanKe) 0.00403 38.49811 (259) n (BimaSumba) > N (Kambera) 0.00600 25.28750 (260) b (Tsouic) > v (Kanakanabu) 0.01715 4.07361 (261) v (PatpatarTolai) > w (Kandas) 0.03857 3.91962 (262) u (BontokKankanay) > o (KankanayNo) 0.04380 49.14539 (263) n (CentralLuzon) > N (Kapampanga) 0.01456 6.07915 (264) t (Ellicean) > d (Kapingamar) 3.974e-05 35.94977 (265) v (LavongaiNalik) > f (KaraWest) 0.02487 4.21517 (266) a (SouthHalmahera) > e (KasiraIrah) 0.05583 5.86174 (267) e (South West Barito) > E (Katingan) 0.00426 27.03064 (268) I (Pasismanua) > i (KaulongAuV) 0.01894 6.17672 (269) a (Northern EastFormosan) > i (Kavalan) 0.08596 4.99960 (270) b (ProtoMalay) > v (KayanMurik) 1.449e-07 6.64852 (271) a (KayanMurik) > e (KayanUmaJu) 0.06056 5.93656 (272) a (SarmiJayapuraBay) > e (KayupulauK) 0.32087 3.93441 (273) a (FloresLembata) > e (Kedang) 0.10748 14.89342 (274) b (KeiTanimbar) > B (KeiTanimba) 4.198e-05 4.13306 (275) a (SoutheastMaluku) > e (KeiTanimbar) 0.17157 1.74318 (276) a (Dayic) > e (KelabitBar) 0.07691 10.95832 (277) e (East NuclearTimor) > E (Kemak) 2.255e-04 5.94197 (278) i (NorthSarawakan) > e (KenyahLong) 0.03423 6.39511 (279) a (LocalMalay) > o (Kerinci) 0.01029 37.28182 (280) a (KilivilaLouisiades) > e (Kilivila) 0.14778 5.83519 (281) o (Peripheral PapuanTip) > a (KilivilaLouisiades) 0.17286 2.23572 (282) o (Central SantaIsabel) > O (KilokakaYs) 0.02707 5.60509 (283) N (MicronesianProper) > n (Kiribati) 0.04469 10.23912 (284) P (Manam) > k (Kis) 0.07162 3.84299 (285) t (KisarRoma) > k (Kisar) 0.05336 24.56857 (286) f (SouthwestMaluku) > w (KisarRoma) 0.03321 7.75331 (287) a (BimaSumba) > o (Kodi) 0.13543 22.04728 (288) l (ProtoCentr) > r (KoiwaiIria) 0.06793 11.95360 (289) O (Central SantaIsabel) > o (Kokota) 0.02707 31.07342 (290) e (Pesisir) > o (Komering) 0.00325 8.58447 (291) a (Blaan) > O (KoronadalB) 0.01132 30.81323 (292) s (NgeroVitiaz) > r (Kove) 0.06561 7.13972 (293) u (PatpatarTolai) > a (Kuanua) 0.16803 1.98361 (294) O (West NewGeorgia) > o (Kubokota) 0.00573 2.00145 (295) v (Nuclear WestCentralPapuan) > b (Kuni) 0.11533 6.21112 (296) Not enough data available for reliable sound change estimates (297) a (MicronesianProper) > e (Kusaie) 0.12251 14.33579 (298) O (Northern Malaita) > o (Kwai) 0.00348 1.99952 (299) a (Northern Malaita) > o (Kwaio) 0.18875 8.31055 (300) l (Tanna) > r (Kwamera) 0.05929 25.71039 (301) N (Northern Malaita) > n (KwaraaeSol) 0.03728 9.80996 (302) Not enough data available for reliable sound change estimates (303) e (MelanauKajang) > @ (Lahanan) 0.00231 14.95868 (304) i (Willaumez) > e (Lakalai) 0.11589 1.41234 (305) r (Nuclear WestCentralPapuan) > l (Lala) 0.13489 4.70309 (306) u (FloresLembata) > o (LamaholotI) 0.02895 10.53359 (307) Not enough data available for reliable sound change estimates (308) s (BimaSumba) > h (Lamboya) 0.03743 9.36744 (309) o (Bibling) > u (LamogaiMul) 0.12203 5.67124 (310) a (Pesisir) > @ (Lampung) 0.00929 12.80317 continued on next page

18 continued from previous page

Code Parent Child Functional load Number of occurrences

(311) e (ProtoMalay) > a (LandDayak) 0.04499 11.18206 (312) Not enough data available for reliable sound change estimates (313) k (Northern Malaita) > g (Lau) 0.07399 10.37078 (314) Not enough data available for reliable sound change estimates (315) Not enough data available for reliable sound change estimates (316) r (NewIreland) > l (LavongaiNalik) 0.13377 4.12434 (317) a (East Manus) > e (Leipon) 0.13133 15.77294 (318) a (Tanna) > e (Lenakel) 0.05393 9.10010 (319) Not enough data available for reliable sound change estimates (320) h (Gela) > G (LengoGhaim) 5.895e-05 1.98182 (321) Not enough data available for reliable sound change estimates (322) f (SouthwestMaluku) > w (Letinese) 0.03321 8.50360 (323) n (West Manus) > N (Levei) 0.18931 43.18724 (324) a (LavongaiNalik) > e (LihirSungl) 0.07781 10.60106 (325) i (West Manus) > e (Likum) 0.08687 4.08584 (326) e (EndeLio) > @ (LioFloresT) 0.00290 10.17585 (327) a (Malayan) > e (LocalMalay) 0.09530 18.43396 (328) Not enough data available for reliable sound change estimates (329) r (Manus) > P (Loniu) 0.00619 8.93647 (330) a (SoutheastIslands) > e (Lou) 0.09902 12.46015 (331) i (LocalMalay) > e (LowMalay) 0.03770 2.99404 (332) a (RemoteOceanic) > e (LoyaltyIslands) 0.14981 15.79651 (333) t (Ellicean) > k (Luangiua) 0.47404 39.73227 (334) e (MonoUruava) > a (LungaLunga) 0.13896 3.85273 (335) O (West NewGeorgia) > o (Lungga) 0.00573 2.00155 (336) O (West NewGeorgia) > o (Luqa) 0.00573 2.00145 (337) b (East Barito) > w (Maanyan) 0.07537 6.13215 (338) a (NewIreland) > e (Madak) 0.16211 8.90472 (339) k (Madak) > g (MadakLamas) 0.05218 9.51948 (340) l (LavongaiNalik) > r (Madara) 0.10006 10.95189 (341) u (ProtoMalay) > o (Madurese) 0.01585 39.16274 (342) l (CentralPapuan) > r (MagoriSout) 0.12605 3.95178 (343) a (Nuclear PapuanTip) > i (Maisin) 0.22933 2.17692 (344) n (SouthSulawesi) > N (Makassar) 0.18370 10.70075 (345) k (MalaitaSanCristobal) > P (Malaita) 0.09152 5.02310 (346) v (SoutheastSolomonic) > f (MalaitaSanCristobal) 0.06368 25.62607 (347) Not enough data available for reliable sound change estimates (348) N (LocalMalay) > n (MalayBahas) 0.17169 34.31949 (349) p (Malayic) > m (Malayan) 0.17398 3.33605 (350) q (ProtoMalay) > h (Malayic) 0.07129 31.25704 (351) a (Vanuatu) > e (MalekulaCentral) 0.08559 25.93669 (352) a (NortheastVanuatuBanksIslands) > e (MalekulaCoastal) 0.08702 31.59187 (353) a (Vitiaz) > o (Maleu) 0.11223 11.90975 (354) e (Bugis) > a (Maloh) 0.07175 4.70755 (355) u (CentralPhilippine) > o (Mamanwa) 0.01883 38.86565 (356) u (East NuclearTimor) > a (Mambai) 0.34369 4.99285 (357) e (BimaSumba) > a (Mamboru) 0.13262 7.46411 (358) k (KairiruManam) > P (Manam) 0.01548 9.58872 (359) a (Marquesic) > e (Mangareva) 0.26245 3.72483 (360) s (BimaSumba) > c (Manggarai) 2.420e-05 8.94154 (361) r (Tahitic) > l (Manihiki) 9.361e-04 13.67126 (362) N (SouthernPhilippine) > n (Manobo) 0.07976 13.17836 (363) 1 (AtaTigwa) > o (ManoboAtad) 0.03723 66.54667 (364) 1 (AtaTigwa) > o (ManoboAtau) 0.03723 66.57068 (365) 1 (Central Manobo) > a (ManoboDiba) 0.12386 3.76411 (366) g (West Central Manobo) > h (ManoboIlia) 0.03983 13.28241 (367) a (South Manobo) > 1 (ManoboKala) 0.12673 7.79055 (368) 1 (South Manobo) > 2 (ManoboSara) 2.271e-04 58.20252 (369) o (AtaTigwa) > 1 (ManoboTigw) 0.03723 11.93633 (370) a (West Central Manobo) > 1 (ManoboWest) 0.13743 4.54113 (371) l (Mansakan) > r (Mansaka) 0.04611 18.44830 (372) o (CentralPhilippine) > u (Mansakan) 0.01883 21.46241 (373) a (Eastern AdmiraltyIslands) > e (Manus) 0.13652 4.75105 (374) v (Tahitic) > w (Maori) 0.00255 12.29480 (375) l (BorneoCoastBajaw) > w (Mapun) 0.03973 7.82813 (376) u (MaranaoIranon) > o (Maranao) 0.00377 54.21909 (377) 1 (SouthernPhilippine) > e (MaranaoIranon) 0.00422 7.67944 (378) Not enough data available for reliable sound change estimates (379) Not enough data available for reliable sound change estimates (380) Not enough data available for reliable sound change estimates (381) Not enough data available for reliable sound change estimates (382) s (East NewGeorgia) > c (Marovo) 0.00144 4.87889 (383) a (Marquesic) > e (Marquesan) 0.26245 10.48977 (384) f (Central East Nuclear Polynesian) > h (Marquesic) 0.05235 2.06222 (385) a (MicronesianProper) > e (Marshalles) 0.12251 42.07442 (386) i (South Babar) > E (MaselaSouthBabar) 0.04148 3.02753 (387) e (Northern NuclearBel) > i (Matukar) 0.04103 3.86380 (388) t (Willaumez) > f (Maututu) 0.00143 5.97137 (389) Not enough data available for reliable sound change estimates (390) Not enough data available for reliable sound change estimates (391) Not enough data available for reliable sound change estimates (392) Not enough data available for reliable sound change estimates continued on next page

19 continued from previous page

Code Parent Child Functional load Number of occurrences

(393) Not enough data available for reliable sound change estimates (394) e (Northern NuclearBel) > i (Megiar) 0.04103 3.96950 (395) b (Nuclear WestCentralPapuan) > p (Mekeo) 0.01057 10.53926 (396) e (Northwest) > a (MelanauKajang) 0.05269 4.10012 (397) a (MelanauKajang) > e (MelanauMuk) 0.05542 8.30760 (398) N (LocalMalay) > n (Melayu) 0.17169 15.57376 (399) e (LocalMalay) > a (MelayuBrun) 0.10667 57.81798 (400) Not enough data available for reliable sound change estimates (401) r (Vitiaz) > l (Mengen) 0.09236 7.76084 (402) a (WestSanto) > e (Merei) 0.12570 4.40581 (403) u (East Barito) > o (MerinaMala) 0.00538 45.42570 (404) p (WesternOceanic) > b (MesoMelanesian) 0.06671 3.25436 (405) e (ProtoMalay) > 1 (MesoPhilippine) 0.00192 27.69856 (406) s (ProtoMicro) > t (MicronesianProper) 0.14436 10.32542 (407) e (Malayan) > a (Minangkaba) 0.09530 36.65853 (408) b (RajaAmpat) > p (Minyaifuin) 0.08190 5.44290 (409) r (KilivilaLouisiades) > l (Misima) 0.09569 2.85853 (410) u (Malayic) > o (Moken) 0.00621 18.67524 (411) a (Ponapeic) > O (Mokilese) 0.00863 20.21510 (412) k (Bwaidoga) > P (Molima) 0.02226 6.39306 (413) r (MonoUruava) > l (Mono) 0.18018 7.90990 (414) Not enough data available for reliable sound change estimates (415) Not enough data available for reliable sound change estimates (416) N (SouthNewIrelandNorthwestSolomonic) > n (MonoUruava) 0.07198 4.50934 (417) t (CenderawasihBay) > P (Mor) 0.00191 7.90698 (418) a (Sulawesi) > o (Mori) 0.06991 15.10951 (419) d (ProtoChuuk) > t (Mortlockes) 0.10821 17.95820 (420) v (EastVanuatu) > w (Mota) 0.04447 4.78867 (421) v (SinagoroKeapara) > h (Motu) 0.01870 13.39891 (422) r (Bibling) > x (Mouk) 3.176e-05 16.89672 (423) u (Sulawesi) > o (MunaButon) 0.04575 3.73293 (424) N (Western Munic) > n (MunaKatobu) 2.731e-04 6.02727 (425) d (UlatInai) > r (MurnatenAl) 0.00181 5.95393 (426) r (ProtoOcean) > l (Mussau) 0.16171 3.95147 (427) a (EastVanuatu) > E (Mwotlap) 0.01111 27.83527 (428) r (EndeLio) > z (Nage) 0.02129 1.96583 (429) i (MalekulaCoastal) > e (Nahavaq) 0.05885 9.63004 (430) n (Willaumez) > l (NakanaiBil) 0.00412 1.02690 (431) a (LavongaiNalik) > @ (Nalik) 0.00211 12.96772 (432) u (CentralVanuatu) > i (Namakir) 0.18127 21.52549 (433) a (MalekulaCentral) > e (Naman) 0.13582 33.25250 (434) b (MalekulaCoastal) > p (Nati) 0.01049 19.54611 (435) r (SoutheastIslands) > l (Nauna) 0.10938 5.60587 (436) a (ProtoMicro) > e (Nauru) 0.13661 5.67814 (437) Not enough data available for reliable sound change estimates (438) s (NehanNorthBougainville) > h (Nehan) 0.05608 13.16549 (439) N (Nehan) > n (NehanHape) 0.11803 18.93949 (440) a (SouthNewIrelandNorthwestSolomonic) > o (NehanNorthBougainville) 0.16106 4.34745 (441) e (Northern NewCaledonian) > a (Nelemwa) 0.13215 10.43158 (442) o (Utupua) > œ (Nembao) 0.01621 4.98208 (443) a (LoyaltyIslands) > e (Nengone) 0.20861 5.30862 (444) m (Vanuatu) > n (Nese) 0.35703 18.91500 (445) a (MalekulaCentral) > e (Neveei) 0.13582 34.11813 (446) e (LoyaltyIslands) > a (NewCaledonian) 0.20861 1.71479 (447) k (SouthNewIrelandNorthwestSolomonic) > g (NewGeorgia) 0.08381 5.00348 (448) e (MesoMelanesian) > i (NewIreland) 0.06281 3.31247 (449) w (BimaSumba) > v (Ngadha) 4.053e-05 14.09575 (450) i (Aru) > e (NgaiborSAr) 0.02299 6.77784 (451) f (NorthNewGuinea) > w (NgeroVitiaz) 0.01667 3.53470 (452) Not enough data available for reliable sound change estimates (453) s (Gela) > h (Nggela) 0.05008 17.07009 (454) b (CentralVanuatu) > p (Nguna) 0.06113 6.26520 (455) p (Sumatra) > f (Nias) 0.00205 8.90278 (456) f (NilaSerua) > h (Nila) 0.01125 4.39065 (457) t (TeunNilaSerua) > l (NilaSerua) 0.13648 1.88492 (458) u (Nehan) > w (Nissan) 0.02937 8.95963 (459) a (Tongic) > e (Niue) 0.18204 4.31952 (460) l (North Babar) > n (NorthBabar) 0.16784 6.84401 (461) R (ProtoCentr) > r (NorthBomberai) 0.02692 10.12083 (462) v (WesternOceanic) > p (NorthNewGuinea) 0.12924 8.71798 (463) a (Nuclear PapuanTip) > e (NorthPapuanMainlandDEntrecasteaux) 0.27192 3.13375 (464) a (Northwest) > e (NorthSarawakan) 0.05269 5.44048 (465) e (Babar) > E (North Babar) 0.03978 1.27682 (466) b (Sulawesi) > w (North Minahasan) 0.05030 3.03447 (467) u (Vanuatu) > o (NortheastVanuatuBanksIslands) 0.06513 4.83269 (468) r (NorthernLuzon) > g (NorthernCordilleran) 0.01688 4.17784 (469) m (NorthernPhilippine) > n (NorthernLuzon) 0.15955 5.87812 (470) N (ProtoMalay) > n (NorthernPhilippine) 0.10751 18.96525 (471) o (NorthernCordilleran) > u (Northern Dumagat) 0.01648 1.25949 (472) l (EastFormosan) > n (Northern EastFormosan) 0.14981 5.83985 (473) p (Malaita) > b (Northern Malaita) 0.00727 4.99257 (474) a (NewCaledonian) > o (Northern NewCaledonian) 0.17212 3.00312 continued on next page

20 continued from previous page

Code Parent Child Functional load Number of occurrences

(475) b (Bel) > p (Northern NuclearBel) 0.08564 2.04429 (476) N (Sangiric) > n (Northern Sangiric) 0.10504 18.92173 (477) q (ProtoMalay) > P (Northwest) 0.04087 30.80231 (478) l (CentralCordilleran) > k (NuclearCordilleran) 0.14865 1.01516 (479) b (Timor) > f (NuclearTimor) 0.08459 2.80832 (480) l (PapuanTip) > n (Nuclear PapuanTip) 0.10099 4.22343 (481) N (Polynesian) > n (Nuclear Polynesian) 0.01742 5.34206 (482) r (WestCentralPapuan) > l (Nuclear WestCentralPapuan) 0.13055 1.52585 (483) t (Ellicean) > d (Nukuoro) 3.974e-05 48.84748 (484) f (HuonGulf) > w (NumbamiSib) 0.03605 5.99986 (485) t (CenderawasihBay) > k (Numfor) 0.18043 10.67064 (486) f (Seram) > h (Nunusaku) 0.02050 9.64666 (487) a (LocalMalay) > @ (Ogan) 0.00113 22.90283 (488) N (Javanese) > n (OldJavanes) 0.12968 22.81888 (489) t (CentralVanuatu) > r (Orkon) 0.18595 20.59925 (490) O (Southern Malaita) > o (Oroha) 0.01151 1.99952 (491) r (EastVanuatu) > l (PaameseSou) 0.17094 18.23342 (492) s (Formosan) > t (Paiwan) 0.10317 11.06529 (493) a (ProtoMalay) > e (Palauan) 0.04499 11.70163 (494) 1 (Palawano) > @ (PalawanBat) 4.146e-04 36.88428 (495) 1 (MesoPhilippine) > u (Palawano) 0.02599 5.66511 (496) Not enough data available for reliable sound change estimates (497) u (Pangasinic) > o (Pangasinan) 0.03899 25.70031 (498) Not enough data available for reliable sound change estimates (499) a (WesternOceanic) > o (PapuanTip) 0.15356 3.56136 (500) N (SouthwestNewBritain) > n (Pasismanua) 0.09977 3.11319 (501) v (PatpatarTolai) > h (Patpatar) 0.00182 8.75307 (502) a (SouthNewIrelandNorthwestSolomonic) > e (PatpatarTolai) 0.15474 6.20403 (503) e (SeramStraits) > i (Paulohi) 0.18832 3.98409 (504) u (Formosan) > o (Pazeh) 2.254e-04 6.81642 (505) f (Tahitic) > h (Penrhyn) 0.02693 5.16875 (506) n (Wetar) > N (Perai) 0.04788 4.13934 (507) u (Central Bisayan) > o (Peripheral Central Bisayan) 0.02827 1.93754 (508) e (PapuanTip) > a (Peripheral PapuanTip) 0.19334 3.54732 (509) q (ProtoMalay) > h (Pesisir) 0.07129 14.80076 (510) v (EastVanuatu) > f (PeteraraMa) 0.01242 17.18743 (511) a (ChamChru) > 1 (PhanRangCh) 0.00636 12.19166 (512) Not enough data available for reliable sound change estimates (513) v (EastFijianPolynesian) > f (Polynesian) 0.05871 17.52595 (514) a (Ponapeic) > e (Ponapean) 0.16942 10.52294 (515) a (PonapeicTrukic) > e (Ponapeic) 0.13099 19.78210 (516) t (MicronesianProper) > d (PonapeicTrukic) 0.04291 17.55398 (517) e (BimaSumba) > a (Pondok) 0.13262 6.53158 (518) B (TukangbesiBonerate) > F (Popalia) 0.00989 10.98695 (519) e (CentralEastern) > @ (ProtoCentr) 0.00248 3.97165 (520) o (PonapeicTrukic) > a (ProtoChuuk) 0.09341 2.83945 (521) N (ProtoAustr) > n (ProtoMalay) 0.13485 12.99231 (522) v (RemoteOceanic) > f (ProtoMicro) 0.04980 13.02087 (523) a (EasternMalayoPolynesian) > o (ProtoOcean) 0.09981 7.99270 (524) f (SamoicOutlier) > w (Pukapuka) 0.00204 8.90305 (525) a (NorthBomberai) > e (PulauArgun) 0.13318 12.93777 (526) u (ProtoChuuk) > U (PuloAnna) 8.595e-06 28.37331 (527) d (ProtoChuuk) > t (PuloAnnan) 0.10821 26.92081 (528) d (ProtoChuuk) > t (Puluwatese) 0.10821 32.33646 (529) u (KayanMurik) > o (PunanKelai) 0.00695 11.47865 (530) b (Formosan) > v (Puyuma) 0.02080 5.97573 (531) s (EastVanuatu) > h (Raga) 0.03550 19.03806 (532) r (CenderawasihBay) > l (RajaAmpat) 0.10632 10.89494 (533) f (East Nuclear Polynesian) > h (RapanuiEas) 0.05254 6.51440 (534) a (Tahitic) > e (Rarotongan) 0.23947 2.99725 (535) a (ProtoMalay) > @ (RejangReja) 0.00259 33.94479 (536) i (CentralEasternOceanic) > a (RemoteOceanic) 0.30671 1.87121 (537) r (Futunic) > g (Rennellese) 0.01560 46.78448 (538) O (Choiseul) > o (Ririo) 0.01953 8.10758 (539) r (NorthNewGuinea) > z (Riwo) 0.01094 1.23968 (540) r (KisarRoma) > R (Roma) 0.00324 9.86839 (541) k (Nuclear WestCentralPapuan) > h (Roro) 0.04267 7.79458 (542) f (West NuclearTimor) > b (RotiTerman) 0.05510 4.89359 (543) t (WestFijianRotuman) > f (Rotuman) 0.07237 18.85436 (544) O (West NewGeorgia) > o (Roviana) 0.00573 6.86288 (545) u (Formosan) > o (Rukai) 2.254e-04 9.26163 (546) k (Tahitic) > P (Rurutuan) 0.00737 41.17228 (547) u (Palawano) > o (SWPalawano) 7.014e-04 10.93716 (548) f (Southern Malaita) > h (Saa) 0.00833 23.36029 (549) Not enough data available for reliable sound change estimates (550) O (Southern Malaita) > o (SaaSaaVill) 0.01151 2.00000 (551) O (Southern Malaita) > o (SaaUkiNiMa) 0.01151 1.99961 (552) Not enough data available for reliable sound change estimates (553) u (Tsouic) > o (Saaroa) 0.00593 29.89765 (554) a (Northwest) > o (Sabahan) 0.01254 7.65235 (555) d (ProtoChuuk) > t (SaipanCaro) 0.10821 30.24687 (556) u (Formosan) > o (Saisiat) 2.254e-04 32.49754 continued on next page

21 continued from previous page

Code Parent Child Functional load Number of occurrences

(557) a (Vanuatu) > E (SakaoPortO) 1.415e-05 7.37912 (558) v (Suauic) > h (Saliba) 0.01548 5.59439 (559) e (ProtoMalay) > a (SamaBajaw) 0.04499 11.37100 (560) a (SuluBorneo) > 1 (SamalSiasi) 0.03121 4.55207 (561) u (CentralLuzon) > o (SambalBoto) 0.03570 51.44942 (562) k (SamoicOutlier) > P (Samoan) 7.913e-05 40.57180 (563) r (Nuclear Polynesian) > l (SamoicOutlier) 0.04761 2.52276 (564) Not enough data available for reliable sound change estimates (565) h (Northern Sangiric) > r (SangilSara) 0.01859 10.57112 (566) q (Northern Sangiric) > P (Sangir) 0.17663 33.66747 (567) P (Northern Sangiric) > q (SangirTabu) 0.17663 5.11841 (568) a (Sulawesi) > e (Sangiric) 0.06360 9.41251 (569) l (SanCristobal) > r (SantaAna) 0.20803 20.89020 (570) O (SanCristobal) > o (SantaCatal) 0.00562 1.99971 (571) s (SouthNewIrelandNorthwestSolomonic) > h (SantaIsabel) 0.01828 6.35726 (572) l (NehanNorthBougainville) > n (SaposaTinputz) 0.13005 8.23827 (573) a (Blaan) > u (SaranganiB) 0.10239 1.65720 (574) o (SarmiJayapuraBay) > a (Sarmi) 0.30507 3.74591 (575) l (NorthNewGuinea) > r (SarmiJayapuraBay) 0.09033 9.81639 (576) a (BaliSasak) > @ (Sasak) 0.07763 7.22450 (577) a (ProtoChuuk) > e (Satawalese) 0.11108 19.48892 (578) a (BimaSumba) > e (Savu) 0.13262 10.66529 (579) n (NorthNewGuinea) > N (Schouten) 0.12461 3.55253 (580) a (Atayalic) > u (Sediq) 0.15623 4.94540 (581) f (Western AdmiraltyIslands) > h (Seimat) 0.03838 5.52135 (582) u (NorthBomberai) > i (Sekar) 0.07251 14.57543 (583) b (SoutheastMaluku) > h (Selaru) 0.00154 5.27609 (584) Not enough data available for reliable sound change estimates (585) E (Pasismanua) > e (Sengseng) 0.00635 6.51904 (586) r (East CentralMaluku) > l (Seram) 0.17834 9.26050 (587) l (Nunusaku) > r (SeramStraits) 0.08036 17.65607 (588) e (MaselaSouthBabar) > E (Serili) 0.03391 27.97054 (589) f (NilaSerua) > w (Serua) 0.09929 7.71214 (590) N (PatpatarTolai) > n (Siar) 0.11079 7.91352 (591) n (FloresLembata) > N (Sika) 0.12345 10.33270 (592) f (Ellicean) > h (Sikaiana) 0.01500 18.73830 (593) O (West NewGeorgia) > o (Simbo) 0.00573 3.68466 (594) a (CentralPapuan) > o (SinagoroKeapara) 0.25340 1.00322 (595) a (LandDayak) > o (Singhi) 0.01654 17.32791 (596) u (EastFormosan) > o (Siraya) 4.190e-04 8.28489 (597) O (Choiseul) > o (Sisingga) 0.01953 13.70360 (598) Not enough data available for reliable sound change estimates (599) r (BimaSumba) > z (Soa) 0.00401 6.96601 (600) a (Sarmi) > e (Sobei) 0.23073 10.06460 (601) k (CentralMaluku) > P (Soboyo) 0.00549 7.46123 (602) a (NehanNorthBougainville) > e (Solos) 0.10574 4.14147 (603) Not enough data available for reliable sound change estimates (604) l (Ambon) > r (SouAmanaTe) 0.03275 3.58352 (605) r (NorthernLuzon) > l (SouthCentralCordilleran) 0.03231 7.74272 (606) n (MaselaSouthBabar) > l (SouthEastB) 0.25859 6.79647 (607) a (CentralVanuatu) > e (SouthEfate) 0.06242 9.08441 (608) v (SouthHalmaheraWestNewGuinea) > p (SouthHalmahera) 0.04413 6.86021 (609) u (EasternMalayoPolynesian) > i (SouthHalmaheraWestNewGuinea) 0.15907 2.91215 (610) i (NewIreland) > u (SouthNewIrelandNorthwestSolomonic) 0.17058 3.74428 (611) u (Sulawesi) > o (SouthSulawesi) 0.04575 8.90368 (612) t (Babar) > k (South Babar) 0.08233 5.96986 (613) l (Bisayan) > y (South Bisayan) 0.06356 5.93967 (614) a (Manobo) > 1 (South Manobo) 0.09378 14.76928 (615) y (West Barito) > i (South West Barito) 0.00602 4.16231 (616) o (Eastern AdmiraltyIslands) > u (SoutheastIslands) 0.01939 2.52762 (617) R (ProtoCentr) > r (SoutheastMaluku) 0.02692 16.51716 (618) r (CentralEasternOceanic) > l (SoutheastSolomonic) 0.18698 6.93482 (619) t (SouthCentralCordilleran) > b (SouthernCordilleran) 0.17112 1.97141 (620) e (ProtoMalay) > 1 (SouthernPhilippine) 0.00192 17.69275 (621) l (Malaita) > n (Southern Malaita) 0.09412 3.02851 (622) o (South Babar) > u (SouthwestBabar) 0.05720 5.18676 (623) b (Timor) > w (SouthwestMaluku) 0.05085 6.39151 (624) i (Vitiaz) > u (SouthwestNewBritain) 0.23570 3.84871 (625) u (Atayalic) > o (SquliqAtay) 0.00681 13.92923 (626) k (Suauic) > P (Suau) 0.02509 20.06806 (627) r (Nuclear PapuanTip) > l (Suauic) 0.04206 7.42557 (628) 1 (Subanun) > o (SubanonSio) 0.03138 42.58704 (629) a (SouthernPhilippine) > 1 (Subanun) 0.05979 13.92604 (630) g (Subanun) > d (SubanunSin) 0.15626 6.97087 (631) e (ProtoMalay) > a (Sulawesi) 0.04499 11.63780 (632) u (SamaBajaw) > o (SuluBorneo) 0.02973 3.89781 (633) e (ProtoMalay) > o (Sumatra) 0.00512 23.81901 (634) N (ProtoMalay) > n (Sunda) 0.10751 21.82474 (635) u (South Bisayan) > o (Surigaonon) 0.03947 24.45484 (636) a (Erromanga) > e (SyeErroman) 0.11628 12.15031 (637) t (SouthSulawesi) > P (TaeSToraja) 0.06897 3.10286 (638) N (Tboli) > n (Tagabili) 0.10214 23.38929 continued on next page

22 continued from previous page

Code Parent Child Functional load Number of occurrences

(639) u (CentralPhilippine) > o (TagalogAnt) 0.01883 10.16369 (640) k (Kalamian) > q (TagbanwaAb) 0.42879 5.41452 (641) q (Kalamian) > k (TagbanwaKa) 0.42879 17.73591 (642) u (Tahitic) > o (Tahiti) 0.19238 6.98869 (643) e (Tahitic) > a (TahitianMo) 0.23947 3.68949 (644) a (Tahitic) > e (Tahitianth) 0.23947 2.71372 (645) f (Central East Nuclear Polynesian) > h (Tahitic) 0.05235 6.11830 (646) v (SaposaTinputz) > f (Taiof) 0.04497 6.79178 (647) l (Ellicean) > r (Takuu) 0.01827 30.88407 (648) Not enough data available for reliable sound change estimates (649) Not enough data available for reliable sound change estimates (650) Not enough data available for reliable sound change estimates (651) Not enough data available for reliable sound change estimates (652) h (Guadalcanal) > G (TalisePole) 5.670e-05 1.97576 (653) f (Wetar) > h (Talur) 0.03346 6.74401 (654) e (Vanikoro) > a (Tanema) 0.28248 10.81075 (655) a (PatpatarTolai) > e (Tanga) 0.10990 12.52910 (656) e (Utupua) > u (Tanimbili) 0.10880 6.71919 (657) a (Vanuatu) > @ (Tanna) 0.01056 13.80984 (658) r (Tanna) > l (TannaSouth) 0.05929 26.23479 (659) a (Vanuatu) > e (Tape) 0.08559 36.62961 (660) Not enough data available for reliable sound change estimates (661) f (Sarmi) > p (Tarpia) 0.09285 4.76042 (662) o (ButuanTausug) > u (TausugJolo) 0.03983 38.80244 (663) Not enough data available for reliable sound change estimates (664) a (Bilic) > O (Tboli) 0.01846 18.79298 (665) O (Tboli) > o (TboliTagab) 0.02446 2.78683 (666) O (Vanikoro) > o (Teanu) 0.13019 24.94671 (667) l (SouthwestBabar) > n (TelaMasbua) 0.17076 14.35748 (668) s (SaposaTinputz) > h (Teop) 0.06456 11.64051 (669) N (East NuclearTimor) > n (TetunTerik) 0.02632 3.88184 (670) t (TeunNilaSerua) > P (Teun) 0.01477 8.79107 (671) e (SouthwestMaluku) > E (TeunNilaSerua) 0.00128 6.36734 (672) b (WesternPlains) > f (Thao) 0.01723 9.66201 (673) u (LavongaiNalik) > a (Tiang) 0.23417 7.41175 (674) a (LavongaiNalik) > o (Tigak) 0.10194 2.85222 (675) r (Futunic) > l (Tikopia) 0.05835 3.75175 (676) R (ProtoCentr) > r (Timor) 0.02692 17.70656 (677) e (Dayic) > o (TimugonMur) 0.00467 20.83722 (678) s (Northern Malaita) > 8 (Toambaita) 0.01236 5.00367 (679) k (Sumatra) > h (TobaBatak) 0.01209 10.18636 (680) s (SamoicOutlier) > h (Tokelau) 0.02620 9.71242 (681) g (Guadalcanal) > h (Tolo) 0.04461 9.43634 (682) a (Tongic) > o (Tongan) 0.28401 6.49634 (683) s (Polynesian) > h (Tongic) 0.04938 24.85227 (684) l (North Minahasan) > d (Tonsea) 0.05604 2.89546 (685) b (North Minahasan) > w (Tontemboan) 0.12446 9.61430 (686) n (MonoUruava) > l (Torau) 0.17242 4.80739 (687) a (Tsouic) > o (Tsou) 0.00357 26.52983 (688) d (Formosan) > c (Tsouic) 0.02777 7.43830 (689) n (Tahitic) > N (Tuamotu) 0.00257 17.01166 (690) a (Wetar) > i (Tugun) 0.34504 3.34154 (691) a (MunaButon) > o (TukangbesiBonerate) 0.19256 6.52913 (692) a (LavongaiNalik) > e (TungagTung) 0.07781 8.41577 (693) u (Barito) > o (Tunjung) 0.00125 6.26453 (694) s (Ellicean) > h (Tuvalu) 0.01029 12.42513 (695) p (Are) > f (Ubir) 0.00167 1.99937 (696) Not enough data available for reliable sound change estimates (697) p (Aru) > f (UjirNAru) 0.01141 10.59153 (698) h (Nunusaku) > b (UlatInai) 0.07007 6.57134 (699) o (Erromanga) > e (Ura) 0.05761 9.45115 (700) l (MonoUruava) > r (Uruava) 0.18018 8.71608 (701) a (EasternOuterIslands) > o (Utupua) 0.19429 6.30089 (702) s (SamoicOutlier) > h (UveaEast) 0.02620 28.16233 (703) r (Futunic) > l (UveaWest) 0.05835 27.69390 (704) r (Futunic) > l (VaeakauTau) 0.05835 41.42212 (705) u (Choiseul) > @ (Vaghua) 3.654e-05 19.13794 (706) Not enough data available for reliable sound change estimates (707) a (EasternOuterIslands) > e (Vanikoro) 0.33088 4.43031 (708) o (Vanikoro) > e (Vano) 0.10677 4.13392 (709) o (RemoteOceanic) > u (Vanuatu) 0.11711 14.34763 (710) o (Choiseul) > O (Varisi) 0.01953 6.43868 (711) Not enough data available for reliable sound change estimates (712) r (SinagoroKeapara) > l (Vilirupu) 0.17037 8.15153 (713) a (NgeroVitiaz) > e (Vitiaz) 0.13267 4.47748 (714) s (MesoMelanesian) > d (Vitu) 0.02619 6.27238 (715) t (HuonGulf) > r (Wampar) 0.06174 5.81284 (716) s (BimaSumba) > h (Wanukaka) 0.03743 14.17816 (717) Not enough data available for reliable sound change estimates (718) v (CenderawasihBay) > w (Waropen) 0.08865 4.74200 (719) r (GeserGorom) > l (Watubela) 0.17521 13.68799 (720) l (AreTaupota) > r (Wedau) 0.04316 3.56623 continued on next page

23 continued from previous page

Code Parent Child Functional load Number of occurrences

(721) i (BimaSumba) > e (WejewaTana) 0.04706 11.31057 (722) u (Seram) > v (Werinama) 3.739e-05 7.78342 (723) t (CentralPapuan) > k (WestCentralPapuan) 0.13745 14.10036 (724) a (ProtoCentr) > o (WestDamar) 0.02891 9.89554 (725) f (CentralPacific) > h (WestFijianRotuman) 0.00402 4.95883 (726) k (NortheastVanuatuBanksIslands) > h (WestSanto) 0.01801 2.59859 (727) a (Barito) > e (West Barito) 0.06510 2.67923 (728) a (Central Manobo) > 1 (West Central Manobo) 0.12386 21.47034 (729) a (Manus) > e (West Manus) 0.16506 7.10396 (730) O (NewGeorgia) > o (West NewGeorgia) 0.00631 2.83443 (731) r (NuclearTimor) > l (West NuclearTimor) 0.14621 3.15074 (732) Not enough data available for reliable sound change estimates (733) 1 (West Central Manobo) > e (WesternBuk) 4.647e-04 76.71920 (734) s (WestFijianRotuman) > c (WesternFij) 0.00926 4.99871 (735) Not enough data available for reliable sound change estimates (736) a (ProtoOcean) > e (WesternOceanic) 0.12139 2.58873 (737) d (Formosan) > s (WesternPlains) 0.04913 4.48311 (738) s (AdmiraltyIslands) > h (Western AdmiraltyIslands) 0.01688 4.34207 (739) a (MunaButon) > o (Western Munic) 0.19256 11.48592 (740) t (SouthwestMaluku) > k (Wetar) 0.09657 3.25493 (741) r (MesoMelanesian) > l (Willaumez) 0.14060 7.57978 (742) b (CentralWestern) > v (WindesiWan) 0.16193 3.04357 (743) P (Manam) > k (Wogeo) 0.07162 8.86144 (744) k (ProtoChuuk) > g (Woleai) 0.00179 35.44518 (745) a (ProtoChuuk) > e (Woleaian) 0.11108 60.39379 (746) a (Sulawesi) > o (Wolio) 0.06991 8.71005 (747) w (Western Munic) > v (Wuna) 0.00508 2.73532 (748) t (Western AdmiraltyIslands) > P (Wuvulu) 0.00203 11.89389 (749) a (HuonGulf) > e (Yabem) 0.13370 6.51629 (750) a (SamaBajaw) > e (Yakan) 0.04299 27.73964 (751) a (KeiTanimbar) > e (Yamdena) 0.18822 17.50706 (752) o (Bashiic) > u (Yami) 0.00368 5.79971 (753) a (ProtoOcean) > i (Yapese) 0.27352 4.64937 (754) a (Javanese) > e (Yogya) 0.08208 4.09979 (755) o (West SantaIsabel) > O (ZabanaKia) 0.02351 13.69682

24 Figure S.1: Percentage of words with varying levels of Levenshtein distance. Known Cognates (gold) were hand- annotated by linguists, while Automatic Cognates were found by our system.

25 m ( n - +

p b t d k g q ! 4

, ' f v s z x " 0 h #

m ( n - +

p b t d k g q ! 4

FijianBau

Niue Teanu

Tongan Vano

veaEast s a E a e Uv Tanema

Tokelau Buma

Pukapuka umboa As

Luangiua Tanimbili

u uoro Nuk o ba m Ne

Kapingamar Seimat

Sikaiana Wuvulu

Tuvalu Lou

Takuu una Na

VaeakauTau Leipon

FutunaEast SivisaTita

veaWest s e W a e Uv Likum

IfiraMeleM Levei

FutunaAniw Loniu

Rennellese Mussau

Anuta Kas iraIrah

Emae Gima n

Tikopia Buli

Bellona Mor

Samoan Minyaifuin

RapanuiEas BigaMisool aw i n iia wa Ha , ' f v s z x " 0 h #

As

Mangareva Waropen

Marquesan Num for

Tahitianth Marau

Rarotongan Amba iYa pen

TahitianMo WindesiWan

Rurutuan (333)

(483)

(264) NorthB baa r

(592) Manihiki

(694)

(647) Da we ra Da we

(704) Tuamotu

(181)

(703) Da i

Penrhyn

(218)

(180) TelaMasbua

Tahiti

(537)

(13) Imroing

Maori

(163)

(209)

(675) Emplawas

WesternFij

(359) (63)

(383) EastMasela

Rotuman (317)

(644) (598) Serili ehu De (325)

(534) (323) Ce ntra lM a s (643) engone Ne

(546) (459)

(682) SouthEastB Iaai

(361) i y % / &u

WestDamar

(666) an la na Ca (708)

(654)

(702) (98)

(680)

(689) (26) (524) (656)

(442) TaranganBa (162) (505) Jawe

(642) UjirNAru el wa m le Ne

(384)

(374) Nga iborS Ar auru Na (182) (330)

(435) Ge se r

Marshalles Watubela u aie Kus

(562) (408) ElatKeiBes

(645) (68)

Kiribati (25) HituAm bon

Ponapean

(533) (378) SouAmanaTe e ø 5 * $ o Pingilapes (8)

(742) Ama ha i

Mokilese (581)

(748) Paulohi

Woleai (153) (563) (667)

(729) Alune (122) PuloAnna (329) (229)

(164) MurnatenAl

SaipanCaro (148) (266)

(200) (588) Bonfia

Sonsoroles (97)

(244) (115) Werinama

huese s Chuuke (606)

(441) (211) BuruNamrol (683) arlnan rolinia Ca (604) 6 œ 7 8 . 3

(460) Soboyo Mortlockes (156)

(514) (135)

(707) (134) (191) Mamboru (701) hueseAK e s Chuuke (512)

(734) (719) (481)

(543) Manggarai (411) Satawalese

(744) Nga dha

Woleaian

(526) Pondok

PuloAnnan (177)

(555) (417) (5) Kodi (616) Puluwatese

(603) (532)

(373) (9) (425) WejewaTana a ) 1 2

(718) aman a m Na

(131) (485) (105)

(513) (6) Bima

(106) (120) eveei e e v Ne

(474) (503)

(419) Soa

(515)

vava v a Av (622)

(132) Savu

SakaoPortO

(577) (386) Wanukaka

AnejomAnei

(745) Ga ura Ngga u

Tape

(527) (146) (660)

(697) EastSumban

(528) ese s Ne

(385) (450) (192) (587)

(297) Baliledo

ah q a v ha Na (161) (283)

(520)

(139) Lamboya (443)

ati Na (159) (698)

(214) (738) LioFloresT (160) amakir k a m Na

(446) Ende (116)

Nguna Na ge

(725) r n o Ork (608)

(433) Kambera

(436) SouthEfate (516)

(445) (108) (465)

(486) PalueNitun Raga

(32) Ana ka la ng

Mwotlap (612)

(429) (86) Sekar PaameseSou

(434) (722) PulauArgun (432)

PeteraraMa (45)

(332)

(406) (454) RotiTerman

Mota (1) (489) Atoni

mr Sout mS Ambry (426)

(607) (586) (326) Mambai

Merei

(531) (165) TetunTerik

r i outh kiS Ara

(536) (427) (352) (357) (428)

(360) Kemak

Ura

(491) (449)

(351)

(522) (517) Lamalerale (510)

SyeErroman (287)

(119)

(557) (420) LamaholotI Lenakel (721)

(609) (152) (542)

(12) (33) (76)

(10) Sika

Kwamera

(659) (599) (29)

(402) (444) (578) Kedang TannaSouth (356)

(724) (716)

(15) (24) (101) (186) (669) Erai

Tawaroga (601)

(150) (149) (277) Talur SantaAna (44)

(308) (582) Aputa i

FaganiRihu (525)

(699) (114) (166) Perai FaganiAguf

(636) (726)

(467) Tugun

BauroPawaV (259) (318) (731)

(496) (307) Iliun (300)

Kahua (11) (306) (170)

(658) Serua rsi Aros (155) (591)

(663) (653)

(273) Nila (569)

rsiaw t wa iTa Aros (14)

(709) (174) Teun rsiOneib Aros (506)

(173)

(112) (690) (589) Kis ar SantaCatal

(171) (77)

(60) (223) (456) Roma

KahuaMami

(246)

(657) EastDamar

Fagani

(21) (479) Letinese

BauroHaunu

(23) (457) Yamdena BauroBaroo

(22) (519) (740) (285) KeiTanimba (570) (670)

SaaAuluVil (461) (178) (540)

(247) Selaru rarWaia a reW Area

(172) KoiwaiIria

(564) rarMaas a a reM Area

(59) (671) Cha morro orio Do (751)

(549)

(58) (274) Lampung Saa

(19) (286) Komering (18)

SaaUlawa (623) (145)

(111) Sunda (142)

r a h Oro (676) (322)

(548) RejangReja SaaSaaVill

(552) TboliTagab SaaUkiNiMa (275)

(490)

(621) Tagabili

Longgu (583)

(550) KoronadalB

LauNorth

(551) (310) SaranganiB

LauWalade (290) (665)

(346) BilaanKoro

Fataleka (638)

(314) (617) BilaanSara (328) i y % / &u KwaraaeSol (291)

(315) (288) (573) CiuliAta y a

Mbaelelea

(175) (70) SquliqAtay

Lau (301)

(345) (71)

(664) Sediq

(389)

Mbaengguu (133) Bunun (313) Kwaio (625)

(127) (81)

(390) (580) Thao (473) e ø 5 * $ o Langalanga (509)

(299) (634) Favorlang

Kwai (535) (672)

(312) Rukai (618) (28)

Toambaita (176)

(298) (74) Paiwan gela Ngge

(158) (100)

(678) Kanakanabu

LengoParip (737)

(453) (260) Saaroa

LengoGhaim (545) 6 œ 7 8 . 3

(321) (553)

(523) (492) Tsou Lengo

(320) (687)

(189) Puyuma

h riNgini Gha (688)

(319) (269) Kav alan TaliseKoo

(198) (530)

(472) (54) Basai h iger riNgge Gha

(649) (147) (109) Ce ntra lAmi TalisePole (197) a ) 1 2

(596) Siraya (652)

h ianda riTa Gha (504) (199) Pazeh

Mbirao (556)

(521) (179)

(392) (750) Saisiat (190) h igae riNgga Gha

(196) (559) Yakan (204) (40)

Malango (632) (91)

(347) (375) Bajo

Talise (230) (560)

(648) Mapun

Tolo (629)

(681) (630) SamalSiasi

h ri Gha

(194) (628) Inabaknon

h riNdi Gha

(195) (620) SubanunSin (650) TaliseMala

(753) (728) (733)

(362) (123) SubanonSio (651)

TaliseMoli (370)

(92) Figure S.2: Branch-specific, most frequent estimated changes. See Table S.5 for more information, cross-

(95) (365) WesternBuk

Bugotu (366) (714)

(393) (43) ManoboWest

MbughotuDh (27)

(741) referenced with the code in parenthesis attached to each branch. (377) (614) (363) ManoboIlia

Yapese (364)

(304) (79) ManoboDiba Vitu

(430) (367) (369) ManoboAtad

Lakalai

(338)

(388) (368) ManoboAtau

akan Bil iB na a k Na (355) (376)

(53) (405) (576) ManoboTigw Maututu (233)

(42)

(339) (118) ManoboKala Barok

(324)

(316) ManoboSara

MadakLamas (80)

(674) (121) Binukid

LihirSungl

(673) Maranao

Tigak

(431) (493) (507)

(253) (613) Iranun Tiang (265) (311)

(372) (717) Sasak (692) (225) alik Na (49) (495) (639) (3)

(52)

(340) (69) (107) (102) (210) Bali aa st es KaraW (208)

(635) Mamanwa

TungagTung (103) Ilonggo

Madara (662)

(124) 26 Hiliga y non

Banoni

(282) (252)

(350) WarayWaray (571) (289)

KilokakaYs (371) Butuanon

(83)

Kokota (727) (641)

(732) (640) TausugJolo (82)

BlablangaG (151) (57)

(302) (187) Surigaonon

(404)

Blablanga (693) (494)

(755) Akla nonBis

LaghuSamas (47)

(157)

(380) (477)

(75) (547) Ce bua no

ZabanaKia (595)

(452) (136) Kalagan

MaringeLel (349)

(381) (615) Mansaka (448) (633) gao oro oP Ngga

(379)

(416) (341) (245) TagalogAnt

MaringeTat

(128) BikolNagaC MaringeKma

(415) (126) TagbanwaKa (700) h eHolo ke Che (410)

(215)

(686) TagbanwaAb Bilur

(414) (403)

(470) BatakPalaw (736) (554)

MonoFauro (337) (267)

(413) PalawanBat (610)

rava v Urua (327) (137)

(334) (464) Banggi

Torau

(129)

(36) (56)

(243) SWPalawano

(735) MonoAlu

(37) (396)

(631) (270) Ha nunoo

Mono

(34) (407) Palauan

(705)

LungaLunga (125) (279) Singhi

(35)

BabatanaKa (206) (400)

(584) Da y a k B a k a t (469)

BabatanaLo (398)

(538) (167) (487) Katingan

Babatana (455)

(710)

(440) (679) (331) Da y a k Nga ju

Vaghua

(711) (231) Kadorih

BabatanaAv (138)

(207)

(38) (48) MerinaMala

Sengga (78)

(597) (348)

(438) (55) (66) (399) Maanyan

Ririo (605)

(278) Tunjung ai i Varis

(572) (217)

(499) (511) Kerinc i ai iGhon Varis (113) (99)

(602) (130)

(468)

(458) MelayuSara

BabatanaTu (224)

(447)

(439) Melayu (154)

Sisingga (611)

(508) (568)

(668) (203) Og a n

(418) (423)

(746)

(51) (46) (466)

aku k Ha (619)

(646) LowMalay

i n a s Nis (397) Indonesian eh H pe nHa ha Ne

(391) (303)

(110) (677) BanjareseM Teop (382)

(462) (276)

(117) MalayBahas (480)

(706) Taiof (730)

(502) (169) MelayuBrun (193) (241) Solos (65)

(168)

(696) (62) Minangkaba

Mbareke (498)

(294) (579) (342) (281) PhanRangCh

Marovo (752)

(539)

(336)

(594) (226) Chru

Vangunu (575)

(335) Ha ina nCha m (463)

h nongga Gha (478)

(212) Moken

gele Ughe (451)

(296)

(723) (255)

(343) Iban

(501) Kubokota

(437) (216)

(213)

(627)

(293) Ga y o

(593) Luqa (471)

(590) (754)

(544) Idaan (488) (64) Lungga

(655) (141) (94)

(344) (561) BunduDusun

(637) (242) o a v Hoa (4)

(261) (476) (497) (263) (50)

(17) (201) (237)

(529) (84)

(96) TimugonMur (250) (691) (271)

(739) u aghe Kus

(292) (39)

(482) (421) (219) KelabitBar

(104) dke Nduk (234)

(712)

(205) (90) (238) Bintulu

Simbo (713) (235)

(41) BerawanLon (183)

Roviana (240)

(280) (228) Belait (272) Patpatar

(409)

(574) Keny ahLong

Kuanua (256)

(305)

(720) MelanauMuk Siar

(395) (232)

(16) Lahanan Tanga

(541) (251) EngganoMal

(353) (295)

Kandas (227)

(401)

(358) (685) (684)

(143) EngganoBan

MagoriSout (61)

(249) Nia s (626) (88) Motu

(558)

(624) (262) TobaBatak Vilirupu (31)

(412) (254) Madurese (140)

Lala (239)

(184) (222) IvatanBasc Mekeo (30) (220)

(236) (257)

(185) (221) (2) Itbayat

Roro (258)

(695) (144) Babuyan

Kuni

(484) Iraralay (749) oura Do (715) (93)

(354) Itbayaten ab di ba Ga (72) (567)

(566)

(565)

(600) Isamorong ii ila Kiliv

(475) (248)

(202)

(661) (518)

(85) (747) (743) (424) Ivasay

Misima (500) (89) (284)

(7) (87) Imorod

oban bua Do (67) Yami

Wedau SambalBoto

ap aiwa pa pa Ga Kapampanga

Ubir IfugaoBayn

Molima KallahanKa

i dio Dio KallahanKe

uaw na wa Guma Inibaloi

Maisin

(387) Pangasinan

Suau

(188)

(394) KakidugenI

Saliba

(73)

(268) IlongotKak

ueawa Auhela (585) IfugaoAmga (309)

Ali

(20)

(422) IfugaoBata

Wogeo BontokGuin

Kis BontocGuin

Kairiru Kankanay No

Riwo Balangaw

a upulauK Kay KalingaGui

Sobei ItnegBinon

Tarpia Ga dda ng

o e Kov Atta P a mplo

Maleu IsnegDibag

Mengen Agta

Biliau Dum a ga tCa s

Matukar Ilokano

eda ge d Ge Yogya

Megiar Ol d Ja v a n e s

Bilibil WesternMalayoPolynesian

KaulongAuV BugineseSo

Sengseng Maloh

m ra Ama Makassar

LamogaiMul TaeSToraja

Aria SangirTabu

Mouk Sangir

Num ba m iS ib SangilSara

Yabem Bantik

Wampar Kaidipang

PunanKelai Goronta loH

Bukat BolaangMon

Kay anUmaJu Popalia

Mori Bonerate

Wolio Wuna

Baree MunaKatobu BanggaiWdi Tonsea Tontemboan m ( n - +

p b t d k g q ! 4

, ' f v s z x " 0 h #

m ( n - +

p b t d k g q ! 4

Saisiat

Pazeh Yakan

Siraya Bajo

enr lAmi ntra Ce Mapun

Basai SamalSiasi

a alan Kav Inabaknon

Puyuma SubanunSin

Tsou SubanonSio

Saaroa WesternBuk

Kanakanabu ManoboWest

Paiwan ManoboIlia

Rukai ManoboDiba

Favorlang ManoboAtad

Thao ManoboAtau

Bunun ManoboTigw

Sediq ManoboKala

SquliqAtay ManoboSara

ilAaya y CiuliAta Binukid

BilaanSara Maranao

BilaanKoro , ' f v s z x " 0 h #

Iranun

SaranganiB Sasak

KoronadalB Bali

Tagabili Mamanwa

TboliTagab Ilonggo

RejangReja nonHiliga y

Sunda WarayWaray

Komering Butuanon

Lampung TausugJolo

h morro Cha Surigaonon

KoiwaiIria Akla nonBis

Selaru Ce bua no

KeiTanimba Kalagan

Yamdena Mansaka

Letinese TagalogAnt

EastDamar BikolNagaC

Roma TagbanwaKa

i ar Kis TagbanwaAb

Teun BatakPalaw Nila i y % / &u

PalawanBat

Serua Banggi

Iliun SWPalawano

Tugun (733) (370)

(366) Ha nunoo Perai (363)

(364)

(369)

(54) (269) Palauan pt i Aputa (225)

(210) Singhi

Talur (103) Da ya k B ak at Erai (40)

(375) (662) Katingan

Kedang Da ya k Nga ju

Sika Kadorih e ø 5 * $ o

LamaholotI MerinaMala

Lamalerale Maanyan

Kemak Tunjung

TetunTerik (367)

(368) Kerinc i Mambai

(456)

(687) (553)

(260) MelayuSara (589)

Atoni (176)

(672)

(580) Melayu

RotiTerman (625)

(133)

(71) Og an

PulauArgun (70)

(573)

(274) 6 œ 7 8 . 3

(291) (540) LowMalay (751)

Sekar (638) (507) (285)

(665) (717) Indonesian n al ng la ka Ana (630)

(628) (102)

(728) BanjareseM (365)

PalueNitun (27) (635)

(223) MalayBahas

Kambera (376) (252) (690) (233)

(371) MelayuBrun

(596) (109) age Na (472) (506)

(14) Minangkaba

Ende

(653) PhanRangCh

LioFloresT

(670)

(170) (267) a ) 1 2

(137) Chru (457)

Lamboya Ha ina nCha m

(91) (560) Baliledo (277)

(669) Moken

EastSumban

(356) Iban auaNg u Ngga ura Ga (279)

(29) (576) (641)

(42) (640) (400) Ga y o

(542) Wanukaka

(273) (121) (57)

(494) (398) (591) Idaan

Savu (290) (613) (47) (487)

(306)

(310) (547)

(3) (331) BunduDusun Soa

(307) (107)

(322) (231) TimugonMur

Bima (145) (48)

(286) (403) (123) KelabitBar

WejewaTana (337) (348)

(614)

(583) (79)

(671) (399) Bintulu

Kodi (275) BerawanLon

Pondok (511)

(740) Belait (556) (504) (147) (428)

g dha Nga (530) (130)

(688)

(492)

(545) (595)

(737)

(165) Keny ahLong

(100) (136)

Manggarai (28)

(326)

(155) (615) MelanauMuk Mamboru (81)

(355) (245)

(525)

(664) Lahanan (731) Soboyo

(582) (80) EngganoMal

BuruNamrol (372) EngganoBan

(750) (632) (230)

Werinama (639) (677)

(69) Nia s

Bonfia (276)

(623) (327) TobaBatak MurnatenAl

(629) (65) Madurese

Alune (362) (217) (62) IvatanBasc (377)

Paulohi (407) (99) (11)

(178) Itbayat m ai ha Ama

(496) (125)

(259) Babuyan

SouAmanaTe (206)

(288)

(479)

(166) Iraralay (617)

iumbon HituAm (118) Itbayaten (308) ElatKeiBes

(425)

(44) (727) Isamorong Watubela (149) (5) (253)

(151) (397) (169)

(186) (495) (138) (242) Ivasay eser e s Ge (693) (303)

(716)

(604) (168)

(676) (208) (237)

(578) (78) Imorod g br Ar iborS Nga (211)

(599) (39)

(503) (66) Yami (76)

UjirNAru (349) (234) (6)

(721) (278) SambalBoto

TaranganBa (559)

(287) (238)

(620) (698) (9) (517) Kapampanga

WestDamar (235)

(43) (74) (449)

(461) IfugaoBayn (535)

(719) (240) SouthEastB

(360)

(634) (126)

(509) (722)

(127)

(357)

(191) (228) KallahanKa (86) (587) enr Mas a lM ntra Ce (410)

(405) (215) KallahanKe

Serili (222)

(554) Inibaloi EastMasela (257)

(486) (493)

(311) (241) Pangasinan

(161) (258) Emplawas

(77) (464)

(52) KakidugenI

(192)

Imroing (167) (561)

(606) (256)

(586) IlongotKak (601) (455) (263)

TelaMasbua

(115)

(101) (396) (679)

(450) (752) IfugaoAmga ai Da (588)

(697) (232)

(350) IfugaoBata (148) aw aD we Da ra we Da (660) (519)

(164) (64) BontokGuin

otBab r ba a NorthB (251) (45)

(229) (55) (220)

(152) (227) BontocGuin

WindesiWan (89)

(667) (187) (221) (386) (497) Kankanay No maiapen iYa Amba (87)

(498) Balangaw Marau

(114) (477) (113) (88) KalingaGui u for Num

(134)

(622) (219) ItnegBinon

(135) Waropen

(742) (226) (262)

(460) Ga dda ng As (619) (8) (633)

(56) (90)

(612) Atta P a mplo (378) (254) BigaMisool (341)

(24) (41) (239) IsnegDibag

Minyaifuin (478)

(724) (184) Agta

Mor

(120) (30)

(25) Dum a ga tCa s (465)

Buli (236) (68)

(485) (605) (110) (255) Ilokano

ian Gima (2) (33) (408)

(718) Yogya a iraIrah Kas (144)

(216) Ol d Ja v a n e s

(532)

Mussau WesternMalayoPolynesian

Loniu (470)

(97) (108)

(417) (469) (471) i y % / BugineseSo &u (200) Levei

(266) Maloh

Likum (468) (754) (93) Makassar

SivisaTita

(323) (488) (354)

(329)

(609) TaeSToraja (325) Leipon

(608) (224)

(729)

(598) (94) SangirTabu auna Na

(317) (344) (567) Sangir Lou

(373) e ø 5 * $ o (153)

(426) (637) (566) SangilSara

Wuvulu

(435) (243) (611) (565) Bantik (160) (330)

Seimat (735) (476)

(616) Kaidipang emb o ba m Ne

(748) (248)

(1) (568) (50) Goronta loH Tanimbili (581) (202)

(442) (201) BolaangMon (738) sumboa As

(656) 6 œ 7 8 . 3

(84) (518) Popalia (701)

Buma (203)

(26) (691) (85) Bonerate

Tanema

(98) (631) (747) Wuna

(159) Vano

(654) (423) (739) (424)

(707) MunaKatobu (708)

Teanu (685)

(666) (466) Tontemboan

FijianBau (684)

(179) (521)

(177) (46) a ) Tonsea 1 2

Niue (51)

(459) BanggaiWdi (683)

Tongan (746)

(682) (418) Baree

veaEast s a E a e Uv (270)

(702) (271) Wolio

Tokelau

(680) (96) Mori

Pukapuka

(524) (529) Kay anUmaJu

Luangiua

(333) (213) (715) Bukat u uoro Nuk

(483) (749) PunanKelai Kapingamar

(162)

(264) (484) Wampar

Sikaiana

(592) Figure S.3: Branch-specific, most frequent estimated changes. See Table S.5 for more information,Yabem cross-

Tuvalu

(694) (67) (422) Num ba m iS ib Takuu

(647)

(563) (624) (20) Mouk VaeakauTau

(704) referenced with the code in parenthesis attached to each branch. (451) (7) (309) Aria (181) (116) FutunaEast

(146) (462) (713)

(500) LamogaiMul (703)

veaWest s e W a e Uv

(513) (585) Ama ra (218)

IfiraMeleM (268)

(182)

(481) (61) (180) Sengseng

FutunaAniw (475) (73)

(537) (394) KaulongAuV Rennellese

(13) (188) Bilibil

Anuta (401)

(575) (292) (72)

(163) (387) Megiar

Emae (353)

(675) (539) Ge da ge d

Tikopia

(562)

(63) (574)

(533) Matukar

Bellona (579) (272) Biliau

Samoan (661) Mengen

RapanuiEas (250) (600)

(384)

(209) Maleu aw i n iia wa Ha

(359)

(156) (4) (249) Kov e Mangareva

(536)

(383) (627) (358) Tarpia Marquesan

(725)

(122)

(644) (480) Sobei Tahitianth

(332) (343)

(534) 27 (499) (284) Kay upulauK Rarotongan

(643) (743)

(139) Riwo (645) TahitianMo (31)

(546)

(443) (558) Kairiru Rurutuan

(214) (205) (361) (463)

(626) Kis

Manihiki

(734)

(689) (104)

(446) Wogeo

(543) Tuamotu

(505)

(436)

(111) Ali

Penrhyn

(642) (17)

(508) (281) Auhela wa Tahiti (374)

(105) (140)

(522) (141) Saliba

Maori (16) (412)

(474) (736) Suau

WesternFij

(385) (720) (297) Maisin

Rotuman (117) (695) (283)

(112) Guma wa na

ehu De (409) (185)

(406) (244) Dio dio

engone Ne (723) (280)

(441) Molima

Iaai (183) Ubir

an la na Ca Ga pa pa iwa

Jawe (594)

(515) (482) Wedau el wa m le Ne

(516) (342) (143) Do bua n auru Na

(158) (295) Misima Marshalles

(514) (502) (541) Kiliv ila

(512) u aie Kus

(351) (712) (395) (411) Ga ba di Kiribati

(709) (421) (305) (744)

(557) Do ura

Ponapean

(12)

(520)

(526) Kuni

Pingilapes

(659) (523) (261) (555)

(444) Roro

Mokilese (655)

(603) (404) Mekeo

Woleai (590)

(131) Lala PuloAnna (293)

(106) (447)

(501) Vilirupu

SaipanCaro

(419) (753) Motu Sonsoroles (132)

(467)

(433) (448) MagoriSout (577)

huese s Chuuke

(445) (730) (544)

(745) Kandas (32)

arlnan rolinia Ca (593)

(527) Tanga (352) (618)

Mortlockes (437)

(528) (610) Siar

hueseAK e s Chuuke (296)

(171) (440) Kuanua Satawalese

(119) (212)

(657) Patpatar

Woleaian (335)

(154) Roviana PuloAnnan (336)

(294) Simbo

Puluwatese (602) (150)

(346) (696) Nduk e aman a m Na

(429) (572)

(193) Kus aghe eveei e e v Ne

(434) (129) (438) (706) Hoa v a (432)

vava v a Av

(726) (382)

(741)

(454) (207) Lungga (714)

SakaoPortO (391) (564)

(489) (190) Luqa

AnejomAnei

(607) Kubokota Tape (646)

(531) Ughe le (427) (668)

ese s Ne (416) (597)

(491) (439) Gha nongga ah q a v ha Na

(699) (38)

(510) (458)

(636) Vangunu (316) ati Na (711)

(420) (75)

(318) (710) Marovo amakir k a m Na (338) (10)

(300) (538) Mbareke

(658) (402) (345)

Nguna (584)

(15) Solos r n o Ork (35)

(571)

(705) Taiof (663) SouthEfate

(569) (34) Teop

Raga

(174) (37)

(49) Ne ha nHa pe (173)

Mwotlap (36)

(60) (334) Nis s a n

PaameseSou

(246) (92) (413)

(189) Ha k u (21)

PeteraraMa (204) (414)

(23) Sisingga

Mota (157) (686) (22)

(621) (700) (570) BabatanaTu

mr Sout mS Ambry

(247) (415) Varis iGhon

(172)

Merei (732)

(328) (59) Varis i r i outh kiS Ara

(58) (124) Ririo

Ura (473) (388)

(430) Sengga (304)

SyeErroman (340)

(692) BabatanaAv

Lenakel (265)

(431) (128) Vaghua Kwamera (673)

(674) (379) Babatana (324)

TannaSouth (339) (381)

(53) BabatanaLo Tawaroga (452)

(549) (380) (19) BabatanaKa

SantaAna

(18) (755) LungaLunga

FaganiRihu

(142) (302)

(548) (82) Mono

FaganiAguf

(552) (83)

(490) MonoAlu

BauroPawaV (289)

(550) (393)

(551) (282) Torau (95) (453)

Kahua (651)

(321) (650)

(320) (195)

(319) (194)

(198) (681)

(649) (648)

(314) (197) (347) Urua v a

(652) (196) (199) (392) rsi Aros

(315)

(175) MonoFauro rsiawat iTa Aros

(301)

(389) Bilur

(313) rsiOneib Aros

(390)

(299) Che ke Holo

SantaCatal (312) (298)

(678) MaringeKma

KahuaMami MaringeTat

Fagani Ngga oP oro

BauroHaunu MaringeLel

BauroBaroo ZabanaKia

SaaAuluVil LaghuSamas

rarWaia a reW Area Blablanga

rarMaas a a reM Area BlablangaG

orio Do Kokota

Saa KilokakaYs

SaaUlawa Banoni

r h a Oro Madara

SaaSaaVill TungagTung

SaaUkiNiMa KaraW es t

Longgu Na lik

LauNorth Tiang

LauWalade Tigak

Fataleka LihirSungl

KwaraaeSol MadakLamas

Mbaelelea Barok

Lau Maututu

Mbaengguu Na k a na iB il

Kwaio Lakalai

Langalanga Vitu

Kwai Yapese

Toambaita MbughotuDh

Ngge la Bugotu

LengoParip TaliseMoli

LengoGhaim TaliseMala

Lengo Gha riNdi

Gha riNgini Gha ri

TaliseKoo Tolo

Gha riNgge r Talise

TalisePole Malango Gha riTa nda Mbirao Gha riNgga e m ( n - +

p b t d k g q ! 4

, ' f v s z x " 0 h #

m ( n - +

p b t d k g q ! 4

Tonsea

Tontemboan BanggaiWdi

MunaKatobu Baree

Wuna Wolio

Bonerate Mori

Popalia anUmaJu Kay

BolaangMon Bukat

oot loH Goronta PunanKelai

Kaidipang Wampar

Bantik Yabem

SangilSara ib iS m ba Num

Sangir Mouk

SangirTabu Aria

TaeSToraja LamogaiMul

Makassar Ama ra

Maloh Sengseng

BugineseSo KaulongAuV

WesternMalayoPolynesian Bilibil

ldJ s e n a v Ja d Ol Megiar

Yogya Ge dge da

Ilokano Matukar

u atas tCa ga a Dum Biliau

Agta , ' f v s z x " 0 h # Mengen

IsnegDibag Maleu

taPamplo a P Atta Kov e

adang dda Ga Tarpia

ItnegBinon Sobei

KalingaGui Kay upulauK

Balangaw Riwo

aknyNo Kankanay Kairiru

BontocGuin Kis

BontokGuin Wogeo IfugaoBata (422)

(20) Ali (309)

IfugaoAmga (585) Auhela wa

IlongotKak (268)

(73) Saliba KakidugenI (394)

(188) Suau

Pangasinan (387) Maisin

Inibaloi Guma wa na

KallahanKe Dio dio

KallahanKa Molima IfugaoBayn i y % / &u

Ubir

Kapampanga Ga pa pa iwa

SambalBoto Wedau Yami

(67) Do bua n (87)

Imorod (7) (284) (89) (500) Misima

Ivasay (424) (743) (747) (85)

(518) (661)

(202) (248) (475) Kiliv ila

Isamorong (600)

(565)

(566)

(567) (72) Ga ba di

Itbayaten

(354)

(93) (715) Do ura

(749)

Iraralay (484) Kuni e ø 5 * $ o

Babuyan

(144) (695)

(258) Roro

(2)

Itbayat

(221) (185) (236) (257)

(30)

(220) Mekeo

IvatanBasc (222)

(184)

(239) (140) Lala

Madurese (254) (31) (412) Vilirupu

TobaBatak (262) (624) (558) Motu (88) (626)

i s Nia (249) (61) MagoriSout

EngganoBan (143) (684) (685) (358)

(401)

(227) Kandas (353) (295)

EngganoMal 6 œ 7 8 . 3

(251) (541) Tanga

Lahanan (16)

(232) (395) Siar

MelanauMuk (720) (305)

(256) Kuanua

eyahLong Keny (574) (409)

(272) Patpatar (228)

Belait (280)

(240) (183) Roviana BerawanLon

(41)

(235) (713) Simbo

(90)

Bintulu

(238) (205) (712)

(234) (104) Nduk e KelabitBar

(219) a ) 1 2 (482) (421)

(39) (292) Kus aghe (739)

(271) (691) (250)

TimugonMur (96) (84) (529)

(237) (201) (17)

(50)

(263)

(497) (476) (4) (261) Hoa v a

(242) (637)

BunduDusun

(561) (344)

(94) (141) (655) Lungga (64) (488)

Idaan (544)

(754) (590)

(471) (593) Luqa ayo y Ga (627) (293) (213)

(216) (501) (437) Kubokota Iban (343)

(255) (723) (296) (451) Ughe le

Moken (212)

(478) (463) Gha nongga aianh m nCha ina Ha (335)

(575) Vangunu

Chru (226) (594)

(539) (336)

(752) Marovo

PhanRangCh (579) (281) (342) (294)

(498) Mbareke

Minangkaba (62) (696)

(168)

(65) Solos (241) (193) MelayuBrun (169) (502)

(480) (730) (706) Taiof

MalayBahas (117)

(276) (462) (382) Teop

BanjareseM

(677) (110)

(303) (391) Ne ha nHa pe Indonesian

(397) Nis s a n

LowMalay (646)

(619) Ha k u

(466) (46) (51)

(746) (423) (418) gan a Og (203) (668)

(568) (508) (611) (154) Sisingga

Melayu (447) (439)

(224) BabatanaTu

MelayuSara (458) (468)

(130) (602)

(113)

(99) Varis iGhon eici Kerinc

(511) (499)

(217) (572) Varis i Tunjung

(278)

(605) Ririo (399) (66)

(55)

Maanyan (438)

(348) (597)

(78) Sengga MerinaMala

(48) (207) (38) (138) BabatanaAv

Kadorih

(231) (711) Vaghua (679) ayakNaju Nga k a y Da

(331) (440) (710)

(455) Babatana

(487) Katingan

(167) (538)

(398) BabatanaLo

(469) ayakBakat a k a B k a y Da (584)

(400)

(206) (35) BabatanaKa Singhi

(279)

(125) (705) LungaLunga

Palauan

(407) (34) Mono

anunoo Ha (270) (631)

(396) (37) MonoAlu (735)

SWPalawano (243)

(56) (129) (36) Torau

Banggi

(464) (334) (137)

(327) (610) Urua v a

PalawanBat (413) (267)

(337) MonoFauro (554) (736) (470)

BatakPalaw

(403) (414) Bilur

TagbanwaAb (686) (215)

(410) (700) Che ke Holo TagbanwaKa

(126) (415) MaringeKma

BikolNagaC (128) MaringeTat TagalogAnt (245)

(341) (416) (379) Ngga oP oro

(633) (448) Mansaka

(615) (381)

(349) MaringeLel Kalagan

(136) (452)

(595) ZabanaKia

ebano bua Ce

(547) (75)

(477) (157) (380)

(47) LaghuSamas kanonBis Akla (755)

(494)

(693) (404) Blablanga Surigaonon (187) (302)

(57)

(151) (82) BlablangaG TausugJolo

(640) (732)

(641)

(727) (83) Kokota Butuanon

(371) (571) (289) KilokakaYs (350) WarayWaray

(252) (282) Banoni

iiaynon y Hiliga (124)

(662) Madara

Ilonggo

(103) TungagTung (635)

Mamanwa

(208) KaraW es t (107)

(210) Bali (69) (102) (340)

(52)

(3)

(495)

(639) (49) Na lik (225) (692) Sasak

(717)

(372)

(311) (265) Tiang

Iranun (613)

(253)

(493)

(507) (431) Tigak

Maranao (673) LihirSungl

Binukid

(121) (674)

(80) MadakLamas

ManoboSara i y (316) (324)% / Barok &u

ManoboKala

(118) (339) (42)

(233) Maututu (405) ManoboTigw

(576) (53)

(355) (376) Na k a na iB il

ManoboAtau

(368) (338) (388) Lakalai ManoboAtad

(367)

(369) (430) Vitu ManoboDiba

(79) (304)

(364) e ø 5 * Yapese $ o

(377) ManoboIlia (363)

(614) (741)

(27) MbughotuDh (43)

ManoboWest (393)

(714) (366) Bugotu WesternBuk (365) (95)

(92)

(370) (651) TaliseMoli SubanonSio

(123)

(362)

(733)

(728) (753) (650) TaliseMala

SubanunSin 6 œ 7 8 . 3 (620) (195) Gha riNdi

Inabaknon

(628) (194) Gha ri (630)

SamalSiasi (681)

(629) Tolo

Mapun (648)

(230)

(560) Talise Bajo

(375) (347)

(632)

(91) Malango (40) (204) Yakan (559) (196)

(190) a ) Gha riNgga1 e2 (750)

Saisiat (392)

(179) (521) (556) Mbirao

Pazeh (199)

(504) (652) Gha riTa nda

Siraya

(596) (197) TalisePole (147) (109)

enr lAmi ntra Ce (649) Gha riNgge r

Basai (472)

(54)

(530) (198) TaliseKoo a alan Kav

(269) (319) (688) Gha riNgini

Puyuma (189)

(687) (320) Lengo

Tsou

(492) (523) (553) (321)

(545) LengoGhaim Saaroa

(260) (453)

(737) Figure S.4: Branch-specific, most frequent estimated changes. See Table S.5 for more information, cross-LengoParip

Kanakanabu (678)

(100) (158) Ngge la

Paiwan

(74) (298)

(176) Toambaita

(28) (618)

Rukai referenced with the code in parenthesis attached to each branch. (312) (672)

(535) Kwai

Favorlang (634) (299)

(509) (473) Langalanga (580) Thao (390)

(81)

(127)

(625) (313) Kwaio Bunun

(133) (389) Mbaengguu

(664) Sediq

(71) (345) (301) Lau

SquliqAtay

(70) (175) Mbaelelea (288)

ilAaya y CiuliAta

(573) (315)

(291) (328) KwaraaeSol (617)

BilaanSara (314)

(638) Fataleka

BilaanKoro (346)

(665)

(290) LauWalade

SaranganiB

(310) (551) LauNorth

KoronadalB (550)

(583) Longgu

Tagabili (621) (490)

(275) SaaUkiNiMa

TboliTagab (552) SaaSaaVill RejangReja (548)

(676)

(322) (142) Oro h a

Sunda (111) 28

(145)

(623) (18) SaaUlawa Komering

(286) (19) Saa (274)

Lampung (58) (549)

(751) Do rio h morro Cha (671) (59)

(564) Area reM a a s

KoiwaiIria (172) Area reW a ia

Selaru (247)

(461)

(178)

(540) SaaAuluVil (670) (570) KeiTanimba

(740)

(519) (285) (22) BauroBaroo Yamdena

(457) (23) BauroHaunu

Letinese

(479) (21) Fagani

EastDamar (657) (246) KahuaMami

Roma

(223)

(456) (60)

(77) (171) SantaCatal (690)

(589) i ar Kis (112) (173)

(506) Aros iOneib

Teun (709) (174)

(14) (569) Aros iTa wat (273) Nila

(653) (663) (591)

(155) Aros i

Serua (658)

(170)

(306)

(11) (300) Kahua

Iliun

(496)

(307)

(731) (318)

(259) BauroPawaV

Tugun (467) (726) (636) FaganiAguf

(166) Perai

(114) (699)

(525) FaganiRihu pt i Aputa

(582) (308)

(44) SantaAna

Talur (277)

(149) (150)

(601) Tawaroga

(186) (24) (669) Erai

(101) (15)

(716) (724)

(356) TannaSouth Kedang

(578) (444) (402)

(29) (599) (659) Kwamera

Sika (10) (33)

(76) (12)

(542)

(152) (609) (721) Lenakel LamaholotI (420)

(557) (119)

(287) (510) SyeErroman (517)

Lamalerale (522) (351)

(449) (491) Ura Kemak (360)

(428) (357) (536) (352) (427) Ara kiS outh

TetunTerik

(165) (531) Merei

Mambai (326)

(586) (607)

(426) Ambry mS out

Atoni (489)

(1) Mota RotiTerman (406) (454)

(332)

(45) (432) PeteraraMa PulauArgun

(722) (434) PaameseSou Sekar

(86) (429)

(612) Mwotlap

n al ng la ka Ana (32) Raga

PalueNitun (486)

(465)

(108) (445)

(436) (516) SouthEfate

Kambera (433)

(608) (725) Ork o n

age Na Nguna (116)

Ende (446) Na m a k ir (160)

LioFloresT (738) (214)

(698) (159) (443) Na ti

Lamboya (139) (283) (520) (161) Na ha v a q

Baliledo (297)

(587) (450)

(192) (385) (528) Ne s e (697)

EastSumban

(660) (146) (527) Tape

auaNg u Ngga ura Ga (745) AnejomAnei

Wanukaka (386) (577) SakaoPortO

Savu (132)

(622) (515) Av a v a

Soa (419)

(503) (474) Ne v e e i

(120) (106) Bima

(6) (513) (105)

(485) (131) Na m a n

(718) WejewaTana

(425)

(9)

(373)

(532) (603) Puluwatese (616) (5) Kodi (417) (555)

(177) PuloAnnan

Pondok (526) Woleaian g dha Nga (744)

(411) Satawalese Manggarai (481) (543)

(719) (734) (512) Chuuke s e AK (701) (191) (134)

Mamboru (707)

(135) (156) (514) Mortlockes

(460) Soboyo

(604) Ca rolinia n

(683) BuruNamrol

(211) (441)

(606) Chuuke s e Werinama

(115) (244)

(97) Sonsoroles

(588)

Bonfia

(200)

(266)

(148) SaipanCaro MurnatenAl

(164)

(329)

(229) (122) PuloAnna

Alune (729)

(667) (563)

(153) Woleai

Paulohi (748)

(581) Mokilese (742)

m ai ha Ama

(8) Pingilapes

(378)

SouAmanaTe (533) Ponapean iumbon HituAm

(25) Kiribati (68) (645) ElatKeiBes

(408) (562) Kus aie

Watubela Marshalles

eser e s Ge (435)

(330) (182) Na uru g br Ar iborS Nga (374)

(384) Ne le m wa UjirNAru (642)

(505) Jawe

(162) TaranganBa

(442)

(656) (524) (26) (689)

(680)

(98) (702)

(654)

(708) Ca na la (666)

WestDamar (361) Iaai SouthEastB (682) (459) (546)

(643) Ne ngone

enr Mas a lM ntra Ce

(323) (534) (325) De hu

Serili (598) (644)

(317) Rotuman EastMasela (383)

(63) (359) WesternFij

Emplawas (675) (209)

(163) Maori Imroing (13)

(537) Tahiti

TelaMasbua (180)

(218) Penrhyn ai Da (703)

(181) Tuamotu (704) aw aD we Da ra we Da (647)

(694) Manihiki (592) otBab r a ba NorthB (264) (483)

(333) Rurutuan

WindesiWan TahitianMo

maiapen iYa Amba Rarotongan

Marau Tahitianth

u for Num Marquesan

Waropen Mangareva

As Ha wa iia n

BigaMisool RapanuiEas

Minyaifuin Samoan

Mor Bellona

Buli Tikopia

ian Gima Emae

a iraIrah Kas Anuta

Mussau Rennellese

Loniu FutunaAniw

Levei IfiraMeleM

Likum Uv e a W e s t

SivisaTita FutunaEast

Leipon VaeakauTau

Na una Takuu

Lou Tuvalu

Wuvulu Sikaiana

Seimat Kapingamar

Ne m ba o Nuk uoro

Tanimbili Luangiua

As umboa Pukapuka

Buma Tokelau

Tanema Uv e a E a s t

Vano Tongan Teanu FijianBau Niue m ( n - +

p b t d k g q ! 4

, ' f v s z x " 0 h #

m ( n - +

p b t d k g q ! 4

Mbirao

h igae riNgga Gha nda riTa Gha

Malango TalisePole

Talise r riNgge Gha

Tolo TaliseKoo

h ri Gha riNgini Gha

h riNdi Gha Lengo

TaliseMala LengoGhaim

TaliseMoli LengoParip

Bugotu la Ngge

MbughotuDh Toambaita

Yapese Kwai

Vitu Langalanga

Lakalai Kwaio

akan Bil iB na a k Na Mbaengguu

Maututu Lau

Barok Mbaelelea

MadakLamas KwaraaeSol

LihirSungl Fataleka

Tigak LauWalade

Tiang LauNorth

alik Na Longgu

aa st es KaraW SaaUkiNiMa

TungagTung SaaSaaVill

Madara Oro ah

Banoni SaaUlawa

KilokakaYs Saa Kokota , ' f v s z x " 0 h #

Do rio

BlablangaG Area reM a s

Blablanga Area reW a ia

LaghuSamas SaaAuluVil

ZabanaKia BauroBaroo

MaringeLel BauroHaunu

gao oro oP Ngga Fagani

MaringeTat KahuaMami

MaringeKma (678)

(298) (312) SantaCatal h eHolo ke Che (299)

(390)

(313) Aros iOneib

Bilur (389)

(301) Aros iTa wa t MonoFauro (175)

(315) Aros i

(392) (199) (196) (652)

rava v Urua (347) (197) (314)

(648) (649)

(681) (198)

(194) (319)

(195) (320)

(650) (321)

(651) Kahua (453) (95) Torau (282) (551)

(393) (550) (289) BauroPawaV

MonoAlu (490)

(83) (552) FaganiAguf (82)

Mono (548) i y % / &u

(302) (142) FaganiRihu

LungaLunga

(755) (18) SantaAna BabatanaKa (19)

(380) (549)

(452) Tawaroga

BabatanaLo

(53)

(381)

(339) TannaSouth (324) Babatana

(379) (674)

(673) Kwamera

(128) Vaghua

(431)

(265) Lenakel (692) BabatanaAv

(340) SyeErroman

(304)

Sengga (430)

(388) (473) Ura

Ririo (124) e ø 5 * $ o (58) Ara kiS outh

ai i Varis (328) (59)

(732) (172) Merei

ai iGhon Varis

(415) (247) Ambry mS out

BabatanaTu (570)

(700) (621) (22) (157) (686) Mota

Sisingga (23)

(414) (204) (21) PeteraraMa aku k Ha (189) (413)

(92) (246) PaameseSou

(334)

i n a s Nis (60)

(36) (173) Mwotlap (49) eh H pe nHa ha Ne

(37) (174) Raga 6 œ 7 8 . 3 Teop

(34) (569)

(663) SouthEfate (705) Taiof

(571)

(35) Ork o n

Solos (15)

(584) (345) (658) (402) Nguna Mbareke

(538) (300) (10)

(338) Na m a k ir

Marovo

(710) (318)

(75) (420)

(711) Na ti (316) Vangunu (636)

(458) (510)

(38) (699) Na ha v a q h nongga Gha (439) (491)

(416) a ) 1 2

(597) Ne s e (668) (427)

gele Ughe (531)

(646) Tape

Kubokota (607) AnejomAnei

Luqa (190) (564) (489)

(391) SakaoPortO (714) (207)

Lungga (454)

(741)

(382) (726) (432) Av a v a o a v Hoa (438)

(706)

(129) (434) Ne v e e i

u aghe Kus

(193)

(572) (429) Na m a n dke Nduk

(696) (346) (150)

(602) Puluwatese (294) Simbo

(336) PuloAnnan

Roviana

(154)

(335) Woleaian

Patpatar (657)

(212) (119) Satawalese (440)

Kuanua (171)

(296) Chuuke s e AK Siar

(610) (528)

(437) Mortlockes (618) (352)

Tanga (527)

(593) (32) Ca rolinia n

Kandas (745)

(544)

(730) (445) (577) Chuuke s e

MagoriSout (448) (467) (433) (132) Sonsoroles Motu

(753) (419) SaipanCaro (501) Vilirupu

(447) (106)

(293) PuloAnna

Lala (131)

(590) Woleai

Mekeo

(404) (603)

(655) Mokilese

Roro (444) (555)

(261) (523) (659) Pingilapes

Kuni (12) (520) (526) Ponapean

oura Do (557) (744) (421)

(305) (709) Kiribati ab di ba Ga (411) (712)

(395) (351)

(512) Kus aie ii ila Kiliv

(502)

(541) (514) Marshalles Misima

(295) (158) Na uru

(143) oban bua Do

(342) (516) Ne le m wa Wedau

(482) (515)

(594) Jawe

ap aiwa pa pa Ga Ca na la Ubir

(183) Iaai

Molima (441)

(280)

(723) Ne ngone i dio Dio (406) (244)

(409)

(185) De hu

uaw na wa Guma (112) (283)

(695)

(117) Rotuman

Maisin (297)

(720) (385) WesternFij (736)

Suau (474)

(16)

(412) Maori (141)

Saliba (522)

(140) (105) (374) Tahiti ueawa Auhela (508)

(281)

(17) (642) Penrhyn

Ali (111)

(436) (543) (505) Tuamotu

Wogeo (446)

(104) (734) (689) Manihiki

Kis (626)

(463) (361)

(205) (214) Rurutuan (558)

Kairiru (443) (546)

(31) (645) TahitianMo

Riwo (139)

(743) (643) Rarotongan a upulauK Kay

(499)

(284) (534)

(343) (332) Tahitianth

Sobei

(480) (725) (122) (644) Marquesan

(358) (627)

Tarpia (536) (383) Mangareva

(4) o e Kov

(249) (156) (359) Ha wa iia n

Maleu (384) (209) (600)

(250) RapanuiEas

Mengen

(661) Samoan

Biliau

(272)

(579) Bellona

Matukar (533)

(574) (562) (63) Tikopia ed ed ge da Ge

(539) (675)

(353) Emae

(387)

Megiar (163) (72)

(575) (292)

(401) Anuta (188)

Bilibil (13) Rennellese (394)

KaulongAuV (537)

(475) i y % / &u

(73) FutunaAniw

Sengseng (180)

(61) (481) (182)

(268) (218) IfiraMeleM m ra Ama

(585) (513) (703) Uv e a W e s t LamogaiMul (500)

(462)

(713) (116) (146) (181) FutunaEast Aria

(451) (7)

(309) (704) VaeakauTau Mouk

(20) e ø 5 * $ o

(624) (563) (647) Takuu u ami ib iS m ba Num

(67)

(422) (694) Tuvalu

Yabem (592) Sikaiana

Wampar

(484) (162) (264) Kapingamar PunanKelai

(749) (483) Nuk uoro

(715)

Bukat

(213) 6 œ 7 8(333) Luangiua. 3 a anUmaJu Kay

(529) (524) Pukapuka

Mori

(96) (680) Tokelau (271)

Wolio (702)

(270) Uv e a E a s t

Baree

(418) (682)

(746) (683) Tongan

BanggaiWdi (459)

(51) a ) Niue 1 2 (46)

Tonsea (177)

(521) (179)

(684) FijianBau (466)

Tontemboan (666)

(685) (708) Teanu

MunaKatobu (707)

(424) (739)

(423) (159) (654) Vano Wuna

(747)

(631) (98) Tanema

(691) Bonerate

(85) (26)

(203) (701) Buma

Popalia (518)

(84) (656)

(738) As umboa BolaangMon

(201) (442)

(202) (581) Tanimbili (568) oot loH Goronta

(50) (1)

(248) (748) Ne m ba o

Kaidipang (616)

(735)

(476) (160) (330) Seimat Bantik

(243)

(611)

(565) (435) Wuvulu

SangilSara

(637) (566) (426)

Figure S.5: Branch-specific, most frequent estimated changes. See Table S.5 for (373) more information,(153) cross-Lou (344) Sangir

(567) (317) Na una (94)

SangirTabu (729) (598)

(224) (608) (325) Leipon

TaeSToraja referenced with the code in parenthesis attached to each branch. (609) (329)

(354)

(488) (323) SivisaTita Makassar

(468)

(754) (93) Likum Maloh (266)

(200) Levei BugineseSo

(471)

(469) (108) (417) (97)

(470) Loniu

WesternMalayoPolynesian (532) Mussau ldJ s e n a v Ja d Ol (216)

(144) Kas iraIrah

Yogya (33) (718) (408)

(2) Gima n (255)

Ilokano (110)

(605) (485) (68)

(236) (465) Buli

u atas tCa ga a Dum (25)

(30) (120) Mor

Agta

(184) (724)

(478) Minyaifuin (41)

IsnegDibag

(239) (24)

(341) BigaMisool (254) (378)

taPamplo a P Atta (612) (56)

(90)

(633) (8) (619) 29 As adang dda Ga (460)

(226)

(262) (135) (742) Waropen

ItnegBinon

(219) (622) (134) Num for (88) (113)

KalingaGui

(477) (114) Marau

(498) Balangaw

(87) Amba iYa pen

aknyNo Kankanay

(497) (386)

(221) (187) (667)

(89) WindesiWan (227)

BontocGuin (152) (220)

(55) (229)

(45)

(251) NorthB a ba r (64) BontokGuin (164)

(519) (660) (148) Da we ra Da we (350) IfugaoBata

(232) (697) (588) Da i

(752)

IfugaoAmga (450) (396)

(679) (101) (115) TelaMasbua

(455)

(263) (601) IlongotKak (586)

(256) (606)

(561) (167) (192) Imroing

KakidugenI

(52)

(464) (77) Emplawas

(258) (161)

Pangasinan

(311) (241)

(493) (486)

(257) EastMasela Inibaloi

(554)

(222) Serili KallahanKe

(215)

(405) (410) (86) (587) Ce ntra lM a s (228) KallahanKa (191)

(127) (357)

(509) (722)

(126) (634) (360) SouthEastB (240) (535) (719) IfugaoBayn (461)

(74) (449) (43)

(235) WestDamar

Kapampanga (517) (698) (9) (620)

(238) (287)

(559) TaranganBa

SambalBoto

(278) (721) (6) (234)

(349) (76) UjirNAru (66)

Yami (503)

(39) (599) (211) Nga iborS Ar (78)

Imorod (578) (237)

(208) (676)

(168) (716) (604)

(303) (693) Ge s e r (138) (495)

(242)

Ivasay (186) (151)

(169)

(397)

(253) (149) (5) Watubela

Isamorong

(727) (44) (425)

(308) ElatKeiBes Itbayaten

(118) HituAm bon (617)

Iraralay (479) (166)

(288)

(206) SouAmanaTe

Babuyan (259)

(125) (496) Ama ha i

Itbayat (178) (11)

(99)

(407) Paulohi (377) IvatanBasc (217)

(62)

(362) Alune Madurese

(65) (629) MurnatenAl

TobaBatak

(327) (623)

(276) Bonfia i s Nia

(69)

(677)

(639) Werinama

(230)

(632) (750)

EngganoBan (372) BuruNamrol EngganoMal

(80) (731) (582) Soboyo

Lahanan (664) (525)

(355)

(245) (81) Mamboru

MelanauMuk

(615) (155) (326)

(28) Manggarai

(136) (100) eyahLong Keny (165) (737)

(595) (545) (492)

(688)

(130) (530) (428) Nga dha (147)

(504)

(556) Belait (740)

(511) Pondok

BerawanLon (275) Kodi (399)

Bintulu (671)

(79) (583)

(614)

(337)

(348) WejewaTana

KelabitBar (123)

(403) (286)

(48) (145) Bima TimugonMur

(231) (322)

(107) (307) Soa

(331)

(3) BunduDusun

(547) (310) (306)

(487) (613)

(47) (290) Savu

Idaan (591)

(494)

(398)

(121)

(57) (273) (542) Wanukaka

ayo y Ga (400)

(640)

(42)

(641) (576) (29)

(279) Ga ura Ngga u

Iban (356) EastSumban Moken (669)

(277) Baliledo

(560)

(91)

aianh m nCha ina Ha (457) Lamboya Chru (137)

(267) (670) (170) LioFloresT

PhanRangCh (653) Ende Minangkaba (14)

(506) (472) Na ge (109) (596)

(371)

MelayuBrun

(233) (690) (252)

(376) Kambera

MalayBahas (223)

(27)

(635) PalueNitun (365)

BanjareseM (728)

(102)

(628)

(630) Ana ka la ng (717)

Indonesian (665) (285)

(507) (638) (751) Sekar LowMalay (291) (540)

(573) (274)

(70) PulauArgun gan a Og (71)

(133) (625) RotiTerman Melayu (580)

(672)

(176) (589) Atoni

MelayuSara (260) (553)

(687) (456) Mambai eici Kerinc (368)

(367) TetunTerik

Tunjung Kemak

Maanyan Lamalerale

MerinaMala LamaholotI

Kadorih Sika

ayakNaju Nga k a y Da Kedang

Katingan

(662) (375) (40) Erai ayakBakat a k a B k a y Da

(103) Talur

Singhi

(210)

(225) Aputa i

Palauan (269)

(54)

(369)

(364) (363) Perai

anunoo Ha

(366)

(370)

(733) Tugun

SWPalawano Iliun

Banggi Serua

PalawanBat Nila

BatakPalaw Teun

TagbanwaAb Kis ar

TagbanwaKa Roma

BikolNagaC EastDamar

TagalogAnt Letinese

Mansaka Yamdena

Kalagan KeiTanimba

ebano bua Ce Selaru

kanonBis Akla KoiwaiIria

Surigaonon Cha morro

TausugJolo Lampung

Butuanon Komering

WarayWaray Sunda

iiayHiliga non RejangReja

Ilonggo TboliTagab

Mamanwa Tagabili

Bali KoronadalB

Sasak SaranganiB

Iranun BilaanKoro

Maranao BilaanSara

Binukid CiuliAta y a

ManoboSara SquliqAtay

ManoboKala Sediq

ManoboTigw Bunun

ManoboAtau Thao

ManoboAtad Favorlang

ManoboDiba Rukai

ManoboIlia Paiwan

ManoboWest Kanakanabu

WesternBuk Saaroa

SubanonSio Tsou

SubanunSin Puyuma

Inabaknon Kav alan

SamalSiasi Basai

Mapun Ce ntra lAmi

Bajo Siraya Yakan Saisiat Pazeh (p, v) a q A (o, ə) B i C a (r, d) m ʀ (ʀ, r) e n (p, b) t k (w, o) r s (i, e) s t (c, s) k p (u, i) u u (e, i) l m (b, p) ŋ ŋ (o, w) p r (i, u) q w (k, g) v i (j, s) n o others others others 0 0.1 0.2 0.3 0.4 0 0.075 0.150 0.225 0.300 0 0.075 0.150 0.225 0.300 Fraction of substitution errors Fraction of insertion errors Fraction of deletion errors

Figure S.6: Most common substitution errors, insertion errors, and deletion errors in the PAn reconstructions produced by our system. In (A), the first phoneme in each pair (x, y) represents the reference phoneme, followed by the incorrectly hypothesized one. In (B), each phoneme corresponds to a phoneme present in the automatic reconstruction but not in the reference, and vice-versa in (C).

30