Subgrouping the -Alor-Pantar languages using systematic Bayesian inference

Gereon A. Kaiping1 and Marian Klamer1

1Leiden University Centre for Linguistics, Universiteit Leiden, NL

April 10, 2019

This paper refines the subgroupings of the 1. Introduction Timor-Alor-Pantar (TAP) family of , using a systematic Bayesian phy- The Timor-Alor-Pantar (TAP) languages are logenetics study. While recent work indi- a family of some 25 non-Austronesian or cates that the TAP family comprises a Timor “Papuan” languages spoken on the islands (T) branch and an Alor Pantar (AP) branch of Timor, Alor and Pantar, as well as on (Holton et al. 2012, Schapper et al. 2017), neighbouring smaller islets, located some the internal structure of the AP branch has 1,000 kilometres west of New Guinea (Holton proven to be a challenging issue, and earlier et al. 2012, Schapper et al. 2017, Holton & studies leave large gaps in our understand- Robinson 2017, Klamer 2017). The TAP fam- ing. Our Bayesian inference study is based ily is split into three branches: Bunak, the on an extensive set of TAP lexical data from group comprising the languages the online LexiRumah database (Kaiping & Makasae, Fataluku, and Oirata, and the Alor- Klamer 2018). Systematically testing differ- Pantar (AP) branch, which comprises the re- ent approaches to cognate coding, loan ex- maining languages. For an alphabetical list clusion, and explicit modelling choices, we of the TAP languages and their locations, see arrive at a subgrouping structure of the TAP Table 1 and Fig. 1 on the next pages. family that is based on features of the phylo- Makasae, Fataluku, and Bunak are spo- genies shared across the different analyses. ken on the island of Timor, the language Our TAP tree differs from all earlier propos- Oirata is spoken on the neighboring island of als by inferring the East-Alor subgroup as an Kisar, the languages Kafoa, Kui, Klon, Adang, early split-off from all other AP languages, Hamap, Kabola, Abui, Kamang, Kula, Wers- instead of the most deeply embedded sub- ing, and Sawila are spoken on Alor; Kaera, group inside that branch. Klamu, Teiwa, Western Pantar, Sar and most Blagar varieties are spoken on the island of Pantar, while Blagar-Pura and Reta are spo-

1 Language ISO Population Dialect Island Abbreviation Abui abz 17000 Fuimelang Alor Ab Petleng Alor Takalelang Alor Atimelang Alor Ulaga Alor Adang adn 7000 Lawahing Alor Ad Otvai Alor Blagar beu 10000 Bakalang Pantar Bl Bama Pantar Kulijahi Pantar Nule Pantar Tuntuli Pantar Warsalelang Pantar Pura Pura Bunak bfn 80000 Bobonaro Timor Maliana Timor Suai Timor Deing twe 1000 Deing Pantar Fataluku ddg 30000 Fataluku Timor Hamap hmu 1300 Moru Alor Kabola klz 3900 Monbang Alor Kaera jka 5500 Abangiwang Pantar Ke Kafoa kpu 1000 Probur Alor Kamang woi 6000 Kamang-Atoitaa Alor Km Kiraman kvd 1900 Kiraman Alor Klon kyo 5000 Bring Alor Kl Hopter Alor Kui kvd 1900 Labaing Alor Ki Kula tpg 5000 Lantoka Alor Makasae mkz 70000 Makasae Timor Klamu nec 1380 Klamu Pantar Nd Oirata oia 1220 Oirata Kisar Reta ret *2500 Reta Pura Teiwa twe 4000 Adiabang Pantar Tw Nule Pantar Lebang Pantar Sawila swt 3000 Sawila Alor Sw Wersing kvw 3700 Maritaing Alor We Taramana Alor Western Pantar lev 10300 Tubbe Pantar WP

Table 1: Languages of the TAP family with the island where they are spoken, and number of speakers. Speaker numbers are taken from Klamer (2017: Table 1), except for Reta (starred), which is taken from Willemsen (to appear).

2 ken on the island of Pura in the Straits be- both groups together form the TAP family tween Alor and Pantar. (Schapper et al. 2017). Most of the TAP languages have speaker If and how the TAP family relates to other communities ranging between 1 000 and families is examined by Holton & Robinson 10 000 speakers. The average number of (2014) who conclude that there is no lexi- speakers per language is 11 852 due to the cal evidence to support an affiliation with three big languages Bunak (80 000 speakers), any other family in the world, including the Fataluku (30 000), and Makasae (70 000), spo- Trans New Guinea family, in contrast to what ken on Timor. One of the smallest languages had been assumed previously (Wurm et al. is Klamu (200 speakers) (Holton 2004), see Ta- 1975, Ross 2005). ble 1. Robinson & Holton (2012) compare the The relatively small size of most of these tree of Holton et al. (2012), shown in Fig. 2, speaker communities and the limited geo- with a Bayesian phylogenetic inference tree, graphical area in which they live has resulted shown in Fig. 3. Both of these trees were cre- in a situation where traditionally neighbour- ated using data collected up to 2009. ing groups had much contact through cul- Over the last few years much additional tural exchanges and intermarriage. Con- comparative lexical data has been collected tact with Austronesian donor languages goes on the TAP languages, increasing the size back a long way, as evidenced by the Aus- of the sample from 12 to 40 TAP language tronesian loans that were borrowed into the varieties (i.e. dialects or languages). The AP proto-language before it split up, per- new word lists have been combined with the haps 3000 years ago (see Section 2.5; Holton earlier body of data and made available in et al. (2012)). The contact between TAP the LexiRumah database (Kaiping et al. 2019, and continues un- Kaiping & Klamer 2018). On the methodolog- til today and has resulted in Austronesian ical side, recent years have seen a significant loan words in the individual TAP languages. improvement of computational tools, models Three recent donor languages of Austrone- and algorithms for inferring language histo- sian loan words are Indonesian, the domi- ries from word lists. For example, there are nant national language of and/or now new methods for automatic cognate de- its basilect Alor Malay, Alorese, the lingua tection using state-of-the-art pairwise pho- franca used on (parts of) Pantar and West netic alignment algorithms (List 2012a,b, List Alor before the advent of Malay/Indonesian, & Greenhill & Tresoldi, et al. 2018, List et and Tetun, one of the national languages of al. 2017, Jäger et al. 2017) that reach nearly Timor Leste. 90% accuracy (B-Cubed F-score) in determin- Over the last 15 years, a body of work ing the cognate sets to be used in phyloge- on TAP languages has appeared, including netic analyses (Rama et al. 2018). Bayesian grammatical descriptions, typological com- phylogenetic inference research has empiri- parisons and historical comparative work. cally shown which models are useful for lex- Historical reconstructions used the tradi- ical data (Chang et al. 2015, Kolipakam et tional comparative method based on regular al. 2018) and many of these models have sound changes, first demonstrating the exis- been made accessible (Maurits et al. 2017) tence of an Alor Pantar group (Holton et al. for easy use with linguistic data that is con- 2012) and an East Timor group (Schapper et forming to cross-linguistic standards (Forkel al. 2012), followed by a demonstration that 2017, Forkel & List, et al. 2018). In addi-

3 ���°E ���°E Alorese Wersing Wersing Reta Kabola Kroku Adang Kula Teiwa Kamang Klamu Blagar Kui cluster Alorese Alorese Hamap Kaera Abui Suboo Sawila Kafoa Sar Reta Klon Blagar Alor Wersing Western Papuna Kiramang Pantar Kui N Pantar ���� Deing Austronesian

� km �� Timor-Alor-Pantar © Owen Edwards Edwards © Owen ���°E ���°E Wetar ���°E cluster Roma

Hresuk Leti Luang �°S Kisar Kisar Kedang see Alor-Pantar map Galolen Oirata Sika Habun Tetun Dadu'a Waimaha Dili Makasae Fataluku Lamaholot Tokodede Mambae Hewa cluster Kemak Makalero Tetun Naueti Kairui-Midiki Bunak Idaté Lakalei Language Family Austronesian Tetun Uab Meto Austronesian-based Creole cluster Timor-Alor-Pantar

��°S Kupang Malay N BRUNEI

East Rote

Dengka Helong � km �� ���� Dhao East-Central Rote Central Rote TIMOR-LESTE Dela- AUSTRALIA

Oenale Tii-Lole Edwards © Owen

Figure 1: Map of the Timor-Alor-Pantar languages

4 tion, a number of new models have been sug- L.C. Robinson, G. Holton / Language Dynamics and Change 2 (2012) 123–149 129 gested that might better match how lexical replacement takes place in reality (Bouck- aert & Robbeets 2017, Kelly & Nicholls 2016). The earlier reconstructions of the TAP family still leave large gaps in our under- standing of the linguistic history of the TAP family, especially in its subgrouping. Apply- ing the traditional comparative method in subgrouping is inherently subjective (we ex- Figure 4. Subgrouping of Alor-Pantar based plain this in Section 4), and is widely recog- on shared phonological innovations (H2012). Figure 2: Subgroupings tree based on shared nized as a fraught task, especially for closely With this backgroundphonological it is natural to ask innovations, why linguists choose accord- to rely on such related languages. In this paper we re- ad-hoc methodologying for to determining and reproduced linguistic subgroups. from Of course Holton the pri- mary answer is that, in many cases, the work of the comparative method remains address the subgrouping of the TAP family, to be completed,et so weal. do (2012) not have knowledge (Figure of regular HPhon2012). historical changes with a particular focus on the internal struc- which could be used to discern shared linguistic history. But even when we do have knowledgeThe of historical language sound changes, abbreviations it is not always easy to determine are ture of the AP subfamily, while making all subgroups with alisted high degree in of Table certainty. 1.Rather than proceeding in a neat hier- archical fashion, sound changes ofen cross-cut each other in a wave-like pattern the analytical steps involved as visible and that is not well-described by a tree. Tis is the case in the history of the Alor- explicit as possible. We reduce the subjec- Pantar languages. While it is possible to identify many phonological innovations in the history of Alor-Pantar, only two of those innovations occur in the same tiveness of the subgrouping task by applying subset of languages (*g > ʔ and *k > Ø / _#, both in Blagar and Adang). Each of the other phonological innovations delineates a distinct subset of languages. recently developed computational tools and Table 1 lists all of the sound changes which have occurred in at least two of the models. Our data is the now extensive lexi- daughter languages. Changes which are restricted to final (_#) or medial (V_V) positionL.C. are Robinson, followed G. Holton by a slash / Language and an Dynamics indication and ofChange which 2 (2012) position 123–149 they occur143 cal dataset of TAP languages that is publicly in. available in LexiRumah. We focus on the lex- H2012 use a subset of these changes to delineate four subgroups within the Alor-Pantar family (see Fig. 4). An Alor subgroup is defined by the merger of icon as a data type that is quantitatively well- *k and *q. Within the Alor subgroup, a West Alor subgroup is defined by the change *s > h, and an East Alor subgroup is defined by the two changes *b > understood and for which models of evolu- p and *s > t. Te former change is also shared with Kamang (Km); the latter tion exist. with Abui (Ab). So while it is tempting to expand this group, only Sawila (Sw) and Wersing (We) share both of these innovations, defining the subgroup we These computational models of lexical refer to as East Alor. Finally, within West Alor, a Straits subgroup is defined by evolution currently rely on cognate-coded word list data. Each language variety is rep- resented by its forms (words) that best rep- resent a list of pre-defined meanings or con- cepts. Forms are grouped in cognate classes according to common origin. An example for a list with the four given concepts ‘arm’, ‘ar- Figure 10. Bayesian MCMC consensus tree for lexical data (relaxed Dollo model). Figure 3: Non-ultrametric tree from row’, ‘bamboo’, and ‘to blaze; to shine’ in two Tis model is particularlyBayesian appropriate phylogenetic to linguistic data since inference it assumes that language varieties can be found in Table 2. innovations may arise only once, but may be lost multiple times independently (Pagel, 2009). Tewith majority a consensus stochastic tree for this Dollo model is model, shown in Fig. as 10, For this article, we use state-of-the-art with the pAP nodedescribed used as an outgroup in and (not shown) reprodruced to root the tree. from Te clade automatic cognate detection algorithms to credibility values listed below each node indicate the percentage of sampled trees which are compatibleRobinson with that & node. Holton Most of these(2012) values are(Figure either at or approximate the incremental process of se- near one hundred percent, indicating that this consensus tree is compatible with lecting cognate sets and identifying regular almost all of theRDol trees sampled 2012). in the analysis. The language abbrevi- Before drawingations any conclusions are listed from this intree, Table a word of 1. caution is in order. sound changes, see Section 2.2. We system- Te fact that the Bayesian tree has high clade credibility values should not be interpreted as evidence that it is somehow the “right” or “correct” tree. Rather, atically exclude loan words from our data the conclusion to be drawn is that additional searching is unlikely to reveal a tree set at various different levels, using different which better fits our data. With this caution, we can compare the Bayesian tree to the split graph (Fig. 5). Toa large extent the groupings in the Bayesian tree are quantitative methods of loan exclusion, and compatible with those in the split graph. First, Sawila (Sw) and Wersing (We)are shown to be closely related, a grouping which was also present in the classification we compare the results of each method. yielded similar results. In contrast to Dunn et al. (2011) we found a strict unidirectional model actually performed slightly better than the gamma covarion model (though still not as well as the Dollo model). 5 Concept Abui-Fuimelang Blagar-Bakalang English Gloss C Form C Form arm 1 tatang 1 atang arrow 2 kaike 3 bulit bamboo 4 mai 5 betuŋ 4 maijan to blaze; to shine 6 hiede 7 bolor 6 ede

Table 2: Example word list for two varieties and four concepts, with forms coded for cognacy (columns C). Cognate forms are given the same code and the same color.

The weighting of the subgrouping evi- ogy used in our study: we discuss the under- dence is done through explicit, stochastic lying data set in Section 2.1, the automatic models for the evolution of the lexicon. At cognate coding method in Section 2.2, and their core, these models assume that a mean- our strategy for dealing with borrowed lex- ing keeps being expressed by words derived ical items in Section 2.3. We then deal with from a particular root for some time, but that the inference procedure, describing the evo- a stochastic process removes existing associ- lutionary model in Section 2.4, the shape of ations between meanings and roots and cre- the tree (in terms of priors, clocks, and cali- ates new associations. In order to be com- brations) in Section 2.5, and methods for our putationally tractable, the models make dif- Bayesian inference in Sections 2.4 and 2.6. ferent assumptions about the independence This is summarized in Section 2.7, where we of the stochastic processes. For example, re-iterate the Base (B) model which we use they generally assume that the words ex- in this study, as well as the alternative mod- pressing different concepts evolve indepen- elling options we explore to check the ro- dently, which is a reasonable assumption for bustness of our results. The alternative mod- forms with separate meanings (eg. ‘to sit’ els involve different methods of coding Cog- and ‘sacrifice’), but a necessary simplifica- nates (C1, C2, C3, C4), treating Loans (L1, L2), tion for forms that express semantically sim- different evolutionary Models (M1, ML), and ilar meanings (eg. ‘to sit’ and ’to dwell’). We different Tree shapes (T1, T2, T5, T6) (see the explicitly compare several different state-of- summary in Table 3). the-art models that have been suggested for language evolution in the Bayesian phyloge- We provide these results in Section 3, sum- netics literature. marizing them in a tree that reflects the con- sensus between the different Bayesian poste- All our parameter values and the imple- rior samples. In Section 4 we compare our mentations of our methods in unambiguous tree with previous work, including replica- computer algorithms are provided open for tion studies (R1, R2, MR) using comparable inspection, and the robustness of all our re- data. Our tree differs from the earlier pro- sults is investigated under several alterna- posals in a number of ways. One difference tive models and/or parameter choices. is that our tree does not confirm the first or- The structure of the paper is as follows. In der split between a Timor and an Alor-Pantar Section 2 we describe the data and methodol- branch which was found in earlier studies,

6 and instead has the languages spoken on lutionary histories in the model, given the Timor attach to the TAP node with two dif- cognate-coded data, must be calculated. In ferent branches. We leave this difference for the following subsections we discuss each of future research. Instead, we focus on the in- these steps. ternal structure of the AP branch, where we The results of a Bayesian inference is a find the position of the East-Alor subgroup as posterior probability distribution of trees of- an early split-off from all the other languages ten characterized by a DensiTree visualiza- in this branch. We discuss these differences tion of trees (Bouckaert 2010, Bouckaert & and address the question how they can be ex- Heled 2014) sampled from the distribution, plained. In Section 5 we discuss our results or by a consensus tree summarizing proper- and check the reliability of our automatically ties of the trees in the distribution. generated tree structure by a qualitative in- For example, Fig. 4 shows the probabil- spection of the innovations that support the ity distribution corresponding to “80% cer- early split off of the East-Alor group. Sec- tainty that language A and language B are tion 6 contains final remarks and a conclu- more closely related to each other than to sion. Apart from filling in some extant gaps language C, and no other information” (cf. about the internal structure of the TAP fam- Fig. 4a), and visualizes it using DensiTree ily, the paper aims to demonstrate how com- (Fig. 4b) and a maximum clade credibility putational methods and tools can be consis- summary tree (Fig. 4c). All prior knowl- tently applied to infer the phylogenetics of a edge and all results of Bayesian phylogenet- group of languages, and, moreover, how they ics are in principle such quantitative cer- can assist the linguist in making the fraught tainty statements, generated as a long se- task of subgrouping more objective and less quence of trees, each with frequency accord- trying. ing to its certainty (Fig. 4d). Data and methods used to produce the results shown in this paper are available as 2.1. Data online supplementary material and through https://osf.io/gpnxm/?view_only= As mentioned above, the data for this study 004ff2889b1640ed96f2c77bb83ec6fa. is taken from the open-access LexiRumah database (Kaiping & Klamer 2018). The most recent version v2.1.2 (Kaiping et al. 2019) of 2. Data and Methods the database contains 128 language varieties (or “lects”), 40 of which are extant Timor- Running a Bayesian phylogenetic inference Alor-Pantar languages and 3 comparative procedure on lexical data has the following reconstructions of TAP proto-languages, requirements: (1) The lexical data must be namely proto-Timor-Alor-Pantar, proto- in a comparable format (phonetic or phone- Alor-Pantar and proto-East Timor-Bunak. mic), with clean meaning-form mappings, The forms of the proto-languages are not and (2) coded for cognacy; (3) loan words independent of the other data points (be- must be handled such that the phylogenetic cause some of them have been used in the signal is not dominated by the contact signal reconstruction procedure), so we did not of the borrowings; (4) a computational model include them in our analyses. The remaining of lexical evolution must be specified; and (5) 40 word lists of TAP languages have been the posterior probability of the possible evo- aggregated from various recent sources

7 and are not all equally long. Originally, their sources may have used different glossing conventions, but LexiRumah p(T) provides manually normalized Concep- 100% ticon mappings (List et al. 2016) for all its concepts. In this way, meanings that are glossed differently across different sources (e. g. the forms glossed as ‘saya’ by Keraf (1978) and as ‘I’ in Blust (1999)) are linked to the same underlying concept (in this case 0% Tree T ‘1sg; https://concepticon.clld.org/ parameters/1209’). Each lexeme in the ABC ACB BCA database is presented with a transcription as (a) Probability distribution of trees supplied by the original source, and a nor- malized, segmented phonetic transcription in IPA (using diacritics only in those cases where that level of detail is available in the original source). Some words occur in sets that are highly correlated in shape. For example, in a lan- guage with a decimal numeral system, the number words above ‘ten’ will typically have forms reflecting the morpheme for ‘ten’; and C A B when the form for ‘ten’ has been replaced in (b) Densitree visualization of the sample. The a language (by e. g. a borrowing), it is very node ages were jittered to better show the likely that the forms for ‘twenty’, ‘thirty’, distribution. etc., or the forms for ‘nineteen’, ‘eighteen’, C C etc., also change. So we excluded the 25 C B decimally composite numeral concepts from 1 B B our word lists. Some but not all of the TAP 0.795 A A languages have quinary numeral systems, so A that the numbers ‘six’ to ‘ten’ are multi- (c) Consensus tree with clade probabilities morphemic in some languages but not in oth- … ABC ABC BCA ABC ABC ABC ABC ACB ABC ABC ACB ABC ABC ABC ABC ABC ers. Therefore we kept the numeral concepts (d) Random posterior sample of trees up to ‘ten’ throughout. Figure 4: Various representations of the tree After removing 25 numeral concepts, each distribution “80% certainty that A TAP lect in our data is attested through and and B are more closely related a word list containing between 99 (Abui- to each other than to C, and no Petleng variety) and 576 (Makasae) forms other information” spanning 597 concepts. The mean number of forms for each TAP word lists is 461, rep- resenting on average 408 different concepts, placing the average synonym count at 1.128 forms per concept. Apart from the very short

8 word list of Abui-Petleng, mutual coverage from applying our standard cognate coding between any two word lists in the database method (see below) on the network. is at least 129 items. In most of our analyses, we use the Lex- Stat algorithm (List 2012a), which has been 2.2. Cognate coding shown to perform very well for the purpose of cognate coding (List et al. 2017), and is In order to get full control over the estab- easily available in the LingPy software pack- lishment of cognate classes and to investi- age for historical linguistics (List & Green- gate the robustness of results under differ- hill & Tresoldi, et al. 2018). LexStat requires ent methods, we ran an automatic cognate at least 100 shared concepts to work well detection algorithm on our data. (List 2014), so we expect the Abui-Petleng Automatic cognate detection algorithms wordlist, which is below this number, to not calculate a score of formal similarity or be coded particularly well. “cognate-ness” for each pair of word forms In oder to reduce the complexity and that have the same meaning. The cur- therefore the number of parameters it needs rently most established type of scoring func- to estimate, LexStat does not operate directly tions converts the phonetic or phonemic on the IPA-transcribed forms, but instead transcriptions of forms into broader sound uses transcriptions that follow a sound class classes, aligns them and calculates a distance model that groups similar sounds together score based on that alignment.1 This gives a into a small number of classes. The SCA network, in which the connections between (List 2012b) sound class model has been es- the forms (‘edges’ in the graph literature) tablished useful for automatic cognate de- are “shorter” (representing a stronger con- tection, but the ASJP sound classes from nection) for similar forms and “longer” (rep- the Automatic Similarity Judgement Pro- resenting weaker connection) for different gram database (Wichmann et al. 2016) might forms. Figure 5 shows the resulting network be a useful alternative, so we use ASJP- for the forms meaning ‘sacrifice’. Note that transcribed data (analysis C4) in addition to the length of the connections in the figure our baseline, which uses SCA sound classes. is only an approximation of the calculated LexStat (List 2012a) first groups together distances, made necessary by embedding the very similar forms that express the same graph in a 2-dimensional plane. The actual concept across different languages. Then value for the cognate-ness distance is instead it calculates systematic sound correspon- reflected in the colour and thickness of the dences by using these very similar forms to lines. bootstrap a table of scores for systematic A second part of the algorithm then tries sound correspondence between each pair of to find clusters in this weighted network. languages. To account for random similari- That is, it tries to find non-overlapping sets ties, these scores are normalized against ran- of forms that all have a short distance to dom correspondences obtained from com- each other, but have a long cognate-ness dis- paring the words for random concepts in the tance to words outside the set. The colour of word lists. The cognate-ness distance of two nodes in Fig. 5 indicates the classes resulting forms can then be thought of as the effec- tive number of changes (discounting system- 1Promising new scoring functions are currently being developed (Jäger & Sofroniev 2016, Jäger et al. 2017) atic sound changes and over-counting unex- but not yet practically available. pected correspondences, as derived from the

9 Abui-Ulaga: Bunak-Suai: mini kasa g r Wersing-Maritaing: Teiwa-Lebang: k rban Bunak-Bobonaro: derma kasa g r Makasae: Kula-Lantoka: kulu'ba: 'gwajam 'pajam Kaera-Abangiwang: Blagar-Pura: go wa samu amea Fataluku: Abui-Takalelang: Bunak-Bobonaro: ku'lupa dei.'pu kasa gum Teiwa-Lebang: Sawila: Bunak-Suai: sembahan p mb 'rian kasa gum Abui-Takalelang: Klon-Hopter: Reta: dei.'pü g 'luk g 'h l sadaka

Figure 5: Network of cognate-ness for the forms meaning ‘sacrifice’, calculated using LexStat with the SCA sound class model. Thicker, darker lines represent a smaller inferred distance between forms. Different colours of the forms indicate different inferred cognate sets, see Table 3

Language Form θ = 0.55 θ = 0.35 θ = 0.75 Abui-Takalelang dei.puŋ 1 1 1 Abui-Takalelang dei.pü 1 1 1 Abui-Ulaga mini 2 2 2 Blagar-Pura ameaŋ 8 9 2 Bunak-Bobonaro kasaʔ gumɛ 3 3 3 Bunak-Bobonaro kasaʔ gɔrɔ 3 4 3 Bunak-Suai kasaʔ gumɛ 3 3 3 Bunak-Suai kasaʔ gɔrɔ 3 4 3 Fataluku kulupa 4 5 4 Kaera-Abangiwang goŋ waŋ samuŋ 5 6 5 Klon-Hopter gɔluk gɔhɔl 6 7 3 Kula-Lantoka gwajam pajam 7 8 3 Makasae kuluba 4 5 4 Reta sadaka 9 10 6 Sawila pəmbərian 10 11 7 Teiwa-Lebang derma 11 12 8 Teiwa-Lebang sembahan 12 13 2 Wersing-Maritaing kɔrban 13 14 4

Table 3: Results of automatic cognate detection on the forms meaning ‘sacrifice’, for the net- works shown in Figs. 5, 6a and 6b

10 sis. However, for comparison we also present results generated from a threshold based on an approach that is more “splitting” (C1: θ = 0.35) and more “lumping” (C2: θ = 0.75). In Fig. 6, we show the same underlying data as in Fig. 5, but transformed into a non- weighted graph using those different thresh- old values. The resulting cognate classes are (a) θ = 0.35 reproduced in Table 3. Note that in Fig. 5, the four forms in different dialects of Bunak (left, dark blue) are grouped together in a single cognate class. With a threshold of θ = 0.35 (Fig. 6a), this class is split in two, whereas for θ = 0.75 (Fig. 6b) the Kula and Klon forms are grouped together with the Bunak forms. An alternative to LexStat is the Online PMI method published by Rama et al. (2017). (b) θ = 0.75 Where LexStat uses three passes over the whole dataset to find sound correspondences Figure 6: Networks of cognate-ness for the in the data and how systematic they are, forms meaning ‘sacrifice’, calcu- the Online PMI method updates its similar- lated using LexStat with the SCA ity scores contiuously (‘online’ in the ma- sound class model, clustered using chine learning jargon) by going through the infomap with different thresholds dataset in small batches and adapting the θ . The forms are the same as those scores used for future alignments after each in Fig. 5. such batch. It uses a statistical measure for the co-occurence of sound segments known as ‘Pointwise Mutual Information’ to gener- bootstrap step) to transform one form into ate these scores. the other, normalized by the chance for ran- OnlinePMI performed better than LexStat dom similarities and the length of the forms. in identifying cognate classes (Rama et al. The network of pairwise cognate-ness dis- 2017), but it gave worse results when re- tances is clustered into disjoint cognate constructing trees using Bayesian phyloge- classes using the infomap graph commu- netics on the bases of these classes (Rama nity structure detection algorithm (Rosvall et al. 2018).2 Compared to LexStat, the On- & Bergstrom 2008). This method has been line PMI method is better able to handle data shown to outperform other methods (List et with lower mutual coverage. This comes at al. 2017). We follow List et al. and trans- form the weighted graph into an unweighted 2It is not particularly surprising that an algorithm graph by connecting exactly those pairs of may perform better in one of these tasks, but worse forms that have a cognate-ness distance less in the other task. The measure used for assess- than than a threshold θ. ing the quality of cognate coding may, for example, barely punish a small systematic bias in the cog- In general, we use the threshold of θ = nate classes, which however is important when us- 0.55 that performed best in List et al.’s analy- ing these classes to reconstruct trees

11 the cost of a significant random component, All languages, however, have loans that re- making it harder to optimize OnlinePMI for sult from borrowing events that occurred so a particular application. recently that the loans entered the languages Due to these advantages and disadvan- as independent events, i.e. they were not in- tages, we chose LexStat as our Baseline herited from the ancestor language. For ex- method, but included Online PMI for a com- ample, most languages in LexiRumah express parative investigation (C3). In that anal- the concept ‘church’, a concept which be- ysis we cognate code out data using On- came widely known in the region only in the line PMI, with parameters α = 0.75, last 300 years, with forms like [geredʒa] or initial cutoff c = 0.5 and batch size [igredʒa], which are obvious recent Indone- m = 256. We use the implementa- sian viz. Portuguese loans. Such recent loans tion available from https://github.com/ do not carry a phylogenetic signal. Yet, the evolaemp/online_cognacy_ident in the presence of a similar word makes the donor version from commit 3b998ae. language and all the recipient languages ap- pear closer to each other than their ac- 2.3. Lexical borrowing tual genealogical relationship would suggest. Therefore, it makes sense to treat each of As mentioned above, the history of the these recently borrowed forms as an isolate Timor-Alor-Pantar languages is strongly in- cognate class, distinct from each other and fluenced by language contact, and multi- any other cognate class in the data. In our linguality is usual in the region (Klamer 2017, binary model of cognate evolution (see Sec- Holton & Klamer 2017). At present, standard tion 2.4), in which each pair of a cognate class Indonesian and local Malay have a strong in- and a meaning is coded separately as either fluence on the vocabulary of all the TAP lan- present or absent, this is functionally equiva- guages, and Tetun has influenced those spo- lent to only providing presence/absence fea- ken in Timor Leste. tures for non-loanword cognate classes, and Some studies have argued that the results coding all cognate classes for a concept ex- of phylogenetic analyses will be reasonably pressed using a recent loan as “absent”. We accurate even if 15% of the vocabulary con- do this as follows. stitute undetected loanwords Greenhill et al. LexiRumah contains word lists for Indone- (2009), Chang et al. (2015). Tree models can sian and different Tetun varieties. Indone- by their nature not encompass borrowing, sian and the locally used variety of Malay so that undetected loans and chance resem- are lexically very similar, and both are the blances in the data are problematic for them. source for the majority of recent loans in the Therefore, some studies exclude loans by TAP languages (Klamer 2017: 11-12). In or- representing a loan word that is connected der to exclude recent loans from our dataset, to a specific meaning as a ’missing’ or ’un- we therefore set all cognate classes to be “ab- known’ form. However, as Chang et al. (2015) sent” if a form appears to be in the same cog- point out, loans represent one case of ex- nate class as an Indonesian or Tetun form. actly the type of lexical replacement that To test the robustness of our trees against the phylogenetic inference models purport other possible treatments of loans, we then to model, so it would be ill-advised to treat compare the results of baseline inference un- them as missing or unknown data when ap- der loan exclusion (B) with an inference un- plying such models. der inclusion of all forms, including those

12 LexiRumah WoLD Borrowed score sugar the sugar 0.794 hour the hour 0.758 mosque the mosque 0.733 market the market 0.675 tobacco the tobacco 0.664 church the church 0.658 lamp the lamp or torch 0.649 candle the candle 0.627 banana the banana 0.622 priest_modern the priest 0.621 … yellow yellow 0.154 mud the mud 0.154 pestle the pestle 0.154 wound the wound or sore 0.152 to_buy to buy 0.152 fowl the fowl 0.152 rope the rope 0.152 salt the salt 0.152 hand the hand 0.151 left_side left 0.150

Table 4: Concepts in LexiRumah with borrowing scores of 0.15 or more according to WoLD. The full list is in the Appendix in appendix B.

13 cognate with the Indonesian form (L1). linguistic perspective, we might expect that We also compare our baseline inference ideally, a cognate class for a particular mean- B with another inference (L2) where we ex- ing should come into existence only once in clude even more potential loans by remov- history, and every attested instance of such ing all the forms from the data that express a cognate class should be explainable as de- “highly borrowable” concepts. Taking the rived from that single origin through inheri- threshold from Gray et al. (2009), we exclude tance or borrowing. In the phylogenetics lit- all concepts with a borrowed score of more erature, this is known as the Dollo assump- than 0.15 according to Haspelmath & Tad- tion (after Dollo (1893)). mor (2009). Table 4 shows a list of con- In linguistic reality, however, this idealisa- cepts we excluded for analysis L2 because tion does not hold. In addition to chance sim- they closely match a concept in WoLD with ilarities and borrowings, as discussed above, a borrowed score of 0.15 or more, the full list there are universal patterns in lexical se- is in appendix B. mantics (List & Greenhill & Anderson, et al. 2018, Zalizniak 2018) and universal tenden- 2.4. Model of language cies in how forms change their lexical se- evolution mantics over time (Traugott & Dasher 2002, Heine & Kuteva 2005). As a result, it is possi- The basic principle of Bayesian phylogenetic ble to find formally related words filling one inference is the application of Bayes’ Theo- and the same meaning slot in different lan- rem guages, even though their most recent com- P (Data | Tree) · P (Tree) mon ancestor may have used an unrelated P (Tree | Data) = P (Data) form for that meaning. This effect is known (1) as homoplasy in biology (Jäger & List 2018). Stochastic Dollo models are computation- to calculate a posterior probability distribu- ally intensive, because they cannot be com- tion over all possible language trees. Trees puted for part of a tree independent of every have a higher posterior probability when other part. Combining this with the effects they are more compatible with either (i) the of imperfections in cognate coding and ho- data (that is, trees with a higher likelihood moplasy implies that stochastic Dollo mod- P (Data | Tree)) or (ii) our knowledge about els cannot be usefully applied to lexical data how language phyologies look in general in most cases (but see Bowern & Atkinson (that is, trees with a higher prior probability (2012), Kolipakam et al. (2018), Michael et al. P (Tree)). We therefore need to (i) assume a (2015) for cases where they can be applied). stochastic model that describes the evolution Besides stochastic Dollo models, several of cognate classes on any given tree and (ii) alternative models of cognate class evolu- specify our prior knowledge about the shape tion are used in studies of Bayesian phylo- of the Timor-Alor-Pantar tree. genetics. Studies comparing different mod- els find that in most cases (Gray et al. 2009, Evolutionary model The stochastic Lee & Hasegawa 2011, Kolipakam et al. 2018) model will be evaluated many millions of the best performing model is Binary covarion times, so it needs to be computationally very (although sometimes, as in (Bowern & Atkin- simple, while still reflecting the evolution of son 2012) it is found that Binary covarion cognate classes as well as possible. From a performs second-best after a Dollo model).

14 1 present” can be relatively fast and frequent. (1+α)π π π0 0 1 A simplified example from our data to il- (D) hot hot lustrate some of the transitions in Fig. 7 absent present would be the following. The proto-TAP form *lamV ’walk’ was inherited in proto-AP as 1 *lam(ar), and a modern reflex of it is Klon lam (1+α)π π π1 0 1 ’walk’. Teiwa also has a word lam, but it has (E) s s s s (C) shifted its meaning to ’follow (a path/river)’, 1 and the word tewar is now the word used for απ0 (1+α)π0π1 ’walk’. Thus, the pair <*lam(ar)-’walk’> did (B) not undergo a transition in Klon, but it has cold cold become absent in Teiwa, thus changing from absent present (A) “cold present” to “cold absent”, cf. transi- 1 tion A in Fig. 7. απ1 (1+α)π0π1 However, the model does not only account for transitions between present and absent Figure 7: Possible transitions in the binary at a very low rate; it also contains two transi- covarion model of cognate evolu- tory states: “hot present” and “hot absent”. tion with their parameterized tran- Applied to our example, the pair <*lam(ar)- sition rates. Letters in brackets ’walk’> would go from “cold present” to “hot mark the transitions discussed in present” (Fig. 7, C) in a language when the the text. word is still in use but speakers also start to use another word as an alternative or syn- onym for it. If over time, this synonym re- In the Binary covarion model, the presence places the <*lam(ar)-’walk’> form, the latter or absence of every cognate-class–meaning will fall into disuse and will only be known by pair is assumed to evolve independently. The some (older) speakers as the archaic form for Binary covarion model allows the 8 transi- ’walk’. This change would illustrate a tran- tions shown in Fig. 7, with rates parameter- sition from “hot present” to “hot absent” ized as shown. A particular pair of a cog- (Fig. 7, D). The next stage might be that the nate class and a meaning may originally have word completely disappears, thus switching been present but be lost over time, or it was to the “cold absent” state (Fig. 7, E). Transi- originally absent but was innovated. Such tions between the intermediate states of “hot transitions between a cognate class-meaning present” and “hot absent” are assumed to oc- pair being present and absent are possible at cur at a much higher rate than any of the a very low (“cold”) rate, see transition A and other transitions. So, the most likely path for B in Fig. 7. the loss of a cognate class does not go from The hot/cold metaphor here comes from “present” to “absent” but from “present” to models of physical processes, where in- “hot present” to “hot absent” to “absent”. creasing the temparature de-stabilizes a sys- Similarly the most likely patch for an inno- tem (eg. melting, de-magnetizing); the vation would not be to simply go from “ab- covarion model’s, “cold absent” and “cold sent” to “present” but first to become “hot present” states are each very stable, whereas absent” (e.g. an innovative word introduced transitions between “hot absent” and “hot by a few speakers), then “hot present” (the

15 new word used in variation with the exist- Because applying the Dollo model is expected ing word), and then become “present”. In to be strongly influenced by undetected bor- the model, the switches between “hot” and rowings, we also perform another analysis “cold” states can reflect different speeds of (ML) using the pseudo-Dollo covarion model evolution of cognate-class–meaning pairs in on a version of the dataset where we re- different parts of the tree. moved the highly borrowable concepts with Because a model with transitions of dif- a borrowed score of more than 0.15 in WoLD ferent types and rates approaches linguis- (Haspelmath & Tadmor 2009), like we did for tic reality more than simpler models that as- analysis L2 (see Section 2.3 and Table 4). sume switches between present and absence of cognate sets, we choose the Binary covar- 2.5. Tree prior, clocks and ion model for our Baseline analysis (B), even calibrations though it is still a highly abstract idealisation of real language evolution. In addition to deciding on the evolutionary The technical details are as follows. The model, we also need to specify our prior transition rates are chosen such that the knowledge of the TAP tree. One type of prior marginal probability to be in a “present” knowledge involves the rate in which lan- state is π1 and the equilibrium probability for guages split from their immediate ancestor. “absent” is π0 = 1 − π1. The vertical tran- A uniform tree prior assumes that all the lan- sition rates between hot and cold states are guages in the tree split at a constant rate, and balanced such that the system is expected to that the internal nodes of the tree have ages be “hot” and “cold” each half of the time. that are uniformly distributed between the The switch rate s is inferred using a Γ prior tips and the root of the tree (i.e., that all our with parameters α = 0.05 and β = 10, subgroups are of the same age). The advan- the slower rate α is distributed uniformly tage of such a prior is that it is computation- between 10−4 and 1, and the underlying ally very simple, but the disadvantage is that present/absent frequencies π0 and π1 are in- it does not have any basis in how languages ferred assuming a flat Dirichlet prior. Rates evolve in reality (Barido-Sottani et al. 2018, are normalised to an expected 1 present/ab- du Plessis 2018). sent switch per time unit, and re-scaled with In biology, a well-established prior on the log-normally distributed clock rate, and a trees is the coalescent tree prior. It describes Γ-distributed rate depending on the concept. the possible ancestries of two genes in a large This Γ rate variation between concepts mixing population: Tracing the gene back in reflects that not all concepts change forms the family tree of different individuals from at equal speeds; instead, the replacement the population, when do the ancestries of the speeds between different concepts vary gene coalesce to a single gene in a single an- widely (Greenhill et al. 2010, 2017, Chang cestor? Due to the assumption of a large, et al. 2015). mixing population, this tree prior does not In addition, we compare our Baseline anal- appear particularly suited to our data. ysis with an alternative analysis using the A third type of tree prior is the Fossilised recently developed pseudo-Dollo covarion Birth-Death (FBD) prior, which assumes that model (M1) (Bouckaert & Robbeets 2017) that languages split from their ancestor at a con- combines a relaxed Dollo assumption with stant birth rate, and go extinct at a different, advantages of the Binary covarion model. but also constant death rate, while leaving

16 fossils behind to the present day with a low Consequently, we do include them in our chance. (This chance is 0 for our data, which baseline analysis (B), which is intended to re- does not contain independent evidence for flect our best knowledge, but run an infer- historical languages.) ence using only the tree calibration on Oirata Rama (2018) has compared the uniform, for comparison (T2). coalescent and FBD priors on Indo-European One observation that has been made is that data, and found that the the trees inferred the level of lexical and grammatical similar- using the FBD prior and the uniform prior ity within the AP sub-branch does not sup- both closely match the traditionally estab- port an age of more than several millennia lished Indo-European tree, while the coales- (Klamer 2017: 10). The reconstructed vocab- cent prior performs much worse. Rama con- ulary of proto-AP appears to contain Malayo- cludes that “any future phylogenetic exper- Polynesian (MP) loan words such as proto-AP iment should test uniform tree prior as a base- *baj ’pig’, cf. proto-MP *babuy; and proto- line before testing more parameter-rich pri- AP *bui ’betelnut, cf. proto-MP *buaq ’fruit’ ors such as FBD or coalescent priors.” (Rama (Holton et al. 2012). These ancient MP loans 2018). We therefore use the FBD prior tree are found across the Alor-Pantar branch, and for our baseline inference B and compare it they follow regular sound changes, which in- with a uniform prior tree in analysis T1. dicates that they were borrowed at the level Our prior knowledge about the TAP tree of proto-AP. We use them to approximately also contains information about the fact that date when proto-AP split from proto-TAP. certain subgroups split, and about times Proto MP has been dated at 4 000 years BP, when that happened. While dating the nodes (Pawley 2002), and speakers of MP languages in the tree is not our primary goal here, it are commonly assumed to have arrived in is known that including multiple calibration the Timor Alor Pantar area around 3 000 BP dates in a tree can influence its topology, that (Pawley 2005, Spriggs 2011). The MP borrow- is, change its subgrouping. Since we are in- ings in proto-AP would thus give the AP fam- terested in the subgrouping of TAP, we in- ily a maximum age of around 3 000 years. vestigate the effects of including calibration Human genetic studies support a connec- dates on subgrouping in the TAP tree. tion between populations of the Lesser Sun- The migration of speakers of the ances- das with Papuan populations of New Guinea tors of today’s Oirata speakers from east and Austronesian-Malayo-Polynesians from Timor to the adjacent island of Kisar, i. e. Asia (Lansing et al. 2011, Xu et al. 2012). The the date when Oirata split from its rela- admixture between Papuan and (’Melane- tives Makasae/Makalero and Fataluku, can sian’-)Asian is estimated to have begun about be dated with high confidence to the year 5 000 years BP in the western part of east- 1721 or shortly before that (Hägerdal 2012: ern Indonesia, decreasing to 3 000 years BP in 337). We therefore include a uniformly dis- the eastern part. This associates the Papuan- tributed calibration on the split that is di- Asian admixture with Austronesian-MP ex- rectly ancestral to Oirata in all our infer- pansion (Xu et al. 2012). This is the only ences permitting the 6 years until 1721, i. e date range with quantitative confidence, and the range 279–286 yBP. gives the confidence interval used in our Calibration dates for other splits in the analysis (given in the configuration file in tree are based on circumstantial evidence terms of the 90% confidence interval re- only, and thus should be taken with caution. quired by BEASTling).

17 The dating is consistent with the scarce ible with the data. There are, however, linguistic and archeological evidence for the 680 425 371 729 975 800 390 different tree contact time frame around 3 000–3 500 BP, so topologies for our 40 TAP languages, and we assume that proto-Timor-Alor-Pantar ex- only a very small proportion of these to- isted as a single language definitely no later gether have any relevant posterior probabil- than 3 000 yBP. ity. The number of trees is so immense that Unlike proto-AP, proto-TAP does not have even if the computation of a single posterior any reconstructible MP borrowings. There probability took only a single operation on a are significant lexical and grammatical dif- Gigahertz-computer, it would still take more ferences between the languages of the AP than 20 000 years to iterate through all trees branch and those of Timor (Holton et al. and find the best one. (2012: 115) and references cited there), sug- In order to avoid this feasibility prob- gesting a long period of separation for the lem, we resort to the Markov Chain Monte branches. We estimate that proto-TAP was Carlo (MCMC) sampling method and make spoken around 4 000 years ago (Bellwood use of the fact that similar trees have some- 1997: 123, Ross 2005: 42, Pawley 2005). what similar probabilities. Instead of draw- In summary, our data does not contain ing trees entirely at random and calculating many calibration points. We therefore ex- the their probabilities (a process known as pect that we do not have sufficient data to Monte Carlo sampling), the MCMC sampling distinguish between a strict clock, which im- method constructs a random walk through plies the same speed of evolution to be the the space of trees. It has a memory of one same throughout the whole tree, and a ran- previous state (which is the characteristic dom clock, which allows the rate of evolution property of a Markov chain) and uses small to vary between different parts of the tree by random modifications, such as swapping two assigning a different relative rate of evolu- subtrees, to explore the search space of all tion to each branch in the tree. trees in a more structured manner. If the In order to test this hypothesis, we infer modification leads to a more probable tree, a tree with a strict clock as baseline (B), and the algorithm accepts it and takes it as its compare this with two further inference se- new state. If the posterior probability of the tups (T5, T6) which use a local random clock. new state is worse than the posterior proba- In these runs, we take the clock rates to bility of the old state, there is still a chance be Gamma distributed. Given the weak ev- that the algorithm accepts it and moves to idence from our calibrations, we expect the it, otherwise it is rejected, and a different tree prior to play a major role in these analy- change away from the same original location ses, so we run one analysis (T5) using a Birth- is tried. In this way, trees that are the state Death prior (like B) and the other analysis of the Markov chain more often are exactly (T6) using a uniform tree prior (like T1). the more probable trees. In fact, the rejec- tion probability is calculated such that a tree 2.6. Inference procedure that is twice as probable will be the state of using MCMC the MCMC twice as often, so by tracking the state of the chain at regular intervals, we de- The prior, the likelihood, and the data rive a sample of trees that reflects the poste- together are in theory sufficient to know rior probability distribution. the posterior distribution of trees compat- The BEAST2 software package, in its most

18 [admin] basename = tap-phylogeny-B log_params = True [MCMC] chainlength = 5000000 [languages] # Do not include the Austronesian languages in the tree languages = buna1278-bobon, buna1278-suai, buna1278-malia, fata1247, maka1316, oira1263, abui1241-fuime, abui1241-petle, abui1241-takal, abui1241-ulaga, adan1251-otvai, adan1251-lawah, blag1240-bama, blag1240-kulij, blag1240-nule, pura1258, blag1240-tuntu, blag1240-warsa, atim1239, baka1276, dein1238, hama1240, kabo1247, kaer1234, kafo1240, kama1365, kira1248, kelo1247-bring, kelo1247-hopte, kuii1253, kula1280-lanto, nede1245, rett1240, sawi1256, teiw1235, teiw1235-adiab, teiw1235-nule, wers1238-taram, wers1238-marit, lamm1241-westp # Excluded: p-east2519, p-alor1249, p-timo1261 tree_prior = birthdeath [language_groups] # Define some language groups so we can refer to them later, eg. for # calibrations and monophyly east2519 = buna1278-bobon, buna1278-suai, buna1278-malia, fata1247, maka1316, oira1263 abui = abui1241-fuime, abui1241-petle, abui1241-takal, abui1241-ulaga adang = adan1251-otvai, adan1251-lawah teiwa = teiw1235, teiw1235-adiab, teiw1235-nule blagar = blag1240-bama, blag1240-kulij, blag1240-nule, pura1258, blag1240-tuntu, blag1240-warsa eastalor = wers1238-taram, wers1238-marit, sawi1256, kula1280-lanto, kama1365 nuclearalor = atim1239, abui, adang, baka1276, blagar, dein1238, hama1240, kabo1247, kaer1234, kafo1240, kira1248, kelo1247-bring, kelo1247-hopte, kuii1253, nede1245, rett1240, teiwa, lamm1241-westp alor1249 = nuclearalor, eastalor timo1261 = alor1249, east2519 [calibration] originate(oira1263) = 279-286 # Hagerdal (2012), p. 336: 1714–1721 alor1249 = normal(3137.5, 162.6) # Alor-Pantar split up after contact with AN; Genetic dating for earliest # admixture between ‘Asian’ and ‘PNG Hl’ peoples in the AP-Region gives this 90% # confidence interval (and is the only date range with quantitative confidence). # The dating is consistent with the scarce linguistic and archeological evidence # for the contact time frame around 3000–3500 BP. For a normal distribution, a # 90% confidence intervall of 535 means a standard deviation of 162.6 timo1261 = 3000-12000 # TAP does not have reconstructible AN borrowings. If Austronesians were # established in the Timor region by 3000 BP according to Spriggs, the split of # TAP must have happened before that date. # There are some strange starting effects going on that otherwise put the root # at 10^17yBP. Avoid those by putting an extreme high upper bound on the root. [clock default] # We don't have detailed calibrations scattered throughout the tree, so a strict # clock will be more useful: It assumes all branches change at roughly the same # speed. type = strict_with_prior # Reasonable rates should be between one change in 100 years and one change in # 10000 years. We take a (logarithmic) average of one change per 1000 years # (which has log 6.9). prior = lognormal(-6.9077552789821368, 2.3025850929940459) [model vocabulary] # Build a standard binary covarion model with rate variation model = covarion rate_variation = True frequencies = estimate data = ../data/autocode-infomap-055-donor-excluded/forms.csv file_format = cldf value_column = Cognateset_ID # Ignore numbers with a high chance of being multi-morphemic. exclusions = eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty, twenty_one, twenty_two, twenty_three, twenty_four, twenty_five, thirty, forty, fifty, sixty, seventy, eighty, ninety, one_hundred_and_twenty_three, two_hundred, one_hundred_thousand,

Figure 8: BEASTling configuration file for the baseline analysis (B)

19 recent version 2.5.0 (Bouckaert et al. 2014, and with clean meaning-form mapping; the Drummond & Bouckaert 2015, Bouckaert data is automatically cognate-coded using 2018a), provides an extensible implemen- LexStat to obtain similarity scores between tation of MCMC for tree inferences. The forms, from which disjoint cognate classes pseudo-Dollo covarion model is provided by are generated by running the informap clus- the Babel add-on for BEAST2, which we use tering algorithm on the graph of all pairs of in version 0.2.0 (Bouckaert 2019). forms with a distance less than θ = 0.55. Because of the vast number of possible ex- We treat all forms that appear to be cog- tensions and model choices, BEAST2 config- nate with Indonesian as recent loans and uration files tend to be unwieldy large and exclude them from our data set. We infer complex. In order to generate the config- trees from this data set (using the BEASTling uration files for the analyses, we therefore and BEAST2 software packages) by means of use the Beastling application, version 1.4.0 Bayesian phylogenetic inference, with a bi- (Maurits et al. 2018, 2017). Beastling allows nary covarion model with Γ rate variation the user to specify a phylogentic analysis for between different concepts and trees follow- linguistics in a concise format, taking rea- ing a Birth-Death tree prior, calibrated at sonable defaults for options not specified ex- the ancestors of Oirata (279–286 yBP), the plicitly. The Beastling configuration for our ancestor of all AP languages (3137.5±162.6 baseline analysis (B) can be seen in Fig. 8. All yBP) and the TAP root (before 3000 yBP). The Beastling configuration files, as well as the BEASTling configuaration file describing this BEAST2 xml configurations generated from analysis can be seen in Fig. 8. them, can be found in the Supplementary In- We then run and compare a further ten formation. different analyses, which are summarized in We run the inference for 5 000 000 steps, Table 5. logging results every 500 steps to avoid a high correlation between subsequent sam- ples. For the baseline analysis, we also ex- 3. Results ecute a run with 50 000 000 steps, sampled every 5 000 steps. The first samples reflect The complete results of the phylogenetic the starting point of the analysis and not inferences are available in the supplemen- its posterior distribution, so we discard the tary material. Using BEAST2’s MODEL- first 10% of samples, and we check using SELECTION 1.4.0 package (Bouckaert 2018b), Tracer 1.7.1 (Tracer: Posterior Summarisation we ran a stepping-stone analysis (Xie et al. 3 in Bayesian Phylogenetics. 2019, Rambaut et al. 2011, Lartillot & Philippe 2006) with 20 steps 2018) whether the chain appears to have con- for each analysis. This method estimates the verged and whether the resulting 9 000 sam- marginal likelihood of a model, or, equiva- ples are dissimilar enough of each other to lently, the evidence in favour of that model. reflect an effective sample size of at least 200. The marginal likelihoods of our analyses are summarized in Tables 6 and 7. The overview shows that the models T5 and T6, which use 2.7. Summary of model a relaxed clock model, perform best, with choices 3Comparison between 8, 20, 50, 70 and 100 steps in- To summarize, our baseline analysis (B) uses dicated that 20 steps are sufficient for a reasonable LexiRumah TAP data in a comparable format estimate.

20 ID Cognate Coding Loanword Handling Evolutionary Model Tree Model Clock model B LexStat-Infomap Exclude Indonesian and Binary covarion with Γ Birth-Death Strict clock (θ = 0.55) Tetun loans rate variation with 4 split calibrations C1 LexStat-Infomap (as B) (as B) (as B) (as B) (θ = 0.35) C2 LexStat-Infomap (as B) (as B) (as B) (as B) (θ = 0.70) C3 Online-PMI (as B) (as B) (as B) (as B) C4 LexStat-Infomap (as B) (as B) (as B) (as B) (θ = 0.55), but with ASJP instead of SCA as sound class model L1 (as B) None (loan inclusion) (as B) (as B) (as B) L2 (as B) Exclude Indonesian and (as B) (as B) (as B) Tetun loans and WoLD highly borrowable con- cepts M1 (as B) (as B) Binary pseudo-Dollo (as B) (as B) covarion with Γ rate variation ML (as B) Exclude Indonesian and Binary pseudo-Dollo (as B) (as B) Tetun loans and WoLD covarion with Γ rate highly borrowable con- variation cepts T1 (as B) (as B) (as B) Uniform with (as B) 4 split calibra- tions T2 (as B) (as B) (as B) Birth-Death (as B) with 1 split calibration T5 (as B) (as B) (as B) (as B) Correlated random clock with Γ rates per branch T6 (as B) (as B) (as B) Uniform with Correlated 4 split calibra- random clock tions with Γ rates per branch

Table 5: Summary of the phylogenetic analyses considered in this article. ID = names of mod- els of analysis as referred to in the text.

21 iue9 Figure al 6 Table tap-phylogeny-B.log -40250 -40200 -40150 -40100 -40050 L2 ML M1 B T2 T1 T6 log-lk T5 Marg. Analysis 0 : : dnei aoro htmodel. that of favour in idence ev- is data the that implies likelihood log- marginal directly higher a are – blocks comparable these of each analyses in The L2. with as second set the data the same the B, with as set analyses first data same the The lists block analyses. different the of log-likelihood marginal Estimated o h ape rm5 0 000 000 50 (B). analysis from baseline of samples steps the for likelihood logarithmic the of Trace 10000000 2E7 − − − − − − − − State 25826 25603 42044 41938 40632 40590 40311 40309 3E7 ∆ obest to − − 4E7 − − − 1735 1629 223 323 281 − 2 5E7 22 B,tepseirpoaiiytae(shown trace probability posterior the (B), prior tree best. performs Birth-Death otherwise, used the of tree instead Uniform a prior uses which strict T1, the models, Of clock likelihood. in difference minor 7 Table ssaesoni i.1.Ms fteclades the of Most 10. Fig. in shown are ysis outliers. as samples L2 effective and 59 samples R2 having effective and 10 MR than with less runs, the having of most for 50 un- and 20 are between analyses ranging lower, other much fortunately the all of for sizes runs sample effective shorter The 200. tar- of our get above is which samples, of dependent size sample effective an represent distri- remaining 9000 posterior = the 50000000/5000 The we that the suggests sure, in bution ‘burn-in’. be samples as of To 10% chain first the steps. discarding 000 000 1 converged model the within that suggests 9) Fig. in 4 l ain nlssse oas aerahda reached have also to seem analyses variant All o h 00000se aeieanalysis baseline step 000 000 50 the For h re ape yorBsln B anal- (B) Baseline our by sampled trees The u scretyotieorcmuainlcapac- computational ity. our size, outside currently sample is effective but sufficient all at show Repeating results likely would similar steps time given. 000 000 results 50 to the analyses in makes which confident steps, us 000 000 1 around after plateau : eetteeiec o htmodel. that for evidence the resent rep- directly likeli- not therefore log do hoods marginal estimated same the the with model done analyses All C1 L1 B C2 C4 C3 log-lk L2 Marg. Analysis sB u ihdfeetdata; different with but B, as ape ftechain the of samples − − − − − − − 53485 43842 41938 40608 40603 30004 25826 244 90 4 % in- · bunak-malia bunak-suai bunak-bobon oirata fata E Timor luku makasae Timor Alor Pantar wersing-marit wersing-taram sawila kula-lanto E Alor (EA) kamang abui-fuimelang abui-atim abui-takal abui-petle abui-ulaga kafoa kiraman Alor Pantar (AP) kui klon-bring klon-hopte kabola Alor adang-lawah adang-otvai hamap kaera reta Nuclear Alor Pantar (NAP) Straits blagar-pura blagar-kulij blagar-baka blagar-nule blagar-bama blagar-tuntu blagar-warsa Pantar klamu teiwa-leban teiwa-nule teiwa-adiab deing westpantar

Figure 10: Densitree plot of the trees in the posterior sample of the baseline analysis (B), with a burn-in of 10%.

23 (subgroups) show up in every tree of the sam- group of Alor languages (West and Central), ple. The most striking feature of this tree, (b) a group of Pantar and Straits languages. compared with the one that has been estab- The split of proto-AP into its two major sub- lished manually (Fig. 2, HPhon2012), is the groups is consistently dated later than the very clear second-order split immediately original separation of the Timor languages. below the proto-Alor-Pantar node between Zooming in on the East-Alor subgroup, the languages of east Alor (Kamang, Wersing, there Kamang is the first language to split Sawila, Kula) on one side, and the languages off, in more than 90% of the tree sample. of west Alor, Pantar and the Straits on the However, sometimes Kamang shows up as a other side. NAP language. This happens most strongly The overview of all inference runs, Figs. 11 in analysis C1 and M1, which generally have and 12, shows that the resulting tree topolo- a badly resolved topology. gies are vastly similar for all model choices. The split between east Alor and the other The two subgroups of NAP are robust in AP languages, in particular, appears in all of all analyses. In Alor, the C Alor subgroup in- them. The only exception is analysis T5. T5 cludes Abui, Kui and Kiraman in all analyses, is identical to T6 apart from the shape of the and also includes Kafoa in all analyses except tree prior, which suggests that it is the birth- for analysis C2. Most analyses group Klon death tree prior which enforces the particu- with Central Alor in the majority of posterior lar shape of T5, and that its shape is not so samples, but a minority of cases groups Klon much a feature of the data as it is a feature with the W Alor languages. This is the case of the highly parametric (and therefore mod- for more than one third of the posterior sam- ifyable) clock with a relatively strong tree ple in the analysis that included loans (L1). prior. In the analysis where all Indonesian loans Another interesting feature of our B tree and highly borrowable concepts from WoLD is that some of the earlier established sub- were excluded (L2), only 1.1% of the poste- groups do not appear as such in our tree. Ac- rior trees group Klon with W Alor. Kafoa is cording to our analysis, the most recent com- grouped as the closest relative of either Abui mon ancestor of the varieties of Adang is also or Klon. the ancestor of Hamap and Kabola, and not of The W Alor group of Adang, Hamap, and Blagar. This is consistent among all the anal- Kabola forms a distinct clade. Interestingly, yses we ran, but in the analysis of Holton et the variety Otvai (which LexiRumah lists as al 2012, Blagar forms a lowel-level subgroup dialect of Adang, and as a dialect with Adang (see Fig. 2, HPhon2012). of Kabola) always groups with Hamap before Overall, our analyses show a very consis- another dialect of Adang or Kabola, and the tent picture of the initial splits of the fam- Otvai-Hamap clade is inferred with a time ily. They indicate the following subgrouping depth that is similar to the time depth of lan- structure, also shown in Fig. 13. The first- guages in the tree. order splits are between Bunak, the E Timor languages, and the AP languages. The AP lan- Other subgroups that were inferred con- guages then split into two major branches: sistently in all our analyses are the Pantar (i) an East-Alor (EA) subgroup (Wersing, Saw- subgroup, comprising the Straits languages ila, Kula, Kamang) opposed to (ii) the Nu- (various dialects of Blagar together with Reta clear AP (NAP) languages comprising (a) a and Kaera) and the Pantar languages.

24 buna1278-malia buna1278-malia buna1278-suai buna1278-suai buna1278-bobon buna1278-bobon oira1263 oira1263 fata1247 fata1247 maka1316 maka1316 wers1238-marit wers1238-marit wers1238-taram wers1238-taram sawi1256 sawi1256 kula1280-lanto kula1280-lanto kama1365 kama1365 abui1241-fuime abui1241-fuime atim1239 atim1239 abui1241-takal abui1241-takal abui1241-petle abui1241-petle abui1241-ulaga abui1241-ulaga kafo1240 kafo1240 kira1248 kira1248 kuii1253 kuii1253 kelo1247-bring kelo1247-bring kelo1247-hopte kelo1247-hopte kabo1247 kabo1247 adan1251-lawah adan1251-lawah adan1251-otvai adan1251-otvai hama1240 hama1240 kaer1234 kaer1234 rett1240 rett1240 pura1258 pura1258 blag1240-kulij blag1240-kulij baka1276 baka1276 blag1240-nule blag1240-nule blag1240-bama blag1240-bama blag1240-tuntu blag1240-tuntu blag1240-warsa blag1240-warsa nede1245 nede1245 teiw1235 teiw1235 teiw1235-nule teiw1235-nule teiw1235-adiab teiw1235-adiab dein1238 dein1238 lamm1241-westp lamm1241-westp

(a) C1: More splitting cognate coding (b) C2: More lumping cognate coding

buna1278-malia buna1278-malia buna1278-suai buna1278-suai buna1278-bobon buna1278-bobon oira1263 oira1263 fata1247 fata1247 maka1316 maka1316 wers1238-marit wers1238-marit wers1238-taram wers1238-taram sawi1256 sawi1256 kula1280-lanto kula1280-lanto kama1365 kama1365 abui1241-fuime abui1241-fuime atim1239 atim1239 abui1241-takal abui1241-takal abui1241-petle abui1241-petle abui1241-ulaga abui1241-ulaga kafo1240 kafo1240 kira1248 kira1248 kuii1253 kuii1253 kelo1247-bring kelo1247-bring kelo1247-hopte kelo1247-hopte kabo1247 kabo1247 adan1251-lawah adan1251-lawah adan1251-otvai adan1251-otvai hama1240 hama1240 kaer1234 kaer1234 rett1240 rett1240 pura1258 pura1258 blag1240-kulij blag1240-kulij baka1276 baka1276 blag1240-nule blag1240-nule blag1240-bama blag1240-bama blag1240-tuntu blag1240-tuntu blag1240-warsa blag1240-warsa nede1245 nede1245 teiw1235 teiw1235 teiw1235-nule teiw1235-nule teiw1235-adiab teiw1235-adiab dein1238 dein1238 lamm1241-westp lamm1241-westp

(c) C3: Online-PMI as cognate coding method (d) C4: ASJP sound class based cognate coding

buna1278-malia buna1278-suai buna1278-bobon oira1263 fata1247 maka1316 wers1238-marit wers1238-taram sawi1256 kula1280-lanto kama1365 abui1241-fuime atim1239 abui1241-takal abui1241-petle abui1241-ulaga kafo1240 kira1248 kuii1253 kelo1247-bring kelo1247-hopte kabo1247 adan1251-lawah adan1251-otvai hama1240 kaer1234 rett1240 pura1258 blag1240-kulij baka1276 blag1240-nule blag1240-bama blag1240-tuntu blag1240-warsa nede1245 teiw1235 teiw1235-nule teiw1235-adiab dein1238 lamm1241-westp 25 (e) L1: No loan exclusion (f) L2: Zealous loan exclusion

Figure 11: Overview over the trees sampled by the baseline model on different datasets buna1278-malia buna1278-suai buna1278-bobon oira1263 fata1247 maka1316 wers1238-marit wers1238-taram sawi1256 kula1280-lanto kama1365 abui1241-fuime atim1239 abui1241-takal abui1241-petle abui1241-ulaga kafo1240 kira1248 kuii1253 kelo1247-bring kelo1247-hopte kabo1247 adan1251-lawah adan1251-otvai hama1240 kaer1234 rett1240 pura1258 blag1240-kulij baka1276 blag1240-nule blag1240-bama blag1240-tuntu blag1240-warsa nede1245 teiw1235 teiw1235-nule teiw1235-adiab dein1238 lamm1241-westp

(a) M1: pseudo-Dollo covarion model (b) ML: pseudo-Dollo covarion model with zeal- ous loan exclusion

buna1278-malia buna1278-suai buna1278-bobon oira1263 fata1247 maka1316 wers1238-marit wers1238-taram sawi1256 kula1280-lanto kama1365 abui1241-fuime atim1239 abui1241-takal abui1241-petle abui1241-ulaga kafo1240 kira1248 kuii1253 kelo1247-bring kelo1247-hopte kabo1247 adan1251-lawah adan1251-otvai hama1240 kaer1234 rett1240 pura1258 blag1240-kulij baka1276 blag1240-nule blag1240-bama blag1240-tuntu blag1240-warsa nede1245 teiw1235 teiw1235-nule teiw1235-adiab dein1238 lamm1241-westp

(c) T1: Uniform tree prior (d) T2: Calibration only on Oirata

buna1278-malia buna1278-malia buna1278-suai buna1278-suai buna1278-bobon buna1278-bobon oira1263 oira1263 fata1247 fata1247 maka1316 maka1316 wers1238-marit wers1238-marit wers1238-taram wers1238-taram sawi1256 sawi1256 kula1280-lanto kula1280-lanto kama1365 kama1365 abui1241-fuime abui1241-fuime atim1239 atim1239 abui1241-takal abui1241-takal abui1241-petle abui1241-petle abui1241-ulaga abui1241-ulaga kafo1240 kafo1240 kira1248 kira1248 kuii1253 kuii1253 kelo1247-bring kelo1247-bring kelo1247-hopte kelo1247-hopte kabo1247 kabo1247 adan1251-lawah adan1251-lawah adan1251-otvai adan1251-otvai hama1240 hama1240 kaer1234 kaer1234 rett1240 rett1240 pura1258 pura1258 blag1240-kulij blag1240-kulij baka1276 baka1276 blag1240-nule blag1240-nule blag1240-bama blag1240-bama blag1240-tuntu blag1240-tuntu blag1240-warsa blag1240-warsa nede1245 nede1245 teiw1235 teiw1235 teiw1235-nule teiw1235-nule teiw1235-adiab teiw1235-adiab dein1238 dein1238 lamm1241-westp 26 lamm1241-westp

(e) T5: Random local clock, birth-death tree prior (f) T6: Random local clock, uniform tree prior

Figure 12: Overview over the trees sampled using models differing from the baseline model Bunak Makasae 0.998 Oirata 0.998 1 Fataluku Kamang Kula 0.982 0.751 0.994 Sawila 0.999 Wersing Adang-Lawahing 0.988 Kabola 0.999 1 Hamap 0.997 Adang-Otvai 0.986 Klon Kiraman 0.676 0.997 Kui 0.987 0.992 Kafoa 0.986 Abui West Pantar Klamu 0.992 0.995 Teiwa 0.987 Kaera 0.996 Reta 0.989 Blagar

Figure 13: Maximum clade credibility consensus tree of all analyses.

27 4. Comparison with concern of that method, which is also explic- previous work itly mentioned in Holton et al. (2012), is the fact that the trees that are constructed on the basis of regular sound changes do not always Here we outline the differences between our converge, and we lack an explicit and ob- tree and the ones proposed earlier by Holton jective theory to weigh which regular sound et al. (2012), Robinson & Holton (2012), and changes count as evidence to construct sub- how these differences can be explained. groupings and which ones do not. Holton Holton et al. (2012) propose a family tree et al. (2012) observe eleven regular sound for AP (reproduced in Fig. 2, HPhon2012) changes in their body of lexical data (see Ta- that is based on shared phonological innova- ble 8), but most of these sound changes de- tions in twelve AP languages. The method- lineate language subgroups that are overlap- ology used involved manually coded cognate ping rather than discrete (e.g. *b > f groups classes, manual removal of loans, and man- Abui with Teiwa and Klamu, while *d > r ual investigation of regular sound changes. groups Abui with Kui). The linguist thus has This tree is also the one represented by Glot- to make a choice which of the sound changes tolog (Hammarström et al. 2018). represent actual stages in the evolution of The tree in Holton et al. (2012) shows rel- the family and hence define subgroups; and atively little resolution and no delineation which ones do not. Holton et al. (2012) chose of a Pantar subgroup, so that all Pantar lan- six out of the fifteen observed regular sound guages are considered as primary split-offs changes as innovations in the evolution of from proto-AP. Also, they have both Blagar as the family tree, and marked them besides the well as the East-Alor languages fairly deeply corresponding nodes in Fig. 2 (HPhon2012), embedded in the tree. They suggest an ori- dismissing the remaining nine. gin and original break-up of the AP family in Overlapping sound changes such as those East Pantar (Holton et al. 2012: 118). observed by Holton et al. (2012) can have In contrast, all our analyses consider the different causes. Sound changes may dif- E Alor languages a first order split off from fuse across languages that are in contact proto-AP and view the Pantar languages as with each other; or they may be unmarked forming a node within the higher Nuclear AP changes (in the sense that they are com- subgroup. In none of our analyses we found monly attested in genealogically and geo- evidence for a Straits-W Alor subgroup, com- graphically distant or unrelated languages) prising Adang, Blagar and Klon as proposed which can occur as parallel but indepen- in Holton et al. (2012: 112). Instead, Adang dent innovations in different subgroups. It and Blagar are each part of different lower- is likely that both sound change diffusion and level subgroups (W Alor vs. Straits), while parallel sound changes are obscuring the his- Klon generally goes in the C Alor group. torical signal in the AP family, so that regular How can these differences be explained? sound changes may not be the best evidence One possible explanation is that Holton et to use for defining its subgroups. If sound al. (2012) analysed phonological innovations, changes are used for that purpose, choices while we analysed lexical changes. In the have to be made as to which of them define classic Comparative Method, phonological subgroups and which are considered to be innovations are the primary evidence used the result of either diffusion or parallel inno- for subgrouping. However, a methodological vation. It is almost impossible to make such

28 Sound changes Languages *b > f Teiwa, Klamu, Abui (in Teiwa and Klamu only non- initially) *b > p Kamang, Sawila, Wersing *d > r Abui, Kui (in Kui only finally) *g > ʔ Blagar, Adang *k > ∅ Blagar, Adang *q > k W Pantar, Blagar, Adang, Klon, Kui, Abui, Kamang, Sawila, Wersing (Adang ʔ < k < *q) *s > h Blagar, Adang, Klon *s > t Abui, Sawila, Wersing *h > ∅ everywhere but Teiwa and W Pantar *m > ŋ / _# W Pantar, Blagar, Adang *n > ŋ / _# Klamu, Kaera, W Pantar, Blagar, Adang, Abui, Kamang, Sawila, Wersing *l > i / _# Teiwa, Kaera, Adang, Kamang *l > ∅ / _# Klamu, W Pantar, Abui *r > l / V_V Klamu, W Pantar, Adang, Kamang *r > i / _# Blagar, Kui, Abui

Table 8: Sound changes in Alor Pantar languages observed by Holton et al. (2012: 113). choices based on a limited set of synchronic group, rather than embedded in the tree as lexical data, so they ultimately become sub- it is in HPhon2012. The East-Alor languages jective choices of researchers. Kamang, Wersing and Sawila form a group The second earlier proposal by Robinson & that is deeply embedded inside the tree, as in Holton (2012) proposes the RDol2012 family HPhon2012. It is likely that the deep embed- tree for AP reproduced in Fig. 3. That tree ding of East-Alor is a consequence of the fact is based on lexical innovation in the same 12 that the tree was rooted by taking out the languages used in Holton et al. (2012). The proto-AP node, thus forcing the subgroup to dataset is available as CLDF-dataset (Forkel & be embedded much deeper. Tresoldi, et al. 2018). A set of proto-AP recon- We ran a replication (MR) of RH2012 on structed forms is included in the dataset. The their original dataset (including the proto- methodology used involved manual removal AP forms), and otherwise following our setup of loans, manually coded cognate classes, M1, which is closest to their configuration a computational Bayesian MCMC tree infer- but forces an ultrametric tree, in which all ence, using a stochastic Dollo model on an recent languages are contemporary, and re- unconstrained tree. Importantly, the proto- places the stochastic Dollo model with the AP node was used as an outgroup to root the pseudo-Dollo covarion model. We calibrated tree (Robinson & Holton 2012: 143). the age of proto-AP to follow a normal dis- In the RDol2012 tree, proto-AP splits into a tribution around 3137.5 yBP, but we did not Pantar subgroup and an Alor subgroup. Bla- constrain it to be the outgroup to the tree. gar is an early split-off from the Alor sub- The result is shown in Fig. 14a.

29 Ad

Kl

Ki

Bl

WP

Nd

Tw

Ke

Ab

Km

We

Sw

(a) Densitree visualization of our replication (a) Densitree visualization of our replication study MR, using the same model as M1 on the study R2, using the same model as B on the data from Robinson & Holton (2012). data from Robinson & Holton (2012). Km Ab 1 Sw 0.712 Km 1 We 0.985 We 1 pAP 1 Sw Ab 1 Ad Ad 0.757 0.999 Kl 1 Kl 0.883 0.907 Ki Ki 0.998 0.784 Bl Bl WP 1 WP 0.999 0.895 Nd 0.587 Nd 1 Tw 1 Ke 1 Ke 0.994 Tw (b) Summary tree of the same study MR. (b) Summary tree of the same study R2.

Figure 14: Posterior distributions of our Figure 15: Posterior distributions of our replication study MR replication study R2

30 While there are similarities between ing our baseline binary covarion model (B). RDol2012 and our replication MR, there are This posterior sample also shows various also some major differences. One difference differences to RDol2012, such as (i) the E concerns the position of Klon (going with Alor languages as an early subgroup of the Kiraman rather than with Adang in the ma- AP branch, sometimes together with Abui, jority). A major difference is the position (ii) the position of Blagar inside the Pantar of the East-Alor subgroup as a primary split subgroup instead of the Alor subgroup, and from the top node, while it was embedded (iii) the clustering of Klon with Kui instead of inside the tree in RDol2012. Adang. We suggest that the cognate classes This is due to the fact that, unlike used in RDol2012 may have been a major RDol2012, we did not force proto-AP to be the reason for the differences. outgroup. Doing that would result in more In sum, the tree in RDol2012 and our repli- subgroupings in especially the Alor side of cations of it (MR and R2) as well as our base- the tree. However, positing proto-AP as the line tree B agree in recognizing a subgroup outgroup implies that it is an unbiased re- of Pantar languages. Major differences again construction. Any minor biases in the proto- concern the position of the East-Alor sub- AP reconstruction are exacerbated by us- group as deeply embedded vs. early split off, ing it as an independent data source of par- as well as the position of Blagar as going with ticular importance. The fact that proto-AP the Alor group vs. the Pantar group. To con- ends up as a sister language to Abui and the clude, the differences between the manually other nuclear Alor-Pantar languages in our constructed HPhon2012 tree, the RDol2012 MR analysis (see Fig. 14a) suggests that the tree and our baseline tree B mainly concern reconstructed proto-AP forms do not take (i) the position of the East-Alor group deeply the East-Alor languages into account enough, inside the tree vs. early split off, and (ii) the and that a tree with that reconstructed lan- position of Blagar as grouping with the lan- guage as root fails to address some properties guages of Alor or with the languages of Pan- of the data. The fact that the Abui-East-Alor- tar. clade in Fig. 3 has a support of only 0.4 may The tree in HPhon2012 has less resolution be a further indication of that issue; note that than all the other trees, for example by not in contrast all clades in Fig. 14b have a sup- recognizing a subgroup comprising the Pan- port value of more than 0.75 (except for the tar languages. This lack of resolution can be Pantar clade, where the affiliation of West- explained by the confusing signal of overlap- ern Pantar is unclear). ping sound changes, which also suggests a In general it is problematic to use the same possible point of contention of that tree. The data for both the inference and the recon- differences between the RDol2012 tree and struction of a phylogenetic tree. Even when our baseline tree B, as well as our replica- included only for visualization (such as in our tions of RDol2012 are likely largely due to (i) replication MR), reconstructed forms can in- methodological issues of forcing the proto- troduce biases and should be handled with language as outgroup and using the same care. data for both tree inference and reconstruc- We consider a second replication of tion, with (ii) a smaller contribution from the RDol2012, which we denote R2, again us- quality and breadth of the underlying data, ing their data (now with proto-AP removed with fewer language varieties and fewer con- from the dataset) and codings, but now us- cepts per variety than the data used here.

31 5. Discussion nate classes showing this pattern are ’bat’ and ’sun’ (see the data in Kaiping et al. Our analysis shows a very strong signal of (2019)). subgrouping of the Timor-Alor-Pantar fam- In sum, the primary split observed in all ily. The first-order splits between Bunak and our automatic inferences is supported by a all other Timor-Alor-Pantar languages is a number of innovations that are established recurrent feature in nearly all of our sampled independently, using the classic comparative trees, and the AP languages are consistently method. inferred with an early split between East- Early split-offs are often taken to reflect Alor languages and Nuclear-AP languages. the greatest age, so that the location of This split was not found in any of the earlier such a split may point to an area where the proposals. It is supported by regular innova- proto-language began to diversify, c.f. Sapir tions, of which we will mention three types in (1916)’s “centre of gravity principle”. An the following (there may be more). (Example early East-Alor primary subgroup of AP may cognate sets are provided in appendix A.) suggest that the proto-AP speakers first ar- Proto-Nuclear Alor Pantar saw a lexical in- rived on the (south-)eastern coast of Alor and novation, adding a lexeme *-om ’inside’ to spread westwards over the course of sub- the expression used for ’(be) inside’, while sequent centuries. This scenario contrasts proto-East-Alor retained the original form with earlier proposals where the homeland reflecting proto-TAP *mi ’be in, at’; which is of AP was hypothesized to be on Pantar, with also found in reconstructions for the Timor a subsequent direction of movement of the branches (Klamer 2018), see Fig. 16. 5 languages from west to east (Holton et al. Here we mention two phonological inno- 2012), or where the homeland was on an is- vations that support the split. Proto-TAP land in the Straits between Alor and Pan- intervocalic *d changed into a liquid *r in tar (Robinson & Holton 2012: 143), with lan- proto-Nuclear AP, but was retained in proto- guages moving west to Pantar, and east to East-Alor; see Fig. 17, where this is illustrated Alor. A phylogeographic inference, which for proto-TAP *hada ‘fire’. (Other cognate takes the cultural adaptation necessary to classes showing this pattern are ’sugarcane’, move between different biomes (eg. from ’tongue’ and ’new’; see the data in Kaiping et coast to mountains) into account, may be al. (2019)). able to give deeper insight into the migration The opposite pattern of innovation cum history of the region. retention is found for proto-TAP intervocalic Within the East-Alor group, the languages *b. This obstruent became voiceless *p in Kula, Sawila and Wersing are grouped to- proto-East-Alor but was retained in proto- gether in every tree sampled in every anal- Nuclear AP, see Fig. 18, where this is illus- ysis, which gives conclusive support for this trated for proto-TAP *habi ’fish’. Other cog- subgroup. In addition, most analyses include Kamang in the group. We hypothesize that 5Note that the figures in this section assume a proto- Kamang is an East-Alor language by origin, Timor node (using the reconstructions of Schapper but may have had significant contact with et al. (2017)), while the node is not supported by one or more Central Alor languages (Abui is our analyses. The observation of innovations that currently its big neighbouring language), so separate nuclear Alor-Pantar from East-Alor lan- guages is however independent of the existence of that it appears more similar to the Central this node. Alor languages now. An opposite hypothe-

32 Figure 16: History of expressions for ’be in(side)’ in the TAP family.

Figure 17: History of reconstructions for ’fire’ in the TAP family.

Figure 18: History of reconstructions for ’fish’ in the TAP family.

33 sis would assume that Kamang belongs to the east of Alor, followed by a subsequent west- Nuclear AP branch but had contact to East- ward movement of the languages. Alor languages and thus became more sim- In addition, the data suggests an early ilar to them. However, because Kamang is split between Bunak and the other Timor- classified as an East-Alor language by 90% of Alor-Pantar languages. This split is not in- the trees in analysis ML, which is our most ferred with as much consistency as the nu- zealous effort in reducing the influence of clear AP/East-Alor split and warrants further loans, we consider the first hypothesis more investigation. likely than its opposite. The historical connections inferred for the The Nuclear AP languages split into an TAP languages depend on both the granular- Alor and Pantar group. The Alor group splits ity of data and the methodological choices of into a Central-Alor subgroup, formed of Abui, the researchers. We have seen that in clas- Kui, Kiraman, Kafoa and Klon, and a W Alor sical models of reconstruction the (subjec- group. The Pantar group splits into a Pantar tive) weighting of conflicting sound change and a Straits group. Our inferences suggest evidence determines the subgrouping, while that, contrary to earlier findings, there is in Bayesian inference, it is the choice of no separate Straits-West-Alor group. The af- evolutionary model and rooting methodol- filiation of Western Pantar remains unclear, ogy (including implicit rooting). It is impor- and the language appears to be a linchpin tant to take these methodological consider- in several inferences. This suggests that a ations into account when the inferred ge- systematic comparative study of the Western nealogies are used as windows on the linguis- Pantar language might be helpful in also dis- tic past, and even more so when they are cerning the immediate subgrouping struc- combined with evidence to reconstruct the ture of the Alor-Pantar Straits subfamily. non-linguistic past.

6. Conclusion Supplementary Material Data and methods used to produce the re- In conclusion, our lexicon-based analysis in- sults shown in this paper are available as dicates the subgrouping structure of the online supplementary material at Timor-Alor-Pantar language family shown in https://osf.io/gpnxm/?view_only= Fig. 19. 004ff2889b1640ed96f2c77bb83ec6fa. The major features of this tree are consis- tently inferred using various state-of-the-art methods of cognate detection and tree infer- References ence. The newly found early split of the Alor- Pantar branch into an East-Alor subgroup Baird, Louise. 2008. A grammar of Klon: A non- and a Nuclear Alor-Pantar subgroup robustly Austronesian language of Alor, Indonesia (Pacific Linguistics 596). OCLC: 278468217. Canberra: appears from all our inferences as well as Pacific Linguistics, Research School of Pacific from the replication of an earlier Bayesian and Asian Studies, Australian National Univer- study which was based on a smaller data set. sity. 242 pp. This early split can be supported by lexical and phonological innovations. It suggests that the Alor-Pantar speakers arrived in the

34 TAP

AP E Timor

Nuclear AP E Alor

Pantar-Straits Alor

Pantar W Alor C Alor

Straits Adang-Otvai Kafoa Abui Makasae etPantar West Teiwa Klamu Blagar Reta Kaera Hamap Adang-Lawahing Klon Kui Kamang Kula Sawila Fataluku Oirata Bunak Kabola Kiraman Wersing

Figure 19: Subgrouping structure of the TAP language family.

35 Baird, Louise. 2017. Kafoa. In Antoinette Schap- / / doi . org / 10 . 1371 / journal . pcbi . per (ed.), The Papuan Languages of Timor, Alor 1003537. and Pantar: Volume 2, 55–108. Berlin / New Bouckaert, Remco & Robbeets, Martine. 2017. York: De Gruyter Mouton. Pseudo Dollo models for the evolution of bi- Barido-Sottani, Joëlle & Bošková, Veronika & nary characters along a tree. bioRxiv. 207571. Plessis, Louis Du & Kühnert, Denise & Mag- https://doi.org/10.1101/207571. nus, Carsten & Mitov, Venelin & Müller, Nicola Bowern, Claire & Atkinson, Quentin. 2012. Com- F. & PečErska, Jūlija & Rasmussen, David A. & putational phylogenetics and the internal Zhang, Chi & Drummond, Alexei J. & Heath, structure of Pama-Nyungan. Language 88(4). Tracy A. & Pybus, Oliver G. & Vaughan, Tim- 817–845. https://doi.org/10.1353/lan. othy G. & Stadler, Tanja. 2018. Taming the 2012.0081. BEAST—A Community Teaching Material Re- Chang, Will & Cathcart, Chundra & Hall, David & source for BEAST 2. Systematic Biology 67(1). Garrett, Andrew. 2015. Ancestry-constrained 170–174. https : / / doi . org / 10 . 1093 / phylogenetic analysis supports the Indo- sysbio/syx060. European steppe hypothesis. Language 91(1). Bellwood, Peter. 1997. The prehistory of the Indo- 194–244. https://doi.org/10.1353/lan. Pacific archipelago. 2nd. Honolulu: University of 2015.0005. Hawai’i Press. Dollo, Louis. 1893. The laws of evolution. Bull. Soc. Blust, Robert. 1999. Proto-Austronesian. In Bel. Geol. Paleontol 7. 164–166. Robert Blust & Russell D. Gray & Simon J. Drummond, Alexei J. & Bouckaert, Remco R. Greenhill & Cordelia Nickelsen (eds.), Aus- 2015. Bayesian Evolutionary Analysis with BEAST tronesian Basic Vocabulary Database. https : 2. Cambridge University Press. 263 pp. / / abvd . shh . mpg . de / austronesian / du Plessis, Louis. 2018. FBD-Tutorial: Dating Species language.php?id=280 (28 January, 2019). Divergences with the Fossilized Birth-Death Pro- Bouckaert, Remco. 2010. DensiTree: making cess. Taming the BEAST. https : / / github . sense of sets of phylogenetic trees. Bioinfor- com / Taming - the - BEAST / FBD - tutorial matics 26(10). 1372–1373. https://doi.org/ (17 July, 2018). 10.1093/bioinformatics/btq110. Forkel, Robert. 2017. CLDF: Cross-linguistic Bouckaert, Remco. 2018a. BEAST 2.5.0. Ver- Data Formats. Zenodo. Version 1.0rc1 DOI sion 2.5.0. https://github.com/CompEvol/ 10.5281/zenodo.835502. https : / / doi . beast2/releases/tag/v2.5.0. org/https://doi.org/10.5281/zenodo. Bouckaert, Remco. 2018b. BEAST2-Dev/Model- 835502. Selection. BEAST2-Dev. https : / / github . Forkel, Robert & List, Johann-Mattis & Greenhill, com / BEAST2 - Dev / model - selection Simon J. & Rzymski, Christoph & Bank, Sebas- (15 February, 2019). tian & Cysouw, Michael & Hammarström, Har- Bouckaert, Remco. 2019. Babel: BEAST Analy- ald & Haspelmath, Martin & Kaiping, Gereon sis Backing Effective Linguistics. Version 0.2.0. A. & Gray, Russell D. 2018. Cross-Linguistic https://github.com/rbouckaert/Babel Data Formats, advancing data sharing and re- (13 February, 2019). use in comparative linguistics. Scientific Data 5. Bouckaert, Remco & Heled, Joseph. 2014. Den- 180205. https://doi.org/10.1038/sdata. siTree 2: Seeing Trees Through the Forest. 2018.205. bioRxiv. 012401. https://doi.org/10.1101/ Forkel, Robert & Tresoldi, Tiago & Greenhill, Si- 012401. mon J. 2018. Lexibank/Robinsonap: Internal Clas- Bouckaert, Remco & Heled, Joseph & Kühnert, sification of the Alor-Pantar Language Family Us- Denise & Vaughan, Tim & Wu, Chieh-Hsi & Xie, ing Computational Methods Applied to the Lexicon. Dong & Suchard, Marc A. & Rambaut, Andrew https : / / doi . org / 10 . 5281 / zenodo . & Drummond, Alexei J. 2014. BEAST 2: A Soft- 1312828. ware Platform for Bayesian Evolutionary Anal- Gray, R. D. & Drummond, A. J. & Greenhill, S. J. ysis. PLoS Comput Biol 10(4). e1003537. https: 2009. Language phylogenies reveal expansion

36 pulses and pauses in Pacific settlement. Science Holton, Gary. 2014. Western Pantar. In Antoinette 323(5913). 479–483. https://doi.org/10. Schapper (ed.), Papuan languages of Timor, Alor 1126/science.1166858. and Pantar: Sketch grammars, vol. 1, 23–96. Greenhill, Simon J. & Atkinson, Q. D. & Meade, A. Berlin: Mouton de Gruyter. & Gray, R. D. 2010. The shape and tempo of lan- Holton, Gary & Klamer, Marian. 2017. The Papuan guage evolution. Proceedings of the Royal Society languages of East Nusantara and the Bird’s of London B: Biological Sciences 277(1693). 2443– Head. In Bill Palmer (ed.), The Languages and 2450. https://doi.org/10.1098/rspb. Linguistics of the New Guinea Area. 569–640. 2010.0051. Berlin/New York: Mouton de Gruyter. Greenhill, Simon J. & Currie, Thomas E. & Gray, Holton, Gary & Klamer, Marian & Kratochvíl, Russell D. 2009. Does horizontal transmission František & Robinson, Laura C. & Schapper, invalidate cultural phylogenies? Proceedings of Antoinette. 2012. The historical relations of the Royal Society of London B: Biological Sciences the Papuan languages of Alor and Pantar. 276(1665). 2299–2306. https://doi.org/10. Oceanic Linguistics 51(1). 86–122. https : / / 1098/rspb.2008.1944. www.jstor.org/stable/23321848 (25 Jan- Greenhill, Simon J. & Wu, Chieh-Hsi & Hua, Xia uary, 2019). & Dunn, Michael & Levinson, Stephen C. & Holton, Gary & Robinson, Laura C. 2014. The lin- Gray, Russell D. 2017. Evolutionary dynam- guistic position of the Timor-Alor-Pantar lan- ics of language systems. Proceedings of the Na- guages. In Marian Klamer (ed.), The Alor-Pantar tional Academy of Sciences 114(42). E8822–E8829. languages: History and typology, 155–198. Berlin: https : / / doi . org / 10 . 1073 / pnas . Language Science Press. 1700388114. Holton, Gary & Robinson, Laura C. 2017. The lin- Haan, Johnson Welem. 2001. The grammar of guistic position of the Timor-Alor-Pantar lan- Adang: A Papuan language spoken on the island of guages. In Marian Klamer (ed.), The Alor-Pantar Alor - Indonesia. Sydney: Uni- languages: History and typology (2nd ed.) Sec- versity of Sydney PhD Thesis. http://www- ond edition., 155–198. Berlin: Language Sci- personal.arts.usyd.edu.au/jansimps/ ence Press. haan/. Huber, Juliette. 2011. A grammar of Makalero : A Hägerdal, Hans. 2012. Lords of the land, lords of the Papuan language of East Timor. Utrecht: LOT sea; Conflict and adaptation in early colonial Timor, Publications: Leiden University PhD Thesis. 1600-1800. Brill. https : / / doi . org / 10 . Huber, Juliette. 2017. Makalero and Makasae. 26530/OAPEN_408241. In Antoinette Schapper (ed.), The Papuan lan- Hammarström, Harald & Bank, Sebastian & guages of Timor, Alor, and Pantar: Volume 2, 268– Forkel, Robert & Haspelmath, Martin (eds.). 351. Berlin: Mouton de Gruyter. 2018. Glottolog 3.2. Jena: MPI for the Science of Jäger, Gerhard & List, Johann-Mattis. 2018. Us- Human History. http : / / glottolog . org/ ing ancestral state reconstruction methods for (7 May, 2018). onomasiological reconstruction in multilin- Haspelmath, Martin & Tadmor, Uri (eds.). 2009. gual word lists. Language Dynamics and Change The World Loanword Database (WOLD). Leipzig: 8(1). 22–54. https : / / doi . org / 10 . 1163 / MPI for Evolutionary Anthropology. http:// 22105832-00801002. wold.clld.org/. Jäger, Gerhard & List, Johann-Mattis & Sofroniev, Heine, Bernd & Kuteva, Tania. 2005. Language con- Pavel. 2017. Using support vector machines tact and grammatical change. Cambridge: Cam- and state-of-the-art algorithms for phonetic bridge University Press. alignment to identify cognates in multi-lingual Holton, Gary. 2004. Report on Recent Linguistic wordlists. In Proceedings of the 15th Conference of Fieldwork on Pantar Island, Eastern Indonesia. the European Chapter of the Association for Compu- https : / / scholarworks . alaska . edu / tational Linguistics: Volume 1, Long Papers, vol. 1, xmlui / bitstream / handle / 11122 / 6808 / 1205–1216. holton-2004-pantar_report.pdf.

37 Jäger, Gerhard & Sofroniev, Pavel. 2016. Auto- going Austronesian expansion in island South- matic cognate classification with a Support east Asia. Journal of Anthropological Archaeology Vector Machine. In Proceedings of the 13th Con- 30(3). 262–272. https://doi.org/10.1016/ ference on Natural Language Processing, vol. 16, j.jaa.2011.06.004. 128–134. Lartillot, Nicolas & Philippe, Hervé. 2006. Com- Kaiping, Gereon A. & Edwards, Owen & Klamer, puting Bayes factors using thermodynamic Marian (eds.). 2019. LexiRumah 2.1.2. https:// integration. Systematic Biology 55(2). 195–207. doi.org/10.5281/zenodo.2540954. https : / / doi . org / 10 . 1080 / Kaiping, Gereon A. & Klamer, Marian. 2018. 10635150500433722. LexiRumah: An online lexical database of Lee, Sean & Hasegawa, Toshikazu. 2011. Bayesian the . PLOS ONE 13(10). phylogenetic analysis supports an agricultural e0205250. https : / / doi . org / 10 . 1371 / origin of Japonic languages. Proceedings of the journal.pone.0205250. Royal Society of London B: Biological Sciences. Kelly, Luke J. & Nicholls, Geoff K. 2016. Lateral rspb20110518. https://doi.org/10.1098/ transfer in Stochastic Dollo models. http:// rspb.2011.0518. arxiv.org/abs/1601.07931. List, Johann-Mattis. 2012a. LexStat: Auto- Keraf, Gregorius. 1978. Morfologi dialek Lamalera. matic detection of cognates in multilingual OCLC: 24990120. Universitas Indonesia Jakarta. wordlists. In Proceedings of the EACL 2012 Joint 610 pp. Workshop of LINGVIS & UNCLH (EACL 2012), 117– Klamer, Marian. 2017. The Alor-Pantar lan- 125. Stroudsburg, PA, USA: Association for guages: Linguistic context, history and typol- Computational Linguistics. http://dl.acm. ogy. In Marian Klamer (ed.), The Alor-Pantar org/citation.cfm?id=2388655.2388671 languages: History and typology, 2nd edn. (Stud- (12 August, 2015). ies in Diversity Linguistics 3), 1. Berlin: Lan- List, Johann-Mattis. 2012b. SCA: Phonetic Align- guage Science Press. ment based on sound classes. In New Direc- Klamer, Marian. 2018. Typology and grammati- tions in Logic, Language and Computation, 32–51. calization in the Papuan languages of Timor, Springer. Alor, and Pantar. In Heiko Narrog & Bernd List, Johann-Mattis. 2014. Investigating the im- Heine (eds.), Grammaticalization from a Typologi- pact of sample size on cognate detection. Jour- cal Perspective, 235–262. Oxford: Oxford Univer- nal of Language Relationship • 11. 91–101. sity Press. https : / / cyberleninka . ru / article / n / Kolipakam, Vishnupriya & Jordan, Fiona M. & investigating-the-impact-of-sample- Dunn, Michael & Greenhill, Simon J. & Bouck- size-on-cognate-detection-1. aert, Remco & Gray, Russell D. & Verkerk, An- List, Johann-Mattis & Cysouw, Michael & Forkel, nemarie. 2018. A Bayesian phylogenetic study Robert (eds.). 2016. Concepticon. Jena: Max of the Dravidian language family. Royal Society Planck Institute for the Science of Human His- Open Science 5(3). 171504. https://doi.org/ tory. https://doi.org/10.5281/zenodo. 10.1098/rsos.171504. 51259. Kratochvíl, František. 2007. A grammar of Abui: List, Johann-Mattis & Greenhill, Simon J. & Gray, A Papuan language of Alor. PhD Dissertation. Russell D. 2017. The Potential of Automatic Utrecht: LOT Publications: Leiden University Word Comparison for Historical Linguistics. PhD Thesis. PLOS ONE 12(1). e0170046. https://doi.org/ Kratochvíl, František. 2014. Sawila. In Antoinette 10.1371/journal.pone.0170046. Schapper (ed.), Papuan languages of Timor, Alor List, Johann-Mattis & Greenhill, Simon J & and Pantar: Sketch grammars, vol. 1, 351–438. Tresoldi, Tiago & Forkel, Robert. 2018. LingPy. Berlin: Mouton de Gruyter. A Python Library for Quantitative Tasks in Histor- Lansing, John Stephen & Murray, P. Cox & De ical Linguistics. Version 2.6.4. Zenodo. https:// Vet, Therese A. & Downey, Sean S. & Hall- doi.org/10.5281/zenodo.1544172. mark, Brian & Sudoyo, Herawati. 2011. An on-

38 List, Johann-Mattis & Greenhill, Simon & An- supervised methods for multilingual cognate derson, Cormac & Mayer, Thomas & Tresoldi, clustering. http://arxiv.org/abs/1702. Tiago & Forkel, Robert (eds.). 2018. CLICS: 04938 (13 July, 2018). Database of Cross-Linguistic Colexifications. Jena: Rambaut, Andrew & Drummond, Alexei J. & Xie, MPI for the Science of Human History. https: Dong & Baele, Guy & Suchard, Marc A. 2018. //clics.clld.org/. Posterior Summarization in Bayesian Phyloge- Maurits, Luke & Forkel, Robert & Kaiping, Gereon netics Using Tracer 1.7. Systematic Biology 67(5). A. 2018. BEASTling. Version 1.4.0. https : 901–904. https : / / doi . org / 10 . 1093 / / / github . com / lmaurits / BEASTling sysbio/syy032. (10 September, 2018). Robinson, Laura C. & Holton, Gary. 2012. Inter- Maurits, Luke & Forkel, Robert & Kaiping, Gereon nal Classification of the Alor-Pantar Language A. & Atkinson, Quentin D. 2017. BEASTling: A Family Using Computational Methods Applied software tool for linguistic phylogenetics us- to the Lexicon. Language Dynamics and Change ing BEAST 2. PLOS ONE 12(8). e0180908. https: 2(2). 123–149. https://doi.org/10.1163/ / / doi . org / 10 . 1371 / journal . pone . 22105832-20120201. 0180908. Ross, Malcolm. 2005. Pronouns as a preliminary Michael, Lev & Chousou-Polydouri, Natalia & Bar- diagnostic for grouping Papuan languages. In tolomei, Keith & Donnelly, Erin & Meira, Sérgio Andrew K. Pawley & Robert Attenborough & Wauters, Vivian & O’Hagan, Zachary. 2015. A & Jack Golson & Robin Hide (eds.), Papuan Bayesian Phylogenetic Classification of Tupí- pasts: Cultural, linguistic and biological histories Guaraní. LIAMES: Línguas Indígenas Americanas of Papuan-speaking peoples (Pacific Linguistics 15(2). 193–221. https : / / doi . org / 10 . 572), 15–65. Canberra: Research School of Pa- 20396/liames.v15i2.8642301. cific & Asian Studies, Australian National Uni- Pawley, Andrew. 2002. The Austronesian dis- versity. http : / / www . academia . edu / persal: people, language, technology. In Peter download/31189925/Ross_Papuan_Pasts. Bellwood & Colin Renfrew (eds.), Examining the pdf. language/farming dispersal hypothesis. 251–271. Rosvall, Martin & Bergstrom, Carl T. 2008. Maps Cambridge: McDonald Institute for Archaeo- of random walks on complex networks re- logical Research. veal community structure. Proceedings of the Pawley, Andrew. 2005. The chequered career National Academy of Sciences 105(4). 1118–1123. of the Trans New Guinea hypothesis: Re- https : / / doi . org / 10 . 1073 / pnas . cent research and its implications. In Papuan 0706851105. pasts: Cultural, linguistic and biological histories Sapir, E. 1916. Time Perspective in Aboriginal Amer- of Papuan-speaking peoples. Pacific Linguistics. ican Culture: A Study in Method. 90. https : / / https://openresearch-repository.anu. doi.org/10.4095/103486. edu . au / handle / 1885 / 84324 (6 October, Schapper, Antoinette & Hendery, Rachel. 2014. 2017). Wersing. In Antoinette Schapper (ed.), Papuan Rama, Taraka. 2018. Three tree priors and five languages of Timor, Alor and Pantar: Sketch datasets: A study of the effect of tree priors grammars, vol. 1, 439–504. Berlin: Mouton de in Indo-European phylogenetics. http : / / Gruyter. arxiv.org/abs/1805.03645 (17 July, 2018). Schapper, Antoinette & Huber, Juliette & van Rama, Taraka & List, Johann-Mattis & Wahle, Jo- Engelenhoven, Aone. 2017. The relatedness hannes & Jäger, Gerhard. 2018. Are Automatic of Timor-Kisar and Alor-Pantar languages: A Methods for Cognate Detection Good Enough preliminary demonstration. In Marian Klamer for Phylogenetic Reconstruction in Historical (ed.), The Alor-Pantar languages: History and ty- Linguistics? http://arxiv.org/abs/1804. pology, 2nd edn., vol. 3, 99–154. Berlin: Lan- 05416 (13 July, 2018). guage Science Press. https : / / epub . uni - Rama, Taraka & Wahle, Johannes & Sofroniev, regensburg.de/31377/ (7 November, 2017). Pavel & Jäger, Gerhard. 2017. Fast and un-

39 Schapper, Antoinette & Huber, Juliette & van En- Zalizniak, Anna A. 2018. Database of Semantic gelenhoven, Aone. 2012. The historical rela- Shifts in Languages of the World. http : / / tion of the Papuan languages of Timor and datsemshift.ru/shifts (11 February, 2019). Kisar. Language and linguistics in Melanesia. 192– 240. Spriggs, Matthew. 2011. Archaeology and the Austronesian expansion: where are we now? Antiquity 85. 510–528. Tracer: Posterior Summarisation in Bayesian Phylo- genetics. 2019. Bayesian Evolutionary Analysis Sampling Trees. https : / / github . com / beast-dev/tracer (22 February, 2019). Traugott, Elizabeth Closs & Dasher, Richard B. 2002. Regularity in semantic change. Cambridge: Cambridge University Press. Wichmann, Søren & Brown, Cecil H. & Holman, Eric W. (eds.). 2016. The ASJP Database. Jena: Max Planck Institute for the Science of Human History. http://asjp.clld.org/ (6 October, 2017). Willemsen, Jeroen. to appear. Reta. In Antoinette Schapper (ed.), The Papuan Languages of Timor, Alor and Pantar: Volume 3. Berlin / New York: De Gruyter Mouton. Windschuttel, Glenn & Shiohara, Asako. 2017. Kui. In Antoinette Schapper (ed.), The Papuan Languages of Timor, Alor and Pantar: Volume 2, 109–183. Berlin / New York: De Gruyter Mou- ton. Wurm, Stephen A. & Voorhoeve, C. L. & McEl- hanon, Kenneth A. 1975. The trans-New Guinea phylum in general. In Stephen A. Wurm (ed.), New Guinea area languages and language study, vol. 1 (Pacific Linguistics C 38), 299–322. Can- berra: Research School of Pacific & Asian Stud- ies, Australian National University. Xie, Wangang & Lewis, Paul O. & Fan, Yu & Kuo, Lynn & Chen, Ming-Hui. 2011. Improving marginal likelihood estimation for Bayesian phylogenetic model selection. Systematic Biol- ogy 60(2). 150–160. https://doi.org/10. 1093/sysbio/syq085. Xu, Shuhua & Pugach, Irina & Stoneking, Mark & Kayser, Manfred & Jin, Li & The HUGO & Pan-Asian Consortium. 2012. Genetic dat- ing indicates that the asian-Papuan admixture through Eastern Indonesia corresponds to the Austronesian expansion. PNAS 109(12). 4574– 4579.

40 Appendix A Cognate sets

The following table lists the cognate sets of proto-TAP *mi ‘be in, at’, proto-Nuclear Alor Pantar *om mi ‘(be) inside’, and reconstructions of intermediate forms. Sources for the re- flexes of proto-TAP *mi ‘be in, at’ etc. are Kaiping et al. (2019), Klamer (2018: 242–246), and specifically for Adang: Haan (2001: 145), Makalero and Makasai: Huber (2011, 2017), Kaip- ing et al. (2019), Klon-Bring: Baird (2008), Kafoa: Baird (2017), Kui: Windschuttel & Shiohara (2017), Reta: Willemsen (to appear), Abui-Takalelang: Kratochvíl (2007), Sawila: Kratochvíl (2014), Wersing: Schapper & Hendery (2014), W Pantar: Holton (2014).

Language Forms reflecting *mi Forms reflecting Reconstructions *om mi proto-TAP *mi ‘be in, at’ proto-Tim *mi Bunak mil ‘inside’ Makasai mi- ‘APPL’, (mutuɁu) Makalero mi- ‘APPL’, (mutuʔ) Oirata (muwainani) Fataluku (mutune) proto-AP *mi Kamang-Atoitaa mi-‘APPL’, mi ‘inside’ proto-East Alor *mi Wersing mi- ‘APPL’, mira ‘in- side’, min ‘be at’ Kula m ‘be located’, məra ‘inside’ Sawila ming ‘be located’, mirea ‘inside’ proto-Nuclear Alor Pantar *-om mi ‘(be) inside’ Klon-Bring mi ‘be at; to place’, mi -omi proto-Central Alor *-om mi ‘LOC’, mi- ‘APPL’ Kui mi- ‘APPL’, mi ‘be in, at’, mare ‘inside’ Kafoa mi ‘be at’ -ommi Abui-Takalelang mi ‘be in’ -o:mi Abui-Ulaga mia ‘be in’ -oni Adang mi ‘be in, at’, mi ‘in, at’ Ɂommi proto-W Alor Pantar Straits *-om mi Blagar-Pura =mi, mi ‘in; to; into; -omi from’ Reta mi ‘be in’ -o:mi Kaera-Abangiwang ming ‘be in, at’, mi ‘in; -ommi at; to; with’ Teiwa meɁ ‘be in’ -ommeɁ WPantar me ‘LOC’, migang ‘to -ume set’

41 Language Forms reflecting *hada ‘fire’ Reconstructions proto-TAP *hada ‘fire’ Bunak hɔtɔ hɔtɔ proto-Timor *hata Makasai ata Oirata aa Fataluku at proto-AP *hada Kamang-Atoitaa ɑt proto-East Alor *ada Wersing ada Kula ada Sawila ada proto-Nuclear AP *hara Klon-Bring ədɑ proto-Central Alor *ara Klon-Hopter ada (wer) Kiraman ar Kui ar Kafoa ɑrɑ Abui ara Adang (awai,aɁfai) proto-W Alor Pantar Straits *hara Kabola (awal) Hamap (afail) Blagar-Pura ad Reta ad Kaera-Abangiwang ad wasing Teiwa ar WPantar ra

42 Language Forms reflecting *habi ‘fish’ Reconstructions proto-TAP *habi ‘fish’ Bunak (ikan) proto-Timor *hapi Makasae afi Oirata ahi Fataluku api proto-AP *habi Kamang-Atoitaa api proto-East Alor *api Wersing api Kula api, apu Sawila api proto-Nuclear AP *habi Klon-Bring əbi proto-Central Alor *habi Klon-Hopter ʔəbi Kiraman ɛb Kui eb Kafoa-Probur ɑfʊi Abui afu Adang-Lawahing ab proto-W Alor Pantar *hab Adang-Otvai hab Hamap-Moru ʔab Kabola-Monbang hab Blagar (all dialects) aba:b Reta ab Reta ʔab Sar haf Teiwa-Adiabang haf Teiwa-Lebang af Teiwa-Nule haf Deing haf WesternPantar-Tubbe hap keʔe

Appendix B Highly borrowable items

The following table lists concepts in LexiRumah that correspond to concepts with borrowing scores of 0.15 or more according to WoLD.

LexiRumah WoLD Borrowed score

sugar the sugar 0.794 hour the hour 0.758 mosque the mosque 0.733 market the market 0.675 tobacco the tobacco 0.664 church the church 0.658 lamp the lamp or torch 0.649 candle the candle 0.627 banana the banana 0.622 priest_modern the priest 0.621 penalty the penalty or punishment 0.613 thousand a thousand 0.581 to_die the soldier 0.577 window the window 0.568

43 LexiRumah WoLD Borrowed score trousers the trousers 0.565 dolphin the porpoise or dolphin 0.554 judge the judge 0.545 money the money 0.537 crocodile the crocodile or alligator 0.533 king_ruler the king 0.529 shirt the shirt 0.506 coconut the coconut 0.492 to_count the country 0.480 to_hear the scissors or shears 0.478 chair the chair 0.470 sea the sea 0.462 garden the garden 0.462 animal the animal 0.456 clan the clan 0.449 poor poor 0.446 enemy the enemy 0.446 canoe the canoe 0.436 to_guard the guard 0.432 price the price 0.431 whale the whale 0.430 shark the shark 0.427 cheap cheap 0.427 cassava the cassava/manioc 0.426 to_command to command or order 0.422 one_hundred a hundred 0.420 friend the friend 0.413 bean the bean 0.406 to_sail the sail 0.404 sail the sail 0.404 corn_maize the maize/corn 0.392 grains the grain 0.391 stick_pole the post or pole 0.389 island the island 0.388 bed the bed 0.386 knife the knife(2) 0.384 to_pray to pray 0.379 slave the slave 0.379 to_rule to rule or govern 0.376 needle the needle(1) 0.374 chicken the chicken 0.372 adultery the adultery 0.370 easy easy 0.367 God the god 0.366 to_eat the weather 0.363 sacrifice the sacrifice 0.363 to_work the work 0.362 to_accuse to accuse 0.360 bamboo the bamboo 0.359 guilty guilty 0.355 because because 0.354 axe the axe/ax 0.349 sweet_potato the sweet potato 0.347

44 LexiRumah WoLD Borrowed score floor the floor 0.346 chieftain the chieftain 0.344 battle the war or battle 0.341 wall the wall 0.340 vagina the vagina 0.336 cookhouse the cookhouse 0.328 scorpion the scorpion 0.324 person the person 0.323 to_promise to promise 0.321 stomach_belly the stomach 0.318 innocent innocent 0.312 milk the milk 0.312 honey the honey 0.308 fever the fever 0.307 language the language 0.307 pig the pig 0.303 expensive expensive 0.297 mat the mat 0.295 millet the millet or sorghum 0.294 penis the penis 0.291 flat flat 0.290 nine nine 0.290 seven seven 0.290 to_bless to bless 0.289 to_sit the need or necessity 0.289 village the village 0.289 holy holy 0.286 to_dance to dance 0.286 rich rich 0.285 moon the moon 0.284 thread the thread 0.283 fathers_sister the father’s sister 0.277 ten ten 0.276 sky the sky 0.274 grandmother the grandmother 0.271 face the face 0.269 fence the fence 0.268 grandfather the grandfather 0.267 correct right(2) 0.266 the voice 0.265 to_go the gourd 0.264 fishnet the fishnet 0.263 vein the vein or artery 0.263 guest the guest 0.263 lungs the lung 0.262 to_trade to trade or barter 0.262 to_learn to learn 0.260 eight eight 0.256 six six 0.256 fruit the fruit 0.255 meeting_house the meeting house 0.254 word the word 0.254 sugarcane the sugar cane 0.254

45 LexiRumah WoLD Borrowed score spear the spear 0.253 to_worship to worship 0.253 spider the spider 0.253 tree the tree 0.252 father the father 0.252 flower the flower 0.251 earthquake the earthquake 0.250 difficult difficult 0.250 turtle the turtle 0.245 to_walk the walking stick 0.245 blue blue 0.240 necklace the necklace 0.239 coral_reef the reef 0.238 snow the snow 0.237 lazy lazy 0.235 twelve twelve 0.235 dust the dust 0.234 lizard the lizard 0.233 eleven eleven 0.232 two two 0.231 ice the ice 0.231 sorcerer the sorcerer or witch 0.230 to_plant_rice to plant 0.230 to_plant_yam to plant 0.230 old_people the old man 0.230 fish_trap the fish trap 0.229 five five 0.228 door the door or gate 0.227 twenty twenty 0.226 frog the frog 0.225 river_stream the river or stream 0.224 heart the heart 0.223 bat the bat 0.223 stranger the stranger 0.222 to_vomit to vomit 0.221 three three 0.219 to_pound to pound 0.219 to_help to help 0.219 to_answer to answer 0.216 angry the anger 0.216 to_crouch to crouch 0.216 mosquito the mosquito 0.216 finished to finish 0.215 lake the lake 0.215 to_surrender to surrender 0.213 testicles the testicles 0.213 four four 0.212 mothers_brother the mother’s brother 0.211 earring the earring 0.210 road the path 0.207 to_think to think(2) 0.206 head the head 0.204 to_refuse to refuse 0.201

46 LexiRumah WoLD Borrowed score husband the husband 0.200 seed the seed 0.199 to_invite to invite 0.197 clothing the clothing or clothes 0.197 to_defend the defendant 0.196 mother the mother 0.195 cold_influenza the cold 0.195 arrow the arrow 0.191 and and 0.191 field the field 0.191 nephew_niece the nephew 0.189 body the body 0.188 all all 0.187 year the year 0.186 to_divide to divide 0.186 fishing_hook the fishhook 0.185 wife the wife 0.183 to_cook to cook 0.183 mortar the mortar(1) 0.182 round round 0.182 to_squeeze to squeeze 0.182 fat the grease or fat 0.180 sick_painful sick/ill 0.178 broom the broom 0.177 rat the mouse or rat 0.177 fifteen fifteen 0.177 smooth smooth 0.174 mountain the mountain or hill 0.174 grass the grass 0.174 to_sell to sell 0.173 to_push to push 0.173 chest the chest 0.172 wind the wind 0.172 branch the branch 0.172 to_come to become 0.171 younger_sibling the younger sibling 0.171 star the star 0.170 to_curse to curse 0.169 green green 0.168 kidney the kidney 0.168 to_rub to rub 0.168 comb the comb 0.167 roof_rafter the rafter 0.165 guts the intestines or guts 0.165 loom the loom 0.165 cloud the cloud 0.165 forest the woods or forest 0.165 dew the dew 0.164 young_people young 0.163 if if 0.162 dog the dog 0.162 rice_field the paddy 0.161 rough rough(2) 0.161

47 LexiRumah WoLD Borrowed score to_bark the bark 0.161 bark_of_tree the bark 0.161 wrong wrong 0.161 lightning the lightning 0.160 horn the horn 0.160 waist the waist 0.160 white white 0.159 bird the bird 0.158 initiation_ceremony the initiation ceremony 0.158 to_whisper to whisper 0.158 deaf deaf 0.158 bow the bow 0.158 woman the woman 0.157 rotten rotten 0.157 snake the snake 0.154 yellow yellow 0.154 mud the mud 0.154 pestle the pestle 0.154 wound the wound or sore 0.152 to_buy to buy 0.152 fowl the fowl 0.152 rope the rope 0.152 salt the salt 0.152 hand the hand 0.151 left_side left 0.150

48