<<

Downloaded by guest on September 24, 2021 fteSn-iea ra h aletplorpia inscrip- www.pnas.org/cgi/doi/10.1073/pnas.1817972116 paleographical earliest The spoken area. half eastern Sino-Tibetan the was in the group ancestor homogeneous of a whose Sinitic, form ago, or y , 2,000 Chinese, about Sinitic a The speak speakers . Sino-Tibetan greatest of majority the of one is 422). p. Sino-Tibetan faces 9, currently (ref. that complex- claims comparative-historical This that to challenges innovations. turn, led shared in has identify and, ity to identification ability cognate our correspon- on affect difficulties the uncertainty These or some elements. stops, nontonal place and modern tones of between aspiration dences e.g., and understood, poorly voicing remain the history grammatical and 2 logical section 8). Appendix, (7, improving is (SI languages correspondences sound Kiranti) Sino-Tibetan of (Gyalrongic, Knowledge polysynthetic grada- to (Lolo-Burmese, the isolating Tujia) of from all complexity including of morphological world, one of the structurally tion in is history families it the since diverse of limited, most severely the knowledge still our is family (6), this century of 19th the of beginning obscurity. in shrouded remain Sino-Tibetan of circumstances formation the the (2–5), of linguists and archaeoge- on phylogeneticists, by debate renewed neticists, the in been recently while cultures has However, great origins Nepal. Indo-European world’s and the human Burma, of in Tibet, several role China, to major rise a played giving and Nepal, have prehistory, China, languages as these across such of (map, Speakers Himalayas, Ocean, Pakistan and the Pacific Bangladesh, beyond the India, countries of to coast range, extending geographic west wide a the comprises across from spoken family (1) Sino-Tibetan languages and 500 The speakers) about billion). billion (1.4 the (3.2 of Sino-Tibetan Indo-European 60% nearly population: for account world’s families these Together, families. T early comparison the language computer-assisted and Cishan languages late Sino-Tibetan the to around link Sino- farmers a to millet cultures. suggest point Yangshao Chinese and findings B.P. north Our 7200 homeland. with and originating the origin Tibetan to estimate their and methods languages of phylogenetic these age use among relationships then we the We infer debate cognates. this establish correspondences the sound on and apply identify and to light data, method linguistic comparative shed linguistic comparative To of database originated. a when develop they about where debate ongoing pre- and with their controversial, languages, Sino-Tibetan remains largest the history people. world’s of billion importance the 1.4 nearly the of by Despite for spoken one (received families, 2019 is prominent 8, most April family and Klein G. language Richard Member Sino-Tibetan Board The Editorial by accepted and Switzerland, Australia 2018) Zurich, 0200, 19, Zurich, ACT October of Canberra, review University University, Bickel, National Balthasar Australian by Language, Edited of Dynamics the for Excellence of Center Council Research Germany; France; Paris, 75006 Sociales, a List Johann-Mattis and Sagart Laurent Sino-Tibetan the of on ancestry light shed phylogenies language Dated eted ehrhsLnusiussrlAi retl,CR,Isiu ainldsLnuse iiiain retls cl e atsEue nSciences en Etudes Hautes des Ecole Orientales, Civilisations et Langues des National Institut CNRS, Orientale, l’Asie sur Linguistiques Recherches de Centre hr i hs agae rgnt n hn h vast The when? and originate languages these did Where the from studied been have languages Sino-Tibetan While xrmte fErsa ftewrdstolretlanguage largest two world’s the eastern of and western Eurasia, the at of rise, the extremities seen have y 10,000 past he c eted ehrhse Math en Recherches de Centre a,1 ulam Jacques Guillaume , | ua prehistory human ,ytipratapcso t phono- its of aspects important yet ), b b,2 eateto igitcadClua vlto,MxPac nttt o h cec fHmnHsoy ea07743, Jena History, Human of Science the for Institute Planck Max , Cultural and Linguistic of Department mtqe el D la de ematiques ´ | atAsia East eto 2). section Appendix, SI a,1 ufnLai Yunfan , | peopling cso,CR,Universit CNRS, ecision, ´ b | oi .Ryder J. Robin , ihCieebigoeo eea rmr rnhs(0.Athird rake, A a (10). branches as primary several A topology of 14). one basal being (13, Chinese Sino-Tibetan proceed with languages presents other group all second which “Tibeto- or of “Tibeto-Karen” out proposals labeled node Burman,” of a to Chinese, group leads to other first leads the branch and A One dispute. structure: two-branch in a recognizes is of position particular, The in family. about Chinese, the exists within in consensus relationships no homeland phylogenetic difficulties, their the these to despite Due by (11) (12). boosted languages Taiwan mostly was Papuan and Melanesia with northwest that contact divergent way in same very diversity the much Austronesian with in been languages, contact have non–Sino-Tibetan may extinct intimate Nepal and by India boosted How- in some (10). diversity to there located Sino-Tibetan suggested respec- was has ever, languages homeland CE, This family’s Sino-Tibetan Nepal. the 1113 that diverse and and authors India most CE, northeastern the in 1114 with is CE, area 1070 The languages CE, these tively. 764 in texts from oldest date script The to recently: reduced Burmese, more were and languages, considerably literary Newar, early other Tangut, family’s the the of Tibetan, part eastern the domain. in diversity Sino-Tibetan linguistic striking of conse- lack the the and, to to today, led quently, speakers language Chinese Chinese of the predominance to numerical peoples and their regions of neighboring of shift annexation Gradual Yel- lower valley. early the River on low centered the was inscriptions, to these an Sh with back has associated The Chinese dating BCE. and millennium literature BCE, first well-studied 1400 before and to abundant date Chinese in tions 1073/pnas.1817972116/-/DCSupplemental. y at online information supporting contains article This 2 1 BY-NC-ND) (CC 4.0 NoDerivatives License distributed under is article access open This hsatcei NSDrc umsin ..i us dtrivtdb h Editorial the by invited editor guest a is B.B. Submission. Direct PNAS Board. a is article This interest.y paper. of y the conflict wrote no J.-M.L declare and authors S.J.G., The V.T., J.-M.L. R.J.R., Y.L., and J.-M.L. G.J., S.J.G., V.T., analyses; L.S., project; phylogenetic figures; the organized the made and provided S.J.G. curation data and V.T., for system L.S. R.J.R., the data; cognates; developed the the assembled J.-M.L. coded who and S.J.G., G.J. Y.L., by G.J., and joined L.S., later analyses; study, phylogenetic the initial initiated provided J.-M.L. and G.J., L.S., contributions: Author owo orsodnesol eadesd mi:[email protected] Email: addressed. be should correspondence whom work.y To this to equally contributed G.J. and L.S. .. ikn h rgno h agaefml ihtelate the with 7200 family cultures. around Yangshao language early to the the family and of Cishan language origin the the phylogenies of linking infer B.P., origin we a the on languages, date Based Sino-Tibetan that families. 50 language of of neighboring prehistory dataset of the and understanding of , is for East Sino-Tibetan importance extension, highest geographical the and size its Given Significance ai-apie S nvriy 57 ai,Fac;and France; Paris, 75775 University, PSL Paris-Dauphine, e ´ y c aetnThouzeau Valentin , . c y io .Greenhill J. Simon , ¯ n igo,teCieepolity Chinese the Kingdom, ang raieCmosAttribution-NonCommercial- Commons Creative www.pnas.org/lookup/suppl/doi:10. NSLts Articles Latest PNAS b,d d Australian , | f6 of 1

ANTHROPOLOGY group places Chinese in a lower-level subgroup with Tibetan (15, sible outgroup; the Stochastic Dollo model gives outgroups 16). Apart from the second group, which relies on lexicostatistic probabilities similar to the relaxed-clock model. The differences methodology, the tree topologies in these proposals are based are discussed further in SI Appendix, section 4. Repeating the on an investigator’s perception of relative proximities between analyses on a smaller sample representing each of the major sub- branches, with no quantification of uncertainty. A search for groups yielded similar results, further discussed in SI Appendix, linguistic innovations uniting several branches of the family is section 4. Tests of the adequacy of the are further ongoing; the limited results so far are consistent with the first discussed in Adequacy of the Tree Model. group of hypotheses (9, 17). SI Appendix, section 2 summarizes different proposals. Discussion Here we combine classical with cutting- Tree Topology and Subgrouping Hypotheses. Despite the prelimi- edge computational methods and domestication studies. First, nary character of our study, until further key languages of the we develop a lexical database of 180 basic concepts family like Newar are sufficiently analyzed and added, our results from 50 languages. The data were either directly collected in consistently support two nontrivial subgrouping hypotheses pre- the field by ourselves or gathered from the literature with ver- viously proposed by historical linguists on the basis of lexical ification by external specialists whenever possible. The list of innovations: The comprising Garo, Rabha, and Jinghpo in most appropriate concepts was established through careful eval- the sample is compatible with the Sal subgroup (23), and the uation of concept lists used in similar studies (SI Appendix, clade including , Lisu, Gyalrongic (Japhug, section 3), and lexical cognates were identified by experts in Situ, Tangut, Stau, and Khroskyabs), and Zhaba corresponds Sino-Tibetan historical linguistics using the to the Eastern Tibeto-Burman or Burmo-Gyalrongic subgroup supported by state-of-the-art annotation techniques. Second, we (24, 25). Our results also indicate that the Burmo-Gyalrongic apply Bayesian phylogenetic methods to these data to estimate group belongs to a larger Tibeto-Gyalrongic clade comprising the most probable tree, outgroup, and timing of Sino-Tibetan Tibetan and also possibly Dulong, a hypothesis that had not been under a range of models of cognate evolution; similar methods explicitly proposed before. have been applied to several other families of languages, includ- The results are inconsistent with a certain number of sub- ing Indo-European (18–20), Austronesian (12), Semitic (21), and grouping proposals, in particular, Sino-Bodic [grouping together Bantu (22). Third, we examine Sino-Tibetan expansion under the Chinese, Tibetan, and Kiranti, excluding Lolo-Burmese (26)]; two most probable phylogenetic scenarios through a considera- Post and Blench’s hypothesis that subgroups in northeastern tion of the family’s plant and animal domesticates, the regions India such as Tani (Bokar) and Mishmi (Yidu and Deng) where they are earliest attested archaeologically, and the dis- are among the first branches of the family, while Sinitic is tribution of the corresponding cognate sets across the family’s closer to Lolo-Burmese and Tibetan (16); the Central Trans- branches. Himalayan hypothesis [a clade comprising Sal and Kuki-Chin (27)]; and the Rungic hypothesis (28), according to which the Results morphology-rich subgroups (Gyalrongic, Kiranti, and Dulong) Cognate Set Distribution. Of the 3,333 cognate sets distributed constitute a clade to the exclusion of Lolo-Burmese and Tibetan over 9,160 lexical items, 90% are shared by fewer than five (SI Appendix, section 4). The last two hypotheses are exclu- languages. The majority of these low-frequency sets are sin- sively based on verbal morphology. The fact that these sub- gletons for which no related in any other language was groups are not confirmed by our results suggests that the found (2,189, 66%). Four cognate sets are found in all or almost commonalities in verbal morphology adduced by these authors all languages in our sample, reflecting well-known Sino-Tibetan to support these subgroups are more likely to reflect a com- cognates (“three,” “four,” “dream,” and “name”). These num- bination of retentions from a common ancestor and parallel bers compare well with the data obtained for other challenging innovations. language families (see SI Appendix, section 3 for details). Since the common origin of person agreement morphology among Gyalrongic, Dulong, and Kiranti is not controversial (28, Tree Topologies and Dating. We present tree topologies and ages 29), and since Kiranti is outside of the Tibeto-Dulong clade, inferred using a relaxed-clock covarion model with BEAST, a phylogenetic inference supports the idea that the absence of per- phylogenetic software package performing Bayesian evolution- son inflexion in Lolo-Burmese and Tibetan is due to a massive ary analysis; see Fig. 2 for summarization. Apart from the Sinitic loss of morphology (7), a hypothesis also supported by poten- group which was constrained in the priors, the posterior dis- tial traces of these inflexions in Tibetan (30). The proximity of tribution provides strong evidence (>95% probability) for six a set of isolating (Lolo-Burmese) and polysynthetic (Japhug and subgroups: (i) Tibeto-Gyalrongic (possibly including Dulong in Situ) languages in our results supports the idea that the rate of a Tibeto-Dulong clade), (ii) Kiranti, (iii) West-Himalayish, (iv) change of structural features can be much more volatile than that Tani-Yidu, (v) Kuki-Tangkhul (possibly including Karbi), and of basic vocabulary (31), and provides an additional example of (vi) Sal. Tshangla and Chepang are isolated branches. Within the abrupt loss of inflectional morphology, comparable to the case of Tibeto-Gyalrongic group, there is also support for Tibetan, Lolo- Goemai in Chadic (32). Burmese, Gyalrongic, and Burmo-Gyalrongic. The more recent Although the likely Urheimat of the family lies in north- part of the tree is thus well resolved. ern China and Sinitic may be the first group to branch off, On the other hand, the results do not allow us to unambigu- the diversity of the subgroups of Sino-Tibetan is highly skewed ously resolve the root of the tree. The most plausible outgroups, toward northern India and Nepal. Of the nine subgroups sup- judging from posterior probabilities, are Sinitic (33%), West- ported by our results (Sinitic, Tibeto-Dulong, Sal, Kiranti, Kuki- Himalayish (15%), Tani-Yidu (9%), a Sinitic-Sal group (8%), Karbi, Tani-Yidu, West-Himalayish, and the isolated languages and Sal (6%). The mean root age estimated with the relaxed- Chepang and Tshangla), only two groups (Sinitic and Tibeto- clock model is at 7184 B.P., with 95% highest posterior density Dulong) are well represented in China. Three other branches interval (HPD) [5093–9568] B.P. (Sal, Tani-Yidu, and Tshangla) are mainly spoken in Burma, In addition to this main analysis, we also analyzed the data India, and Bhutan but straddle the border with China. This geo- using two more constrained models: a strict-clock covarion graphical distribution suggests that the historical success of a model and a Stochastic Dollo model. These more constrained few subgroups (Sinitic, Tibetan, and Lolo-Burmese, the latter models lead to less uncertainty in the deep tree topology. In two belonging to the Tibeto-Dulong group) has eroded linguistic particular, under the strict-clock model, Sinitic is the only pos- diversity in China, including on the Tibetan plateau, whereas the

2 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1817972116 Sagart et al. Downloaded by guest on September 24, 2021 Downloaded by guest on September 24, 2021 aate al. et Sagart (see 1. Fig. prob- a (33% and first scenario the Under outgroup scenario. scenarios Chinese outgroup expansion West-Himalayish a possible languages: two Sino-Tibetan suggests of 4) section Appendix, and expansion. archaeologically family’s an the lack of account India (36) supported plateau eastern demographically Tibetan (26), home- the Sichuan likely and as most (16), of such family’s proposals part the known Alternative thus northeastern was land. is The rice domain Sino-Tibetan think speakers. have the not Sino-Tibetan Sino- to do of ancestral enough stages we to early early although the are expansion, in sheep Tibetan role and demography-boosting a pigs, par- played rice, In languages. millet, of that Sinitic date foxtail its by in root in occupied our located under now early was ticular, 1), that, family (Fig. first indication lack- area Sino-Tibetan strong sets these broad the a cognate of is phase, with member, all expansion those Appendix, Chinese even archaeologically, (SI a China, horse that, ing northern and fact in cattle, The appear foxtail plant, two analysis: 5). rice least phylogenetic section at sheep, our in in pig, identified correspondences millet, its branches sound to cog- the regular attention forming of pay with names should sets domesticate family six nate language identify account a We any of domesticates. thus, origins (35); by the procurement driven of food processes in demographic changes through favorable arise families Agriculture. guage and Archaeology, Homeland, resulting 34), 33, diversity. (27, higher zone a refuge in a as served has area Himalayan u eonto ftefml’ otlkl ugop (SI outgroups likely most family’s the of recognition Our .Acetlnug oain eetspoe oiia n utrleietr ftevarieties. the of epicenters cultural and political supposed reflect locations language Ancient Agriculture). and Archaeology, Homeland, agae norsml,cnrse ihacetstsrflcigerysae fdmsiainadteetmtdsra fnnSntclanguages non-Sinitic of spread estimated the and domestication of stages early reflecting sites ancient with contrasted sample, our in Languages ,0 B,bomonmillet, broomcorn yBP, 7,200 ca. ti lie htlan- that claimed is It per ntelt aghoculture Yangshao late (phytoliths) cultivation the rice First, in of evidence appears groups. Indirect neighboring adopted: was southward with and westward contact Japonica its of through times of expansion, early branch the non-Sinitic the in entered Sino-Tibetan have would barley) and wheat, Appendix , (SI predictions identi- these support two 5 pig, barley. section including and or rice distributions, wheat, for sets cattle, set fied horses, cognate rice, individual for not The millets, but two the sheep, for sets, and exist home- should cognate pig, Sinitic, Sino-Tibetan that of outside predicts and a in region reflected Cishan–Yangshao scenario, the out-group in land Yangshao: Sinitic and Cishan the important in Under are archaeologically wheat, absent and cattle, are horses, China barley Rice, and early elsewhere. in speakers upon Sino-Tibetan to relied widely domesticates four still These north- (37). were the culture Yangshao at early of identified edge in been ern pigs; have and sheep millet, domesticated foxtail addition, and broomcorn from principally tence northern the River. inside Yellow plateau, the the of of its bend half of western result the a assumed. as to be individualized expansion have to would needs group of migration non-Sinitic Sinitic stages The to significant initial just No located the is south: homeland or the Sinitic culture The 1). Cishan (Fig. of culture Yangshao plateau, stages loess final Chinese north the the of during half eastern homeland the in Sino-Tibetan located the was B.P.), 7400 date: root median ability, h eodr ioTbtndmsiae rc,ctl,horses, cattle, (rice, domesticates Sino-Tibetan secondary The subsis- their derived cultures Yangshao and Cishan the Both ). ie rsmbysra rmtesuhat(Henan), southeast the from spread presumably rice, NSLts Articles Latest PNAS ca. 60B,i the in BP, 5690 | f6 of 3

ANTHROPOLOGY southwest part of the Yangshao area (Wei River valley) (38). as the difficulty for experts to identify reliable cognates. To avoid this Rice, by then outside of its wild habitat, is well established in problem, we made sure that all languages have for at least Xishanping, at the far western end of the loess plateau in south- 85% of the concepts in our questionnaire. Due to our strict procedure, eastern Gansu (Majiayao culture), where pigs and the two millets some potentially important languages, such as Newar, are missing from are also found, by 5070 BP. Millet-and-rice agriculture then our sample (SI Appendix, section 3). This means that our results are pre- liminary, but we think they are interesting and important enough to expands south along the eastern edge of the Tibetan plateau, be shared. entering Sichuan (Baodun, 4700 BP), although d’Alpoim Guedes When preparing the language data, we selected the translations for (39) sees Baodun rice as an extension of Yangtze irrigated cul- our concept list semiautomatically and used publicly available software tivation rice. This complex subsistence strategy further expands libraries (45) to convert the language-specific orthographies into stan- south into Yunnan (e.g., Baiyangcun, 4500 BP; Haimenkou, dard phonetic transcriptions (46), to ease the task of cognate assignment. 3600 BP). The earliest archaeological evidence for cattle and At all stages of data preparation, we maintained a computer-assisted as horses, domesticated west of China, comes from Gansu, at the opposed to a solely computer-based or solely manual workflow: We made eastern edge of the Tibetan plateau, in the period 5400 BP use of automatic tools and custom scripts for data preprocessing, but to 4200 BP; we assume that the horse entered the non-Sinitic we made sure that all data were always checked again by an expert to avoid errors. branch in that region, and that its name was later transferred to Sinitic; for the name of cattle, see (SI Appendix, section 5). The archaeology of Sino-Tibetan–speaking regions in Burma, northeast India, and the Himalayan area is very limited, but radi- Maru ation from Yunnan along the main rivers which flow out of the 0.44 Bola Himalayas would bring non-Sinitic speakers and their domesti- Atsi cates to many of their current locations. Genetics support the Lashi Xiandao idea (40, 41) that a second route carried Sino-Tibetan speak- Achang ers southwest from Gansu across the Tibetan plateau: foxtail Rangoon Burm. millet—but not rice—was cultivated at Changdu Karuo on the Old Burm. Mekong River in eastern Tibet between 4700 BP and 4300 BP, Lisu later at Changguogou along the Yarlung Tsangpo in south- Daofu rGyalr. ern Tibet beginning ca. 3400 BP, and, by ca. 1700 BP, farther 0.51 Wobzi Khroskyabs west, near the Indus River, at Kyung-lung Mesa (39). This route Tangut would bring Sino-Tibetan speakers to the area occupied by mod- Maerkang rGyalr. ern West-Himalayish speakers, the western-most Sino-Tibetan Japhug group. Modern speakers of these languages have replaced the Zhaba millets and rice by the more frost-tolerant cereals barley and 0.79 Batang Tibetan 0.62 wheat (42), indicative of an adaptation to the Tibetan plateau. Xiahe Tibetan Alternatively, under the West Himalayan outgroup scenario Alike Tibetan (15%, 7200 BP) and assuming a homeland in the eastern 0.40 Old Tibetan loess plateau, the two groups West-Himalayish and non–West- Dulong Himalayish expanded westward in parallel, reaching the western 0.70 Rongpo end of the loess plateau at similar places and times, in the late Byangsi sixth millennium BP, speaking distinct languages. One group Bunan then moved southwestward across the plateau, while the other 0.34 Kulung expanded south following the plateau’s foothills. Depending on 0.62 Khaling 0.37 0.73 Thulung the timing of subsequent splits, the group ancestral to Sinitic 0.59 Bahing might face a lengthy eastward back-migration. Limbu Taken together, these two scenarios account for nearly half of Bantawa the tree output behind our phylogeny. However, with 33% poste- Hayu 0.33 rior probability and a more straightforward pattern of expansion, Yidu the Chinese outgroup scenario is the better supported of the two. 0.40 Darang Taraon

0.28 0.52 Bokar Tani Materials and Methods Tshangla Lexical Data. Sino-Tibetan languages differ considerably as to their syllabic Chepang structure. Some languages only allow consonant–vowel-type syllables, while Mizo (Lushai) others have complex clusters and final consonants. The former type of lan- Hakha guages can be shown to be highly innovative, and a series of specific sound 0.79 Ukhrul changes generally make cognate judgments very difficult, except for a few Karbi (Mikir) well-investigated cases. For this reason, it was decided to exclude from the Xingning Chin. sample all languages having lost the final stops -p, -t, and -k, unless pub- Longgang Chin. lished sources on the sound laws necessary to recover the lost segments were 0.42 Guangzhou Chin. available (as in the case of Lisu, for which ref. 43 was used). This resulted in Jieyang Chin. a collection of 50 Sino-Tibetan languages (see SI Appendix, section 3), which Chaozhou Chin. reflects the major particularly well-studied subgroups of the language fam- Beijing Chin. ily, including modern Chinese . The present concept list is based on 0.54 Old Chin. a larger set of 250 (see SI Appendix, section 3), reduced to 180 on Rabha the basis of the following criteria: (i) availability of data, (ii) avoiding pairs Garo of concepts with high polysemy (“hand” vs. “arm”; in such cases, only one Jingpho concept was chosen), and (iii) avoiding words prone to have nursery forms −7 −6 −5 −4 −3 −2 −1 0 (“father” and “mother”). Particular importance was attributed to assembling a dataset of high Fig. 2. The Maximum Clade Credibility tree from the best-fitting model average mutual coverage, defined as the proportion of the overlap of (relaxed clock with covarion). Branches with less than 0.8 posterior prob- concepts for which a exists in each language pair (44). With ability are dashed; other branches have posterior probabilities >0.8. For low coverage, the uncertainty of phylogenetic analyses increases, as well densitrees of the data, see SI Appendix, section 4.

4 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1817972116 Sagart et al. Downloaded by guest on September 24, 2021 Downloaded by guest on September 24, 2021 0 ersI(1998) I Peiros 10. od r nycgaei oeo hi at,i a edfcl omk a make problem to this difficult circumvent to be tried (i may We it by (50). parts, cognacy their overall their of on some judgment in singular cognate the Chinese only for (compare are word suffix words personal the in plural plural of related a for compounds and only Words often pronoun (50). are in are entirely that example, assessment not for words cognate but pronouns, with , of deal their problem to of some is particular languages a Asian (49), Southeast derivation when and disregarded (14) be to them allowing tag, specific analyses. performing a in cognates with be as cannot database, coded cases, the were exceptional loanwords in Identified possible undertaken. although systematically identifica- borrowings, Kiranti, borrowings. and such of Gyalrongic of as number tion such considerable families a language precise nonliterary of is For recognition laws sound allows of knowledge and languages, In enough, . methodol- and related Sinitic the closely of cognates. among case inherited using the borrowings from are identified discriminated detect and be to 48 difficult and can More 47 They refs. in sample. described Chinese, the ogy from of widespread are Loanwords languages Nepali) and dataset. in particular, the (in languages in Indic and loanwords Tibetan, of identification the Judgments. Cognate aate al. et Sagart correspond dates other the the texts; to corresponds Chinese Chinese Classical Old great for to range the [2,800 date of yBP; Chinese, (the 1,200 under Tibetan, period yBP Old Old allowed yBP; 900 800 follows: Tangut, not Burmese, Old as and thus prior; uniform calibrations are a in be innovations specified yBP, 2,300] only parallel We con- may tree; model. cognate at the each this disappear that on and Under constraint once distribution. appear additional born lognormal the cognates a with the (55), in rates, of model drawn stant branch rate, Dollo each clock Stochastic all allows own over the model its rate relaxed-clock have same to the the tree tree; at the change to on to models: cognates branches addition covarion forces different In the model two with governing probable. strict-clock used parameter branch, equally We The a parameter. each being includes latent also on the 0 model of rate to this transition rates, “slow” 1 fast and or and 1 slow “fast” binary the to a the 0 at Under cognates from 1 versa. whether transitions and indicating vice variable 0 and latent between language, a switch is a there in (54), being (0) model from covarion switch absent may to cognates (1) how poste- present defines the model from a trees Each in plausible Monte distribution. of ages Chain sample rior a Markov node obtain performed internal to sampling we and (MCMC) model, Carlo topology each tree For the framework. reconstruct change Bayesian cognate to of tree models three a used along We values. missing with matrix binary Dating. and Reconstruction Phylogenetic for like (53). catalogs, concepts reference for public facili- and to (1) which linking languages (52), by form initiative the reuse in Formats data data Data the tates Cross-Linguistic provide further the from we by data, taken database suggested our those (STEDT) of reuse especially Thesaurus facilitate To sources, and (36). original Etymological the Sino-Tibetan to Where the link transparency. also high with we decisions possible, expert reflecting (51), framework Appendix , (SI cognate are words 3). of section (iii parts which and show unit, to meaning-bearing alignments main multiple the identify to techniques tation .Hne 20)Wa sSn-iea?Saso fafil n agaefml in family language a and field a of Snapshot Sino-Tibetan? is in What indexation (2008) person Z of Handel nature Trans- the 9. and in Kiranti complexity Gyalrongic, Tangut, morphological (2016) G of Jacques dynamics 8. historical The (2015) S nations. DeLancey Indo-Chinese the 7. of literature and languages the On (1808) J Indo-Europeans. Leyden the and 6. Milk (2017) B Indo-European Sagot the L, of Sagart R, expansion Eurasia. Garnier and origins Age 5. the Bronze Mapping (2012) of al. et genomics R, Population Bouckaert (2015) 4. al. et ME, Indo-European for Allentoft source a is 3. steppe the from migration Massive (2015) al. et W, Haak 2. Hammarstr 1. ic h ioTbtnlnugsehbtrc atrso compounding of patterns rich exhibit languages Sino-Tibetan the Since h ont uget eecridotwt epo e annotation new a of help with out carried were judgments cognate The abra C,Australia). ACT, Canberra, flux. Sino-Tibetan/Trans-Himalayan. Himalayan. Res Farming Beyond family. language 522:167–172. Europe. in languages xldn tm fhg opudod (ii compoundhood, high of items excluding ) 3:158–289. agLnus Compass Linguist Lang mH oklR aplahM(2018) M Haspelmath R, Forkel H, om ¨ igDiscov Ling oprtv igitc nSuhatAsia Southeast in Linguistics Comparative Jh ejmn,Asedm,p 291–311. pp Amsterdam), Benjamins, (John Science uigcgaecdn,priua aewsdvtdto devoted was care particular coding, cognate During Nature 13:37–56. 337:957–960. 2/3:422–441. 522:207–211. igVanguard Ling h ont escnb oe sa as coded be can sets cognate The 2:1–13. Glottolog w o ˇ sn rnprn anno- transparent using ) I vs. “I” MISH ea Germany). Jena, (MPI-SHH, Asrla ainlUniv, National (Australian w agaeDispersal Language o-men ˇ w”.If “we”). using ) Asiatick Nature 5 a re 20)Rve fTugo n aol 2003. LaPolla and Thurgood of Review (2003) G Driem van 15. (2003) JA Matisoff 14. (1972) expansion southeast PK reveal Benedict phylogenies of Language 13. (2009) languages SJ Greenhill Austronesian AJ, Drummond aberrant RD, Gray the 12. Explaining (2006) A Pawley 11. 9 ye J ihlsG 21)Msigdt nasohsi ol oe o iaytrait binary for model Dollo stochastic a in data Missing (2011) GK Nicholls anatolian RJ, the Ryder support times 19. divergence Language-tree (2003) QD Atkinson RD, Gray 18. innovation. Tibeto-Burman a for candidate A (2017) L Sagart 17. of perspective the from phylogeny Sino-Tibetan Rethinking (2014) MW Post R, Blench 16. blt o h eann 5 ftedt.Ti aeaBysfco of factor Bayes a gave This prob- data. predictive the posterior the of computing 25% 75% of then remaining log subset data, the selected assessed the for randomly leaves we in a ability chosen 63, on traits randomly model ref. the 15 the Following of estimating of subfamilies. by subset major fit computa- a the model For on of evolution. done all non–tree-like was representing for this allows reasons, Dollo which Stochastic tional 63, Transfer Lateral ref. the of under data model the of subset a Model. reanalyzing Tree the the of are Adequacy results the heterogeneity. but rate 19, exclude or ref. include by we catas- whether proposed included same We as well. heterogeneity mixed rate and trophic stationarity reached had we Chain results Markov The (62). TraitLab 10 in implemented in model present Dollo Stochastic the Model. Dollo Stochastic the Under Analyses relaxed ggtree the the (61). using to plotted package correspond were of R and therefore BEAST, favor with 2 analysis in model Fig. marginal 23) covarion in clock the = presented (SD compute strict-clock results the 85 The to against at model. evidence particles decisive estimated indicating 40 was model, relaxed-clock factor used the Bayes we log case, The to relaxed-clock each likelihood. a (59) and In algorithm model Sampling (60). strict-clock Nested a model of a likelihood used marginal then the We compare TreeAnota- produced (58). software were v.1.6 computation the Tracer ESS using and using convergence tree of Verification credibility v2.4.7. tor clade maximum a produce 10 10 sampled every We trees correlation. thinned (ESS) We size parameter. sample each effective with convergence 10 reaching relaxed- performed iterations, a and We strict-clock a model. both with clock which languages, (57) extinct model include Birth-Death to Fossilized us a allows fitted We (56). v2.4.7 Models. BEAST2 (see ware Covarion data the the Under of Analyses subset a analyze varieties also language 4). we of section results, Appendix, sampling SI the our influence that sure not make does To uncertain. also also are we see 2, tions century; Fig. in nearest tree in the consensus densitrees main to the present rounded to addition text In 4). earliest section the Appendix, of date the to n R etro xelneGatC10001 ..i upre ya by supported scholarship. is Data V.T. Science CE140100041. Science, Data Grant Lettres supported Excellence et supported Sciences 120101954 is of are Paris DE S.J.G. Y.L. Center Grant Project and 715618. ARC Discovery J.-M.L. and Grant (ARC) map; draft. Council Starting the Research the Australian on Council on by comments Research comments for European for Wu M.-S. Gray by map; D. geographic R. the and for Sell H. data; Hammarstr H. ACKNOWLEDGMENTS. a under be topology not and to age 64 root model. ref. of the tree of estimating 4.2.1 level when section bias by estimated systematic shown to liable the range is should the Furthermore, within model model is inference. network which tree 0.162]), the the a in using in transfer and issues lateral decisive, to not lead is not evidence the model, work 4 ulctosi igitc Ui aionaPes ekly,Vl135. Vol Berkeley), Press, California (Univ Linguistics in Publications settlement. Pacific in pauses and pulses debate. of years 150 Melanesia: aa n t plcto otedtn fProto-Indo-European. of dating the to 71–92. application its and data, origin. Indo-European of theory 46:101–119. 71–104. pp Berlin), Gruyter, de (Mouton languages. Indian east north 66:282–284. 10 n icrigtefis 0 sbr-n iulcek niae htthe that indicated checks Visual burn-in. as 10% first the discarding and BF = IAppendix SI nfvro h ewr oe;atog hsfvr h net- the favors this although model; network the of favor in 1.8 m n .LPlafrhl n nomto nlnugsand languages on information and help for LaPolla R. and om, ¨ ioTbtn Conspectus A Sino-Tibetan: nvriyo California of University Proto-Tibeto-Burman, of Handbook IAppendix SI etakL onrh .Wde,N .Hl,K Ma, K. Hill, W. N. Widmer, M. Konnerth, L. thank We orsodt u f10 of run a to correspond d ilN,Oe-mt T Owen-Smith NW, Hill eds Linguistics, Trans-Himalayan 4 evrfidteaeuc fate oe by model tree a of adequacy the verified We re rmtepseirte itiuinto distribution tree posterior the from trees oy Soc Polyn J Nature 8 CCieain,wt uni f10 of burn-in a with iterations, MCMC oso hc set ftereconstruc- the of aspects which show to Science 426:435–439. ecntutdtesuigtesoft- the using trees constructed We 115:215–257. β/µ ˆ eas nlzdtedt under data the analyzed also We CmrdeUi rs,Cambridge). Press, Univ (Cambridge 323:479–483. 4 eeain ormv auto- remove to generations = NSLts Articles Latest PNAS 0 9%HD[.5 to [0.058 HPD (95% 0.104 8 trtostinn every thinning iterations ulShOin f Stud Afr Orient Sch Bull a igitAi Orient Asie Linguist Cah ttScB Soc Stat R J > | 0 for 300 f6 of 5 60: SI 7

ANTHROPOLOGY 20. Chang W, Cathcart C, Hall D, Garrett A (2015) Ancestry-constrained phylogenetic 44. Rama T, List JM, Wahle J, Jager¨ G (2018) Are automatic methods for cognate detection analysis supports the Indo-European steppe hypothesis. Language 91:194–244. good enough for phylogenetic reconstruction in historical linguistics?Proceedings of 21. Nicholls G, Ryder R (2011) Phylogenetic models for Semitic vocabulary. Proceedings the North American Chapter of the ACL (N Am Assoc Computational Linguistics, of the 26th International Workshop on Statistical Modelling (Copiformes, Valencia, Stroudsburg, PA), pp 393–400. Spain), p 26. 45. List JM, Greenhill S, Forkel R (2017) LingPy. A Python Library for Quantitative Tasks in 22. Currie TE, Meade A, Guillon M, Mace R (2013) Cultural phylogeography of the Bantu Historical Linguistics (Max Planck Inst Sci Human History, Jena). languages of Sub-Saharan . Proc R Soc. B 280:20130695. 46. List JM, et al. (2019) Cross-Linguistic Transcription Systems (Max Planck Institute for 23. Burling R (1983) The . Linguist Tibeto-Burman Area 7:1–32. the Science of Human History, Jena, Germany). 24. Bradley D (1997) Tibeto-Burman languages and classification. Papers in Southeast 47. Sagart L, Xu S (2001) History through loanwords: The loan correspondences between Asian Linguistics, ed Bradley D (Pacific Linguistics, Canberra, ACT, Australia), pp 1–72. Hani and Chinese. Cah Linguist Asie Orient 30:3–54. 25. Jacques G, Michaud A (2011) Approaching the historical phonology of three highly 48. Jacques G (2004) Phonologie et Morphology du Japhug (rGyalrong). Ph.D. thesis (Univ eroded Sino-Tibetan languages: Naxi, Na and Laze. Diachronica 28:468–498. Paris VII–Denis Diderot, Paris). 26. van Driem G (1997) Sino-bodic. Bull Sch Orient Afr Stud 60:455–488. 49. Jacques G (2017) A reconstruction of Proto-Kiranti verb roots. Folia Linguist Hist 27. DeLancey S (2015) Morphological evidence for a central branch of Trans-Himalayan 38:177–215. (Sino-Tibetan). Cah Linguist Asie Orient 44:122–149. 50. List JM (2016) Beyond cognacy: Historical relations between words and their 28. Thurgood G (2017) Sino-Tibetan: Genetic and areal subgroups. The Sino-Tibetan implication for phylogenetic reconstruction. J Lang Evol 1:119–136. Languages, eds Thurgood G, LaPolla R (Routledge, London), pp 3–39. 51. List JM (2017) A web-based interactive tool for creating, inspecting, editing, 29. van Driem G (1993) The Proto-Tibeto-Burman verbal agreement system. Bull Sch and publishing etymological datasets. Proceedings of the EACL (European Assoc Orient Afr Stud 61:292–334. Computational Linguistics, Barcelona), pp 9–12. 30. Jacques G (2010) A possible trace of verbal agreement in Tibetan. Himalayan Linguist 52. Forkel R, et al. (2018) Cross-Linguistic data formats, advancing data sharing and re-use 9:41–49. in . Sci Data 5:1–10. 31. Greenhill SJ, et al. (2017) Evolutionary dynamics of language systems. Proc Natl Acad 53. List JM, Cysouw M, Greenhill S, Forkel R (2018) Concepticon. A Resource for the Sci USA 114:E8822–E8829. Linking of Concept List (Max Planck Inst Sci Human History, Jena). 32. Hellwig B (2011) A Grammar of Goemai (Mouton de Gruyter, Berlin). 54. Huelsenbeck JP (2002) Testing a covariotide model of DNA substitution. Mol Biol Evol 33. Nichols J (1992) Language Diversity in Space and Time (Univ Chicago Press, Chicago). 19:698–707. 34. Bickel B, Nichols J (2005) Inclusive/exclusive as person vs. number categories world- 55. Nicholls GK, Gray RD (2008) Dated ancestral trees from binary trait data and their wide. Clusivity, eds Haspelmath M, Dryer MS, Gil D, Comrie B (Oxford Univ Press, application to the diversification of languages. J R Stat Soc B 70:545–566. Oxford), pp 94–97. 56. Bouckaert R, et al. (2014) BEAST 2: A software platform for Bayesian evolutionary 35. Diamond J, Bellwood P (2003) Farmers and their languages: The first expansions. analysis. PLoS Comput Biol 10:e1003537. Science 300:597–603. 57. Gavryushkina A, Welch D, Stadler T, Drummond AJ (2014) Bayesian inference of 36. Matisoff JA (2015) The Sino-Tibetan and Thesaurus Project sampled ancestor trees for epidemiology and fossil calibration. PLoS Comput Biol (Univ California, Berkeley). 10:e1003919. 37. Dodson J, et al. (2014) Oldest directly dated remains of sheep in China. Sci Rep 4: 58. Rambaut A, Drummond A, Suchard M (2014) Tracer (v. 1.6). Available at beast. 7170. community/. Accessed April 5, 2018. 38. Zhang J, et al. (2010) Phytolith evidence for rice cultivation and spread in Mid-Late 59. Maturana P, Brewer BJ, Klaere S, Bouckaert R (2017) Model selection and parameter Neolithic archaeological sites in central North China. Boreas 39:592–602. inference in using nested sampling. arXiv:1703.05471. Preprint, posted 39. d’Alpoim Guedes J, et al. (2013) Moving agriculture onto the Tibetan plateau: The March 16, 2017. archaeobotanical evidence. Archaeol Anthropol Sci 6:255–269. 60. Drummond AJ, Ho SY, Phillips MJ, Rambaut A (2006) Relaxed phylogenetics and 40. Kang L, et al. (2012) Y-chromosome O3 haplogroup diversity in Sino-Tibetan pop- dating with confidence. PLoS Biol 4:e88. ulations reveals two migration routes into the eastern Himalayas. Ann Hum Genet 61. Yu G, Smith DK, Zhu H, Guan Y, Lam TTY (2017) ggtree: An R package for visualization 76:92–99. and annotation of phylogenetic trees with their covariates and other associated data. 41. Wang LX, et al. (2018) Reconstruction of Y-chromosome phylogeny reveals two Methods Ecol Evol 8:28–36. neolithic expansions of Tibeto-Burman populations. Mol Genet Genomics 293:1293– 62. Nicholls GK, Ryder RJ, Welch D (2013) TraitLab: A MatLab package for fitting and 1300. simulating binary tree-like data. 42. d’Alpoim Guedes JA, Lu H, Hein AM, Schmidt AH (2015) Early evidence for the use 63. Kelly LJ, Nicholls GK (2017) Lateral transfer in stochastic Dollo models. Ann Appl Stat of wheat and barley as staple crops on the margins of the Tibetan Plateau. Proc Natl 11:1146–1168. Acad Sci USA 112:5625–5630. 64. Ryder R (2010) Phylogenetic Models of Language Diversification. DPhil dissertation 43. Bradley D (1979) Proto-Loloish (Curzon, London). (Univ Oxford, Oxford).

6 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1817972116 Sagart et al. Downloaded by guest on September 24, 2021