<<

Linguistic Phylogeny with Bayesian Markov Chain Monte Carlo: The Case of Indo-European

by

Stephen Tyndall

A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy () in the University of Michigan 2019

Doctoral Committee: Professor Sarah Thomason, Chair Associate Professor Steven Abney Professor H. Baxter Professor Benjamin Fortson ©Stephen Tyndall [email protected] ORDID iD: 0000-0001-7276-6695

2019 ACKNOWLEDGMENTS

First and foremost, I would like to thank Sally Thomason, without whose encouragement and support I’d never have completed this the- sis. And thank you as well to the rest of the committee for provid- ing advice and ideas. Their suggestions and help have been invaluable.

Thanks as well to my friends and colleagues, particularly Kate Sherwood, Terry Szymanski, Rob Gillezeau, Lauren Squires, Tayfun Bilgin, Jon Yip, and Kevin McGowan, who provided support, friendship, and mo- tivation through this long process. Further thanks to the members of GLaM and GEO for all the social and professional aid, and to Bu and Ping for their calming presence through the final stages of this thesis.

Finally, thank you to my family, Karen Brichetto, Eric Tyndall, David Richardson, and Doug Tyndall, for all your support during graduate school. I could not have done this without you.

ii TABLE OF CONTENTS

Acknowledgments ...... ii

List of Figures ...... iv

List of Tables ...... v

Abstract ...... vi

Chapter

1 Introduction ...... 1 1.1 Background Information...... 1 1.2 Statistical Methods in Linguistics...... 2 1.3 The Particular Problem of Indo-European (Computational) Phylogeny..3 1.4 Dissertation Objectives...... 4 1.5 Scope and Limitations...... 6 1.6 Structure of the Dissertation...... 6 1.6.1 Literature Review...... 6 1.6.2 Methodology...... 7 1.6.3 Results...... 7 1.6.4 Conclusion...... 7 2 Literature Review ...... 9 2.1 Computational Phylogeny in Linguistics...... 9 2.1.1 Prior Computational Phylogeny Work in Linguistics...... 15 2.2 Models of the spread of the Indo-European Family...... 18 2.2.1 Tradition Family Tree(s) of IE...... 18 2.2.2 Non-Computational Dating Methods for the Breakup of PIE... 18 2.2.3 Models of the Breakup and Spread of Proto-Indo-European... 21 2.2.4 The Farming Model...... 21 2.2.5 The Pastoralist Model...... 23 2.3 Computational Phylogeny of Proto-Indo-European...... 25 2.3.1 Gray and Atkinson...... 25 2.3.2 Chang et al. 2005...... 30 2.3.3 The Ringe and Warnow Method...... 31 2.3.4 Computational Phylogeny and Estimating the Dates of Language Splits...... 35

iii 3 Methodology ...... 38 3.1 Specifying Retentions with MRBAYES...... 39 3.2 Capturing Phonological Changes as Characters for Bayesian MCMC... 40 3.3 Data Coding...... 42 4 Results ...... 47 4.1 A Brief Survey of the Results...... 47 4.2 Summary of the Results...... 47 4.2.1 Characteristics and Subdivisions of the Set of Swadesh Lists... 49 4.2.2 Kitchen Sink Results...... 52 4.3 Six-Way Results...... 54 4.3.1 Phonological Characters Only...... 54 4.3.2 Lexical Characters Only...... 58 4.3.3 Lexical and Phonological Characters...... 61 4.3.4 Overall Consideration of the Six Conditions...... 64 4.4 Replication of Gray and Atkinson 2003...... 64 4.4.1 Replication without Root Node Specification...... 66 4.4.2 Replication with Root Node Specification...... 66 4.5 Discussion...... 66 4.5.1 Root Node Selection...... 68 4.6 Implications and Limitations...... 70 5 Conclusion ...... 72 5.1 Primary Benefits...... 72 5.2 Secondary Benefits...... 73 5.3 Future Work...... 73 Appendix ...... 75

Bibliography ...... 81

iv LIST OF FIGURES

2.1 Network from [Holden and Gray, 2006]...... 10 2.2 Network from [Greenhill and Gray, 2005]. This splitstree network graph of Austronesian demonstrates a strong tree-like signal in the data used to create the graph...... 11 2.3 Network from [Bakker et al., 2011]. Splitstree network showing very weak treelike signal in source data...... 12 2.4 Tree from Gray and Atkinson (2003)...... 29

4.1 Entire IELEX Lexical Tree...... 53 4.2 Ancient , Only, No Retentions Specified...... 56 4.3 Ancient Languages, Phonology Only, Retentions Specified...... 57 4.4 Ancient Languages, Lexical Characters Only, No Retentions Specified..... 59 4.5 Ancient Languages, Lexical Characters Only, Retentions Specified...... 61 4.6 Ancient Languages, Lexical and Phonological Characters, No Retentions Spec- ified...... 62 4.7 Ancient Languages, Lexical and Phonological Characters, Retentions Specified 63 4.8 Dyen List, No Specification of Root...... 67 4.9 Dyen List, Specification of Root...... 68

5.1 A splitstree network showing the relationships among the Mande languages.. 74

v LIST OF TABLES

3.1 Manner Outcomes of PIE *p...... 41 3.2 Manner Outcomes of PIE *t...... 41 3.3 Manner Outcomes of PIE *k´...... 42

vi ABSTRACT

This study aims to improve the application of computational phylogeny to lan- guage families by testing a series of possible improvements in the structure and use of input data, particularly in order to help reconcile computational techniques with traditional historical linguistic family tree creation. The method used is Bayesian Markov Chain Monte Carlo (Bayesian MCMC), with the MRBAYES software package, replicating several prior computational phylogenetic studies of the Indo-European . The study examines two hypotheses about changes to the input data for the method: distinguishing innovations from reten- tions in the input data will improve the resulting trees, and providing a bias-free mechanism for incorporating sound change data in addition to lexical data will further improve both the output and the usability of Bayesian MCMC. The results of the study demonstrate small improvements over baseline under both hypothe- ses, demonstrating the utility of both kinds of additional data. Further work will include application of these new data types to other language families.

vii CHAPTER 1

Introduction

In this dissertation, I aim to improve the application of computational phylogeny to lan- guage families by testing a series of possible improvements in the structure and use of input data, particularly in order to help reconcile computational techniques with traditional historical linguistic family tree creation. The use of computational phylogeny for linguis- tics is both recent and controversial, particularly in its use by Gray & Atkinson (2003) to date the breakup of the Indo-European language family. I extend the method used by Gray & Atkinson in two ways: by including sound change data without pre-selecting particular changes, and by carefully distinguishing innovations from retentions in both lexical and sound change data.

1.1 Background Information

The process of constructing family trees (phylogenies) for languages is very similar to con- structing phylogenies for biological species. Even as far back as the 19th century, Darwin observed parallels between languages and species: “The formation of different languages and of distinct species, and the proofs that both have been developed through a gradual pro- cess, are curiously parallel. . . We find in distinct languages striking homologies due to com- munity of descent, and analogies due to a similar process of formation. . . ”[Darwin, 1871, 57-58]. A number of techniques have been developed by biologists to produce species family trees and to date the breakup of populations. It should come as no surprise, then, that some of these techniques have been adapted to producing language phylogenies from linguistic data. In particular, a number of scholars, primarily non-linguists, have applied a computational and statistical technique called the Bayesian Markov Chain Monte Carlo (hereafter Bayesian MCMC) method to language families. This technique has been used with considerable success in biology, and some uses within linguistics have shown great promise in generating accurate family trees and dating

1 the breakup of linguistic groups. Recently, the use of the technique to date the breakup of a particularly well-studied language family, Indo-European, has generated significant controversy among historical linguists. The Indo-European language family has one of the longest histories of study in linguis- tic science. The family contains ten major sub-families, the earliest of which, Anatolian, appears on the stage of history with documents written in Hittite, ca. 1800 BCE, and the most recently-attested of which, Albanian, is attested only from the 16th century CE. The languages of the family are widely distributed over Europe and Asia, by the ranging from Iceland in the west to the , modern , in western China.

1.2 Statistical Methods in Linguistics

In recent decades, the use of statistical methods in linguistics has allowed linguists to use and analyze large volumes of data in small amounts of time. The advent of these methods has been somewhat abrupt, as Steve Abney describes: “In the space of the last ten years, statistical methods have gone from being virtually unknown in computational linguistics to being a fundamental given. In 1996, no one can profess to be a computational linguist without a passing knowledge of statistical methods. . . . anyone who cannot at least use the terminology persuasively risks being mistaken for kitchen help at the ACL [Association for Computational Linguistics] banquet”[Abney, 1996]. Writing in 1996, Abney discusses the resistance of other subdisciplines of linguistics to these techniques. More recently, however, statistical methods have caught on, even within . In particular, the application of statistical methods to generating language family trees from vocabulary data has seen considerable use in the last decade, sometimes by linguists, often by biologists. The methods, however, are complex, and some of the discussion of these methods comes from scholars who lack full control or understanding of the techniques in question. In a discussion of some phylogenies of the Indo-European language family, Razib Khan suggests “At some point in the future I suspect all of this research will make recourse to Bayesian phylogenetics, but at this stage of the game even most people who use Bayesian phylogenetic packages don’t really understand how they work”[Khan, 2012].

2 1.3 The Particular Problem of Indo-European (Computa- tional) Phylogeny

The Indo-European language family is one of the best-studied families in the world, but its origins are still under some debate. In particular, the spread of Indo-European languages through their eventual territory is a complex problem. Indeed, Diamond and Bellwood consider the breakup of Proto-Indo-European to be “the most intensively studied, yet still the most recalcitrant, problem of historical linguistics” [Diamond and Bellwood, 2003]. While this statement is somewhat hyperbolic, a significant amount of ink has been spilled over the problem in recent years. The attested time depth of the Indo-European language family provides some of this uncertainty. Few language families are attested from as early as Indo-European, and fewer still have the quantity and variety of ancient documentation that Indo-European scholars have access to. Among major language families, only Semitic has more ancient docu- mentation, but the languages of the family are much more similar to each other than the Indo-European languages. Indo-European is one of the largest language families, comprised of some 4451 living languages and numerous extinct languages, some with records dating back to the second millennium BCE. The family is divided into ten traditional subfamilies, presented here in the order of appearance in the historical record:

1. Anatolian - attested from ca. 1700 BCE

2. Greek - from ca. 1400 BCE

3. Indo-Iranian - from ca. 1000 BCE

4. Italic - from ca. 700 BCE

5. Celtic - from ca. 500 BCE

6. Germanic - from ca. 100 BCE

7. Tocharian - from ca. 500 CE

8. Armenian - from ca. 600 CE

9. Balto-Slavic - from ca. 800 CE 1http://www.ethnologue.com/subgroups/indo-european

3 10. Albanian - from ca. 1500 CE

There are a few Indo-European languages, such as Phrygian and Thracian, that may or may not belong to any of these subfamilies. The scant remaining material from them makes their position in the family uncertain. Standard trees of Indo-European have Proto-Indo-European at the root, and ten inde- pendent branches corresponding to the subfamilies above. Apart from the two joined subfamilies, i.e. Indo-Iranian and Balto-Slavic, no larger subgroupings of the IE language families listed above have been solidly established. Some major do seem to split families, such as the centum-satem division, in which the subfamilies are split according their treatment of the proto-Indo-European palatal k´, but few such splits have significant evidence that is not contradicted by other possible splits. Two larger groupings have gotten considerable traction in the past. The first, Italo- Celtic, a subgroup based upon some phonological a morphological similarities, is increas- ingly accepted by historical linguists. The second, Indo-Hittite, postulates that the Anato- lian subgroup is a sister-subgroup to proto-Indo-European, both of which are descended from proto-Indo-Hittite. However, under the presumption that Anatolian was the first branch to break off from proto-Indo-European, this new subgrouping differs in little other than terminology from the standard model.

Proto-Indo-European

Anatolian Non-Anatolian PIE only differs from

Proto-Indo-Hittite

Anatolian Proto-Indo-European in the names of the nodes in the tree. This new nomenclature does emphasize the differences between Anatolian and the other known branches, however. The most accepted model of Indo-European, then, is a ten-subfamily model with flat branching, with all subfamilies descended from Proto-Indo-European.

4 1.4 Dissertation Objectives

In this dissertation, I examine computational approaches to linguistic phylogeny with re- spect to both tree structure and dating methods. With this dissertation, I aim to accomplish two major tasks:

• The primary task is to improve the application of Bayesian MCMC techniques for creating phylogenies of language families, using Indo-European as a test case. To date, Bayesian MCMC phylogenies of the Indo-European language family have given problematic results, as discussed below. I put forth two hypotheses regarding methodological improvements to the Bayesian MCMC methods currently in practice:

1. Carefully distinguishing innovations from retentions in the input data will in- crease the accuracy of Bayesian MCMC methods. 2. Using phonological change data in a manner free of selection bias will prove less vulnerable to errors due to borrowing than using strictly vocabulary.

These two hypotheses require six distinct conditions:

1. Control: Only -tagged vocabulary, no distinction between innovations and retentions. 2. Cognate-tagged vocabulary only, retentions specified in the input. 3. Phonological data only, no distinction between retentions and innovations. 4. Phononological data only, retentions specified in the input. 5. Vocabulary and phonological data, no distinction between retentions and inno- vations. 6. Vocabulary and phonological data, retentions specified in the input.

These tests will demonstrate the utility of this novel method of structuring phono- logical changes for computational phylogeny and provide greater linguistic sophis- tication to methods that have so far been somewhat primitive. These tests should improve the Bayesian MCMC method’s use in the study of Indo-European, and test both the resulting structure of the family tree and the dates for the breakup of the Proto-Indo-European ancestral language against the well-established results of the and linguistic archaeology.

5 • The secondary, more service-oriented task is to explain the details of the Bayesian Markov Chain Monte Carlo (Bayesian MCMC) method in a way that permits a clearer analysis of the method and its results than has been available in the literature. The technique itself falls outside the traditional purview of most historical linguists, and much of the discussion of the method is problematic. An accessible explanation of the method aimed at an audience of historical linguists will be a useful contribution to the field.

A third benefit of this dissertation is the availability, on GITHUB, of a number of pieces of software, including scripts to take CSV files of and output NEXUS files for use with MRBAYES, Neighbornet, and BEAST phylogeny packages. Such tools are nowhere now currently available, and so each time someone new attempts computational phylogeny project for linguistics, they must write their own custom solution to an already- solved problem. By making these tools available, at the very least, I’ll be saving time for future projects, and perhaps even permitting some linguists with weak or non-existent programming skills to engage in computational phylogeny research.

1.5 Scope and Limitations

In order to make the project herein directly comparable to prior studies, the data under consideration is largely identical to that used in Gray & Atkinson (2003) and later studies, i.e. the large tagged lexicon compiled by Dyen et al. (described in detail in the Methods chapter below), in combination with the data compiled by the IELEX project. The primary goal in this choice of data is ease of replication. This study will then be directly comparable to not only the Gray & Atkinson project, but a number of other studies. The Dyen et al. list of Indo-European cognates has become something of a standard choice comparing distinct methods of computational phylogeny [Nichols and Warnow, 2008]. The data used in this dissertation is from 87 Indo-European languages, including all those listed in Dyen et al., plus additional Anatolian and Tocharian material from Ringe et al. 2002.

1.6 Structure of the Dissertation

The dissertation is organized as follows:

6 1.6.1 Literature Review

Chapter 2 first reviews the development and use of computational phylogeny for linguis- tics, with a special focus on prior attempts at phylogenies of the Indo-European family. The chapter describes how the Bayesian MCMC method works to produce highly proba- ble phylogenetic trees, and reviews a number of other, related uses of computational phy- logeny within linguistics. This chapter also lays out the two major competing models of the breakup of Proto-Indo-European, the Farming Model and the Pastoralist Model, and presents the main evidence for each. I then discuss prior computational phylogenies of Indo-European. Additionally, I discuss a few formerly popular methods for dating lan- guage family splits which, like computational phylogenetic methods, are statistical in na- ture. These are presented along with the evidence that eventually led to their widespread abandonment.

1.6.2 Methodology

Chapter 3 sets out the methodology of the current study. In this chapter, I describe the details of the Bayesian MCMC method and the software packages that implement it. Also here, the selection procedures and structures of the source data are discussed. I describe the use of corresponding phonemes as a way of capturing sound changes as input data for Bayesian MCMC while avoiding the cherry-picking of individual sound changes that has been used in the past. I introduce distinctions between innovations and retentions in the input data for Bayesian MCMC phylogeny. Methods and code for selecting subsets of the cognate data set (Dyen et al.) are also provided here, as is code for converting these data sources into formats suitable for computational phylogeny.

1.6.3 Results

Chapter 5 sets out the results of the current study: the consensus trees from all six tests laid out above. I also discuss the fit of the resulting consensus trees to the competing models of the breakup of Proto-Indo-European. This chapter also presents a discussion of how this computational phylogeny experi- ment fits into the larger picture of linguistic phylogeny, and how the examination of inno- vations and retentions in computational methods adds to the large number of computational tools available to linguists. I also discuss the tension between specificity and generalizabil- ity, both in this project and across computational phylogeny.

7 1.6.4 Conclusion

Chapter 7 concludes the dissertation and presents other possible uses for the methods pre- sented herein. Further, I propose a number of ways this work may be extended in the future.

8 CHAPTER 2

Literature Review

This chapter first reviews the development and use of computational phylogeny for linguis- tics, with a special focus on prior attempts at phylogenies of the Indo-European family. The chapter describes how the Bayesian MCMC method works to produce highly proba- ble phylogenetic trees, and reviews a number of other, related uses of computational phy- logeny within linguistics. This chapter also lays out the two major competing models of the breakup of Proto-Indo-European and presents the main evidence for each. Addition- ally, defunct lexicostatistical methods for dating language splits are presented along with the evidence that eventually led to their abandonment. The chapter contains three sections:

1. Computational Phylogeny in Linguistics

2. Models of the Breakup of Proto-Indo-European

3. Prior Computational Phylogenies of Indo-European

2.1 Computational Phylogeny in Linguistics

While historical linguistics has not traditionally gotten on particularly well with computers, a number of recent studies have used computational phylogenetic methods to help date language families, build family trees, and resolve other problems in the study of language

9 change. Advances in the study of phylogeny, primarily from biology and bioinformatics have led to a number of important new tools and methods for understanding the structure of the relationships between related languages. These methods fall generally into two broad categories, based on the resulting structure they produce to represent the data: tree methods and network methods. Both of these produce graphs, which are sets of nodes connected by edges. Graphs can be directed, in which case each edge only points in one direction, from one node to another, or undirected, if nodes are connected without regard to direction. Directed graphs can be rooted, i.e. all nodes in the graph are descended by directed edges from a single node, the root. Network methods produce graphs that indicate relationships among languages. but without specifying a particular root. When used for historical linguistics, these methods produce graphs where individual languages are endpoints, and a series of edges connects these endpoints. Cycles, also called reticulations, in the graphs indicate conflicting data. These methods are often used when evidence is expected to be ambiguous, since the output presents ambiguities or conflicts in an easy-to-see way. See figure 2.1 for an example of such a network.

Figure 2.1: Network from [Holden and Gray, 2006].

In figure 2.1, the cycle in the center shows that there are two possible ways to divide the languages, both of which show two subgroups of two languages. The edges labeled A show one possible division, into a Swati-Ndebele subgroup and a Ngoni-Zulu subgroup,

10 and the B edges show the other division. As the A edges are longer, more of the evidence points to the first subgrouping. Network methods can provide useful insights into the structure of language families or into the nature of the signal in the data used to construct phylogenies of these families. A graph with fewer and smaller cycles (i.e. boxes, as in the above diagram) demonstrates a more treelike signal than a graph with many large cycles. A large number of cycles may indicate a large amount of borrowing among members of the family, or that the language family itself developed in a manner not particularly treelike. Compare figures 2.2 and 2.3 below for examples of networks showing strong and weak treelike signals, respectively.

Figure 2.2: Network from [Greenhill and Gray, 2005]. This splitstree network graph of Austronesian demonstrates a strong tree-like signal in the data used to create the graph.

Tree methods produce rooted trees with with branches. When these methods are used for the purposes of historical linguistics, the root of the resulting trees are the proto-

11 Figure 2.3: Network from [Bakker et al., 2011]. Splitstree network showing very weak treelike signal in source data. language, and the leaf nodes of the tree are the individual languages of the language family. Many of these trees represent time as well, and the of the edge between points is proportional to the amount of time between the two points. These methods most closely approximate the results of traditional historical linguistic science, and are more common. The work discussed below uses these methods, as does my study. Once tree methods have been chosen, the two major issues left are the tree creation method and the selection of the characters used to evaluate the created trees. The most common method used to create phylogenies of language families is called Bayesian Markov Chain Monte Carlo, which is one of a large class of algorithms used for sampling over a probability distribution. The process of creating a phylogenetic tree is not typically understood as ‘sampling,’ which implies selection from a large number of pre-existing trees, rather than the construction of a single, correct tree. Bayesian MCMC recasts the problem as one of selecting a large sample of possible trees and computing the probability of these trees, given some data. These methods, in which the optimal result is slowly approximated by evaluating many sub-optimal instances instead of computing the optimal result directly, have a long history within statistics and computing, beginning just after World War 2. Monte Carlo methods, where a large number of random trials is used to obtain results, were the first of this class

12 of methods to be invented. Initially, these methods were used primarily in particle physics, notably by Enrico Fermi, Stan Ulam, and John von Neumann [Andrieu et al., 2003]. Ulam is typically credited as the inventor of Monte Carlo simulation methods, and he claims to have invented them while playing cards and recovering from an illness:

“The first thoughts and attempts I made to practice [the Monte Carlo method] were suggested by a question which occurred to me in 1946 as I was conva- lescing from an illness and playing solitaires. The question was what are the chances that a Canfield solitaire laid out with 52 cards will come out success- fully? After spending a lot of time trying to estimate them by pure combina- torial calculations, I wondered whether a more practical method than “abstract thinking” might not be to lay it out say one hundred times and simply observe and the number of successful plays. This was already possible to envis- age with the beginning of the new era of fast computers, and I immediately thought of problems of neutron diffusion and other questions of mathematical physics . . . ” quoted in [Eckhardt, 1987].

Bayesian MCMC is one such method, where a large number of random trials is used to calculate an optimal result. The name Bayesian Markov Chain Monte Carlo has three parts:

• Bayesian - the use of Bayes’s theorem:

P (A)P (B|A) P (A|B) = P (B)

to evaluate the probability of trees

• Markov Chain - the transition between sampled trees probabilistically

• Monte Carlo - the selection of a large number of random initial trees from which the method proceeds

13 The steps below follow a tree through the the initial iteration of the Bayesian MCMC method:

1. A random tree A is created.

2. Bayes’s rule is used to evaluate the probability of tree A, i.e. P(A)

3. A new tree B is created by modifying tree A in some small, random way.

4. Bayes’s rule is used to calculate P(B).

5.• If P(B) is greater than P(A) (i.e. the new tree is more likely to represent the data than the old tree), then repeat step 1, starting with tree B instead of a new random tree.

• If P(A) is greater than P(B) (i.e. the new tree is less likely to represent the data than the old tree), then there are two possible outcomes, start again with B or A, selecting to start with A randomly, with probability 1-P(B).

As this process is repeated over and over again, the trees generated grow increasingly likely to represent the data. The final step provides a way for these algorithms to move from a higher-probability tree to a lower-probability tree so as to avoid becoming trapped in local maxima and thereby missing an area of the search space with even greater probability. The visualization below demonstrates the rapid convergence of three random indepen- dent starting points.

14 The three independent starting points (all to the right) begin to converge on the eventual distribution. The three chains begin to be indistinguishable as they converge, at which point they begin sampling only from the high probability points in the search space. Not all points are kept - the initial points are very far from the target distribution. In most Markov Chain Monte Carlo methods, Bayesian included, some large number of the initial samples, as many as a half million in many cases, is discarded, as these sample points are not probable samples from the target distribution. This discarding process is called ‘burn in’. Once this burn-in process is completed, the remaining trees are combined into a final consensus tree, in which the edges are the mode of the edges in the remaining post-burn-in trees.

2.1.1 Prior Computational Phylogeny Work in Linguistics

A number of phylogenetic studies have been used to test hypotheses about language diver- gence in history. Many such studies, however, are performed by statisticians and biologists, with little regard for the particulars of linguistic theory. As Nichols and Warnow state “Phy- logeny estimation methods based on unrealistic models of language evolution are unlikely to produce accurate estimations of evolutionary history, and simulation studies based on

15 unrealistic models of language evolution are unlikely to be informative of performance on real data. Thus, a critical understanding of the models underlying estimation methods and simulation studies is essential to a proper interpretation of phylogenetic analyses and simulation studies”[Nichols and Warnow, 2008]. To this end, I examine here the relevant studies of linguistic phylogeny and language breakup and spread, organizing them the types of characters used in the analyses, starting with lexical characters, the most common type, then phonological and morphological characters, which are much rarer, and finally other types of characters, such as typological features. Most computational phylogeny in linguistics uses lexical items as the characters in question, for a number of reasons. First, such data is usually readily available and inter- pretable by non-specialists, as dictionaries are frequently easy to access online. Second, there is reasonable agreement as to what as a shared character. When two languages share a cognate word, they’re said to have the same character for the purpose of discerning their phylogeny, i.e. cognate words are the analogs to shared genes. Greenhill and Gray use lexical characters to examine two major different expansion hypotheses of the Austronesian family, the ‘Express Train’ model, which has Austrone- sian spreading from Taiwan south to the Philippines and thence west to Java and Sumatra and east to New Guinea and the Solomon islands, and the ‘Entangled Bank’ model, which considers Austronesian to have appeared in Melanesia and then expanded sporadically in all directions, with much continuous contact, as well as a few models intermediate be- tween these[Greenhill and Gray, 2005]. The first model predicts a strong treelike structure to the language family, and very distinct language groups, as the ‘express train’ precludes significant contact effects, while the Entangled Bank predicts a much less tree-like signal. Greenhill and Gray used lexical characters, more than 5000 cognate sets from 77 Austrone- sian languages, for two analyses, the first a Bayesian MCMC analysis aimed at finding a traditional tree, and the second a splitstree network analysis intended to examine the de- gree to which the data was treelike. The tree they produced agrees significantly with the

16 traditional phylogeny of the family [Blust, 1999]. The strength of the treelike signal, they claim, rules out the ‘Entangled Bank’ model of the spread of the family. However, some weaknesses in the data prevent them from making strong claims about the intermediate hypotheses they discuss. Holden uses such methods to show that the spread of Bantu languages across sub- Saharan Africa matched archaeological evidence of the spread of farming techniques in the same area from the 3rd millennium BCE to the first millennium CE [Holden, 2002]. Likewise, other such studies have examined the internal structure of the Bantu language family [Rexova´ et al., 2006]. A few studies have used phonological or morphological characters as input for com- putational phylogeny, such as [Ringe et al., 2002], discussed extensively below. Sicoli and Holton use phonological, morphological, and typological characters to examine the Dene-Yeniseian hypothesis and the hypothetic spread of the source culture through Russia, Alaska, and Canada [Sicoli and Holton, 2014].

2.1.1.1 Phylogeny for other Linguistic Goals

Phylogenetic methods have been applied to examine methodological problems within lan- guage phylogeny. One such study relates the likelihood of replacement of a word over time to its frequency of use [Pagel et al., 2007]. Greenhill et al. (2009) examine the effects of borrowing on the results of computational methods for phylogeny [Greenhill et al., 2009]. Another study by Greenhill et al. (2010) examines the rate of change of typological char- acteristics in language families [Greenhill et al., 2010]. Within Indo-European, a few early computational phylogenetic studies have attempted to find evidence for internal subgrouping in the family. Rexova´ et al. found a few possible larger subgroups, e.g. a Germano-Celto-Italic grouping [Rexova´ et al., 2003]. Ringe et al. propose a method for computational phylogeny and test it using archaic Indo-European languages with lexical, phonological, and morphological data, a project discussed in more

17 detail below [Ringe et al., 2002]. The use of these methods, in particular by Gray and Atkinson, continues to be very controversial [Gray et al., 2011].

2.2 Models of the spread of the Indo-European Language

Family

When discussing the date of a language, it’s important to distinguish between the time period of the existence of the language itself and the dates of the separations of the var- ious daughter languages. Proto-Indo-European must have existed as a unitary language for some period of time, and no evidence about the earliest time the language existed is available. Thus, for the purposes of this study, the date of Proto-Indo-European will refer to the last date before the first known subfamily splits off from the proto-language. This is somewhat artificial, of course, since there was no specific day on which the peoples who would eventually be speaking (presumed to be the first subfamily to break off) suddenly switched from Proto-Indo-European to Proto-Anatolian. This defini- tion, however, is the best available, and is implicitly or explicitly used by most scholars.

2.2.1 Tradition Family Tree(s) of IE

The traditional phylogeny of the Indo-European language family has Proto-Indo-European as the root node, and ten daughter branches, representing the ten traditional sub-families of Indo-European.

18 2.2.2 Non-Computational Dating Methods for the Breakup of PIE

2.2.2.1 Linguistic Archaeology

Dating by linguistic archaeology is a technique that relates the presence of reconstructable terms in a proto-language to known natural or archaeological facts. As Calvert Watkins states:

When we have reconstructed a proto-language, we have also necessarily estab- lished the existence of a prehistoric society, a speech community that used that proto-language. The existence of Proto-Indo-European presupposes the exis- tence, in some fashion, of a society of Indo-Europeans. Language is intimately linked to culture in a complex fashion; it is at once the expression of culture and a part of it. Especially the lexicon of a language - its dictionary - is a face turned toward culture. Though by no means a mirror, the lexicon of a language remains the single most effective way of approaching and under- standing the culture of its speakers. As such, the contents of the Indo-European lexicon provide a remarkably clear view of the whole culture of an otherwise unknown prehistoric society. The evidence that archaeology can provide is limited to material remains. But human culture is not confined to material artifacts. The reconstruction of vocabulary can offer a fuller, more interesting view of the culture of a prehistoric people than archaeology precisely because it includes nonmaterial culture[Watkins, 1969].

In arguments about the location of the Indo-European homeland, for instance, plant and animal vocabulary items have often been used as a source of evidence. For instance, the fact that a word for ‘oak tree’ *perkw- must be reconstructed for Proto-Indo-European means that Proto-Indo-European must have been spoken in an area in which oak trees were

present. Likewise, the presence of the word for ‘wolf’ *wl.kwos in Proto-Indo-European indicates that the Proto-Indo-European speakers must have lived where wolves were also present. A large list of reconstructed vocabulary points to aspects of Proto-Indo-European cul-

ture. Agricultural terms, such as *puhxro- ‘wheat’ and *rughi- ‘rye’ show familiarity with agricultural techniques, and terms for domestic animals and their products, such as *ghaid-

19 ‘goat,’ and *g(a)lakt- ‘milk,’ point to the pastoral nature of Proto-Indo-European society1 [Campbell, 2004, 381-384] This method, in concert with archaeological data about the dates of domestication for horses and cattle and the invention of wheeled vehicles and associated technologies (both ca. 4000 BCE), has traditionally provided the most convincing and most widely-accepted date for the breakup of Proto-Indo-European.

2.2.2.2

Glottochronology was one of the earliest rigorous attempts to apply mathematical methods to dating language breakup. The method proposes a constant average rate of turnover in core vocabulary, i.e. a Swadesh list from a particular language at time t1 will have some different vocabulary from a Swadesh list of the same language at time t2, given that t1 and t2 are far enough apart [Lees, 1953]. Once the rate for turnover has been found, then simply looking at two Swadesh lists, one from a and one from its proto-language, and counting the number of cognate forms in these two lists should provide enough data to decide how much time had passed between the breakup of the proto-language and the daughter language in question. A Swadesh list is a list of basic terms, such as sun, water, man, woman, stone, blood, and fire, believed to be universally present in natural languages and less likely to be bor- rowed than other terms - see [Swadesh, 1952][Swadesh, 1955] for more information. While Swadesh lists have proven useful for historical linguistics and other comparative work, there is no theoretical background for their construction, and there are a number of differ- ent versions. Lees uses the following equation to calculate the time depth of a split between a daugh- ter language and its proto-language:

1for a much larger list of terms, see Campbell (2004), pp. 381 - 393.

20 logF t = s 2logk

In the equation, t is the time depth in millennia of the divergence between two lan-

guages, Fs is the fraction of cognate vocabulary between two languages, and k is the uni- versal rate of vocabulary replacement. To calculate k, Lees used 13 languages2 and prepared a pair of Swadesh lists for each language: one list from an older stage of the language, and one from a more recent stage. He then observed the fraction of words that were replaced in the time between the older stage and the more recent stage. From these lists, he concluded that on average languages retain about 81% of their core vocabulary after a millennium. Thus, with that constant and the equation above, it was in principle possible to calculate the date of divergence for any daughter language from its proto-language using just a pair of Swadesh lists. There are a significant number of problems with this method, however. First, there is no reason to assume that lexical replacement happens at a fixed rate, or that the rate is universal. Studies have shown that there’s significant variation in the rates of vocabulary re- placement, e.g. [Bergsland and Vogt, 1962]. Bergsland and Vogt showed that within North Germanic, retention rates of vocabulary are both higher and more variable than the glot- tochronological method allows. As a consequence, glottochronological dating of language divergence is unlikely to produce accurate results. Glottochronology has therefore been largely abandoned as a method for dating lan- guage split. Some other lexicostatistical techniques have persisted, but these are outside the scope of this discussion.

2English, Spanish, French, German, Coptic, Athenian, Cypriote, Chinese, Swedish, Italian, Portuguese, Rumanian, and Catalan

21 2.2.3 Models of the Breakup and Spread of Proto-Indo-European

I follow [Gray and Atkinson, 2003], [Atkinson and Gray, 2006], and [Garrett, 2012] in col- lapsing the many conceptions of the spread of Indo-European into Europe and Asia into two main models, the farming model and pastoralist model:

2.2.4 The Farming Model

The farming model connects the spread of Indo-European with the spread of agriculture. In particular, farming techniques came from Anatolia (modern Turkey) into Europe ca. 7000-6000 BCE. Renfrew has proposed that Indo-European speakers brought these farming techniques, providing much comparative evidence, and showing that the spread of language with farming is a known process [Renfrew, 1990]. Farming techniques and communities entered Europe from Anatolia by the 6th millen- nium BCE. In particular, the earliest farming communities in Europe appeared in Greece and Macedonia and appeared to have been established by colonists from Anatolia, since these new communities are very similar to the early Anatolian farming communities at C¸atal Huy¨ uk¨ and Hacilar, both in southern Turkey. These communities cultivated emmer and einkorn wheat and raised pigs, just as the later sites in Greece would [Bogucki, 1996]. Indeed, these new communities in Greece were probably not created by local populations, as evidence shows that they were “established by colonists from points to the east and that indigenous peoples’ involvement in these communities was minimal” [Bogucki, 1996, 244]. From Greece, these techniques spread through the remainder of Europe throughout the neolithic period. Under this model, Anatolia is taken to be the homeland of the Indo-European speaking peoples, or at least point of entry for Indo-European speaking peoples westward into Eu- rope and then north and east into and south into the Indian subcontinent. The language family, according to this hypothesis, spread with both colonization and indige- nous adoption of these crops and techniques.

22 Some typological data shows similar spread of languages with farming methods [Diamond and Bellwood, 2003]; below are three of their fifteen examples:

• The Bantu languages spread with agriculture in western Africa from ca. 2000 BCE. See also [Huffman, 1982].

• Arawakan languages spread with farmers’ colonies from the Orinoco river through the Bahamas from ca. 400 BCE.

• Uto-Aztecan languages spread north into the southern with Mesoamer- ican crops and farming techniques, though this example is somewhat confounded by the fact that the Uto-Aztecan speakers in the area became hunter-gatherers, i.e. gave up farming, when they reached the southwestern landscape that was less suitable for farming than their homelands.

Typological evidence, however, can only point to parallels and cannot provide evidence for what happened in any particular historical instance. The fact that some language fam- ilies have spread with farming techniques does not mean that Indo-European must have spread with farming techniques. For similar reasons, typological arguments about the structure of Indo-European, for instance the Glottalic Theory of Proto-Indo-European stop , have failed to gain acceptance among historical linguists. In general, most historical linguists are skeptical of purely typological arguments. Some computational techniques, discussed in more detail below, have seemed to sup- port this model of the spread of Indo-European.

2.2.5 The Pastoralist Model

The pastoralist model of Indo-European expansion takes the view that Indo-Europeans had horses, cattle, and wheeled vehicles, and these technologies allowed them to expand from

23 their original homeland3 [Mallory, 1989]. Proponents of this model of Indo-European ex- pansion usually assume a breakup of Proto-Indo-European at ca. 4500-4000 BCE, i.e. at least 2000 years after the spread of farming into Europe. Evidence for this model is strong. Many words referring to herding, horsemanship, and vehicles must be reconstructed for Proto-Indo-European:

2.2.5.1 Words and Roots for Animal-Keeping

•*H xwl.H1neH2 ‘wool’

•*H 2eg-´ ‘drive or lead’ yields Lat. ago¯, Gk. ago¯, both ‘I lead (an animal)’

• *ekwos´ ‘horse’ yields Skt asvas´ , equus, Gk. hippos, all ‘horse’

• *gwow- ‘cattle’ yields Gk. bous, Skt gaus, ‘cow’

• *yug-´ ‘yoke’ yields Gk zugon, Skt yugam, Lat. iugum, English yoke

2.2.5.2 Words for Vehicles

• *kwekwlos ‘wheel’ yields Skt cakra, Gk kuklos, English wheel, all meaning ‘wheel’

•*H 2eks´ ‘axle’ yields English axle

• *wegh-´ ‘ by vehicle’ yields English wagon, Skt vahati ‘transports’, Latin veho¯ ‘I go by vehicle’

•*H xiHxso ‘harness pole, thill’

• *dhrHx ‘harness’

To be reconstructable, a word must have been present in the proto-language, and so these large domesticated animals and these technologies must have been present in the

3The homeland is itself somewhat contentious even among supporters of the pastoralist model, but most sources assume an area near the Black Sea.

24 culture at the time when Proto-Indo-European was still a unitary language. The dates of the development of these features of Indo-European culture provide the earliest possible date for the breakup of Proto-Indo-European, or at least those branches that inherit these terms. The presence of a word for ‘wool’ requires a later date for Proto-Indo-European as well; early wild sheep have hair instead of wool, and the earliest indications of woolly sheep in the historical record are from the mid fifth millennium BCE [Darden, 2001, 196]. Ox-dragged plows and sledges appear in the historical record sometime after the fifth mil- lennium BCE [Darden, 2001, 192]. The presence of words for ‘harness pole’ and ‘yoke’ therefore requires that Proto-Indo-European have been a unitary language at some point just prior to that date. The case of the wheel is similar. Words for ‘wheel’, organized by the PIE root from which they’re derived, are presented below:

• PIE *kwel ‘go round’: Greek kuklos´ , cakra´ , English wheel, Avestan cakrasˇ , Tocharian A kukal¨ (meaning ‘wagon’ in this case)

• PIE *werg ‘twist, spin’: Hittite hurki¯ sˇ, Tocharian A wark¨ ant¯ , Tocharian B yerkwantai

• PIE *ret ‘run, roll’: Latin rota, German Rad

Of these, the Greek, Sanskrit, Avestan, Tocharian A, and English forms are cognate. The Hittite and Tocharian forms, while derived from the same root, do not appear to come from the same complete Proto-Indo-European word - they have different suffix morphol- ogy. Thus, if one were to take these various words for ‘wheel’ as the entirety of the data for dating the breakup of the Indo-European family, it would be possible to conclude that the Anatolian, Tocharian, and Italic branches had split from PIE before the development or spread of wheeled vehicles, ca 4000 BCE. However, Greek, Sanskrit, and English, since their words for wheel are direct cognates, must still been part of the proto-language at the time of the development of the wheel. The fact that these forms are all descended from PIE

25 *kwekwlos means that that the director ancestor of all three branches must have both had wheels and have called the wheel *kwekwlos. This model also shows much agreement with archaeological data [Anthony, 2007]. The spread of domestic horses through Europe corresponds well with this model, as do the dates of vehicles and vehicle-related technology discovered at archaeological sites throughout Europe. By contrast, the farming model does not fit with this evidence.

2.3 Computational Phylogeny of Proto-Indo-European

A number of prior studies have examined the structure and breakup of the Indo-European language family. The two major attempts are discussed in significant detail below.

2.3.1 Gray and Atkinson

Prior computational attempts to put a date on the breakup of Proto-Indo-European have been at odds with archaeological evidence and other linguistic dating methods that support the pastoralist model discussed above. Atkinson and Gray have suggested a date for the breakup of Proto-Indo-European at ca. 7000 BCE [Atkinson and Gray, 2006]. Gray and Atkinson used a piece of phylogenetic treebuilding software, MrBayes, origi- nally intended for creating trees that describe species relationships [Huelsenbeck et al., 2001]. MrBayes uses Bayesian MCMC to infer likely tree structures from individual species genomes. To do this, it begins by generating a large number of random trees, and then evaluates each for for likelihood based on data provided. Some of these higher scoring trees are then modified, and the new trees are evaluated for likelihood. This process is repeated many times, and so as the trees get more likely, i.e. as the number of iterations grows, the process moves some sampling random trees to sampling highly probable trees. Gray and Atkinson used a database of Swadesh lists for Indo-European languages tagged with cognate data. They drew their data from the 200 word database by Dyen et al.,

26 then added Anatolian and Tocharian A and B wordlists of their own [Dyen et al., 1992]. A portion of the entry for ‘and’ from the Dyen et al. wordlist is presented below. a 002 AND b 001 002 55 Gypsy Gk DA 002 41 Latvian UN 002 56 Singhalese SAHA 002 08 Rumanian List IAR 002 79 Wakhi ET, SE, WOZ 002 09 Vlach SE 002 73 Ossetic AEMAE 002 31 Swedish VL A b 002 002 36 Faroese OG 002 33 Danish OG 002 32 Swedish List OCH 002 34 Riksmal OG 002 30 Swedish Up OCH 002 35 Icelandic ST OG b 003 002 60 Panjabi ST TE 002 57 Kashmiri TA 002 61 Lahnda TE b 004 002 62 OR 002 63 Bengali AR, EBON 002 65 Khaskura ARU, RA 002 64 Nepali List AU, RA 002 74 Afghan AU 002 75 Waziri AU

For each lexical item in the 200 item list, they prepared a list of every possible pairing of two languages. Each pair was coded as follows:

• 1 if the two languages had cognate forms for that Swadesh item.

• 0 if the two languages had forms from different etyma for that Swadesh item.

With this coding, retentions of lexical items from the proto-language are coded in the same way as shared innovations, i.e. two languages that are coded 1 for a particular seman-

27 tic slot could share an innovated replacement, but they could also both simply retain the original Proto-Indo-European word for that particular lexical item. For instance, English and Sanskrit would have a 1 in the slot for ‘father’, since both words, father and pitar¯ , respectively, descend from PIE *pH. 2ter¯ , also ‘father’. This is a shared retention, and is therefore useless for subgrouping. On the other hand, English and German would have a 1 in the slot for ‘hand,’ since their terms, hand and Hand, are cognate. This is an example of an innovation, as the PIE lexical item for hand, *man, is the root of nether the English nor German forms and is therefore useful for subgrouping. Using lexical data as input for Bayesian MCMC phylogeny has some benefits and some drawbacks. In the benefits column, the first and most important characteristics are avail- ability and ease of use. Large word lists exist for most extent Indo-European languages, since Indo-European languages include most of the best-studied languages in the world. Indo-European also has the most extensive history of reconstruction of any language fam- ily (cf Sir William Jones’s observation in the 18th century that Greek, Latin, and Sanskrit are too similar to be unrelated). The drawbacks include some errors created by contact situations. The trees produced by the Atkinson & Gray method occasionally yield some results at odds with most stan- dard Indo-European tree structures. Their results place the Dutch and close together, while the longstanding and accepted tree of the places English as Frisian’s closest relative. See Figure 2.4 below for the tree generated by Gray and Atkinson. They found a likely breakup of Indo-European at ca. 7000 BCE, a date that is consistent with the farmer model discussed above, but not with the pastoralist model or the conclusions drawn from linguistic archaeology.

28 Figure 2.4: Tree from Gray and Atkinson (2003)

29 2.3.2 Chang et al. 2005

Chang et al. 2005 used computational phylogeny to investigate the dating of the breakup of Proto-Indo-European, using a modified version of the Dyen wordlist called IELEX. This wordlist contains the Swadesh lists for the languages in the Dyen list, together with Swadesh lists for a number of ancient languages. Additionally, the list is carefully marked for borrowings in those places where careful philology can distinguish a borrowing from a retained word. They carefully specified the shape of the permissible trees, requiring that all known subgroups of Indo-European be present in the data. In essence, their project specifies a tree and uses a number of different substitution models to examine the time required to replace the items in the Swadesh lists between the root node of the tree and the leaf nodes (i.e. Proto-Indo-European, and the various modern Indo-European languages). They used 35 constraints, first splitting Indo-European in Anatolian and Nuclear Indo-European, then Nuclear Indo-European into Tocharian and Inner (i.e. non-Tocharian, non-Anatolian) Indo- European. Their resulting tree looks precisely like contemporary trees of Indo-European. The major difference in the method used by Chang et al. is in the fact that the ancestor nodes of most tree nodes are directly specified. Where many methods will suggest that the and Latin share a common ancestry, Chang et al.’s trees force the Bayesian MCMC runs to return trees where Latin is the direct ancestor of each of the Romance branches. This does two things: first, it eliminates homoplasy as a possible influence on tree structure, and second, it prevents what the authors term jogging in the rate of replacement. In the instance of a run without such specifications, a possible resulting tree might be either:

30 Proto-Latin-Romance

Latin Old French

as well as:

Latin

Old French

In the first instance, the result requires that there must have been at least a few changes between the common ancestor of Latin and Old French, while in the second, Latin and Old French are in a parent child relationship. The number of changes between the two languages is constant, but in the first instance, there are fewer changes in the time between the common ancestor and Old French than there are in the time between Latin and Old French in the second instance. This alteration, the shifting of a number of changes to a time prior to the usually-recognized ancestor of a language, may show fewer changes in a longer period of time, and so suggest a longer time between changes, and thus a deeper date for any split of a proto-language. By making strong claims about the ancestry of every language in their sample, Chang et al. eliminate the possibility of time expansion through jogging, and their discovered date accords well with the Pastoral Model of Indo-European expansion detailed above. This does, however, eliminate the possibility of any of their specified language ances- tries having a sister, rather than daughter, relationship. For Latin and Proto-Romance, this is well-supported, but particularly outside Indo-European, such facts are much less well- known, and a more general method would be helpful.

2.3.3 The Ringe and Warnow Method

Ringe and Warnow (2002) have also used computational methods to establish dated trees for Indo-European [Ringe et al., 2002]. In addition to lexical data similar to that used by Gray and Atkinson (see above), they use a set of 22 phonological changes. Instead of the

31 larger set of languages studied by Gray and Atkinson, they use a subset of Indo-European languages, a selection of the older-attested languages from each of the ten branches of the family. The languages used, organized by subgroup, are as follows:

1. Anatolian: Hittite, Luvian, Lycian

2. Indo-Iranian: Vedic, Avestan,

3. Greek: Greek

4. Italic: Umbrian, Oscan, Latin

5. Germanic: Gothic, , ,

6. Armenian: Armenian

7. Celtic: , Welsh

8. Tocharian: Tocharian A, Tocharian B

9. Balto-Slavic: Old Church Slavic, Old Prussian, Lithuanian, Latvian

10. Albanian: Albanian

They do not discuss the selection of their sound changes in the , but they do present them in a separate document, available on Tandy Warnow’s website:4

The characters discussed here were chosen because they seemed unusual enough or complex enough to be probative. It can be seen that they do not validate all the clades that our methodology has found, and that should cause no surprise: purely phonological evidence for some IE subgroups is much better than for others. Most of them define uncontroversial subgroups of the family also rec- ognizable on other grounds; we were able to discover only three (P1 through P3) that might validate higher clades.

4http://www.cs.utexas.edu/˜tandy/histling.html, accessed ca. March 2015

32 This suggests, then, that the sound changes used were not chosen according to any easily replicable procedure, and were probably chosen because they point to known sub- groups. This test, then, does not do much to test the use of sound changes in computational phylogeny at all, since the sound changes selected simply reinforce pre-existing ideas about the structure of the language family in question.

1. The first change, a sequence change of *p. . . kw to *kw...kw appears only in Latin, Old Irish, and Welsh, and is present to suggest a much-discussed Italo-Celtic subgroup.

2. The second change, the satem shift, wherein PIE labiovelars and velars merge, while palatals become , occurs in Vedic, Avestan, Old Church Slavic, Lithuanian, Old Persian, Old Prussian, and Latvian.

3. The third change, usually called ruki, is a retraction of *s after *r, *u, *k, or *i, and serves to connect Indo-Iranian and Balto-Slavic, as it appears in Vedic, Avestan, Old Church Slavic, Lithuanian, and Old Persian.

The first three of their chosen changes, as they say, point to identified (though either controversial or discarded) higher-order subgroups among the branches of Indo-European. The remaining changes serve to distinguish the individual branches.

1. Lenition of stops after long or unstressed vowels, present in Hittite, Luvian, and Lycian, marks off the Anatolian subgroup.

2. Medial *kw becomes *gw unless *s follows is present in Hittite, Luvian, and Lycian, also marking Anatolian.

3. Cop’s Law, present in Hittite, Luvian, and Lycian, Anatolian.

4. Initial *ye- becomes *e-, present in Hittite, Luvian, Lycian, Anatolian.

5. Merger of *i, *e, and *u, and a merger of *a and *o,¯ present in Toch. A and Toch. B, marking the Tocharian subgroup.

33 6. *mbh become *m, present in Tocharian A and B, marking Tocharian.

7. This change encompasses a number of things: “loss of preconsonantal *d, affrication of remaining *d, and merger of palatalized *d with palatalized dorsals”, present only in Toch. A and Toch. B, marking off Tocharian.

8. *tsk becomes *tk, *kst becomes *k@st, present in Toch. A and Toch. B, marking off Tocharian.

9. “merger of all nonhigh vowels and syllabic nasals”, present in Vedic, Avestan, and Old Persian, marking off Indo-Iranian.

10. Bartholomae’s Law, present in Vedic and Avestan, marking off Indo-Iranian.

11. “merger of voiceless aspirated stops and preconsonantal voiceless stops as fricatives,” present in Avestan and Old Persian, marking off Indo-Iranian.

12. “development of intonation contrast (acute vs. circumflex) in nonfinal heavy sylla- bles,” present in Old Church Slavic, Lithuanian, Old Prussian, and Latvian, marking off Balto-Slavic.

13. “(a) Grimm’s Law; (b) Verner’s Law; (c) shift of to initial ; (d) merger of unstressed *e with *i unless *r follows immediately,” present in Old English, Old Norse, Old High German, and Gothic, marking off Germanic.

14. (a) loss of intervocalic * unless *i precedes and does not follow immediately; (b1) *@i > *ai, and (b2) *oV¯ > *overlong o,”¯ present in Old English, Gothic, Old Norse, and Old High German, marking off Germanic.

15. “merger of word-final nonnasalized *o¯ with short *u; lowering of *e¯ to *a¯ in stressed syllables, but merger of *e¯ with *ai in unstressed syllables,” present in Old English, Old Norse, and Old High German, marking off Germanic.

34 16. “merger of *Dw and *zw with *ww,” present in Old English and Old High German, marking off West Germanic.

17. merger of *e¯ with *¯ı; merger of *o¯ with *u¯ in final syllables (including monosylla- bles), but with *a¯ elsewhere,” present in Old Irish and Welsh, marking off Celtic.

18. “*p > *k before obstruents, *b before liquids, * before nasals and after *s, 0 else- where,” present in Old Irish and Welsh, marking off Celtic.

19. “syncope of short vowels in final syllables next to *s and after ,” present in Oscan and Umbrian, marking off Italic.

Given this data, no reasonable method could produce a tree at odds with the stan- dard Indo-European phylogenies, and no real justification or principled method of deciding which sound changes to include is presented.

2.3.4 Computational Phylogeny and Estimating the Dates of Language

Splits

The problem of computing the date of the breakup of a language family (or the breakup of a biological species into daughter species) is most often viewed as a problem of rate estimation for character substitutions. The characters, in the case of languages, are most often the Swadesh list characters discussed above, but may also include other language features. The problem, as formalized by Tavare 1986, represents the characters as

(Xi(t),Yi(t)) (2.1)

where Xi and Yi represent the i-th character of the sequence in question at time t in the two languages that have undergone a breakup, respectively[Tavare,´ 1986]. Thus, at time t

= 0, before the two languages have diverged, Xi = Yi for all i, i.e. for each character in the Swadesh list.

35 The problem, then, may be formalized as an estimation of the number of changes over a given time period, or formalized, the value of i for which Xi(t) 6= Yi(t) for a given time period t. The most general such method in common use today and the one used by the Gray and Atkinson (2003) study discussed above is a generalized time-reversible substitution model (GTR), a model that presumes changes in individual lineages happen as a time- invariant poisson process, but also permits for variable rates of replacement for different types or categories of features[Tavare,´ 1986]. Each tree, then, is created with a given rate of character change, and the degree to which this rate matches the 14 dates chosen by Gray and Atkinson is part of the evaluation of the dates attached to the tree. Other, newer methods, such as relaxed clock phylogeny, attempt a more accurate model in which the more closely-related members of the family have more closely-related substi- tution rates. In order to model this relationship, the rates on particular branches of family trees are chosen “among lineages as varying in an autocorrelated manner, with the rate in each branch being drawn (a priori) from a parametric distribution whose mean is a function of the rate on the parent branch” [Drummond et al., 2006]. The other component of their evaluation was a process called smoothing. Smoothing is a measure that penalizes trees with adjacent edges that vary in length, e.g. trees that propose a very different number of years between Proto-Indo-Iranian and Proto-Indic, and then Proto-Indic and Sanskrit are considered less likely than trees that show a similar number of years between these adjacent edges. While this does not make a direct claim about the rate of , it does imply a claim about the meta-rate of change, i.e. the rate at which language change itself changes. The smoothing procedure claims that this meta-rate is low, and gives preference to trees where adjacent branches are similar in length. This implication is not supported by any data, and I have found no work that attempts to test such a hypothesis. Later computational work on the problem of dating Indo-European has resulted a num-

36 ber of distinct dates for the breakup of PIE, in particular, Bouckart et al. and Chang et al. To replicate as closely as possible prior experiments, I will use the same 14 dates used by Atkinson and Gray, and perform a similar evaluation based on those dates. However, the assumption made by smoothing will not be implemented. Instead, trees will be selected based on those dates alone. The GTR method of dating permits different types of characters to change at different rates, which means that rates of change for the sounds in my data will be permitted to be different than the rates of change for the vocabulary characters.

37 CHAPTER 3

Methodology

In order to replicate as closely as possible Gray and Atkinson (2003) (with the above- mentioned experimental variations, of course), I will use the updated Dyen et al. dataset compiled and annotated by the IELEX consortium. This provides strong comparanda to G & A, while still using careful philological judgments about cognacy and borrowing in the languages in question. The two variables under consideration in this study are 1) the inclusion of lexical or phonological characters in the input data and 2) the specification of retentions in the input data. The first variable has three values: lexical characters only, phonological characters only, and both lexical and phonological characters. The second variable has just two values: specification of retentions and no specification of retentions. These two variables require six experimental conditions:

1. Lexical characters only, no specification of retentions (i.e. the control, a replication of Gray and Atkinson 2003)

2. Lexical characters only, retentions specified in the input.

3. Phonological characters only, no specification of retentions.

4. Phonological characters only, retentions specified in the input.

5. Phonological and lexical characters, no specification of retentions.

38 6. Phonological and lexical characters, retentions specified in the input.

This chapter lays out first the mechanism for each of these experiments, Bayesian Markov Chain Monte Carlo performed by the MRBAYES software package, and then the method of data coding for each of the data types and variables.

3.1 Specifying Retentions with MRBAYES

The MRBAYES software package lacks any mechanism to directly specify the structure of any tree node other than leaves (i.e. the attested languages). Even those runs that use ancient languages that are, in actuality, direct ancestors of modern languages, will produce trees in which the ancient languages and modern languages are connected by a parent node, with few to no changes between the parent and the ancient language in question (cf. the discussion of ‘jogging’ in Chang et al. in chapter 2). As such, directly specifying retentions (i.e. the PIE language) as the root node of the tree is impossible. What is possible, however, is specifying a particular language or set of languages as an outgroup, a language or set of languages that is known to be related to every language in the sample in question, i.e. a sister to the common ancestor. The algorithm then works to minimize the distance between the root node language and the outgroup language (or the calculated structure of the group of outgroup languages, if more than one is specified). Because of this, specifying Proto- Indo-European as an outgroup to all its daughter languages forces the algorithm to generate a root node for the daughter languages that is as close as possible to Proto-Indo-European. While the possibility exists that some small number of changes may be presumed between the root of the daughter languages and specified PIE, the ‘jogging’ problem identified by Chang et al., the number of changes is minimized by the algorithm, resulting in an effective root node that is identical to, or at least nearly identical to, PIE. Thus, specifying PIE as an outgroup to all the daughter languages effectively roots the tree at PIE. In three of the experimental conditions, PIE has been set as an outgroup in the

39 MRBAYES run.

3.2 Capturing Phonological Changes as Characters for Bayesian

MCMC

Using sound changes as characters in Bayesian MCMC has been attempted before, as dis- cussed in Chapter 2, but never in a principled way. To avoid bias, I will include the conso- nant phonemes of the languages in question, broken down into place, manner, and voicing. For each proto-phoneme of Proto-Indo-European, I use the phoneme in the daughter lan- guages that descends from it. In this way, I will capture changes in place, manner, and from Proto-Indo-European to its daughter languages without having to choose to include or exclude any particular sound changes. In short, this selection mechanism includes cognate phonemes as input data, rather than sound changes themselves. For example, the Proto-Indo-European phoneme *bh in initial position results in b in Old English, bh in Sanskrit, and f in Latin. I split these into place, manner, and voice, then code them in the same way that the vocabulary is coded - i.e. a column for each characteristic and type in the data. For example, the column for +voiced will have ones in the slots for PIE, Old English, and Sanskrit and a zero in the row for Latin, while the column for -voiced will include a one in the row for Latin and zeros elsewhere. In order to use sound changes while avoiding issues of cherry-picking, I will include strictly the basic outcomes, that is the outcomes initially before a vowel, of all the Proto- Indo-European consonants in a sample of languages identical to that used by Ringe and Warnow (2002). Using sound changes, which are always innovations, as input data for the Bayesian MCMC method should improve its accuracy and reliability. To code sound changes, I first collect the outcomes of the Proto-Indo-European stop series and then break the outcome into place, manner, and voice features. As with vocabu- lary, for every Proto-Indo-European sound I prepared three separate groupings of language

40 Table 3.1: Manner Outcomes of PIE *p

*p OE Goth. ON OCS Skt OE x 1 1 0 0 Goth 1 x 1 0 0 ON 1 1 x 0 0 OCS 0 0 0 x 0 Skt 0 0 0 0 x

Table 3.2: Manner Outcomes of PIE *t

*t OE Goth. ON OCS Skt OE x 1 1 0 0 Goth 1 x 1 0 0 ON 1 1 x 0 0 OCS 0 0 0 x 0 Skt 0 0 0 0 x

pairs, one for place, one for manner, and one for voice. For a given pair of languages, the coding will go as follows:

• 0 if the feature is different in the two languages, or if it is the same in both languages and the proto-language.

• 1 if the feature is the same in both languages but different from the proto-language.

This coding captures shared innovations while excluding any retentions from the proto- language. Further, choosing the consonants of Proto-Indo-European avoids the trap of selective data. For an example, see tables 3.1, 3.2, and 3.3, which present the coded data for the manner outcomes of PIE *p,*t, and *k´, respectively, in Old English, Gothic, Old Norse, Old Church Slavic, and Sanskrit.

41 Table 3.3: Manner Outcomes of PIE *k´

*k´ OE Goth. ON OCS Skt OE x 1 1 1 1 Goth 1 x 1 1 1 ON 1 1 x 1 1 OCS 1 1 1 x 1 Skt 1 1 1 1 x

3.3 Data Coding

The two types of data are encoded in a format called nexus, a commonly-used format used primarily in computational biology. Each character, i.e. each cognate group and phoneme- characteristic pairing, is represented as a column, and each language is a row. The cells are filled with zeros or ones, depending on whether that particular cognate form or phoneme feature is present in the language. An abbreviated example appears below. This file includes 16 languages, or taxa in nexus terminology, and 93 characters, i.e. cognate feature slots.

1 #NEXUS 2 BEGIN DATA; 3 DIMENSIONS NTAX=16 NCHAR= 93; 4 FORMAT DATATYPE=STANDARD MISSING=?; 5 MATRIX 6 Latin 1010101011 7 Proto-Indo-European 1010101011 8 Umbrian 1010101011 9 Old_English 011001011 10 Classical_Armenian 1001010111 11 Ancient_Greek 1010101011 12 Avestan 1010101011

42 13 Gothic 0110010111 14 Oscan 1010101011 15 Old_Norse 0110010111 16 Vedic_Sanskrit 1010101011 17 Old_High_German 0110010111 18 Tocharian_B 1010100111 19 Tocharian_A 1010100111 20 Hittite 1010100111 21 Old_Church_Slavonic 101010101 22 23 ; 24 END;

MRBAYES then loads the file, takes some configuration options, then begins the Bayesian MCMC process. For this project, I used one million MCMC generations, six markov chains (i.e. six distinct initial random trees), and default values for the sampling frequency and burn-in period. From the MRBAYES prompt, the following commands are executed to load the file, set the defaults, and start the run. For those runs that specify Proto-Indo-European as an outgroup, i.e. for those experi- ments that include information about retentions in the input data, the third command in the list is included. It is otherwise omitted.

MrBayes > execute ancientphon.nex

MrBayes > lset nst=6 rates=invgamma

MrBayes > outgroup Proto-Indo-European

43 MrBayes > mcmc nchains=6

After one million generations (which take 10 minutes of running time on the test ma- chine), the trees are generated and summarized. The following pair of ASCII trees is dis- played, the first representing confidence in the branches, the second the result phylogeny with branch lengths proportional to the number of changes between nodes.

Clade credibility values: /------Proto-Indo-Euro˜ (2) | | /------Classical_Armen˜ (5) | | | /---94--+ /------Avestan (7) | | \--83--+ |------95------+ \------Old_Church_Sla˜ (16) | | | \------Vedic_Sanskrit (11) | + /------Latin (1) | | | /------100------+ /------Umbrian (3) | | \--52--+ | | \------Oscan (9) | | | | /------Old_English (4) | | | | | /---89--+ /------Gothic (8) | | | \--92--+ \--89--+ /--100-+ \------Old_Norse (10) | | | | /--68--+ \------Old_High_German (12) | | | | | \------Hittite (15) | /--97--+ | | | /------Tocharian_B (13) | | \------100------+ \---96--+ \------Tocharian_A (14) | \------Ancient_Greek (6)

The above tree shows the confidence in the branches, from 0 to 100. Note the low confidence

44 in the branch that joins Hittite with the Germanic languages in this sample, and the complete confi- dence in the branch joining the two .

45 In the below tree, the result consensus tree is presented, i.e. an average of the sampled trees after the burn-in period. This tree is a rooted average of the many resulting trees.

Phylogram (based on average branch lengths): /------Proto-Indo-Euro˜ (2) | | /------Classical_Armen˜ (5) | | | /----+ /- Avestan (7) | | \-+ |-----+ \- Old_Church_Sla˜ (16) | | | \------Vedic_Sanskrit (11) | + /---- Latin (1) | | | /------+/- Umbrian (3) | | \+ | | \- Oscan (9) | | | | /--- Old_English (4) | | | | | /--+ /--- Gothic (8) | | | \-+ \------+ /------+ \- Old_Norse (10) | | | | /-+ \- Old_High_German (12) | | | | | \-- Hittite (15) | /------+ | | | /- Tocharian_B (13) | | \----+ \------+ \- Tocharian_A (14) | \- Ancient_Greek (6)

The tree may then be visualized with FigTree1, a common tree visualization package. All resulting trees below are drawn via FigTree. The input data files for these experiments are available by emailing the author or re- questing access to the git repository at https://github.com/styndall/iephylo.

1Available at https://github.com/rambaut/figtree/releases

46 CHAPTER 4

Results

4.1 A Brief Survey of the Results

This chapter primarily presents the results of the experiment detailed in the Methods above, followed by the results of a modified replication of Gray and Atkinson (2003). The chapter is organized into four major parts:

1. Short summary of the results

2. Characteristics of the Swadesh Lists in use and their effects on the Bayesian MCMC process

3. Detailed Results of the Six Experimental Conditions

4. Results of the Replication of Gray and Atkinson

5. Discussion

4.2 Summary of the Results

A quick review of the hypotheses is necessary. The first hypothesis, that providing the system with a way of distinguishing innovations from retentions, is weakly upheld by the results presented below, primarily in the experimental conditions that use both cognate and

47 phonology data. The second hypothesis, that the method of including sound changes de- tailed in chapter 3 will improve the tree structures resulting from the Bayesian MCMC pro- cess, was upheld for some situations but not for others. The experimental conditions using both lexical and phonological data showed results more in line with traditional IE recon- structions than either of lexical or phonological characters alone. It appears that including sound changes (i.e. phonological characters) in the analysis turns out to be primarily useful when one is working with sparse lexical data. This makes the method detailed above useful for future studies of languages and language families where the Swadesh lists are lacking. I started by running the conditions on a subset of ancient IE languages: Latin, Oscan, Umbrian, Old English, Old High German, Old Norse, Gothic, Tocharian A, Tocharian B, Old Church Slavic, Hittite, , , Avestan, and . To evaluate the resulting trees, the trees were checked for the presence of several well-established subgroups:

• Germanic (Old English, Old High German, Old Norse, and Gothic)

• Italic (Latin, with Oscan and Umbrian as a subgroup within Italic)

• Tocharian (Tocharian A and B)

• Indo-Iranian (Vedic Sanskrit and Avestan)

The experimental conditions for the ancient languages used in this study were as fol- lows:

1. Phonological Characters Only, No Added Information About Proto-Language

2. Phonological Characters Only, Specification of the Proto-Language as Outgroup

3. Lexical Characters Only, No Added Information About Proto-Language

4. Lexical Characters Only, Specification of the Proto-Language as Outgroup

48 5. Combined Phonological and Lexical Characters, No Added Information About Proto- Language

6. Combined Phonological and Lexical Characters, Specification of the Proto-Language as Outgroup

Note that the languages used here have Swadesh lists that vary from completely (or over-)full to very small, and this variation has some important effects on how the MCMC method places these languages in the resulting trees.

4.2.1 Characteristics and Subdivisions of the Set of Swadesh Lists

The characteristics of the set of Swadesh lists compiled by IELex and used for this study are important to understanding the output of the processes described in the methodology chapter. IELEX contains 207 Swadesh entries and 148 languages. These languages do not all have a word in every Swadesh entry, however, due to differences in both attesta- tion and in current documentation, research, or technical expertise. Palaic, for instance, a poorly-attested Anatolian language, only has words for 16 of the 207 Swadesh items, and , a minor North Germanic language, rather bizarrely has merely a single entry in the IELEX database. The sparsity of cognate data for some of these languages has im- portant ramifications for phylogenetic projects using cognates as input data. To test this, I prepared two sets of ancient IE languages, one set containing all languages listed below, and the other without Old Cornish (see explanation below). The number following each language is the number of Swadesh entries present for that language.

1. Hittite 114

2. Vedic Sanskrit 249

3. Avestan 188

4. Ancient Greek 248

49 5. Latin 226

6. Gothic 198

7. Classical Armenian 197

8. Tocharian A 143

9. Sogdian 172

10. Old High 235

11. Old Church 235

12. Old English 240

13. Old Prussian 135

14. Old Cornish 55

Figure 4.1 shows the results of a 1-million generation Bayesian MCMC run on the complete set of data. Deciding how sparse a language’s database entry can be and still be useful is a non-trivial problem. Clearly, given a mean number of data points near 200, 55 is insufficient, but Tocharian and Hittite, with approximately 3 times as much data, seem to fit in appropriately. Fine-grained tests may reveal a sort of tipping point where the utility flips, but that degree of testing is outside the scope of this project. [Chang et al., 2015] use three different sets of languages and cognates for their study:

• Broad: 94 languages and 204 cognate classes (i.e. Swadesh entries), excluding blow, mother, and father

• Medium: 78 languages and 143 cognate classes, chosen so as to keep sparse data to a minimum (though the particulars of the selection go undiscussed)

50 • Narrow: 52 languages, selected by removing any modern language that did not have a direct ancestor also present in the dataset. The 143 cognate classes remain un- changed.

Note that in every instance, they mix both ancient and modern languages and specify closely the tree structure that must result. The complete list of their specifications is avail- able on page 214 of the paper. This specification obviates entirely the problem of aberrant tree structures based on sparse data, as the MCMC simulation is never required to select the most highly-probable tree structure from a number of options, unlike in most other phylo- genetic studies. Under such constraints, sparsity may prove to be less problematic, though that possibility remains unstudied.

• Ancient Languages: Strictly those branches that are attested early in the history of IE and that have languages in the IELEX dataset. The languages are

– Hittite

– Tocharian A and B

– Vedic Sanskrit

– Avestan

– Ancient Greek

– Latin

– Oscan

– Umbrian

– Gothic

– Old High German

– Old Norse

– Old Church Slavonic

51 – Old English

Of these languages, Oscan, Umbrian, and Old Cornish have fewer than 60 items in the IELEX database, and were in some cases excluded, as noted in the relevant sections below.

• The set of languages included in the small dataset in [Chang et al., 2015]. These 52 languages include both modern and ancient languages, and all have full or nearly-full Swadesh lists.

4.2.2 Kitchen Sink Results

Simply using the entirety of the IELEX dataset yields a few unsurprising results, given the above observations about the relationship between sparse data and tree structure. For instance, Oscan and Umbrian have very sparse Swadesh lists (28 and 33 Swadesh items in IELEX, respectively) are misplaced in the results of this MCMC run, and their placement is largely driven by their poor connection with the other languages in the sample - the accident of vocabulary attestation is the primary driver of their position in the tree. The tree is in Figure 4.1 below. Note also the position of Elfdalian, which appears in the tree as part of the branch that broke away from the proto-language earliest. This position makes most sense from the perspective of the algorithm, as Elfdalian appears to have only a single cognate lexical item present in the IELex dataset. This extreme example shows that languages with sparser data can be misplaced easily, and suggests that sparser Swadesh lists may place languages earlier in the history of the language family than is warranted by what is known from traditional historical linguistic investigation. In fact, the branch that contains Elfdalian is entirely composed of languages with very small Swadesh lists: Old Breton (79), Old Cornish (55), and Old Welsh (36), Elfdalian (1), Magahi (62), Oscan (28), Umbrian (33), Lycian (17), and Palaic (16). However, even

52 ALBANIAN Albanian_G Albanian_Standard Albanian_T Albanian_Top Albanian_K Albanian_C Sardinian_C Sardinian_L Sardinian_N Brazilian Portuguese_ST Spanish Catalan Dolomite_Ladino Friulian Romansh Italian Ladin French French_Creole_C French_Creole_D Walloon Provencal Latin Rumanian_List Vlach Breton_List Breton_ST Breton_Se Cornish Welsh_C Welsh_N Irish_A Irish_B Scots_Gaelic Old_Irish Manx Armenian_List Armenian_Mod Classical_Armenian Greek_D Greek_Md Greek_Mod Greek_Ml Greek_K Tocharian_A Tocharian_B Elfdalian Old_Breton Old_Cornish Old_Welsh Magahi Luvian Palaic Lycian Oscan Umbrian Gaulish Assamese Oriya Bengali Bihari Hindi Lahnda Panjabi_ST Gujarati Marathi Khaskura Nepali Marwari Sindhi Gypsy_Gk Kashmiri Singhalese Avestan Old_Persian Baluchi Persian Tadzik Zazaki Sariqoli Shughni Wakhi Kurdish Waziri Sogdian Digor_Ossetic Iron_Ossetic Ossetic Kati BULGARIAN_P Bulgarian MACEDONIAN_P Macedonian SERBOCROATIAN_P Serbian Serbocroatian SLOVENIAN_P CZECH_P SLOVAK_P POLISH_P Czech Slovak Czech_E Slovenian Lower_Sorbian Upper_Sorbian BYELORUSSIAN_P Byelorussian Polish Ukrainian UKRAINIAN_P RUSSIAN_P Russian Latvian Lithuanian_O Lithuanian_ST Old_Prussian Vedic_Sanskrit Dutch_List Flemish Frisian German German_Munich Schwyzerdutsch Pennsylvania_Dutch Old_High_German Danish Faroese Icelandic_ST Old_Norse Swedish Swedish_Up Swedish_Vl Norwegian English Sranan Gothic Proto-Indo-European Old_Church_Slavonic Old_English Ancient_Greek Hittite

2000.0

Figure 4.1: Entire IELEX Lexical Tree

53 within this abberrant branch, the languages are most closely associated with their known relatives, showing that even under very bad conditions, the algorithm is quite robust. The various subfamilies are largely kept together, and the early structure of the tree shows very few strong branches, consistent with the traditional ten-way split reconstructed by traditional historical linguistics. A few strange choices appear, such as the connection of Albanian and Sardinian. A full investigation of the effects of various configurations of Swadesh lists, while desirable, is outside the purview of this study, which examines the effects of phonological characters and the specification of the proto-language on MCMC studies.

4.3 Six-Way Results

In this section, the results of the six-way are presented, first in general terms of Phonological Characters, Lexical Characters, and then both, and then in terms of the individual trees themselves.

4.3.1 Phonological Characters Only

Using phonological characters alone appears to be a poor choice, probably because there are simply too few characters. Additionally, the space of possible changes is small, and the probability of homoplastic changes based on articulatory, acoustic, and/or auditory factors is rather high, at least compared to the kinds of variations that can be present in lexical data. Both trees in this phonology-only subconditions show combined sub-branches of Proto- Indo-European, a result not consistent with traditional phylogeny. Again, the small space of input and output for sound changes probably causes this problem, since many languages will share innovations simply by drift. However, closely-related languages do generally appear together in these trees, so at least some phylogenetic signal is detectable in this data. Additionally, languages that were

54 misplaced in the Kitchen Sink results above are placed together, showing that this variety of data can add useful information for MCMC.

4.3.1.1 Not Specifying Retentions

The tree in figure 4.2 is remarkably bad by the standards of traditional IE philology and linguistics. The relationships shown are as often due to accidental similarity (e.g. Old Church Slavonic with Avestan) as to actual descent. A few of the test subgroups are present, however - the Germanic subgroup and the Tocharian.

4.3.1.2 Specifying Retentions

Figure 4.3 shows the result of the phonology-only tree with PIE specified as an outgroup. The tree does not represent the traditional tree of IE particularly well, though the Italic and Germanic branches are clearly placed together, which improves on the results without the PIE specification in the prior experiment presented above. This demonstrates that, in situations with sparse lexical data, phonology can provide useful information for MCMC, even though it’s clear that these characters, on their own, are not appropriate or reliable input data for Bayesian MCMC experiments.

55 Latin

Umbrian

Oscan

Old_English

Gothic

Old_Norse

Old_High_German

Hittite

Tocharian_B

Tocharian_A

Ancient_Greek

Classical_Armenian

Avestan

Old_Church_Slavonic

Vedic_Sanskrit

Figure 4.2: Ancient Languages, Phonology Only, No Retentions Specified

56 Proto-Indo-European

Classical_Armenian

Avestan

Old_Church_Slavonic

Vedic_Sanskrit

Latin

Umbrian

Oscan

Old_English

Gothic

Old_Norse

Old_High_German

Hittite

Tocharian_B

Tocharian_A

Ancient_Greek

Figure 4.3: Ancient Languages, Phonology Only, Retentions Specified

57 4.3.2 Lexical Characters Only

As I discussed in Chapter 2, tagged lexical characters are the most usual input data for MCMC phylogeny studies for linguistics. Here in both experimental subconditions, these trees show much more of the individual-branch-like signal that characterizes traditional IE family trees, i.e. the nodes showing the early breaks from the source language are very close to each other. This is significantly closer to the ten-branch IE language reconstruction than the prior trees, and so must better-represent the history of the languages in question. The languages with sparse Swadesh lists, however, are badly misplaced, an error that is entirely unsurprising, given the discussion of the effects of sparseness at the beginning of this chapter. Further, the position of Hittite and Tocharian with respect to prior MCMC studies of Indo-European is aberrant. Gray and Atkinson, however, specified the position of Anatolian and Tocharian in their input, such that their algorithm automatically assigned zero probabil- ity to any tree which did not have the structure:

Proto-Indo-European

Anatolian Non-Anatolian PIE

Tocharian Non-Tocharian PIE

Likewise, Chang et al. specified the entirety of their tree structures. The presence of Anatolian and Tocharian in the center of the trees in the following experimental runs is not surprising, given that their placement as their earliest to break from PIE is largely based on non-lexical data.

58 4.3.2.1 Not Specifying Retentions

The tree in figure 4.4 shows very strongly that the branches of the family evolved indepently over the history of the family. The existence of non-traditional subfamilies is very brief, essentially equivalent to an immediate split. The placement of Oscan and Umbrian, as mentioned above, is entirely wrong. The subfamilies present, on the other hand, are well-formed with respect to traditional reconstruction, and better in a number of significant ways than the phonology-only trees presented above. In particular, the Germanic subfamily is improved here over the prior trees, correctly splitting the family into East, North, and West Germanic. Avestan and Vedic Sanskrit are likewise grouped together correctly.

Latin

Gothic

Old_Norse

Old_High_German

Old_English

Tocharian_B

Tocharian_A

Hittite

Umbrian

Oscan

Classical_Armenian

Ancient_Greek

Avestan

Vedic_Sanskrit

Old_Church_Slavonic

0.02

Figure 4.4: Ancient Languages, Lexical Characters Only, No Retentions Specified

59 4.3.2.2 Specifying Retentions

Specifying the proto-language with this dataset provides a few benefits over the run without this specification. First, the central trunk (the branch from which all the languages but Latin descend) is more unitary, representing better the independence of the branches of PIE. The position of Latin in the tree is quite surprising - in both lexical-only runs, Latin is a sister to all other branches. However, the distance between Latin and proto-Everything Else is not great, so it does not suggest a large split between Latin and the other languages and is consistent with the independence of the branches traditionally reconstructed for Indo- European. This surprising result better shows off the improvements provided by the addition of the phonological (i.e. sound change) data in the following two experiments.

60 Proto-Indo-European

Latin

Tocharian_B

Tocharian_A

Hittite

Umbrian

Oscan

Classical_Armenian

Ancient_Greek

Avestan

Vedic_Sanskrit

Old_Church_Slavonic

Gothic

Old_Norse

Old_High_German

Old_English

Figure 4.5: Ancient Languages, Lexical Characters Only, Retentions Specified

4.3.3 Lexical and Phonological Characters

These runs, in which both lexical and phonological characters were included, are most similar to the traditional subgrouping of PIE. In particular, the run that specified the proto- language created the tree that most closely follows the traditional reconstruction of the Indo-European language family.

4.3.3.1 Not Specifying Retentions

In this run, figure 4.6, the result is very similar to lexical-only runs. The independent branching is present, but Oscan and Umbrian are shown to be different enough from the other languages to put them further up in the tree structure, rather than down with Tocharian

61 and Hittite. However, without the specification of the root node, the algorithm cannot see that these two languages share a significant number of innovations with Latin. As evaluated against the set list of subgroups, all are present save Italic. This run has selected a roote node that is quite close to Italic, which leaves the subgroup split. This tree is very similar to the lexical-only run above.

Latin

Tocharian_B

Tocharian_A

Hittite

Gothic

Old_Norse

Old_High_German

Old_English

Classical_Armenian

Ancient_Greek

Avestan

Vedic_Sanskrit

Old_Church_Slavonic

Umbrian

Oscan

0.02

Figure 4.6: Ancient Languages, Lexical and Phonological Characters, No Retentions Spec- ified

62 4.3.3.2 Specifying Retentions

This tree, figure 4.7 best represents the traditional Indo-European reconstruction, showing that the combination of sound changes and the specification of the proto-language provides a useful benefit over the exclusive use of lexical characters. The major benefit appears in the languages poorly represented by their Swadesh lists, a finding unsurprising, given the tests at the beginning of this chapter. All the subgroups that appear in traditional reconstructions also appear here: Germanic, Indo-Iranian, Italic, and Tocharian. This tree tree therefor represents the best result of the six conditions, upholding the initial hypotheses.

Proto-Indo-European

Latin

Umbrian

Oscan

Tocharian_B

Tocharian_A

Hittite

Gothic

Old_Norse

Old_High_German

Old_English

Classical_Armenian

Ancient_Greek

Avestan

Vedic_Sanskrit

Old_Church_Slavonic

0.02

Figure 4.7: Ancient Languages, Lexical and Phonological Characters, Retentions Specified

63 4.3.4 Overall Consideration of the Six Conditions

The improvement in the trees (in specific sections in the case of phonology-only, and overall in the case of phonology+specification of proto-language) shows that both hypotheses are upheld, though perhaps more weakly than I initially suspected they would be. The weaknesses in the phonological character-only conditions demonstrate that phonology- only is not a reasonable choice for Bayesian MCMC phylogeny, though it can reveal con- nections that a lexical character only examination might miss. The addition of phonology- only runs on top of the usual lexical studies may provide occasional useful information, and may be valuable, not for the trees they produce, but for showing potential variants in subtrees that may require consideration. The inclusion of the proto-language at the root of the tree proved useful in every case, and crucial in the final case, where the Italic subgroup was produced correctly by the MCMC run.

4.4 Replication of Gray and Atkinson 2003

This project would be incomplete without a replication of Gray and Atkinson (2003), as their study provided the initial impetus for this dissertation. However, I present a number of differences in the initial data preprocessing and the assumptions built into the model used to select trees. The input data for their project consisted of the living Indo-European languages listed in the Dyen et al. wordlists, along with Hittite and Tocharian A and B. Further, they specified, in the input to MRBAYES, that the living languages formed a monophyletic group, which in turn formed a monophyletic group with Tocharian A and B, with Hittite as an outgroup. The diagram below illustrates the tree structure their input specified.

64 PIE

Hittite Non-Hittite PIE

Tocharian Living IE Languages

Tocharian A Tocharian B

This tree structure builds in two important assumptions:

• That Anatolian was the earliest group to depart the IE homeland.

• That the root of the PIE tree is more similar to Hittite than to any of the living IE languages

This structure also provides the system with strong constraints on possible proto-languages at the root of the tree. This restriction also precludes any relationship between Hittite and Tocharian or between either of these language families with any of the living languages. Thus, in terms of structure, Gray and Atkinson’s project was limited only to finding sub- grouping within the living IE languages in the Dyen list. Their tree is appended to this section. Their results line up well with traditional IE reconstructions, with early splits among the major groups and close relationships among the well-known groupings. Indo-Iranian is well-represented, and the split between the Indic and lines up with non-Bayesian MCMC results. The Germanic languages are likewise split along traditional lines. Overall, their results are strongly consistent with the traditional, ten-branch model of the Indo-European family. The larger subgroupings early in the tree disappear quickly, leaving the branches to innovate independently for most of their histories. For this study, however, I’ve chosen to remove the restrictions Gray and Atkinson placed on the tree structures, so that the system is completely unconstrained in placing Hittite and Tocharian A and B in the tree. I ran this replication twice, first with no ad- ditional information, and second, with the PIE Swadesh list included as an outgroup (as

65 discussed above and in the Methods chapter). Thus, the system is free to infer a root node for all families, unconstrained by the presumption of Anatolian having departed first.

4.4.1 Replication without Root Node Specification

A completely naive run produces a tree that, with the exception of the , which are presumed to be very similar to the root of the entire tree, still closely adheres to the traditional structure of the IE family. The major trouble is the selection of the root node. The root, which Gray and Atkinson specified as very close to Hittite, came out as essentially the ancestor of the Slavic languages. The remainder of the branches are well- represented and largely consistent with the ten-branch structure of Indo-European. This root creates problems for the system and breaks up the traditional PIE reconstruc- tion incorrectly, at least early on. The , however, are placed reasonably as a sister to the Slavic languages, and the Indo-Iranian, Germanic, Italic, and Celtic subgroups are both present and set together.

4.4.2 Replication with Root Node Specification

Specifying Proto-Indo-European as an outgroup to the system improves the structure of the tree, placing the Slavic languages in their traditional position as one of the sub-branches of Balto-Slavic. Anatolian and Tocharian share a common ancestor, though far enough back to more or less represent a sufficiently early split not to represent an important claim about shared history.

4.5 Discussion

The trees presented above, particularly the six-way comparison of conditions, demonstrate that including the proto-language as an outgroup in the input data improves the results of the Bayesian MCMC method. This improvement, however, is only available when at least

66 Russian RUSSIAN_P Welsh_N Welsh_C Breton_List Breton_ST Irish_A Irish_B Faroese Icelandic_ST Swedish_Up Danish Afrikaans Dutch_List Flemish Frisian Italian Brazilian Portuguese_ST Spanish Catalan Ladin French Provencal Walloon Sardinian_N Sardinian_C Sardinian_L Vlach Rumanian_List Marathi Gujarati Hindi Panjabi_ST Lahnda Bengali Khaskura Kashmiri Gypsy_Gk Singhalese Ossetic Wakhi Tadzik Baluchi Waziri ALBANIAN Albanian_G Albanian_T Albanian_Top Albanian_K Albanian_C Tocharian_B Tocharian_A Hittite Greek_Mod Greek_D Greek_K Armenian_List Armenian_Mod Lithuanian_O Lithuanian_ST Latvian Slovenian Bulgarian Macedonian BULGARIAN_P MACEDONIAN_P Serbocroatian SERBOCROATIAN_P SLOVENIAN_P CZECH_P Czech_E Slovak Czech SLOVAK_P POLISH_P Byelorussian Ukrainian Polish BYELORUSSIAN_P UKRAINIAN_P

Figure 4.8: Dyen List, No Specification of Root some amount of historical reconstruction has already been done for the language family in question. For Indo-European languages, the long history of non-computational history scholarship permits easy access to, and a reasonable certainty of accuracy of, the recon- structed proto-language. Another major issue in the construction of these trees is the process of selecting a root node. The trees in question are rooted trees, but without the specification of the outgroup, the root node appears to be selected somewhat randomly. A longer discussion of this issue appears below.

67 Proto-Indo-European Italian Brazilian Portuguese_ST Spanish Catalan Ladin French Provencal Walloon Sardinian_N Sardinian_C Sardinian_L Vlach Rumanian_List Russian RUSSIAN_P Byelorussian Ukrainian Polish BYELORUSSIAN_P UKRAINIAN_P CZECH_P Czech_E Slovak Czech SLOVAK_P POLISH_P Bulgarian Macedonian BULGARIAN_P MACEDONIAN_P Serbocroatian SERBOCROATIAN_P SLOVENIAN_P Slovenian Lithuanian_O Lithuanian_ST Latvian Marathi Gujarati Hindi Panjabi_ST Lahnda Bengali Khaskura Kashmiri Gypsy_Gk Singhalese Ossetic Wakhi Tadzik Baluchi Waziri ALBANIAN Albanian_G Albanian_T Albanian_Top Albanian_K Albanian_C Tocharian_B Tocharian_A Hittite Greek_Mod Greek_D Greek_K Armenian_List Armenian_Mod Welsh_N Welsh_C Breton_List Breton_ST Irish_A Irish_B Faroese Icelandic_ST Swedish_Up Danish Afrikaans Dutch_List Flemish Frisian

Figure 4.9: Dyen List, Specification of Root

4.5.1 Root Node Selection

One of the most consistent characteristics of the trees resulting from the study is the prob- lems with selecting an appropriate root node. From an algorithmic perspective, the optimal root node, the one that provides for the smallest number of independent changes in the tree, will be one with features similar to the proto-language of one of the subfamilies of the tree. For instance, in Figure 4.8 above, the algorithm selected a root node that fits very closely with Common Slavic. This choice produces a tree that minimizes the common innovations required, but represents the Slavic languages as very basal branches, with few innovations away from the proto-language. Even then, however, the closely-related Slavic languages

68 are grouped together, and the Baltic languages are a sister branch to all Slavic. This selection, while clearly incorrect with respect to actual history, does usefully min- imize the number of independent changes along the branches of the resulting tree. This demonstrates clearly that

1. efficient structure and historical accuracy are not strictly related, and

2. information about the actual structure of the root node (i.e. the proto-language) is critical for accurately identifying the subfamilies appropriately.

There are a few solutions to the root-node selection problem. The first, and the one that I tested here, is simply to specify the root node. This obviates the problem entirely, and, as seen above, results in significant improvement over the naive use of Bayesian MCMC phylogenetic methods. Another option that may improve root node selection in naive runs is a process of re- selecting a branch as a new root node. For any tree structure, it is possible to select an arbitrary edge as the new root of the tree. A new node is created on that edge. The nodes above that new node then become children of the new root, and the nodes below remain the same. See below for an example, where the tree is re-rooted on the edge between nodes A and E: The initial state of the tree:

A

B E

C D F G

The new node is created on the edge between A and E, and the tree is re-rooted there:

69 AE

A E

B F G

C D

Starting from the root, and iteratively moving down a naively-created tree until the leaves are all similar distances from the root may well help to resolve the root selection issue. For example, in the tree in Figure 4.9 above, the iterative process would move down from the current root to the edge between Slovenian and the common ancestor of the Baltic languages and the remainder of Indo-European. This single example is insufficient to show general usefulness of the rerooting method for root-node selection, but it does show that naive runs may in principle be improved drastically with a simple presumption of similar rates of change. Even with these limitations, the remainder of the tree, those languages outside the sub- family selected as close to the root, are largely placed in accordance with traditional recon- structions, demonstrating that these methods are quite robust.

4.6 Implications and Limitations

The results of the phonological characters introduced here suggest that the inclusion of more phonological characters may be better still. Including environments where changes are likely to occur, as in before a high vowel, may also provide a useful phylogenetic signal to benefit Bayesian MCMC runs. However, in selecting such environments, caution must be taken to avoid choosing those environments that confirm pre-existing notions about the structure of the family. If environments are chosen in such a way as to give evidence about known sound changes, this methodology would be in essence no different than simply selecting already-known sound laws and coding languages as 1 or 0 for the presence or

70 absence. I know of no existing hierarchy of phonological environments for sound changes that could provide a reasonable basis for selection, but such a thing may exist. Further research here could well provide a principled selection procedure and permit a much greater number of phonological characters without selection bias on the part of the researcher. Further, the somewhat kludgy method of specifying the proto-language in the data could be much improved. As discussed earlier, the languages in question are specified to be a monophyletic in-group, and the proto-language is specified as the outgroup. It approxi- mates, but does not actually force, the root node of the tree to represent the exact vectorized version of the proto-language. At the time of writing, this was the best phylogenetic pack- ages could do. However, Chang et al. 2015, as part of their study, have produced a custom version of the BEAST phylogeny package that permits ancestor nodes to be specified ex- actly. This package is exceedingly new, and using it may allow me to implement an exact root node specification, rather than the workaround necessary here.

71 CHAPTER 5

Conclusion

It should come as no surprise that sound changes carry a phylogenetic signal. Such changes are, after all, how family trees have been constructed in the past. The experiments above, however, have shown that carefully encoding phonemic data in a way that captures sound changes without introducing bias by choosing specific changes can provide useful extra information for computational phylogeny, as can adding a reconstruction of the language as an outgroup. These techniques presuppose the existence of some amount of scholarly work recon- structing the family in question. Indeed, Bayesian MCMC phylogeny is completely impos- sible without data points that are known to be cognate, i.e. descended from the same initial form, which already requires that some historical work has been done. Such work need not necessarily be as thoroughgoing as the Indo-European data presented here, but at least basic cognacy data must be present. For a completely unstudied language family, no tree construction will be possible, computationally or otherwise.

5.1 Primary Benefits

This work has two primary benefits - the mechanism of including sound change data with- out directly selecting sound changes, and demonstrating that the addition of the recon- structed proto-language as an outgroup to the rest of the languages is useful for tree accu- racy. In both cases, the benefits are mostly available in cases where significant reconstruc-

72 tion has already been done on a language family. Tests of these new data mechanisms on language families with less history of historical scholarship have yet to be done but may still show some benefits.

5.2 Secondary Benefits

This work further includes both easily-accessible data files for computational phylogeny and instructions for running the exact experiments with MrBayes. The technical barriers to reproducing this work are considerable, and I hope that this reduces them somewhat and permits other scholars to experiment with computational phylogeny with other language families and other kinds of data.

5.3 Future Work

Several immediate expansions of this work present themselves, both in Indo-European and further afield. Including sound changes more directly, but in an extremely complete fash- ion, as characters for computational phylogeny would be a useful experiment. Collections of sound changes exist, such as The Laws of Indo-European [Collinge, 1985]. Including every law in this volume, and coding the Indo-European languages for whether the law applies to them, would at the very least provide a larger dataset of sound changes than was applied in Ringe and Warnow (1999) or in this work. This sort of data, of course, will only be available in the most-studied language families. Applying these methods to other language families is another obvious next step. Pre- liminary work is already ongoing for the Mande language family, as seen in the initial network diagram in figure 5.1. Mande is a much less studied language family than Indo-European, and initial work so far has been using only word lists tagged for cognacy. As more data comes available, I will expand this initial work to include more data and the techniques tested in this thesis.

73 Figure 5.1: A splitstree network showing the relationships among the Mande languages

74 APPENDIX A

Source Data

For access to the input data files for the experiments in this work, please email the author, or request access to the github repository at https://github.com/styndall/iephylo. The data files are too large to include in plain text here. The following is the list of languages compiled by IELEX, presented in alphabetical order. The and underscoring is present in the databases and serves to dif- ferentiate different sources. The languages in all caps are from the Dyen et al. wordlist, while the others are later additions. The number following the language is the number of Swadesh entries present for each language. The number may be larger than 207, as some language may have more than one word in a particular Swadesh slot. Note that the numbers vary widely - Palaic is very sparsely attested and only has attested words for 16 Swadesh entries, while Ancient Greek (much better attested), has 248 distinct entries.

1. ALBANIAN 207

2. Afrikaans 223

3. Albanian C 168

4. Albanian G 207

5. Albanian K 181

6. Albanian Standard 197

7. Albanian T 223

8. Albanian Top 212

9. Ancient Greek 248

10. Armenian List 172

11. Armenian Mod 217

12. Assamese 207

75 13. Avestan 188

14. BULGARIAN P 193

15. BYELORUSSIAN P 194

16. Baluchi 194

17. Bengali 199

18. Bihari 223

19. Brazilian 223

20. Breton List 217

21. Breton ST 219

22. Breton Se 199

23. Bulgarian 194

24. Byelorussian 198

25. CZECH P 192

26. Catalan 232

27. Classical Armenian 197

28. Cornish 222

29. Czech 226

30. Czech E 208

31. Danish 204

32. Digor Ossetic 189

33. Dolomite Ladino 207

34. Dutch List 227

35. Elfdalian 1

36. English 199

37. Faroese 227

38. Flemish 216

39. French 216

76 40. French Creole 208

41. French Creole 194

42. Frisian 193

43. Friulian 232

44. Gaulish 55

45. German 218

46. German Munich 200

47. Gothic 198

48. Greek D 204

49. Greek K 201

50. Greek Md 210

51. Greek Ml 199

52. Greek Mod 216

53. Gujarati 193

54. Gypsy Gk 178

55. Hindi 224

56. Hittite 114

57. Icelandic ST 213

58. Irish A 200

59. Irish B 214

60. Iron Ossetic 195

61. Italian 231

62. Kashmiri 193

63. Kati 107

64. Khaskura 198

65. Kurdish 117

66. Ladin 212

77 67. Lahnda 192

68. Latin 226

69. Latvian 231

70. Lithuanian O 195

71. Lithuanian ST 217

72. Lower Sorbian 190

73. Luvian 46

74. Luxembourgish 201

75. Lycian 17

76. MACEDONIAN P 193

77. Macedonian 224

78. Magahi 62

79. Manx 111

80. Marathi 228

81. Marwari 182

82. Nepali 244

83. Norwegian 207

84. Old Breton 79

85. Old Church 235

86. Old Cornish 55

87. Old English 240

88. Old High 235

89. Old Irish 187

90. Old Norse 270

91. Old Persian 72

92. Old Prussian 135

93. Old Welsh 36

78 94. Oriya 241

95. Oscan 28

96. Ossetic 206

97. POLISH P 194

98. Palaic 16

99. Panjabi ST 198

100. Pashto 210

101. Pennsylvania Dutch 185

102. Persian 205

103. Polish 220

104. Portuguese ST 245

105. Proto-Indo-European 164

106. Provencal 239

107. RUSSIAN P 195

108. Romansh 229

109. Rumanian List 214

110. Russian 205

111. SERBOCROATIAN P 234

112. SLOVAK P 192

113. SLOVENIAN P 192

114. Sardinian C 191

115. Sardinian L 195

116. Sardinian N 183

117. Sariqoli 189

118. Schwyzerdutsch 199

119. Scots Gaelic 196

120. Serbian 229

79 121. Serbocroatian 203 122. Shughni 223 123. Sindhi 189 124. Singhalese 135 125. Slovak 217 126. Slovenian 179 127. Sogdian 172 128. Spanish 233 129. Sranan 163 130. Swedish 236 131. Swedish Up 222 132. Swedish Vl 209 133. Tadzik 236 134. Tocharian A 143 135. Tocharian B 154 136. UKRAINIAN P 193 137. Ukrainian 225 138. Umbrian 33 139. 196 140. Urdu 213 141. Vedic Sanskrit 249 142. Vlach 160 143. Wakhi 189 144. Walloon 195 145. Waziri 204 146. Welsh C 198 147. Welsh N 214 148. Zazaki 182

80 BIBLIOGRAPHY

[Abney, 1996] Abney, S. (1996). Statistical methods and linguistics. The balancing act: Combining symbolic and statistical approaches to language, pages 1–26. [Andrieu et al., 2003] Andrieu, C., De Freitas, N., Doucet, A., and Jordan, M. (2003). An introduction to mcmc for machine learning. Machine learning, 50(1):5–43. [Anthony, 2007] Anthony, D. W. (2007). The horse, the wheel and language. how bronze- age riders from the steppes shaped the modern world. [Atkinson and Gray, 2006] Atkinson, Q. and Gray, R. (2006). How old is the Indo- European language family? Illumination or more moths to the flame? Phylogenetic methods and the prehistory of languages, pages 91–109. [Bakker et al., 2011] Bakker, P., Daval-Markussen, A., Parkvall, M., and Plag, I. (2011). Creoles are typologically distinct from non-creoles. Journal of Pidgin and Creole Lan- guages, 26(1):5–42. [Bergsland and Vogt, 1962] Bergsland, K. and Vogt, H. (1962). On the validity of glot- tochronology. Current Anthropology, 3(2):115–153. [Blust, 1999] Blust, R. (1999). Subgrouping, circularity and extinction: some issues in austronesian comparative linguistics. In Selected from the Eighth International Conference on Austronesian Linguistics, volume 1, pages 31–94. Symposium Series of the Institute of Linguistics (Preparatory Office), Academia Sinica. [Bogucki, 1996] Bogucki, P. (1996). The spread of early farming in europe. American Scientist, pages 242–253. [Campbell, 2004] Campbell, L. (2004). Historical linguistics: an introduction. MIT Press. [Chang et al., 2015] Chang, W., Cathcart, C., Hall, D., and Garrett, A. (2015). Ancestry- constrained phylogenetic analysis supports the indo-european steppe hypothesis. Lan- guage, 91(1):194–244. [Collinge, 1985] Collinge, N. (1985). The Laws of Indo-European, volume 35. John Ben- jamins Publishing. [Darden, 2001] Darden, B. (2001). On the question of the anatolian origin of indo-hittite. Greater Anatolia and The Indo- family, ed. by Drews, pages 184–228.

81 [Darwin, 1871] Darwin, C. (1871). The descent of man.

[Diamond and Bellwood, 2003] Diamond, J. and Bellwood, P. (2003). Farmers and their languages: the first expansions. Science, 300(5619):597–603.

[Drummond et al., 2006] Drummond, A. J., Ho, S. Y., Phillips, M. J., and Rambaut, A. (2006). Relaxed phylogenetics and dating with confidence. PLoS biology, 4(5):e88.

[Dyen et al., 1992] Dyen, I., Kruskal, J., and Black, P. (1992). An indoeuropean classi- fication: A lexicostatistical experiment. Transactions of the American Philosophical Society, 82(5).

[Eckhardt, 1987] Eckhardt, R. (1987). Stan ulam, john von neumann, and the monte carlo method. Los Alamos Science, 15:131–136.

[Garrett, 2012] Garrett, A. (2012). The chronology of proto-indo-european: Reassess- ing gray and atkinson (2003). Conference Presentation, 14th Biennial Reconstruction Workshop, Ann Arbor, Michigan.

[Gray and Atkinson, 2003] Gray, R. and Atkinson, Q. (2003). Language-tree divergence times support the anatolian theory of indo-european origin. Nature, 426(6965):435.

[Gray et al., 2011] Gray, R. D., Atkinson, Q. D., and Greenhill, S. J. (2011). Language evolution and human history: what a difference a date makes. Philosophical Transac- tions of the Royal Society B: Biological Sciences, 366(1567):1090–1100.

[Greenhill and Gray, 2005] Greenhill, S. and Gray, R. (2005). Testing population disper- sal hypotheses: Pacific settlement, phylogenetic trees and austronesian languages. The evolution of cultural diversity: A phylogenetic approach, pages 31–52.

[Greenhill et al., 2010] Greenhill, S. J., Atkinson, Q. D., Meade, A., and Gray, R. D. (2010). The shape and tempo of language evolution. Proceedings of the Royal Soci- ety B: Biological Sciences, 277(1693):2443–2450.

[Greenhill et al., 2009] Greenhill, S. J., Currie, T. E., and Gray, R. D. (2009). Does hori- zontal transmission invalidate cultural phylogenies? Proceedings of the Royal Society B: Biological Sciences, 276(1665):2299–2306.

[Holden, 2002] Holden, C. (2002). Bantu language trees reflect the spread of farming across sub-saharan africa: a maximum-parsimony analysis. Proceedings of the Royal Society of London. Series B: Biological Sciences, 269(1493):793–799.

[Holden and Gray, 2006] Holden, C. and Gray, R. (2006). Rapid radiation, borrowing and dialect continua in the bantu languages. Phylogenetic Methods and the Prehistory of Languages, page 19.

[Huelsenbeck et al., 2001] Huelsenbeck, J., Ronquist, F., et al. (2001). Mrbayes: Bayesian inference of phylogenetic trees. Bioinformatics, 17(8):754–755.

82 [Huffman, 1982] Huffman, T. N. (1982). Archaeology and ethnohistory of the african . Annual review of Anthropology, 11:133–150. [Khan, 2012] Khan, R. (2012). There are more things in prehistory than are dreamt of in our urheimat. http://blogs.discovermagazine.com/gnxp/2012/08/ there-are-more-things-in-prehistory-than-are-dreamt-of-in-our-urheimat/. [Online; Accessed 26-Oct-2012]. [Lees, 1953] Lees, R. (1953). The basis of glottochronology. Language, pages 113–127. [Mallory, 1989] Mallory, J. (1989). In search of the Indo-Europeans: Language, archae- ology and myth. London: Thames and Hudson. [Nichols and Warnow, 2008] Nichols, J. and Warnow, T. (2008). Tutorial on computa- tional linguistic phylogeny. Language and Linguistics Compass, 2(5):760–820. [Pagel et al., 2007] Pagel, M., Atkinson, Q., and Meade, A. (2007). Frequency of word- use predicts rates of lexical evolution throughout indo-european history. Nature, 449(7163):717–720. [Renfrew, 1990] Renfrew, C. (1990). Archaeology and language: the puzzle of Indo- European origins. Cambridge University Press. [Rexova´ et al., 2006] Rexova,´ K., Bastin, Y., and Frynta, D. (2006). Cladistic analysis of bantu languages: a new tree based on combined lexical and grammatical data. Natur- wissenschaften, 93(4):189–194. [Rexova´ et al., 2003] Rexova,´ K., Frynta, D., and Zrzavy,` J. (2003). Cladistic analysis of languages: Indo-european classification based on lexicostatistical data. Cladistics, 19(2):120–127. [Ringe et al., 2002] Ringe, D., Warnow, T., and Taylor, A. (2002). Indo-european and computational cladistics. Transactions of the philological society, 100(1):59–129. [Sicoli and Holton, 2014] Sicoli, M. A. and Holton, G. (2014). Linguistic phylogenies support back-migration from beringia to asia. PLoS ONE, 9(3):e91722. [Swadesh, 1952] Swadesh, M. (1952). Lexico-statistic dating of prehistoric ethnic con- tacts: with special reference to north american indians and eskimos. Proceedings of the American philosophical society, pages 452–463. [Swadesh, 1955] Swadesh, M. (1955). Towards greater accuracy in lexicostatistic dating. International journal of American linguistics, pages 121–137. [Tavare,´ 1986] Tavare,´ S. (1986). Some probabilistic and statistical problems in the analy- sis of dna sequences. Lect. Math. Life Sci, 17:57–86. [Watkins, 1969] Watkins, C. (1969). Indo-european and the indo-europeans. The Ameri- can Heritage College Dictionary.

83