Linguistic Phylogeny with Bayesian Markov Chain Monte Carlo: the Case of Indo-European
Total Page:16
File Type:pdf, Size:1020Kb
Linguistic Phylogeny with Bayesian Markov Chain Monte Carlo: The Case of Indo-European by Stephen Tyndall A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Linguistics) in the University of Michigan 2019 Doctoral Committee: Professor Sarah Thomason, Chair Associate Professor Steven Abney Professor William H. Baxter Professor Benjamin Fortson ©Stephen Tyndall [email protected] ORDID iD: 0000-0001-7276-6695 2019 ACKNOWLEDGMENTS First and foremost, I would like to thank Sally Thomason, without whose encouragement and support I’d never have completed this the- sis. And thank you as well to the rest of the committee for provid- ing advice and ideas. Their suggestions and help have been invaluable. Thanks as well to my friends and colleagues, particularly Kate Sherwood, Terry Szymanski, Rob Gillezeau, Lauren Squires, Tayfun Bilgin, Jon Yip, and Kevin McGowan, who provided support, friendship, and mo- tivation through this long process. Further thanks to the members of GLaM and GEO for all the social and professional aid, and to Bu and Ping for their calming presence through the final stages of this thesis. Finally, thank you to my family, Karen Brichetto, Eric Tyndall, David Richardson, and Doug Tyndall, for all your support during graduate school. I could not have done this without you. ii TABLE OF CONTENTS Acknowledgments ................................... ii List of Figures ..................................... iv List of Tables ...................................... v Abstract ......................................... vi Chapter 1 Introduction ..................................... 1 1.1 Background Information..........................1 1.2 Statistical Methods in Linguistics......................2 1.3 The Particular Problem of Indo-European (Computational) Phylogeny..3 1.4 Dissertation Objectives...........................4 1.5 Scope and Limitations...........................6 1.6 Structure of the Dissertation........................6 1.6.1 Literature Review..........................6 1.6.2 Methodology............................7 1.6.3 Results...............................7 1.6.4 Conclusion.............................7 2 Literature Review .................................. 9 2.1 Computational Phylogeny in Linguistics..................9 2.1.1 Prior Computational Phylogeny Work in Linguistics....... 15 2.2 Models of the spread of the Indo-European Language Family....... 18 2.2.1 Tradition Family Tree(s) of IE................... 18 2.2.2 Non-Computational Dating Methods for the Breakup of PIE... 18 2.2.3 Models of the Breakup and Spread of Proto-Indo-European... 21 2.2.4 The Farming Model........................ 21 2.2.5 The Pastoralist Model....................... 23 2.3 Computational Phylogeny of Proto-Indo-European............ 25 2.3.1 Gray and Atkinson......................... 25 2.3.2 Chang et al. 2005.......................... 30 2.3.3 The Ringe and Warnow Method.................. 31 2.3.4 Computational Phylogeny and Estimating the Dates of Language Splits................................ 35 iii 3 Methodology ..................................... 38 3.1 Specifying Retentions with MRBAYES.................. 39 3.2 Capturing Phonological Changes as Characters for Bayesian MCMC... 40 3.3 Data Coding................................. 42 4 Results ........................................ 47 4.1 A Brief Survey of the Results........................ 47 4.2 Summary of the Results........................... 47 4.2.1 Characteristics and Subdivisions of the Set of Swadesh Lists... 49 4.2.2 Kitchen Sink Results........................ 52 4.3 Six-Way Results............................... 54 4.3.1 Phonological Characters Only................... 54 4.3.2 Lexical Characters Only...................... 58 4.3.3 Lexical and Phonological Characters............... 61 4.3.4 Overall Consideration of the Six Conditions............ 64 4.4 Replication of Gray and Atkinson 2003.................. 64 4.4.1 Replication without Root Node Specification........... 66 4.4.2 Replication with Root Node Specification............. 66 4.5 Discussion.................................. 66 4.5.1 Root Node Selection........................ 68 4.6 Implications and Limitations........................ 70 5 Conclusion ...................................... 72 5.1 Primary Benefits.............................. 72 5.2 Secondary Benefits............................. 73 5.3 Future Work................................. 73 Appendix ........................................ 75 Bibliography ...................................... 81 iv LIST OF FIGURES 2.1 Network from [Holden and Gray, 2006]..................... 10 2.2 Network from [Greenhill and Gray, 2005]. This splitstree network graph of Austronesian demonstrates a strong tree-like signal in the data used to create the graph...................................... 11 2.3 Network from [Bakker et al., 2011]. Splitstree network showing very weak treelike signal in source data............................ 12 2.4 Tree from Gray and Atkinson (2003)...................... 29 4.1 Entire IELEX Lexical Tree............................ 53 4.2 Ancient Languages, Phonology Only, No Retentions Specified......... 56 4.3 Ancient Languages, Phonology Only, Retentions Specified........... 57 4.4 Ancient Languages, Lexical Characters Only, No Retentions Specified..... 59 4.5 Ancient Languages, Lexical Characters Only, Retentions Specified....... 61 4.6 Ancient Languages, Lexical and Phonological Characters, No Retentions Spec- ified........................................ 62 4.7 Ancient Languages, Lexical and Phonological Characters, Retentions Specified 63 4.8 Dyen List, No Specification of Root....................... 67 4.9 Dyen List, Specification of Root......................... 68 5.1 A splitstree network showing the relationships among the Mande languages.. 74 v LIST OF TABLES 3.1 Manner Outcomes of PIE *p........................... 41 3.2 Manner Outcomes of PIE *t........................... 41 3.3 Manner Outcomes of PIE *k´........................... 42 vi ABSTRACT This study aims to improve the application of computational phylogeny to lan- guage families by testing a series of possible improvements in the structure and use of input data, particularly in order to help reconcile computational techniques with traditional historical linguistic family tree creation. The method used is Bayesian Markov Chain Monte Carlo (Bayesian MCMC), with the MRBAYES software package, replicating several prior computational phylogenetic studies of the Indo-European language family. The study examines two hypotheses about changes to the input data for the method: distinguishing innovations from reten- tions in the input data will improve the resulting trees, and providing a bias-free mechanism for incorporating sound change data in addition to lexical data will further improve both the output and the usability of Bayesian MCMC. The results of the study demonstrate small improvements over baseline under both hypothe- ses, demonstrating the utility of both kinds of additional data. Further work will include application of these new data types to other language families. vii CHAPTER 1 Introduction In this dissertation, I aim to improve the application of computational phylogeny to lan- guage families by testing a series of possible improvements in the structure and use of input data, particularly in order to help reconcile computational techniques with traditional historical linguistic family tree creation. The use of computational phylogeny for linguis- tics is both recent and controversial, particularly in its use by Gray & Atkinson (2003) to date the breakup of the Indo-European language family. I extend the method used by Gray & Atkinson in two ways: by including sound change data without pre-selecting particular changes, and by carefully distinguishing innovations from retentions in both lexical and sound change data. 1.1 Background Information The process of constructing family trees (phylogenies) for languages is very similar to con- structing phylogenies for biological species. Even as far back as the 19th century, Darwin observed parallels between languages and species: “The formation of different languages and of distinct species, and the proofs that both have been developed through a gradual pro- cess, are curiously parallel. We find in distinct languages striking homologies due to com- munity of descent, and analogies due to a similar process of formation. ”[Darwin, 1871, 57-58]. A number of techniques have been developed by biologists to produce species family trees and to date the breakup of populations. It should come as no surprise, then, that some of these techniques have been adapted to producing language phylogenies from linguistic data. In particular, a number of scholars, primarily non-linguists, have applied a computational and statistical technique called the Bayesian Markov Chain Monte Carlo (hereafter Bayesian MCMC) method to language families. This technique has been used with considerable success in biology, and some uses within linguistics have shown great promise in generating accurate family trees and dating 1 the breakup of linguistic groups. Recently, the use of the technique to date the breakup of a particularly well-studied language family, Indo-European, has generated significant controversy among historical linguists. The Indo-European language family has one of the longest histories of study in linguis- tic science. The family contains ten major sub-families, the earliest of which,