Phonotactics in Historical Linguistics: Quantitative Interrogation of a Novel Data Source
Total Page:16
File Type:pdf, Size:1020Kb
Phonotactics in historical linguistics: Quantitative interrogation of a novel data source Jayden Macklin-Cordes BA (Hons) 0000-0003-1910-3969 A thesis submitted for the degree of Doctor of Philosophy at The University of Queensland in 2021 School of Languages and Cultures Abstract Historical linguistics has been uncovering historical patterns of shared linguistic descent with consid- erable success for nearly two centuries. In recent decades, historical linguistics has been augmented with quantitative phylogenetic methods, increasing the scope and scale of linguistic phylogenetic trees that can be inferred and, by extension, the kinds of historical questions that can be investigated. Throughout this period of rapid methodological development, lexical cognate data has remained the main currency of interest. However, there is an immense array of structural variation throughout the world’s languages—the outcome of thousands of years of linguistic evolution—raising the prospect of additional sources of historical signal which could help infer shared linguistic pasts. This thesis investigates the prospect of making linguistic phylogenetic inferences using quantitative phonotactic data, semi-automatically extracted from wordlists. The primary research question moti- vating the study is as follows. Can a linguistic phylogeny be inferred with greater confidence when lexical cognate data is combined with phonotactic data? Underlying this is the more general question of how to approach and evaluate novel empirical data sources in linguistic phylogenetic research. I begin with a literature review, situating current linguistic phylogenetic research within historical linguistics broadly and elucidating the motivations for looking further into phonotactics as a data source. I also introduce Australian languages and their phonological profiles, as these provide the language sample for the rest of the thesis. Four papers follow. The first re-evaluates phoneme frequencies. The deceptively simple question of which statistical distribution best characterises the frequencies of phonemes in a language is crucial, because particular distribution types can be suggestive of particular causal processes that generated them. Using a maximum likelihood framework, I find modest evidence that more frequent phonemes are characterised by a power law distribution while less frequent phonemes are characterised by an exponential distribution. This is followed by discussion of theoretical implications. The second paper reviews the place of phylogenetic methods in linguistics. It considers the scientific task of comparison and the potential confound of phylogenetic autocorrelation. I discuss comparative approaches from within linguistics and other domains of science and I advocate for the increased uptake of methods which control for phylogeny directly over various balanced sampling methods commonly employed in linguistic typology. The third paper draws on the phylogenetic comparative methods discussed above to quantify the degree of phylogenetic signal in phonotactic data. Three simple data representations of phonotactics are extracted from wordlists: binary biphone data coding whether or not any given sequence of two phonological segments is present, biphone frequency data, coding the relative frequencies of a segment being preceded or followed by another segment, and sound class biphone data, coding the relative frequencies of a natural sound class being preceded or followed by another sound class. I find some degree of phylogenetic signal in all datasets, with the strongest signal in the sound class dataset. Finally, the fourth paper tests the primary research question of this thesis. Using a Bayesian computational approach, I infer and compare phylogenies of 44 western Pama-Nyungan languages using two models: one in which the tree is inferred using binary biphone data, sound class data and pre-existing cognate data together, versus one in which two trees are inferred, one using the phonotactic data and the other using the cognate dataset alone. Within the bounds of current computational limits, I do not find evidence that phonotactic data strengthen support for the resulting phylogeny. Although marginal likelihood estimates seem to offer prima facie support for the hypothesis, there is evidence that the evolutionary model for phonotactic data is insufficient and the resulting calculations are unreliable. I conclude the thesis with a discussion, drawing together the previous papers and detailing a methodological scaffold for investigating novel sources of linguistic data in quantitative historical lin- guistics in future. There are clear steps for future research to pursue with phonotactics. More generally, however, I present an empirical, step-wise process for methodological development in quantitative historical linguistics, from the initial link between language change processes and observable data distributions, to implementation within complex, expressive phylogenetic models. Declaration by author This thesis is composed of my original work, and contains no material previously published or written by another person except where due reference has been made in the text. I have clearly stated the contribution by others to jointly-authored works that I have included in my thesis. I have clearly stated the contribution of others to my thesis as a whole, including statistical assistance, survey design, data analysis, significant technical procedures, professional editorial advice, financial support and any other original research work used or reported in my thesis. The content of my thesis is the result of work I have carried out since the commencement of my higher degree by research candidature and does not include a substantial part of work that has been submitted to qualify for the award of any other degree or diploma in any university or other tertiary institution. I have clearly stated which parts of my thesis, if any, have been submitted to qualify for another award. I acknowledge that an electronic copy of my thesis must be lodged with the University Library and, subject to the policy and procedures of The University of Queensland, the thesis be made available for research and study in accordance with the Copyright Act 1968 unless a period of embargo has been approved by the Dean of the Graduate School. I acknowledge that copyright of all material contained in my thesis resides with the copyright holder(s) of that material. Where appropriate I have obtained copyright permission from the copyright holder to reproduce material in this thesis and have sought permission from co-authors for any jointly authored works included in the thesis. Publications included in this thesis Macklin-Cordes, Jayden L., Claire Bowern & Erich R. Round. 2021. Phylogenetic signal in phonotac- tics. Diachronica 38. https://doi.org/10.1075/dia.20004.mac. Macklin-Cordes, Jayden L. & Erich R. Round. 2020a. Re-evaluating phoneme frequencies. Frontiers in Psychology 11. https://doi.org/10.3389/fpsyg.2020.570895. Submitted manuscripts included in this thesis Macklin-Cordes, Jayden L. & Erich R. Round. Submitted. Challenges of sampling and how phyloge- netic comparative methods help: With a case study of the Pama-Nyungan laminal contrast. Preprint. https://doi.org/10.5281/zenodo.4796981. Other publications during candidature Round, Erich R., T. Mark Ellison, Jayden L. Macklin-Cordes & Sacha Beniamine. 2020. Automated parsing of interlinear glossed text from page images of grammatical descriptions. In 12th language resources and evaluation conference (LREC-2020), 2871–2876. Marseille, France: European Language Resources Association. https://www.aclweb.org/anthology/2020.lrec-1.350. Conference presentations Hollis, Jordan, Genevieve C. Richards, Jayden L. Macklin-Cordes & Erich R. Round. 2016. The Cape York lexical records of Bruce Sommer. In Australian Linguistics Society (ALS) annual conference. Monash University, Caulfield, Australia. https://doi.org/10.6084/m9.figshare.4299377. Macklin-Cordes, Jayden L. 2017. High-definition phonotactic typology in Sahul: Choosing the data- rich approach. In Association of linguistic typology (ALT-12). Australian National University, Canberra, Australia. https://doi.org/10.6084/m9.figshare.5861067. Macklin-Cordes, Jayden L., Nathaniel L. Blackbourne, Thomas J. Bott, Jacqueline Cook, T. Mark Ellison, Jordan Hollis, Edith E. Kirlew, Genevieve C. Richards, Sanle Zhao & Erich R. Round. 2017. Robots who read grammars [poster]. In CoEDL fest. Alexandra Park Conference Centre, Alexandra Headlands, QLD, Australia. https://doi.org/10.6084/m9.figshare.4625248. Macklin-Cordes, Jayden L., Steven P. Moran & Erich R. Round. 2016. Evaluating information loss from phonological dimensionality reduction. In Speech science and technology (SST) satellite symposium: The role of predictability in shaping human language sound patterns. Western Sydney University, Australia. https://doi.org/10.6084/m9.figshare.4465976. Macklin-Cordes, Jayden L. & Erich R. Round. 2016a. High-definition phonotactic data contain phylogenetic signal [poster]. In 90th annual meeting of the Linguistic Society of America (LSA). Washington, DC. https://doi.org/10.6084/m9.figshare.4534238. Macklin-Cordes, Jayden L. & Erich R. Round. 2016b. High-definition phonotactics reflect linguistic pasts. Yale Linguistics Friday Lunch Talk. Yale University, New Haven, CT. Macklin-Cordes, Jayden L. & Erich R. Round. 2016c. Reflections of linguistic history in quantitative phonotactics. In Australian Linguistics Society