Taraka Rama Studies in Computational Historical Linguistics

i “slut_seminar_thesis” — 2015/5/7 — 17:04 — page 1 — #1 i i i Taraka Rama Studies in computational historical linguistics i i i i i “slut_seminar_thesis” — 2015/5/7 — 17:04 — page 2 — #2 i i i Data linguistica <http://www.svenska.gu.se/publikationer/data-linguistica/> Editor: Lars Borin Språkbanken Department of Swedish University of Gothenburg 27 • 2015 i i i i i “slut_seminar_thesis” — 2015/5/7 — 17:04 — page 3 — #3 i i i Taraka Rama Studies in computational historical linguistics Models and analysis Gothenburg 2015 i i i i i “slut_seminar_thesis” — 2015/5/7 — 17:04 — page 4 — #4 i i i Data linguistica 27 ISBN ISSN 0347-948X Printed in Sweden by Reprocentralen, Campusservice Lorensberg, University of Gothenburg 2015 Typeset in LATEX 2e by the author Cover design by Kjell Edgren, Informat.se Front cover illustration: Author photo on back cover by Kristina Holmlid i i i i i “slut_seminar_thesis” — 2015/5/7 — 17:04 — page i — #5 i i i ABSTRACT Computational analysis of historical and typological data has made great progress in the last fifteen years. In this thesis, we work with vocabulary lists for ad- dressing some classical problems in historical linguistics such as discriminating related languages from unrelated languages, assigning possible dates to splits in a language family, employing structural similarity for language classification, and providing an internal structure to a language family. In this thesis, we compare the internal structure inferred from vocabulary lists with the family trees inferred given in Ethnologue. We also explore the ranking of lexical items in the widely used Swadesh word list and compare our ranking to another quantitative reranking method and short word lists com- posed for discovering long-distance genetic relationships. We also show that the choice of string similarity measures is important for internal classification and for discriminating related from unrelated languages. The dating system presented in this thesis can be used for assigning age estimates to any new language group and overcomes the criticism of constant rate of lexical change assumed by glottochronology. We also train and test a linear classifier based on gap-weighted subsequence features for the purpose of cognate identification. An important conclusion from these results is that n-gram approaches can be used for different historical linguistic purposes. i i i i i “slut_seminar_thesis” — 2015/5/7 — 17:04 — page ii — #6 i i i i i i i i “slut_seminar_thesis” — 2015/5/7 — 17:04 — page iii — #7 i i i SAMMANFATTNING i i i i i “slut_seminar_thesis” — 2015/5/7 — 17:04 — page iv — #8 i i i i i i i i “slut_seminar_thesis” — 2015/5/7 — 17:04 — page v — #9 i i i ACKNOWLEDGEMENTS I thank my supervisors, family, friends, collaborators, teachers, opponents, and colleagues for everything. This thesis would not have been possible without them. The paper presentations would not have been possible without the gener- ous funding of Graduate School in Language Technology, Centre for Language Technology, and Digital Areal Linguistics project. i i i i i “slut_seminar_thesis” — 2015/5/7 — 17:04 — page vi — #10 i i i i i i i i “slut_seminar_thesis” — 2015/5/7 — 17:04 — page vii — #11 i i i CONTENTS Abstract i Sammanfattning iii Acknowledgements v 1 Introduction 1 1.1 Computational historical linguistics . .1 1.1.1 Historical linguistics . .1 1.1.2 What is computational historical linguistics? . .2 1.2 Problems and contributions . .6 1.2.1 Problems and solutions . .6 1.2.2 Contributions . .8 1.3 Overview of the thesis . .9 I Background 13 2 Computational historical linguistics 15 2.1 Differences and diversity . 15 2.2 Language change . 19 2.2.1 Sound change . 19 2.2.2 Semantic change . 26 2.3 How do historical linguists classify languages? . 30 2.3.1 Ingredients in language classification . 30 2.3.2 The comparative method and reconstruction . 33 2.4 Complementary techniques in language classification . 43 2.4.1 Lexicostatistics . 43 2.4.2 Beyond lexicostatistics . 44 2.5 A language classification system . 45 2.6 Tree evaluation . 46 2.6.1 Tree comparison measures . 46 2.6.2 Beyond trees . 47 2.7 Dating and long-distance relationship . 49 i i i i i “slut_seminar_thesis” — 2015/5/7 — 17:04 — page viii — #12 i i i viii Contents 2.8 Linguistic diversity and prehistory . 51 2.8.1 Diversity from a non-historical linguistics perspective . 51 2.8.2 Diversity from a historical linguistics perspective . 53 2.9 Genetics and linguistic prehistory . 54 2.9.1 Early studies and problems . 55 2.9.2 Genetic studies: A world-wide scenario . 55 2.10 Conclusion . 57 3 Databases 59 3.1 Cognate databases . 59 3.1.1 Dyen’s Indo-European database . 59 3.1.2 Ancient Indo-European database . 60 3.1.3 Intercontinental Dictionary Series (IDS) . 60 3.1.4 World loanword database . 61 3.1.5 List’s database . 61 3.1.6 Austronesian Basic Vocabulary Database (ABVD) . 61 3.2 Typological databases . 62 3.2.1 Syntactic Structures of the World’s Languages . 62 3.2.2 Jazyki Mira . 62 3.2.3 AUTOTYP . 62 3.3 Other comparative linguistic databases . 62 3.3.1 ODIN . 63 3.3.2 PHOIBLE . 63 3.3.3 World phonotactic database . 63 3.3.4 WOLEX . 63 3.4 Conclusion . 64 II Prolegomenon 65 4 Cognates, n-grams, and trees 67 4.1 Some questions . 67 4.2 Linguistic features for probably related languages . 68 4.3 Cognate identification and language classification . 69 4.3.1 N-grams for historical linguistics . 73 4.3.2 Language classification with cognate identification . 75 4.3.3 Language classification without cognate identification . 77 5 Linking time-depth to phonotactic diversity 81 6 Summary and future work 87 6.1 Summary . 87 i i i i i “slut_seminar_thesis” — 2015/5/7 — 17:04 — page ix — #13 i i i Contents ix 6.2 Future work . 88 III Papers on application of computational techniques to vocabulary lists for automatic language classification 91 7 Phonological diversity, word length, and population sizes across languages: The ASJP evidence 93 7.1 Introduction . 93 7.2 SR as predictor of phonological inventory sizes . 94 7.3 SR and word length . 101 7.4 SR and population sizes . 107 7.5 SR and geography . 111 7.6 Discussion and conclusion . 113 8 Typological distances and language classification 117 8.1 Introduction . 117 8.2 Related Work . 118 8.3 Contributions . 120 8.4 Database . 120 8.4.1 WALS . 120 8.4.2 ASJP . 121 8.4.3 Binarization . 122 8.5 Measures . 122 8.5.1 Internal classification accuracy . 123 8.5.2 Lexical distance . 124 8.6 Results . 125 8.6.1 Internal classification . 126 8.6.2 Lexical divergence . 126 8.7 Conclusion . 126 9 N-gram approaches to the historical dynamics of basic vocabulary 129 9.1 Introduction . 129 9.2 Background and related work . 132 9.2.1 Item stability and Swadesh list design . 132 9.2.2 The ASJP database . 133 9.2.3 Earlier n-gram-based approaches . 134 9.3 Method . 135 9.4 Results and discussion . 138 9.5 Conclusions . 140 i i i i i “slut_seminar_thesis” — 2015/5/7 — 17:04 — page x — #14 i i i x Contents 10 Phonotactic diversity and time depth of language families 141 10.1 Introduction . 141 10.1.1 Related work . 142 10.2 Materials and Methods . 144 10.2.1 ASJP Database . 144 10.2.2 ASJP calibration procedure . 145 10.2.3 Language group size and dates . 145 10.2.4 Calibration procedure . 148 10.2.5 N-grams and phonotactic diversity . 148 10.3 Results and Discussion . 149 10.3.1 Worldwide date predictions . 155 10.4 Conclusion . 157 11 Linguistic landscaping of South Asia using digital language re- sources: Genetic vs. areal linguistics 159 11.1 Introduction . 160 11.2 Towards a language resource from Grierson’s LSI . 162 11.3 Some preliminary experiments . 164 11.4 Discussion, conclusions and outlook . 166 12 Evaluation of similarity measures for automatic language classification 169 12.1 Introduction . 169 12.2 Related Work . 171 12.2.1 Cognate identification . 172 12.2.2 Distributional similarity measures . 174 12.3 Is LD the best string similarity measure for language classification? 175 12.4 Database and language classifications . 175 12.4.1 Database . 175 12.5 Similarity measures . 177 12.5.1 String similarity measures . 178 12.5.2 N-gram similarity . 180 12.6 Evaluation measures . 181 12.6.1 Distinctiveness measure (dist) . 182 12.6.2 Correlation with WALS . 182 12.6.3 Agreement with Ethnologue . 183 12.7 Results and discussion . 184 12.8 Conclusion . 187 13 Automatic cognate identification with gap-weighted string sub- sequences. 189 i i i i i “slut_seminar_thesis” — 2015/5/7 — 17:04 — page xi — #15 i i i Contents xi 13.1 Introduction . 189 13.2 Related work . 190 13.3 Cognate identification . 191 13.3.1 String.

Taraka Rama Studies in Computational Historical Linguistics

Mission: New Guinea]

The Languages of Melanesia: Quantifying the Level of Coverage

Wipi Grammar Essentials

Library of Congress Subject Headings for the Pacific Islands

Papua: the ‘Unknown’

A Linguistic Description of Lockhart River Creole

Indigenous and Minority Placenames

Library of Congress Subject Headings for the Pacific Islands

LCSH Section J

Torres Strait Islander People in Qld: a Brief Human Rights History

Advances in the Study of Siouan Languages and Linguistics

Language Ecology and Language Planning in Chiang Rai Province, Thailand