Taraka Rama Vocabulary Lists in Computational Historical Linguistics

i “mylic_thesis” — 2013/12/19 — 20:14 — page 1 — #1 i i i Taraka Rama Vocabulary lists in computational historical linguistics i i i i i “mylic_thesis” — 2013/12/19 — 20:14 — page 2 — #2 i i i Data linguistica <http://www.svenska.gu.se/publikationer/data-linguistica/> Editor: Lars Borin Språkbanken Department of Swedish University of Gothenburg 25 • 2014 i i i i i “mylic_thesis” — 2013/12/19 — 20:14 — page 3 — #3 i i i Taraka Rama Vocabulary lists in computational historical linguistics Gothenburg 2014 i i i i i “mylic_thesis” — 2013/12/19 — 20:14 — page 4 — #4 i i i Data linguistica 25 ISBN 978-91-87850-52-3 ISSN 0347-948X Printed in Sweden by Reprocentralen, Campusservice Lorensberg, University of Gothenburg 2014 Typeset in LATEX 2e by the author Cover design by Kjell Edgren, Informat.se Front cover illustration: A network representation of relations between Dravidian languages. by Rama and Kolachina (2013) c Author photo on back cover by Kristina Holmlid i i i i i “mylic_thesis” — 2013/12/19 — 20:14 — page i — #5 i i i ABSTRACT Computational analysis of historical and typological data has made great progr- ess in the last fifteen years. In this thesis, we work with vocabulary lists for addressing some classical problems in historical linguistics such as discrimi- nating related languages from unrelated languages, assigning possible dates to splits in a language family, employing structural similarity for language classification, and providing an internal structure to a language family. In this thesis, we compare the internal structure inferred from vocabulary lists to the family tree structure inferred through the comparative method. We also explore the ranking of lexical items in the widely used Swadesh word list and compare our ranking to another quantitative reranking method and short lists composed for discovering long-distance genetic relationships. We also show that the choice of string similarity measures is important for internal classification and for dis- criminating related from unrelated languages. The dating system presented in this thesis can be used for assigning age estimates to any new language group and overcomes the criticism of constant rate of lexical change assumed by glottochronology. An important conclusion from these results is that n-gram approaches can be used for different historical linguistic purposes. The field is undergoing a shift from – the application of computational methods to – short, hand-crafted vocabulary lists to automatically extracted word lists from corpora. Thus, we also experiment with parallel corpora for automatically ex- tracting cognates to infer a family tree from the cognates. i i i i i “mylic_thesis” — 2013/12/19 — 20:14 — page ii — #6 i i i i i i i i “mylic_thesis” — 2013/12/19 — 20:14 — page iii — #7 i i i SAMMANFATTNING Datorbaserad analys av historiska data har gjort stora framsteg under det sen- aste decenniet. I denna licentiatuppsats använder vi ordlistor för att ta oss an några klassiska problem inom historisk lingvistik. Exempel på sådana problem är hur man avgör vilka språk som är släkt med varandra och vilka som inte är det, hur man tidsbestämmer rekonstruerade urspråk, hur man klassificerar språk på grundval av strukturella likheter och skillnader, samt hur man sluter sig till den interna strukturen i en språkfamilj (dess ‘familjeträd’). I uppsatsen jämför vi metoder som använts för att postulera språkliga fa- miljeträd. Specifikt jämför vi ordlistebaserade metoder med den traditionella komparativa metoden, som även använder andra språkliga drag för jämförelsen. Med fokus på ordlistor jämför vi den ofta använda Swadesh-listan med al- ternativa listor föreslagna i litteraturen eller framtagna i vår egen forskning, med avseende på deras användbarhet för att angripa de nämnda historisk- lingvistiska problemen. Vi visar också i experiment att valet av likhetsmått är mycket betydelsefullt när strängjämförelser används för att bestämma den interna strukturen i en språkfamilj eller för att skilja besläktade och obesläktade språk åt. Ett viktigt resultat av dessa experiment är att n-grambaserade metoder lämpar sig mycket väl för flera olika språkhistoriska ändamål. De metoder för språklig datering som presenteras här kan användas för att tidsbestämma nya språkfamiljer, dock utan att vara beroende av antagandet att förändringen av ett språks basordförråd är konstant över tid, ett hårt kritiserat antagande som ligger till grund för glottokronologin som den ursprungligen formulerades. Metodologiskt har man inom området nu börjat utforska möjligheten att övergå från att arbeta med korta, på förhand givna ordlistor till att tillämpa språkteknologiska metoder på stora språkliga material, t ex hela (traditionella) lexikon eller strukturerat språkligt material som extraheras ur flerspråkiga korpusar. I uppsatsen utforskas användningen av parallella korpusar för att auto- matiskt finna ord med ett gemensamt ursprung (kognater) och därefter härleda ett språkligt familjeträd från kognatlistorna. i i i i i “mylic_thesis” — 2013/12/19 — 20:14 — page iv — #8 i i i i i i i i “mylic_thesis” — 2013/12/19 — 20:14 — page v — #9 i i i ACKNOWLEDGEMENTS I thank my supervisor, Lars Borin for taking me in at the beginning of my stay in Sweden and also providing me with a place to live. We had interest- ing discussions on historical linguistics in the meetings. I thank my assistant supervisor, Søren Wichmann for inviting me to Leipzig and for all the inter- esting discussions on historical linguistics. He always read any piece of work I wrote and helped me improve it. I thank my assistant supervisor, Markus Fors- berg for all the support. I thank all my collaborators Eric Holman, Prasanth Kolachina, and Sudheer Kolachina for all the critic on the work done and pub- lished during my stay in Sweden. I thank all the colleagues (past and present) at Språkbanken and Department of Swedish for the nice environment. The pa- per presentations would not have been possible without the generous funding of Graduate School in Language Technology, Centre for Language Technol- ogy, and Digital Areal Linguistics project. I thank all the colleagues (past and present) in these places for the environment. I thank Harald Hammarström for comments on my papers. I thank Anju Saxena for all the support. I also thank the colleagues at MPI-EVA, Leipzig and Alpha-Informatica, Groningen for the nice environment. Finally, this work would not have been possible without the support of my family and friends across the world. I thank my family and friends for all the support through my stay in Sweden. i i i i i “mylic_thesis” — 2013/12/19 — 20:14 — page vi — #10 i i i i i i i i “mylic_thesis” — 2013/12/19 — 20:14 — page vii — #11 i i i CONTENTS Abstract i Sammanfattning iii Acknowledgements v I Introduction to the thesis 1 1 Introduction 3 1.1 Computational historical linguistics . .3 1.1.1 Historical linguistics . .4 1.1.2 What is computational historical linguistics? . .4 1.2 Questions, answers, and contributions . .9 1.3 Overview of the thesis . 11 2 Computational historical linguistics 15 2.1 Differences and diversity . 15 2.2 Language change . 19 2.2.1 Sound change . 19 2.2.2 Semantic change . 27 2.3 How do historical linguists classify languages? . 31 2.3.1 Ingredients in language classification . 31 2.3.2 The comparative method and reconstruction . 34 2.4 Alternative techniques in language classification . 44 2.4.1 Lexicostatistics . 44 2.4.2 Beyond lexicostatistics . 45 2.5 A language classification system . 47 2.6 Tree evaluation . 47 2.6.1 Tree comparison measures . 48 2.6.2 Beyond trees . 49 2.7 Dating and long-distance relationship . 50 2.7.1 Non-quantitative methods in linguistic paleontology . 52 2.8 Conclusion . 54 i i i i i “mylic_thesis” — 2013/12/19 — 20:14 — page viii — #12 i i i viii Contents 3 Databases 55 3.1 Cognate databases . 55 3.1.1 Dyen’s Indo-European database . 56 3.1.2 Ancient Indo-European database . 56 3.1.3 Intercontinental Dictionary Series (IDS) . 56 3.1.4 World loanword database . 57 3.1.5 List’s database . 57 3.1.6 Austronesian Basic Vocabulary Database (ABVD) . 58 3.2 Typological databases . 58 3.2.1 Syntactic Structures of the World’s Languages . 58 3.2.2 Jazyki Mira . 58 3.2.3 AUTOTYP . 58 3.3 Other comparative linguistic databases . 59 3.3.1 ODIN . 59 3.3.2 PHOIBLE . 59 3.3.3 World phonotactic database . 59 3.3.4 WOLEX . 59 3.4 Conclusion . 60 4 Summary and future work 61 4.1 Summary . 61 4.2 Future work . 62 II Papers on application of computational techniques to vocabulary lists for automatic language classification 65 5 Estimating language relationships from a parallel corpus 67 5.1 Introduction . 67 5.2 Related work . 69 5.3 Our approach . 70 5.4 Dataset . 71 5.5 Experiments . 73 5.6 Results and discussion . 74 5.7 Conclusions and future work . 76 6 N-gram approaches to the historical dynamics of basic vocabulary 79 6.1 Introduction . 79 6.2 Background and related work . 82 6.2.1 Item stability and Swadesh list design . 82 i i i i i “mylic_thesis” — 2013/12/19 — 20:14 — page ix — #13 i i i Contents ix 6.2.2 The ASJP database . 83 6.2.3 Earlier n-gram-based approaches . 84 6.3 Method . 85 6.4 Results and discussion . 87 6.5 Conclusions . 89 7 Typological distances and language classification 91 7.1 Introduction . 91 7.2 Related Work . 92 7.3 Contributions . 94 7.4 Database . 94 7.4.1 WALS . 94 7.4.2 ASJP . 95 7.4.3 Binarization . 95 7.5 Measures . 96 7.5.1 Internal classification accuracy .

Taraka Rama Vocabulary Lists in Computational Historical Linguistics

PAPERS in NEW GUINEA LINGUISTICS No. 18

Mission: New Guinea]

The Languages of Melanesia: Quantifying the Level of Coverage

Information to Users

Wipi Grammar Essentials

Languages of the World--Indo-Pacific

Computing a World Tree of Languages from Word Lists

Demonstratives in Discourse

Geographical Origins of Language Structures∗

Library of Congress Subject Headings for the Pacific Islands

Highly Complex Syllable Structure

The Ndu Language Family (Sepik District, New Guinea)