
Investigating the potential of ancestral state reconstruction algorithms in historical linguistics Gerhard Jäger & Johann-Mattis List Tübingen University & CRLAO / Team AIRE, Paris Capturing Phylogenetic Algorithms for Linguistics, Leiden October 28, 2015 Jäger & List (Tübingen/Paris) Ancestral state reconstruction Leiden 1 / 42 Introduction What is Ancestral State Reconstruction? While tree-building methods seek to find branching diagrams which explain how a language family has evolved, ASR methods use the branching diagrams in order to explain what has evolved concretely. Ancestral state reconstruction is very common in evolutionary biology but only spuriously practiced in computational historical linguistics (Bouchard-Côté et al. 2013). In classical historical linguistics, on the other hand, linguistic reconstruction of proto-forms and proto-meanings is very common and one of the main goals of the classical comparative method (Fox 1995). Jäger & List (Tübingen/Paris) Ancestral state reconstruction Leiden 2 / 42 Introduction ASR of Lexical Replacement Patterns If we look for words corresponding to one meaning in a wordlist and know which of the words are cognate or not, we may ask which of the word forms was the most likely candidate to be used in the proto-language of all descendant languages. This question resembles the task of “semantic reconstruction”, but in contrast to classical semantic reconstruction, we are only operating within one concept slot here, disregarding all words with a different meaning which may also be cognate with the words in our sample. As a result of this restriction, it is quite likely that we cannot recover the original form from our data. It is, however, very interesting to see to which degree we can propose a good candidate word form (cognate set) for the proto-language. Jäger & List (Tübingen/Paris) Ancestral state reconstruction Leiden 3 / 42 Introduction ASR of Lexical Replacement Patterns Kopf kop head tête testa cap "head" "head" "head" "head" "head" "head" Jäger & List (Tübingen/Paris) Ancestral state reconstruction Leiden 4 / 42 Introduction ASR of Lexical Replacement Patterns Kopf kop head tête testa cap "head" "head" "head" "head" "head" "head" Jäger & List (Tübingen/Paris) Ancestral state reconstruction Leiden 4 / 42 Introduction ASR of Lexical Replacement Patterns ? "head"? ? ? ? Kopf kop head tête testa cap "head" "head" "head" "head" "head" "head" Jäger & List (Tübingen/Paris) Ancestral state reconstruction Leiden 4 / 42 Introduction ASR of Lexical Replacement Patterns *kop testa "head" "head" Kopf kop head tête testa cap "head" "head" "head" "head" "head" "head" Jäger & List (Tübingen/Paris) Ancestral state reconstruction Leiden 4 / 42 Introduction ASR of Lexical Replacement Patterns *kaput- "head" *haubud- caput "head" "head" *kop testa "head" "head" Kopf kop head tête testa cap "head" "head" "head" "head" "head" "head" Jäger & List (Tübingen/Paris) Ancestral state reconstruction Leiden 4 / 42 Introduction This talk reconstruction of cognate class at the root ? A A B C C B Jäger & List (Tübingen/Paris) Ancestral state reconstruction Leiden 5 / 42 Introduction This talk reconstruction of cognate class at the root B A A B C C B Jäger & List (Tübingen/Paris) Ancestral state reconstruction Leiden 5 / 42 Materials and Methods Materials Data Jäger & List (Tübingen/Paris) Ancestral state reconstruction Leiden 6 / 42 Materials and Methods Materials Data IELex ABVD 153 Indo-European doculects 743 Austronesian doculects ! 100 were selected at random 207 concepts 210 concepts; for 154 of them entries for Proto-Indo-European entries for Proto-Austronesian for 135 concepts ! used as gold standard split into training set and test set: arbitrarily split into training set and test set: training set: 81 concepts, 1695 cognate classes (88 training set: 67 concepts, occur in PAn) 1127 cognate classes (83 test set: 74 concepts, occur in PIE) 1584 cognate classes (79 test set: 68 concepts, 957 occur in PAn) cognate classes (79 from PIE) Jäger & List (Tübingen/Paris) Ancestral state reconstruction Leiden 7 / 42 Materials and Methods Methods Prerequisites: Trees Anakalang EastSumbaneseUmbuRatuNggaidialect Mamboru EastSumbaneseKamberaSoutherndialect EastSumbaneseLewadialect Kambera Masiwang TetunTerikFehandialect Lakalai NakanaiBilekiDialect GhariNggeri GhariTandai Talise TaliseMalagheti Tolo KwaraaeSolomonIslands Toambaita Lau Saa Tabar Babuyan Isamorong Ivasay Itbayat Itbayaten Imorod Iraralay Yami KakidugenIlongot Cebuano Surigaonon Tagalog TagalogAnthonydelaPaz Trees ManoboAtadownriver ManoboAtaupriver WesternBukidnonManobo DayakNgaju Katingan Indonesian MalayBahasaIndonesia Melayu Kerinci Ogan Komering KomeringUluAdumanisVillage KomeringIlirPalauGemantungVillage KomeringKayuAgungAsli trees were inferred with full KomeringUluDamarpuraVillage LampungApiDaya KomeringUluPerjayaVillage LampungApiBelalau LampungApiKotaAgung LampungApiKrui LampungApiRanau LampungApiSukau LampungApiKalianda LampungApiTalangPadang LampungApiJabung LampungApiPubian data set (training + test LampungApiSungkai LampungApiWayKanan Lampung LampungNyoAbungKotabumi LampungNyoAbungSukadana LampungNyoMenggalaTulangBawang Carolinian Woleai Chuukese FijianBau Neveei data) via Bayesian inference TannaSouthwest FutunaEast Niue Samoan Tongan Luangiua Sikaiana Rennellese Tikopia Hawaiian Marquesan Maori Pukapuka IELex outgroup: Anatolian Penrhyn Rarotongan Tuamotu Rurutuan TahitianModern Prasun Ashkun BabatanaKatazi Kati Sengga Sogdian Ossetic Kubokota Digor_Ossetic Luqa Iron_Ossetic Wakhi Blablanga Shughni BlablangaGhove Sariqoli Baluchi ABVD outgroup: MaringeKmagha Kurdish KilokakaYsabel Zazaki Tadzik Kokota Persian CiuliAtayalBandai Pashto Waziri SquliqAtayal Old_Persian PaiwanKulalao Avestan Vedic_Sanskrit Kashmiri Hindi 0.06 Lahnda Panjabi_St Urdu Bhojpuri Magahi Malayo-Polynesian Sindhi Marwari Gujarati Marathi Assamese Oriya Bengali Bihari Nepali Khaskura Gypsy_Gk Singhalese Old_Prussian Latvian Lithuanian_O Lithuanian_St Bulgarian_P Bulgarian Macedonian Macedonian_P Serbocroatian Serbian Serbocroatian_P Slovenian random samples of 1000 Slovenian_P Russian Russian_P Ukrainian_P Polish Ukrainian Byelorussian Byelorussian_P Slovak Czech_E Czech Slovak_P Czech_P Polish_P Upper_Sorbian Lower_Sorbian Old_Church_Slavonic Old_Breton trees from posterior Old_Cornish Old_Welsh Cornish Breton_Se Breton_List Breton_St Welsh_C Welsh_N Gaulish Old_Irish Irish_A Irish_B Gaelic_Scots Manx Oscan Umbrian Vlach Rumanian_List Dolomite_Ladino distributions Romansh Ladin Friulian Italian Walloon French Provencal Catalan Brazilian Portuguese_St Spanish Sardinian_L Sardinian_C Sardinian_N Latin Gothic Afrikaans Flemish Dutch_List Frisian German Standard_German_Munich Schwyzerduetsch maximum clade credibility Letzebuergesch Pennsylvania_Dutch Old_High_German Old_English English Old_Gutnish Old_Norse Icelandic_St Faroese Old_Swedish Stavangersk Norwegian Danish Danish_Fjolde Gutnish_Lau Oevdalian Swedish Swedish_Up Swedish_Vl trees Tocharian_A Tocharian_B Albanian_T Albanian Albanian_G Standard_Albanian Albanian_Top Albanian_K Albanian_C Ancient_Greek Greek_Ml Greek_D Greek_Md Tsakonian Greek_Mod Greek_K Classical_Armenian Armenian_Mod Armenian_List Lycian Luvian Palaic Hittite 600.0 Jäger & List (Tübingen/Paris) Ancestral state reconstruction Leiden 8 / 42 Materials and Methods Methods Phylogenetic uncertainty Prasun Ashkun Kati Sogdian Ossetic Digor_Ossetic Iron_Ossetic Pashto Waziri Baluchi Kurdish Zazaki Tadzik Persian Wakhi Shughni Sariqoli Old_Persian Avestan Vedic_Sanskrit Kashmiri Nepali Khaskura Bengali Assamese Oriya Bihari Gujarati Marathi Sindhi Marwari Hindi Urdu Lahnda Panjabi_St Bhojpuri Magahi Gypsy_Gk Singhalese Old_Prussian Latvian Lithuanian_O Lithuanian_St Old_Church_Slavonic Serbocroatian proper way to deal with it: Serbian Serbocroatian_P Bulgarian_P Bulgarian Macedonian 100.0 Macedonian_P Slovenian Slovenian_P Russian Russian_P work with posterior sample Ukrainian_P Byelorussian_P Byelorussian Polish Ukrainian Polish_P Upper_Sorbian Lower_Sorbian Czech Slovak rather than with a single tree Czech_E Slovak_P Czech_P Gothic German Standard_German_Munich Pennsylvania_Dutch Schwyzerduetsch Letzebuergesch Frisian poor man’s method: Afrikaans Flemish Dutch_List Old_High_German Old_English English Old_Gutnish Stavangersk Norwegian Danish remove all short branches Danish_Fjolde Gutnish_Lau Oevdalian Swedish Swedish_Up Swedish_Vl Old_Swedish Faroese Old_Norse (shorter than some Icelandic_St Old_Breton Old_Cornish Old_Welsh Welsh_C Welsh_N Cornish Breton_St Breton_Se threshold) Breton_List Gaulish Old_Irish Irish_A Irish_B Gaelic_Scots Manx Oscan do ASR with resulting Umbrian Vlach Rumanian_List Dolomite_Ladino Romansh Ladin Friulian Italian Walloon multifurcating tree French Provencal Catalan Brazilian Portuguese_St Spanish Sardinian_L Sardinian_C Sardinian_N Latin Tocharian_A Tocharian_B Albanian_T Standard_Albanian Albanian Albanian_G Albanian_Top Albanian_K Albanian_C Ancient_Greek Greek_Mod Greek_Md Greek_Ml Greek_D Tsakonian Greek_K Classical_Armenian Armenian_Mod Armenian_List Lycian Luvian Palaic Hittite Jäger & List (Tübingen/Paris) Ancestral state reconstruction Leiden 9 / 42 Materials and Methods Methods Coding Multi-state Binarized non-A B A A non-A non-A non-A non-A B non-C non-B non-B B non-B non-B B A A non-CB non-C non-C C C non-C C B Jäger & List (Tübingen/Paris) Ancestral state reconstruction Leiden 10 / 42 Materials and Methods Methods Polymorphisms (a.k.a. synonyms) problem for multistate coding possible representations:
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages61 Page
-
File Size-