Statistical Inference for the Linguistic and Non-Linguistic Past
Total Page:16
File Type:pdf, Size:1020Kb
Statistical inference for the linguistic and non-linguistic past Igor Yanovich DFG Center for Advanced Study “Words, Bones, Genes and Tools” Universität Tübingen July 6, 2017 Igor Yanovich 1 / 46 Overview of the course Overview of the course 1 Today: trees, as a description and as a process 2 Classes 2-3: simple inference of language-family trees 3 Classes 4-5: computational statistical inference of trees and evolutionary parameters 4 Class 6: histories of languages and of genes 5 Class 7: simple spatial statistics 6 Class 8: regression taking into account linguistic relationships; synthesis of the course Igor Yanovich 2 / 46 Overview of the course Learning outcomes By the end of the course, you should be able to: 1 read and engage with the current literature in linguistic phylogenetics and in spatial statistics for linguistics 2 run phylogenetic and basic spatial analyses on linguistic data 3 proceed further in the subject matter on your own, towards further advances in the field Igor Yanovich 3 / 46 Overview of the course Today’s class 1 Overview of the course 2 Language families and their structures 3 Trees as classifications and as process depictions 4 Linguistic phylogenetics 5 Worries about phylogenetics in linguistics vs. biology 6 Quick overview of the homework 7 Summary of Class 1 Igor Yanovich 4 / 46 Language families and their structures Language families and their structures Igor Yanovich 5 / 46 Language families and their structures Dravidian A modification of [Krishnamurti, 2003, Map 1.1], from Wikipedia Igor Yanovich 6 / 46 Language families and their structures Dravidian [Krishnamurti, 2003]’s classification: Dravidian family South Dravidian Central Dravidian North Dravidian SD I SD II Tamil Malay¯al.am Kannad.a Telugu Kolami Brahui (70M) (38M) (40M) (75M) (0.1M) (4M) (Numbers of speakers from Wikipedia) Igor Yanovich 7 / 46 Language families and their structures Dravidian: similarities and differences Proto-Dr. Tamil Malay¯al.am Kannada Telugu Kolami Brahui *y¯at.u ‘goat, sheep’ y¯at.u, ¯at.u ¯at.u ¯ad.u ¯et.a ‘ram’ — h¯et. *n¯ir ‘water’ n¯ir n¯ir n¯ir, n¯iru n¯iru ¯ir d¯ir *kay ‘hand’ kai kai, kayyi kayi, kayyi, key c¯eyi key — From [Krishnamurti, 2003] and [Burrow and Emeneau, 1984] http://dsal.uchicago.edu/dictionaries/burrow/ From [Krishnamurti, 2003] Igor Yanovich 8 / 46 Language families and their structures Dravidian: similarities and differences From [Krishnamurti, 2003] Igor Yanovich 9 / 46 Trees as classifications and as process depictions Trees as classifications and as process depictions Igor Yanovich 10 / 46 Trees as classifications and as process depictions Two interpretations of the tree: (i) classification Tree as classification: captures the similarity between different languages within the family The closer A and B are in the tree, the more similar they are Dravidian family South Dravidian Central Dravidian North Dravidian SD I SD II Tamil Malay¯al.am Kannad.a Telugu Kolami Brahui (70M) (38M) (40M) (75M) (0.1M) (4M) A bit of tree terminology: daughter, mother, sister; clade; root; internal node, leave/tip/terminal node; polytomy, binary branching Igor Yanovich 11 / 46 Trees as classifications and as process depictions Quick exercise: closeness in a tree? Exercise 1: how to define closeness in a tree reasonably? Dravidian family South Dravidian Central Dravidian North Dravidian SD I SD II Tamil Malay¯al.am Kannad.a Telugu Kolami Brahui (70M) (38M) (40M) (75M) (0.1M) (4M) Igor Yanovich 12 / 46 Trees as classifications and as process depictions Two interpretations of the tree: (ii) depiction of the history Proto-Dravidian Proto-South Dr. Proto-Central Dr. Proto-North Dr. time Proto-SD I Proto-SD II Tamil Malay¯al.am Kannad.a Telugu Kolami Brahui (70M) (38M) (40M) (75M) (0.1M) (4M) Igor Yanovich 13 / 46 Trees as classifications and as process depictions Presuppositions of the “history tree” At the root, we have a single language It develops as time goes by, and sometimes divides into several How to connect the tree and innovations? Perspective 1: nodes are distinct languages ) once many new features appear, we have many languages instead of one Perspective 2: nodes are linguistic communities ) communities get isolated, and gradually accumulate different innovations Igor Yanovich 14 / 46 Trees as classifications and as process depictions The comparative method: mapping innovations to a tree 1 Assume that the modern diversity of a family is created by a tree-like process of community splitting 2 Identify regular correspondences between identical-by-descent items roots morphemes phonemes 3 A constraints problem: find a tree s.t. subsequent changes starting from the common state at the root can result in the modern languages ) the result is a labelled tree where changes are associated with specific branches Igor Yanovich 15 / 46 Trees as classifications and as process depictions The comparative method: mapping innovations to a tree Selected changes reconstructed by [Krishnamurti, 2003]: Proto-Dravidian *e ! *i*u Proto-South Dr. Proto-Central Dr. Proto-North Dr. time novel ‘she’ *n- Proto-SD I Proto-SD II ! * ; - Tamil Malay¯al.am Kannad.a Telugu Kolami Brahui (70M) (38M) (40M) (75M) (0.1M) (4M) Igor Yanovich 16 / 46 Trees as classifications and as process depictions The comparative method: mapping innovations to a tree SouthD SD I SD II TaMa Tamil Malay¯al.am Kannad.a Telugu Gondi *c- ! ∅− apical loss, etc. Igor Yanovich 17 / 46 Trees as classifications and as process depictions Complications: lateral transfer Old Norse they ?3sg; - s PGermanic Northern Middle English they ; 3sg - Old English s h¯i, 3sg -th Southern Middle English Present-Day English h¯i, 3sg -th they; 3sg -s Igor Yanovich 18 / 46 Trees as classifications and as process depictions Innovations on a tree In principle, the tree model assumes that innovations happen on specific branches of the tree. In practice, this is unrealistic: Some innovations span sets of dialects/languages not forming a clade There exists secondary lateral transfer [Bammesberger, 1992]: “Since linguistic subgrouping can be carried out only on the basis of shared innovations, some of the traits which are peculiarly characteristic of Germanic and set Germanic off from all the related languages must be listed here. It is probably true to say that none of these characteristics is limited to Germanic; but the sum total of the traits to be mentioned is peculiar to Germanic.” Igor Yanovich 19 / 46 Trees as classifications and as process depictions Exercise Unrealistic position: languages develop in a tree-like process, and innovations always apply to clades. Exercise 2: formulate two reasonable alternative positions on how languages develop through time (3-4 sentences each) Tree-with-exceptions position: The challenge: explain the exceptions to the tree-like pattern Something-other-than-a-tree position: The challenge: explain why a large part of innovation patterns is often consistent with the simple-tree model Igor Yanovich 20 / 46 Linguistic phylogenetics Linguistic phylogenetics Igor Yanovich 21 / 46 Linguistic phylogenetics Linguistic phylogenetics Practical compromise: work with the tree model, but with the understanding of its limitations. Linguistic phylogenetics: the study of genetic (i.e. descent-based) relationships between languages By-hand phylogenetics: uses the classical comparative method to identify innovations employs assumptions about dialect contact to explain deviations from the tree structure aims to rely on the complete set of available data: phonology, morphology, lexis, syntax, semantics Igor Yanovich 22 / 46 Linguistic phylogenetics Linguistic phylogenetics By-hand phylogenetics: uses the classical comparative method to identify innovations employs assumptions about dialect contact to explain deviations from the tree structure aims to rely on the complete set of available data: phonology, morphology, lexis, syntax, semantics Computational phylogenetics: employs methods from computational biology (sometimes modified to better fit linguistic data) relies on classical historical linguistics for gold standard phylogenies and often identity-by-descent categorizaton is based in practice on circumscribed datasets Igor Yanovich 23 / 46 Linguistic phylogenetics By-hand vs. computational phylogenetics Humans: are good at incorporating different types of evidence are good at constructing scientific, logical proofs are bad at processing many similar datapoints (because boring and resource-intensive) are bad at correctly estimating probabilities Computers: are good at processing very, very many similar datapoints are good at estimating probabilities given a set of assumptions cannot by themselves incorporate different types of evidence cannot interpret the numerical results they produce Igor Yanovich 24 / 46 Linguistic phylogenetics By-hand vs. computational phylogenetics [Krishnamurti, 2003]: by-hand analysis of the Dravidian family painstaking linguistic-reconstruction work weighing information from many types of data explicitly noting deviations from the strict tree model (e.g. Telugu) A simple computational analysis of the Dravidian family: 1 compile a dataset coding the presence of cognate lexemes (=lexemes identical by descent) in 18 Dravidian languages. (The data by [Rama, 2017], published by him on GitHub.) 2 run computational phylogenetic algorithms on these data 3 interpret the results Igor Yanovich 25 / 46 Linguistic phylogenetics Computational phylogenetics: example Tree based on Maximum Parsimony on [Rama, 2017]’s data: Igor Yanovich 26 / 46 Linguistic phylogenetics Computational