<<

Statistical inference for the linguistic and non-linguistic past

Igor Yanovich

DFG Center for Advanced Study “Words, Bones, Genes and Tools” Universität Tübingen

July 6, 2017

Igor Yanovich 1 / 46 Overview of the course Overview of the course

1 Today: trees, as a description and as a process

2 Classes 2-3: simple inference of language-family trees

3 Classes 4-5: computational statistical inference of trees and evolutionary parameters

4 Class 6: histories of languages and of genes

5 Class 7: simple spatial statistics

6 Class 8: regression taking into account linguistic relationships; synthesis of the course

Igor Yanovich 2 / 46 Overview of the course Learning outcomes

By the end of the course, you should be able to:

1 read and engage with the current literature in linguistic and in spatial statistics for linguistics

2 run phylogenetic and basic spatial analyses on linguistic data

3 proceed further in the subject matter on your own, towards further advances in the field

Igor Yanovich 3 / 46 Overview of the course Today’s class

1 Overview of the course

2 Language families and their structures

3 Trees as classifications and as process depictions

4 Linguistic phylogenetics

5 Worries about phylogenetics in linguistics vs.

6 Quick overview of the homework

7 Summary of Class 1

Igor Yanovich 4 / 46 Language families and their structures

Language families and their structures

Igor Yanovich 5 / 46 Language families and their structures Dravidian

A modification of [Krishnamurti, 2003, Map 1.1], from Wikipedia Igor Yanovich 6 / 46 Language families and their structures Dravidian

[Krishnamurti, 2003]’s classification:

Dravidian family

South Dravidian Central Dravidian North Dravidian

SD I SD II

Tamil Malay¯al.am Kannad.a Telugu Kolami Brahui (70M) (38M) (40M) (75M) (0.1M) (4M)

(Numbers of speakers from Wikipedia)

Igor Yanovich 7 / 46 Language families and their structures Dravidian: similarities and differences

Proto-Dr. Tamil Malay¯al.am Kannada Telugu Kolami Brahui *y¯at.u ‘goat, sheep’ y¯at.u, ¯at.u ¯at.u ¯ad.u ¯et.a ‘ram’ — h¯et. *n¯ir ‘water’ n¯ir n¯ir n¯ir, n¯iru n¯iru ¯ir d¯ir *kay ‘hand’ kai kai, kayyi kayi, kayyi, key c¯eyi key —

From [Krishnamurti, 2003] and [Burrow and Emeneau, 1984] http://dsal.uchicago.edu/dictionaries/burrow/

From [Krishnamurti, 2003] Igor Yanovich 8 / 46 Language families and their structures Dravidian: similarities and differences

From [Krishnamurti, 2003] Igor Yanovich 9 / 46 Trees as classifications and as process depictions

Trees as classifications and as process depictions

Igor Yanovich 10 / 46 Trees as classifications and as process depictions Two interpretations of the tree: (i) classification

Tree as classification: captures the similarity between different languages within the family The closer A and B are in the tree, the more similar they are

Dravidian family

South Dravidian Central Dravidian North Dravidian

SD I SD II

Tamil Malay¯al.am Kannad.a Telugu Kolami Brahui (70M) (38M) (40M) (75M) (0.1M) (4M)

A bit of tree terminology: daughter, mother, sister; ; root; internal node, leave/tip/terminal node; polytomy, binary branching

Igor Yanovich 11 / 46 Trees as classifications and as process depictions Quick exercise: closeness in a tree?

Exercise 1: how to define closeness in a tree reasonably?

Dravidian family

South Dravidian Central Dravidian North Dravidian

SD I SD II

Tamil Malay¯al.am Kannad.a Telugu Kolami Brahui (70M) (38M) (40M) (75M) (0.1M) (4M)

Igor Yanovich 12 / 46 Trees as classifications and as process depictions Two interpretations of the tree: (ii) depiction of the history

Proto-Dravidian

Proto-South Dr. Proto-Central Dr. Proto-North Dr. time

Proto-SD I Proto-SD II

Tamil Malay¯al.am Kannad.a Telugu Kolami Brahui (70M) (38M) (40M) (75M) (0.1M) (4M)

Igor Yanovich 13 / 46 Trees as classifications and as process depictions Presuppositions of the “history tree”

At the root, we have a single language

It develops as time goes by, and sometimes divides into several

How to connect the tree and innovations? Perspective 1: nodes are distinct languages ⇒ once many new features appear, we have many languages instead of one

Perspective 2: nodes are linguistic communities ⇒ communities get isolated, and gradually accumulate different innovations

Igor Yanovich 14 / 46 Trees as classifications and as process depictions The comparative method: mapping innovations to a tree

1 Assume that the modern diversity of a family is created by a tree-like process of community splitting

2 Identify regular correspondences between identical-by-descent items roots morphemes phonemes

3 A constraints problem: find a tree s.t. subsequent changes starting from the common state at the root can result in the modern languages ⇒ the result is a labelled tree where changes are associated with specific branches

Igor Yanovich 15 / 46 Trees as classifications and as process depictions The comparative method: mapping innovations to a tree

Selected changes reconstructed by [Krishnamurti, 2003]:

Proto-Dravidian

*e → *i*u Proto-South Dr. Proto-Central Dr. Proto-North Dr. time novel ‘she’ *n-

Proto-SD I Proto-SD II → * ∅ -

Tamil Malay¯al.am Kannad.a Telugu Kolami Brahui (70M) (38M) (40M) (75M) (0.1M) (4M)

Igor Yanovich 16 / 46 Trees as classifications and as process depictions The comparative method: mapping innovations to a tree

SouthD

SD I SD II

TaMa

Tamil Malay¯al.am Kannad.a Telugu Gondi

*c- → ∅− apical loss, etc.

Igor Yanovich 17 / 46 Trees as classifications and as process depictions Complications: lateral transfer

Old Norse they ?3sg; -

s PGermanic Northern Middle English

they ; 3sg - Old English s h¯i, 3sg -th Southern Middle English Present-Day English h¯i, 3sg -th they; 3sg -s

Igor Yanovich 18 / 46 Trees as classifications and as process depictions Innovations on a tree

In principle, the assumes that innovations happen on specific branches of the tree.

In practice, this is unrealistic: Some innovations span sets of dialects/languages not forming a clade There exists secondary lateral transfer

[Bammesberger, 1992]: “Since linguistic subgrouping can be carried out only on the basis of shared innovations, some of the traits which are peculiarly characteristic of Germanic and set Germanic off from all the related languages must be listed here. It is probably true to say that none of these characteristics is limited to Germanic; but the sum total of the traits to be mentioned is peculiar to Germanic.”

Igor Yanovich 19 / 46 Trees as classifications and as process depictions Exercise

Unrealistic position: languages develop in a tree-like process, and innovations always apply to .

Exercise 2: formulate two reasonable alternative positions on how languages develop through time (3-4 sentences each)

Tree-with-exceptions position: The challenge: explain the exceptions to the tree-like pattern

Something-other-than-a-tree position: The challenge: explain why a large part of innovation patterns is often consistent with the simple-tree model

Igor Yanovich 20 / 46 Linguistic phylogenetics

Linguistic phylogenetics

Igor Yanovich 21 / 46 Linguistic phylogenetics Linguistic phylogenetics

Practical compromise: work with the tree model, but with the understanding of its limitations.

Linguistic phylogenetics: the study of genetic (i.e. descent-based) relationships between languages

By-hand phylogenetics: uses the classical comparative method to identify innovations employs assumptions about dialect contact to explain deviations from the tree structure aims to rely on the complete set of available data: phonology, morphology, lexis, syntax, semantics

Igor Yanovich 22 / 46 Linguistic phylogenetics Linguistic phylogenetics

By-hand phylogenetics: uses the classical comparative method to identify innovations employs assumptions about dialect contact to explain deviations from the tree structure aims to rely on the complete set of available data: phonology, morphology, lexis, syntax, semantics

Computational phylogenetics: employs methods from computational biology (sometimes modified to better fit linguistic data) relies on classical historical linguistics for gold standard phylogenies and often identity-by-descent categorizaton is based in practice on circumscribed datasets

Igor Yanovich 23 / 46 Linguistic phylogenetics By-hand vs. computational phylogenetics

Humans: are good at incorporating different types of evidence are good at constructing scientific, logical proofs are bad at processing many similar datapoints (because boring and resource-intensive) are bad at correctly estimating probabilities

Computers: are good at processing very, very many similar datapoints are good at estimating probabilities given a set of assumptions cannot by themselves incorporate different types of evidence cannot interpret the numerical results they produce

Igor Yanovich 24 / 46 Linguistic phylogenetics By-hand vs. computational phylogenetics

[Krishnamurti, 2003]: by-hand analysis of the Dravidian family painstaking linguistic-reconstruction work weighing information from many types of data explicitly noting deviations from the strict tree model (e.g. Telugu)

A simple computational analysis of the Dravidian family:

1 compile a dataset coding the presence of cognate lexemes (=lexemes identical by descent) in 18 Dravidian languages. (The data by [Rama, 2017], published by him on GitHub.) 2 run computational phylogenetic algorithms on these data 3 interpret the results

Igor Yanovich 25 / 46 Linguistic phylogenetics Computational phylogenetics: example

Tree based on Maximum Parsimony on [Rama, 2017]’s data:

Igor Yanovich 26 / 46 Linguistic phylogenetics Computational phylogenetics

Strong points: allows us to derive correct probabilistic inferences given our assumptions can easily process vast amounts of data requires us to use explicit assumptions about language change

Challenges: easy to misinterpret the results preparing the data for analysis is a challenge in itself relative lack of certain knowledge about the past hinders validation

⇒ at least some of the challenges can be met by understanding the internal logic of the computational-phylogenetic methods (our goal in this class!)

Igor Yanovich 27 / 46 Worries about phylogenetics in linguistics vs. biology

Worries about phylogenetics in linguistics vs. biology

Igor Yanovich 28 / 46 Worries about phylogenetics in linguistics vs. biology Worries about statistical phylogenetics in linguistics

Is it even appropriate to use statistics for language-family-tree inference?

Some common worries: The underlying mechanisms of language change are still poorly understood, in contrast to DNA in biology Rates of language change are not constant Linguistic innovations do not always align with clades Language change often features unique events

In fact, the same problems arise in genetics as well. Computational biologists and statisticians definitely consider them challenges, but do not think we should give up the whole enterprise.

Igor Yanovich 29 / 46 Worries about phylogenetics in linguistics vs. biology Biological mutation is also a mess

Base-pair mutation (between single DNA letters A, C, G, T): Synonymous (there are 64 three-letter sequences for 20 amino acids) Non-synonymous neutral mutations (without selective consequences) Non-synonymous non-neutral mutations Pretty trivially, observed non-neutral mutations will have a different rate than neutral ones.

In humans, we have at least the following (review: [Ségurel et al., 2014]): Local scale: (i) mutations in -CG- much more common; (ii) G→A and C→T are twice more common than the reverse; (iii) G↔A (two-ring molecules) and C↔T (one-ring) are more common than the rest. Intermediate: a number of poorly understood effects (e.g., more mutations in repetitive chunks of DNA) Global: differences based on the timing in the replication process (beginning vs. end); rate differences between chromosomes

Igor Yanovich 30 / 46 Worries about phylogenetics in linguistics vs. biology Biological mutation is also a mess

But single-letter mutations are not the only ones! Single nucleotide: order of 10−8 per site per generation in humans Insertions, deletions, insertions combined with deletions: slightly slower, but same order in humans as nuclear [Sjödin et al., 2010]; however, three orders (!!!) of variation across the tree of life [Sung et al., 2016] Microsatellite mutations: 10−3-10−4 in humans[Sun et al., 2012] Mitochondrial DNA mutations: on average about two orders faster than nuclear DNA, but there are 100-fold (!!) rate differences between mammals for neutral mutations [Nabholz et al., 2008]

[Baer et al., 2007]: “Thus, despite a century of investigation and much theoretical progress, our understanding of mutational variation at the empirical level — the extent to which mutation rates vary within and between species, the forces responsible for mutational variation, relationships between µ and U, the influence of mutation rate variation on the rate of — remains primitive, especially for multicellular .” Igor Yanovich 31 / 46 Worries about phylogenetics in linguistics vs. biology Biological mutation is also a mess

We might not understand the empirical differences in mutation rates.

But we can explicitly account for the rate uncertainty while doing phylogenetic inference.

Practical solutions for biological phylogenetics: Allowing different sites in the genome to evolve at different rates Allowing the “base mutation rate” to change throughout the phylogeny, varying in different lineages

In linguistic terms, this means we can allow change-rate variation between features as well as between different language clades. We will learn how to apply this in practice in Class 3.

Igor Yanovich 32 / 46 Worries about phylogenetics in linguistics vs. biology Innovations do not always align with clades

Discordance between species and gene trees is actually expected

Species tree: the true tree representing the evolutionary process on the population level Gene tree: the tree representing the coalescence history of a particular character (e.g. a site in the genome)

There are three main reasons for that: Genetic variation within populations Multiple instances of the same mutation Lateral gene transfer

Igor Yanovich 33 / 46 Worries about phylogenetics in linguistics vs. biology Innovations do not always align with clades

Genetic variation leads to so-called incomplete lineage sorting:

The picture is from a good popular explanation of the phenomenon by Dennis Venema at the BioLogos blog: part 1, part 2

Igor Yanovich 34 / 46 Worries about phylogenetics in linguistics vs. biology Innovations do not always align with clades

The same change may occur independently in different lineages Standard genetic models assume that the chance for identical mutations is negligibly small. [Harpak et al., 2016] draw on data from humans and other primates to demonstrate this is incorrect even for the nuclear genome. There are hotspots of mutation where independent, but identical mutations may easily happen in humans and another primate species.

Lateral gene transfer: interbreeding between species [de Manuel et al., 2016]: ancient admixture from bonobos to chimpanzees, accounting for < 1% of the chimpanzee genome

Igor Yanovich 35 / 46 Worries about phylogenetics in linguistics vs. biology Unique events

The worry: history often features unique, and sometimes quite improbable, events, while statistics goes for most probable.

Some form of this worry will always be legitimate. Nature can trick the researcher.

But this is true for any science, including classical historical linguistics and even linguistics as a whole: E.g., historical linguistics relies on the principle of uniformity, which basically says “no really unique events”

Igor Yanovich 36 / 46 Worries about phylogenetics in linguistics vs. biology But sometimes unique events can be reconstructed

A known discrepancy between nuclear vs. mtDNA of Neanderthals: Nuclear genome points to tree ((Neanderthals, Denisovans), Moderns), with early divergence with modern humans All the 18 existing mitochondrial Neanderthal genomes point to tree ((Neanderthals, Moderns), Denisovans)

The logical explanation for that would be inter-breeding, with the observed Neanderthal mtDNA coming from a human line. [Posth et al., 2017] estimate the probability of complete mtDNA replacement dependent on the share of moderns joining the Neanderthals:

Igor Yanovich 37 / 46 Worries about phylogenetics in linguistics vs. biology The bottom line

Biological phylogenetics faces many fundamental challenges, similarly to linguistic phylogenetics: Rates of change vary drastically Innovations do not align neatly with clades There clearly happen pretty unique events

As a consequence, reconstructing trees is no easy walk. But no reason to lose hope, either!

Igor Yanovich 38 / 46 Quick overview of the homework

Quick overview of the homework

Igor Yanovich 39 / 46 Quick overview of the homework Homework 1

Assignment 1.1: tree exercises; installing and using software (20 points overall)

Assignment 1.2: reading and commenting online on [François, 2015] (5 points overall)

Igor Yanovich 40 / 46 Quick overview of the homework ...in Assignment 1.1:

Newick notation: simple bracketed tree notation: ((English, Dutch), Spanish) Nexus file format: a general format for phylogenetics; can contain a tree or set of trees in the Newick notation Two programs for visualizing trees: Dendroscope and FigTree

A tree visualized in FigTree

Igor Yanovich 41 / 46 Quick overview of the homework ...in Assignment 1.1:

R: the free software R is a powerful programming language, and one of the best choices for statistical work. If you never programmed before, R statements are commands that tell it to perform certain operations. Simplifying slightly, there are two basic types of commands: a = 5 + 7 variable assignment plot(1:100, log(1:100)) function call with side effects

In the next class, you will need R to generate trees in in-class exercises. The current assignment asks you to install it and run some very simple operations. No independent programming is needed at this point!

Igor Yanovich 42 / 46 Quick overview of the homework If you have technical problems...

If you cannot install some software, or it does not seem to work: ⇒ post your problem to the discussion “Technical problems” on Canvas

Igor Yanovich 43 / 46 Summary of Class 1

Summary of Class 1

Igor Yanovich 44 / 46 Summary of Class 1 Main points of Class 1

1 Trees can be viewed as a description of the true process of linguistic-community evolution

2 Evolution of linguistic communities is not necessarily tree-like Evolution of linguistic features is expectedly non-tree-like even for tree-like evolution of the communities

3 By-hand and computational phylogenetic work complement each other: Humans are smart, but bad at tedious work Computers are stupid, but good at tedious work

4 Reconstructing reasonable trees and checking whether evolution was really tree-like is challenging in both biology and linguistics The good news: biologists have been working on those challenges, and we can borrow many things from them

Igor Yanovich 45 / 46 References

References

Igor Yanovich 46 / 46 References

Baer, C. F., Miyamoto, M. M., and Denver, D. R. (2007). Mutation rate variation in multicellular eukaryotes: causes and consequences. Nature Reviews Genetics, 8:619–631. Bammesberger, A. (1992). The place of English in Germanic and Indo-European. In The Cambridge History of the English Language. Volume 1: The Beginnings to 1066, chapter 2, pages 26–66. Cambridge University Press. Burrow, T. and Emeneau, M. B. (1984). A Dravidian etymological dictionary. Clarendon Press, 2nd edition. de Manuel, M., Kuhlwilm, M., Frandsen, P., Sousa, V. C., Desai, T., Prado-Martinez, J., Hernandez-Rodriguez, J., Dupanloup, I., Lao, O., Hallast, P., Schmidt, J. M., Heredia-Genestar, J. M., Benazzo, A., Barbujani, G., Peter, B. M., Kuderna, L. F. K., Casals, F., Angedakin, S., Arandjelovic, M., Boesch, C., Kühl, H., Vigilant, L., Langergraber, K., Novembre, J., Gut, M., Gut, I., Navarro, A., Carlsen, F., Andrés, A. M., Siegismund, H. R., Scally, A., Excoffier, L., Tyler-Smith, C., Castellano, S., Xue, Y., Hvilsom, C., and Marques-Bonet, T. (2016). Chimpanzee genomic diversity reveals ancient admixture with bonobos. Science, 354(6311):477–481. François, A. (2015). Trees, waves and linkages. Models of language diversification. In Bowern, C. and Evans, B., editors, The Routledge handbook of historical linguistics, pages 161–189. Routledge.

Igor Yanovich 46 / 46 References

Harpak, A., Bhaskar, A., and Pritchard, J. K. (2016). Mutation rate variation is a primary determinant of allele frequencies in humans. PloS Genetics, 12(2):e1006489. Krishnamurti, B. (2003). The Dravidian languages. Cambridge University Press, Cambridge. Nabholz, B., Glémin, S., and Galtier, N. (2008). Strong variations of mitochondrial mutation rate across mammals — the longevity hypothesis. Molecular Biology and Evolution, 25(1):120–130. Posth, C., Wißing, C., Kitagawa, K., Pagani, L., van Holstein, L., Racimo, F., Wehrberger, K., Conard, N. J., Kind, C. J., Bocherens, H., and Krause, J. (2017). Deeply divergent archaic mitochondrial genome provides lower time boundary for African gene flow into Neanderthals. Nature Communications, 8:16046. Rama, T. (2017). Dating the Dravidian language family: A preliminary investigation. https: //github.com/PhyloStar/dravidian-dating/wiki/Dravidian-Dating-with-LSI. Ségurel, L., Wyman, M. J., and Przeworski, M. (2014). Determinants of mutation rate variation in the human germline. The Annual Review of Genomics and Human Genetics, 15:47–70.

Igor Yanovich 46 / 46 References

Sjödin, P., Bataillon, T., and Schierup, M. H. (2010). Insertion and deletion processes in recent human history. PLoS One, 5(1):e8650. Sun, J. X., Helgason, A., Masson, G., Ebenesersdóttir, S. S., Li, H., Mallick, S., Gnerre, S., Patterson, N., Kong, A., Reich, D., and Stefansson, K. (2012). A direct characterization of human mutation based on microsatellites. Nature Genetics, 44:1161–1165. Sung, W., Ackerman, M. S., Dillon, M. M., Platt, T. G., Fuqua, C., Cooper, V. S., and Lynch, M. (2016). Evolution of the insertion-deletion mutation rate across the Tree of Life. G3, 6(8):2583–2591.

Igor Yanovich 46 / 46