Opportunities in quantitative historical linguistiics

Eric Smith

work with Tanmoy, Jon, Bill, Ian, Dan Hruschka,Logan Sutton, Hyejin Youn

collaborative with Murray Gell-Mann, , George Starostin Outline

• Brief history of languages at SFI

• The steps toward quantitative phylogenetics • Example 1: lexical reconstruction • Example II: inferring models of word-order change

• New opportunities for empirical discovery The Evolution of Human Languages Program at SFI

Goals: Deep reconstruction of language history and connection with genes and the archaeological record

Murray Gell-Mann George Starostin

Ilia Peiros Deep reconstruction: the Nostratic hypothesis

http://starling.rinet.ru/maps/maps.php?lan=en • Coined by Holger Pedersen (1903) • Modern form of the hypothesis by Vladislav Ilitch-Svitych and (1960s -- present) • Estimated 12,000-15,000 BCE http://en.wikipedia.org/wiki/Nostratic_languages Methods: and

Assign any sound map without penalty, EINS ONE • ZWEI TWO but require regularity DREI THREE VIER FOUR • Exclude borrowed items in either FàNF FIVE language from consideration  3OUND  4IER -AP ANIMAL 2EH DEER • Identify fraction preserved cognates; (UND DOG convert to separation time (penalty) 6ATER FATHER -UTTER MOTHER Attempt to fit separation times to an -ILCH MILK • &INGER FINGER ultrametric structure (tree) !RM ARM "EIN LEG   t/τ PPreserve(word) = e−

∆t = τ log (frac. preserved) sep − EHL today • The Dene-Caucasion hypothesis • Borean? Relic of the last ice-age? • Reconstructions of Dravidian, Khoisan, ...

• The EHL database and Starling

http://starling.rinet.ru/cgi-bin/response.cgi?root=config&morpho=0&basename=\data\alt\turcet&first=1 Concepts and steps in a quantitative phylogenetics

• Roles of Likelihood and Bayesian methods • Frequent-pattern versus rare-feature innovations • Bayes’s theorem and prior prejudice • Typological constraints and Bayesian priors • Information criteria and significance of parameters • The likelihood part of a phylogenetic algorithm • Overall structure of sound and meaning change • Alignment, sound correspondence, and errors • Context discovery, detection of borrowings • Classification and reconstruction Rare innovations versus clusters of common innovations • Rare innovations: single features with ~0 probability to occur by chance (go in Bayesian priors) • Imply common descent or borrowing, even w/o mathematics • Only seen once: hard to assign probabilities from frequencies • Common in morpho-syntactic features • Useless for dating; do not support induction • Common variations: (estimate with likelihood) • Examples: sound shift and meaning shift in core lexicon • Individually uninformative, but can assign probabilities from data • Require math to handle, but do support induction, and can be informative about dates if change processes are regular Bayes’s Theorem and model comparison P (m, d) P (m) ≡ Represent both data and models ! d • )

d

with a joint probability (

P Split joint probability into • ≡ conditionals either of two ways ta P (m, d)

) da

d

m,

P (m, d) = P (m d) P (d) (

| P = P (d m) P (m) | m models ! • Bayes’s theorem: priors and likelihoods

P (d m) P (m) Bayesian prior P (m d) = | | P (d) Bayesian P (d m) P (m) posterior = | P (d m!) P (m!) m! | ! Typological constraints are a natural domain for priors (Ian’s lecture) • Three roles for priors • Include frequency evidence from outside this sample • Include non-frequency evidence (rare innovations) • Represent out-of-field evidence (molecular phylogenies) • On states • phoneme inventory, word order, ... • implicational relations (pronouns, time, color, aspect) • On transitions • phoneme contexts, intermediate word-order states (NDO) • geometric models of phonology or semantics? Akaike and Bayesian Information Criteria

K N! AIC 2 log (L) + 2k P (n , . . . n ) = pni 1 K i n !, . . . , n ! ≡ − i=1 1 K ! " # ND( !n p!) BIC 2 log (L) + k log (n) = e− N " ≡ − K !n n ni D p! i log N N ! ≡ N p i=1 i ! " # Kullback-Leibler divergence, or Relative Entropy ND( !n p!) n˜ e− N " δ p˜ = ρ(p˜) k Likelihoods & N − n˜ !n " # (n˜ , . . . , n˜ ) = p˜ j their maxima ! L 1 k j j=1 !k n˜i n˜j n˜j p˜ maxp˜ log (n˜1, . . . , n˜k) = N log i → N L N N j=1 ! ND( !n p!) k e− N " N p˜i log p˜i N p˜i(p!) log p˜i(p!) + (Distribution of max-likelihoods) ≈ 2 !n i i ! ! ! k 2 ND( !n p!) ND( !n p!) N (p˜(!n/N) p˜(p!)) e− N " n˜ (p!) log p˜ N p˜ (p!) log p˜ (p!) e− N " − i i ≈ i i − 2 p˜ (p!) i i j=1 i !!n ! ! !!n ! k Maximize average = N p˜ (p!) log p˜ (p!) i i − 2 likelihood of real samples i ! over average models N p˜ log p˜ k ∼ i i − Unbiased sample estimator! i ! More detail on Akaike derivation

k (n˜ , . . . , n˜ ) = p˜n˜j Likelihood of any data L 1 k j given a particular model j=1 ! Average log-likelihood of actual data, from a model ND !n p! e− ( N " ) log (n˜(!n) p˜) = n˜ (p!) log p˜ produced with fixed L | i i i estimated parameters p~ !!n (average) (data) (model) !

Now this averaged log-likelihood, averaged over estimated models; 2nd-order Taylor exp’n

k 2 ND( !n p!) ND( !n p!) N (p˜(!n) p˜(p!)) e− N " n˜ (p!) log p˜ (!n) N p˜ (p!) log p˜ (p!) e− N " − i i ≈ i i − 2 p˜ (p!) i i j=1 i !!n ! (estimated ! !!n ! (average) k models) = N p˜ (p!) log p˜ (p!) i i − 2 i But we don’t have the ideal p~s; the best! we can do is obtain an unbiased estimator from any single sample; for this we need to identify the bias typical of samples

ND !n p! k e− ( N " )N p˜ log p˜ N p˜ (p!) log p˜ (p!) + i i ≈ i i 2 !n i i ! !(sample ML) ! (ideal ML) (variance correction)

Use this to replace the ideal ML with an unbiased estimator from samples, get previous slide (data) Word lists are the starting point for lexical (= phonological / semantic) reconstruction http://starling.rinet.ru/cgi-bin/response.cgi?root=config&morpho=0&basename=\data\alt\turcet&first=1 Objects that must be modeled are joint histories of sound and meaning for a collection of words

• n.b., word forms are attested; meanings are indirectly inferred, and often ambiguous • easy to trace a form; but inadequate to infer history from forms alone

BREATH BLOW 0)% DHUS DESTRUCTION DISSIPATION

0' DEUZA WILDANIMAL ,AT BESTIA ANIMALWILD

'ER TIER ANIMAL %NG DEER DEER )T BESTIA ANIMAL BEAST &R BETE ANIMAL BEAST (n.b.: red deer) (n.b.: bete noir) Representing sound and meaning “innovation” in the (Bill’s lecture)

• Suppose that some stable meaning EINS ONE categories can be identified ZWEI TWO DREI THREE Identify primary words for each meaning VIER FOUR • FàNF FIVE  3OUND  • Try to exclude “borrowed” terms; 4IER -AP ANIMAL suppose that what is left has been 2EH DEER (UND DOG transmitted through vertical descent 6ATER FATHER -UTTER MOTHER • Identify systematic sound relations and -ILCH MILK try to infer historical sound changes &INGER FINGER !RM ARM "EIN LEG • Associate semantic innovations with in-   language substitutions within meaning categories Preserved meanings suggest sound maps

0)% (OI WO KO NO (Numbers give an example in DUOH which we can treat primary TREI meanings as language-universal KWETUOR and historically relatively stable) FINFI

0' AINAZ ,AT UNUS TWAI Use preservation of DUO meaning to infer regular RJIZ TRES D relations of sounds: here, FIDWOR e.g., thr <> tr QUATTUOR (Innovation) FINFI QUINQUE (Sound similarity) (Sound similarity) 'ER EINZ %NG ONE )T UNO &R UN ZWEI TWO DUE DEUX DREI THREE TRE TROIS VIER FOUR QUATTRO QUATRE FUNF FIVE CINQUE CINQ Sound maps help identify meaning shifts

0)% DHUS ANE

KWEN

0' DEUZA ,AT ANIMAL RAICAZ CUN DA CANIS

'ER TIER %NG ANIMAL )T ANIMALE &R ANIMAL REH DEER HUND DOG CANE CHIEN HOUND The “alignment problem”: what to compare w/ what? Maximum-likelihood estimation of history and process (phonological only, here) • Suppose we have proposed an alignment of positions in the daughter languages • Propose phoneme assignments to aligned positions in the ancestor (with probabilities) • Estimate regular correspondence of ancestor to daughter phonemes (w/ or w/o probabilities) • Estimate random violations (with probabilities) F F F F F

B à K B à J à K M I J I K B E J I K B I J I K B E D I K

LANGUAGE LANGUAGE LANGUAGE LANGUAGE LANGUAGE LANGUAGE Sound correspondences among the languages

SA Starostin, AV Dybo and OA Mudrak 2003. An of . Leiden: Brill

h0ROTO 4URKICv

v h# h HULYM 3 h7ESTERN ENISEI A 9 Y h AN v 4URKICv v v UT v

LTAI /LD AK 9 ! h h 4URKIC 4 - 1YZ 3 4 8 3 . 8 9 + # $ 3 OFA UV AR AK AK YR ! HOR ALAJ HUV IDDLE# OLGAN ! OTHERCLOSELANGUAGES Y 5 G LTAI/IR A UT AS YL8 LTAI YZ ASH IGHUR AK HULYM AS OT Typology I: sound relations and features inferred from sound changes F F F F F

M I J I K B E J I K

LANGUAGE LANGUAGE Context dependence: an Artificial Intelligence problem

3PLITOF/LD%NGLISHK

3TAGE) KATT KEAFF KINN 3TAGE)) KATT T°EAFF T°INN 3TAGE))) KATT T°AFF T°INN From R.L. Trask, 3PLITOF,ATINS “Language change”

3TAGE) KARA FLOS FLOSES 3TAGE)) KARA FLOS FLOZES 3TAGE))) KARA FLOS FLORES

• Sound change can be regular, but not at single-phoneme level • Conditioning contexts can be lost; must be guessed • AIC and BIC can be used to judge guesses Vowel harmony in the Turkic language family: a predictive supra-segmental context

110000

Front/back agreement 100000 in roots and affixes 90000 nom a book 80000 70000 BIC inek a cow AIC SC 60000 nomnar books NoSC inekter cows 50000

40000

30000 25 50 100 200 300 Number of cognate sets in dataset Identification of borrowings: words that do not fit the system pattern

GER ENG Stadt city Unterschiedlich different auftauchend emergent nachlässig negligent Auftrag mission Neigung passion Durchschnitt intersection Übergang transition

Receiver Operating Characteristic curve New conceptual domain: mathematical likelihood modeling of semantic shift • Phonological and semantic constraints interact with polysemy and synonymy to structure sound and meaning change • Semantic categories, split, join, and move in some “space” which we do not know

0)% %NG 7ILDANIMAL 7ILDANIMAL

DHUS "REATH "REATH DEER ANE DEER 3PIRIT ANIMAL 3PIRIT DEER $OMESTICANIMAL $OMESTICANIMAL Phylogeny and reconstruction

• Modern glottochronology applied to expert linguists’ judgments of cognate classifications • Presence/absence data format modeled after genes • Not yet a model of processes of sound and meaning change Maslova: how much can you do with incomplete reconstructions of the past?

Q p Two quantities: stationary frequency f¯ =  Q  P typical number of pairs p + q F P  F h¯ = (p + q)2 [2 (p + q)] 2f¯ 1 f¯ − − ! "

Regression model for observed frequencies estimate (p/q): fi = f¯+ !i

& Fit of number of pairs against mean, in children of a common family ancestor: , , put bounds on (p+q), p/q h h¯ = f f¯ 2 (p + q) 1 2f¯ + ε i − i − − i ! " ! " Three partitions of standard word order

Q f(OV ) 0.53 ≈  Q  P 1 /6 6/ 24ky F P  F p + q ≈

f(SO) 0.96 Q ≈ 1  Q 3/ /3  P 11ky p + q ≤ F  F P p/q 15 ≥

f(SV ) 0.86 Q ≈ 1  Q 36 63  P (10ky, 30ky) p + q ∈ F  F P p/q 3 ≥ New opportunities for quantitative characterization of regularities

• Cross-linguistic regularities in frequency of use, and relations to rates of change • Punctuated equilibrium and correlations of the “clock” of language change with culture • Polysemy, synonymy, and semantics • Full speaker-corpus archives (Norquist et al.), formant-based analysis (Labov), ... Typology II: frequency of use and rates of change Linguistic punctuated equilibrium: do cultural phenomena like splitting drive language divergence? Toward a likelihood model of semantic shift: Inferring polysemy with English as a meta-language

-ETA LANGUAGE , ,

 SUN XUN W DAY W MOON AI W  MONTH   

3UN -OON

 

$AY -ONTH Network of polysemes in 81 diverse languages

Data and Graph courtesy Bill Croft, Logan Sutton, and Hyejin Youn Summary comments

• Much linguistics traditionally modeled on logic; Can we root in probability? • Language evolution is not molecular evolution; Same principles lead to different models • New and refined quantitative signatures: Both methodological and conceptual opportunities