<<

Improved Reconstruction of Protolanguage Word Forms

Alexandre Bouchard-Côté Thomas L. Griffiths Dan Klein Oceanic languages

Hawaiian Samoan Tongan Maori Oceanic languages

‘fish’ Hawaiian iʔa Samoan iʔa Tongan ika Maori ika Shared ancestry

‘fish’ Hawaiian iʔa Samoan iʔa Proto-Oceanic Tongan ika Maori ika Shared ancestry

‘fish’ *k > ʔ Hawaiian iʔa Samoan iʔa Proto-Oceanic Tongan ika ‘fish’ Maori ika POc *ika Shared ancestry

‘fish’ Hawaiian iʔa Samoan iʔa Proto-Oceanic Tongan ika ‘fish’*ʔ > k Maori ika POc *ika or ‘fish’ POc *iʔa Shared ancestry

‘fish’ Hawaiian iʔa Samoan iʔa Proto-Oceanic Tongan ika ‘fish’ Maori ika POc *ika or ‘fish’ POc *iʔa Can we harness more languages?

‘fish’ Hawaiian iʔa Samoan iʔa Tongan ika Maori ika Geser ikan Rapanui ika Nukuoro iga ika Welcome to Oceanic Park

Oceanic Park

. Motivation . Computational model Outline: . Learning and inference . Experiments on Proto-Oceanic Why reconstruct?

. Can answer a large number of questions about our past

. Learn about ancient populations’ migrations Why reconstruct?

. Can answer a large number of questions about our past

. Learn about ancient populations’ migrations . Decipherment of ancient scripts How linguists do reconstruction

. Direct diachronic evidence, sometimes How linguists do reconstruction

. Direct diachronic evidence, sometimes

‘fish’ ‘fear’ . Often not available Hawaiian iʔa makaʔu (prehistorical cultures) Samoan iʔa mataʔu . The comparative method Tongan ika manavahē . Unsupervised setup Maori ika mataku Computational Model Input

Nggela Bugotu Tape Avava Neveei Naman Nese ‘fish’ ‘fear’ SantaAna Nahavaq Nati KwaraaeSol Lau Kwamera iʔa makaʔu Tolo Hawaiian Marshalles PuloAnna ChuukeseAK SaipanCaro Puluwatese Woleaian PuloAnnan iʔa mataʔu ... Carolinian Samoan Woleai Chuukese Nauna PaameseSou VaeakauTau Takuu ika Tongan Tongan Samoan IfiraMeleM + Niue FutunaEast ika mataku UveaEast Maori Rennellese Kapingamar Nukuoro ... Luangiua Hawaiian Marquesan Tahitianth Rurutuan Maori Tuamotu Mangareva Rarotongan 512 languages X 6856 cognate sets Penrhyn RapanuiEas Pukapuka Mwotlap Mota FijianBau IPA format Namakir Nguna ArakiSouth Saa Raga PeteraraMa Density: 60K entries Input

Nggela Bugotu Tape Avava Neveei Naman Nese ‘fish’ ‘fear’ SantaAna Nahavaq Nati KwaraaeSol Lau Kwamera iʔa makaʔu Tolo Hawaiian Marshalles PuloAnna ChuukeseAK SaipanCaro Puluwatese Woleaian PuloAnnan iʔa mataʔu ... Carolinian Samoan Woleai Chuukese Nauna PaameseSou Anuta VaeakauTau Takuu ika Tokelau Tongan Tongan Samoan IfiraMeleM Tikopia Tuvalu + Niue FutunaEast ika mataku UveaEast Maori Rennellese Emae Kapingamar

Sikaiana } Nukuoro ... Luangiua Hawaiian Marquesan Tahitianth Rurutuan Maori Tuamotu Mangareva Rarotongan 512 languages X 6856 cognate sets Penrhyn RapanuiEas Pukapuka Mwotlap Mota FijianBau IPA format Namakir Nguna ArakiSouth Saa Raga PeteraraMa Density: 60K entries Input

Nggela Bugotu Tape Avava Neveei Naman Nese ‘fish’ ‘fear’ SantaAna Nahavaq Nati KwaraaeSol Lau Kwamera iʔa makaʔu Tolo Hawaiian Marshalles PuloAnna ChuukeseAK SaipanCaro Puluwatese Woleaian PuloAnnan iʔa mataʔu ... Carolinian Samoan Woleai Chuukese Nauna PaameseSou Anuta VaeakauTau Takuu ika Tokelau Tongan Tongan Samoan IfiraMeleM Tikopia Tuvalu + Niue FutunaEast ika mataku UveaEast Maori Missing data Rennellese Emae Kapingamar Sikaiana Nukuoro ... Luangiua Hawaiian Marquesan Tahitianth Rurutuan Maori Tuamotu Mangareva Rarotongan 512 languages X 6856 cognate sets Penrhyn RapanuiEas Pukapuka Mwotlap Mota FijianBau IPA format Namakir Nguna ArakiSouth Saa Raga PeteraraMa Density: 60K entries Graphical model

‘fish’ ‘fear’ Hawaiian iʔa makaʔu Samoan iʔa mataʔu Tongan ika Maori ika mataku Graphical model

POc ‘fish’ ‘fear’ Hawaiian iʔa makaʔu Samoan iʔa mataʔu Tongan ika i!a i!a ika ika Maori ika mataku H. S. T. M. Graphical model

POc ‘fish’ ‘fear’ Hawaiian iʔa makaʔu Samoan iʔa mataʔu Tongan ika maka!u maka!u makaku Maori ika mataku H. S. T. M. Graphical model

POc ‘fish’ ‘fear’ Hawaiian iʔa makaʔu Samoan iʔa mataʔu Tongan ika Maori ika mataku H. S. T. M. Graphical model

POc Models diachronic ‘fish’ ‘fear’ word mutation Hawaiian iʔa makaʔu Samoan iʔa mataʔu Tongan ika Maori ika mataku H. S. T. M. Modeling string mutation

. What kind of string mutations need to be captured?

. Substitution *k > ʔ Modeling string mutation

. What kind of string mutations need to be captured?

. Substitution ‘break’ *k > ʔ Hawaiian haki . Insertion (and deletion) Samoan fati *h > wh Tongan fasi Maori whati Modeling string mutation

. What kind of string mutations need to be captured?

. Substitution ‘break’ *k > ʔ Hawaiian haki . Insertion (and deletion) Samoan fati *h > wh Tongan fasi Maori whati . Context *h > wh / # _ Modeling string mutation

. What kind of string mutations need to be captured?

. Substitution ‘break’ ‘aloha’ *k > ʔ Hawaiian haki aloha . Insertion (and deletion) Samoan fati alofa *h > wh Tongan fasi ʔalofa Maori whati aroha . Context NOT: arowha *h > wh / # _ String transducer

ta!gis

ta!gi ta!gis

ta!gi a!gi angi

‘to cry’ String transducer

ta!gis

# t a ! i s # ta!gi ta!gis

ta!gi a!gi angi # a n g i #

‘to cry’ String transducer } ! # t a ! i s #

? # a n g i # String transducer

θS : Substitution/Deletion Parameters

!S

! # t a ! i s #

? # a n g i # String transducer

θI : Insertion Parameters

!S !I ! # t a ! i s #

n ? # a n g i # String transducer

!S !I ! # t a ! i s #

n g ? # a n g i # Parameters

! . Global? . Cannot explicitly represent sound changes!

POc

H. S. T. M. θ = θS & θI Parameters

. Global? . Cannot explicitly represent sound changes! ! ! ! ! ! ! . Branch-specific POc . Parameter proliferation!

H. S. T. M. Parameters

! . Global? . Cannot explicitly represent sound changes! ! ! ! ! ! !

. Branch-specific POc . Parameter proliferation!

. Solution: . Learning H. S. T. M. cross-linguistic trends Cross-linguistic trends

. Some sound changes are unlikely cross-linguistically:

. Velar stop to vowel: k > a Cross-linguistic trends

. Some sound changes are unlikely cross-linguistically:

. Velar stop to vowel: k > a

. Some sound changes are frequent cross-linguistically:

. Consonant place change: k > ʔ

. Debuccalization: f > h

. Identity (faithfulness): x > x Learning cross-linguistic trends

. How to learn these universals: express the transducer parameters as the output of a log-linear model

!S a ! i ‣ŋ > n θS exp λ,f mutation ‣ to Maori ∝ { " # } n ( ) Learning cross-linguistic trends

. How to learn these universals: express the transducer parameters as the output of a log-linear model

!S a ! i ‣ŋ > n θS exp λ,f mutation ‣ to Maori ∝ { " # } n ( ) Add the name of the current branch in the context Learning cross-linguistic trends

. How to learn these universals: express the transducer parameters as the output of a log-linear model . Universals ignore the name of the current branch

λ ( ŋ > n ) Universal !S + a ! i ŋ > n ‣ŋ > n λ ( mutation &) to Maori θS mutation exp λ,f ‣ to Maori ∝ { " # } +... n ( ) Branch-specific Add the name of the current branch in the context Second improvement

. Response to a concrete problem: sound changes are not exceptionless in real data Second improvement

. Response to a concrete problem: sound changes are not exceptionless in real data

. Example: tension between a sound change and a morphological paradigm Second improvement

. Response to a concrete problem: sound changes are not exceptionless in real data

. Example: tension between a sound change and a morphological paradigm

Passive marker Vowel sound change whaka-maori-tia vs. ia > ie (‘translate into Maori’)

Which one wins? Second improvement

. Response to a concrete problem: sound changes are not exceptionless in real data

. Example: tension between a sound change and a morphological paradigm

Passive marker Vowel sound change whaka-maori-tia vs. ia > ie (‘translate into Maori’) If the sound change wins, get marked form: ending -tie Which onebecome wins? an exception Adding markedness features

. Add dependencies in the string transducer model:

!S !I !S !I ! !

n g ? a n g ? Adding markedness features

. Add dependencies in the string transducer model:

!S !I !S !I ! !

n g ? a n g ?

. Also add new features: word has /a #/ word has & /C V V/ mutation to Maori Learning and Inference Learning λ while reconstructing

. Monte Carlo EM

. M step: not analytic but convex . E step: challenging; use MCMC Learning λ while reconstructing

. Monte Carlo EM

. M step: not analytic but convex . E step: challenging; use MCMC

. Hardness of inference (E step):

. Horizontal links ⟹ (inference ≥ non-planar Ising inference) . Insertions, deletion ⟹ non-standard setup Our previous work

Single Sequence Resampling (SSR) bubu!ru Gibbs algorithm buburu ...

bubure buburu

buruburu bubure buuburu vuluvulu

‘grass’ Gibbs sampler

bubu!ru

buburu ...

bubure buburu buruburu bubure buuburu vuluvulu

‘grass’ Gibbs sampler

bubu!ru

... bubu!re

bubure buburu buruburu bubure buuburu vuluvulu

‘grass’ Gibbs sampler

bubu!ru

... bubu!re

bubure buburu buruburu bubure buuburu vuluvulu

‘grass’ Gibbs sampler

. Problems with the Single Gibbs sampler:

. Extremely slow in phylogenetic trees with high branching (most linguistic trees)

. Slow mixing in large trees Slow mixing

b u b ... b u r u b ...

b u b ... b u b ... ? b u r u b ... b u r u b ... b u r u b ... v u l u v ... b u r u b ... v u l u v

How to jump to a state where the liquids /r/ and /l/ have a common ancestor? Slow mixing

b u b ...

b u b ... b u b ... b u r u b ... v u l u v ... Slow mixing

b u b ...

b u b ... b u b ... b u r u b ... v u l u v ...

b u b ...

b u r u b ... b u b ...

b u r u b ... v u l u v ... Slow mixing

b u b ...

b u b ... b u b ... b u r u b ... v u l u v ...

b u b ...

b u r u b ... b u b ...

b u r u b ... b u r u b ... v u l u v ...

b u r u b ... b u b ...

b u r u b ... v u l u v ... Slow mixing

b u b ...

b u b ... b u b ... b u r u b ... v u l u v ...

b u b ... b u r u b ...

b u r u b ... b u b ... b u r u b ... b u r u b ...

b u r u b ... b u r u b ... v u l u v ... b u r u b ... v u l u v

b u r u b ... b u b ...

b u r u b ... v u l u v ... Slow mixing

b u b ... b u r u b ...

b u b ... b u b ... b u r u b ... b u r u b ... b u r u b ... v u l u v ... b u r u b ... v u l u v

b u b ... b u r u b ...

b u r u b ... b u b ... b u r u b ... b u r u b ...

b u r u b ... b u r u b ... v u l u v ... b u r u b ... v u l u v

b u r u b ... b u b ...

b u r u b ... v u l u v ... Solution: taking vertical slices

b u b ...

SSR b u b ... b u b ...

b u r u b ... v u l u v ...

b u b ... Ancestry b u b b u b ... resampling ...

b u r u b ... v u l u v ... Solution: taking vertical slices

b u b ...

SSR b u b ... b u b ...

b u r u b ... v u l u v ...

b u r u b ... Ancestry b u r u b ... b u r u b ... resampling

b u r u b ... v u l u v Experiments Comparison to other methods

. Evaluation: edit distance from a reconstruction made by a linguist (lower is better) Comparison to other methods

. Evaluation: edit distance from a reconstruction made by a linguist (lower is better)

. Oakes 2000

. Uses exact inference and 2.20 deterministic rules 1.65

1.10

. Reconstruction of 0.55 Proto-Malayo-Javanic 0 (recontructed in Oakes This Nothefer 1975) Comparison in large phylogenies

. Centroid: a novel heuristic based on an approximation to the Minimum Bayes risk

. Reconstruction of 1.80 Proto-Oceanic 1.35

(reconstructed in 0.90 Blust 1993) 0.45

0 . Both algorithms use Centroid This 64 modern languages Back to the initial puzzle

. Can we harness more modern languages to improve reconstructions? Back to the initial puzzle

. Can we harness more modern languages to improve reconstructions?

. Using previous model (NIPS 2008): NO (!)

. No sharing across branches Back to the initial puzzle

Performance of our model:

2.4

2.2 Mean edit distance 2

to Blust’s 1.8 reconstruction 1.6

0 10 20 30 40 50 60 Number of modern languages: close to POc ➔ far (less useful) Visualization of learned universals

. For each pair of phonemes, there is a link with grayscale value proportional to the weight of that transition

. Organized in the shape of a IPA chart for convenience

Place of articulation m = n ; < F A

p b t d ' K c O k g q 5 $ 1

Manner D# f v ! P s z + 4 , 3 ç . x 6 ) 0 2 - " h 7

( C B j >

/ r *

E G

i y 9 & :u

e ø L @ 8 o

M œ N Q % J

a ? H I Visualization of learned universals

m 8 n 6 7 ? : p b t d % A c B k g q 3 $ /

=# f v ! C s z ) 2 * 1 ç , x 4 ' . 0 + " h 5

& < ; j 9

- r (

> @

*The model did not have features encoding natural classes Visualization of learned universals

m 8 n 6 7 ? : p b t d % A c B k g q 3 $ / =# f v ! C s z ) 2 Voicing* 1 ç , x 4 ' . 0 + " h 5 & < change; j 9

- r (

> @ Visualization of learned universals

m 8 n 6 7 ? : p b Placet d % A c B k g q 3 $ /

=# f v change! C s z ) 2 * 1 ç , x 4 ' . 0 + " h 5

& < ; j 9

- r (

> @ Visualization of learned universals

m 8 n 6 7 ? :

p b t d % A c B k g q 3 $ /

=# f v ! C s z ) 2 * 1 ç , x 4 ' . 0 + " h 5

Debuccalization& < ; j 9 - r (

> @ Conclusion

. We proposed three improvements . Markedness of internal reconstructions . Cross-linguistic universals . Using a new inference method to scale up

. Results: . We outperform previous approaches . We show that using more languages improves reconstructions

. Current work: using the model to attack open questions in historical linguistics Thank you! Acknowledgments

. Simon Greenhill, Robert Blust and Russell Gray for sharing their Austronesian dataset

. Michael Oakes for sharing his dataset and results

. Anna Rafferty and our anonymous reviewers for their comments

. Research funded by NSERC and NSF BCS-0631518