Arxiv:1908.02477V3 [Cs.CL] 9 May 2021 Language, a “Proto-Language”
Total Page:16
File Type:pdf, Size:1020Kb
Ab Antiquo: Neural Proto-language Reconstruction Carlo Meloni∗1 Shauli Ravfogel∗2,3 Yoav Goldberg2,3 1Department of Comparative Language Science, University of Zurich 2Computer Science Department, Bar Ilan University 3Allen Institute for Artificial Intelligence [email protected], fshauli.ravfogel, [email protected] Abstract nates in daughter languages is possible since his- torical sound changes within a language family Historical linguists have identified regularities are not random. Rather, the phonological change in the process of historic sound change. The is characterized by regularities that are the result comparative method utilizes those regularities to reconstruct proto-words based on observed of constraints imposed by the human articulatory forms in daughter languages. Can this pro- and cognitive faculties (Millar, 2013). For exam- cess be efficiently automated? We address the ple, we can find such regular change—commonly task of proto-word reconstruction, in which called “systematic correspondence”—by looking the model is exposed to cognates in contempo- at the evolution of the first phoneme of Latin’s rary daughter languages, and has to predict the word for “sky”:1 proto word in the ancestor language. We pro- vide a novel dataset for this task, encompass- ing over 8,000 comparative entries, and show that neural sequence models outperform con- ventional methods applied to this task so far. Error analysis reveals a variability in the abil- ity of neural model to capture different phono- logical changes, correlating with the complex- ity of the changes. Analysis of learned embed- dings reveals the models learn phonologically Figure 1: the evolution of Latin word for “sky” is sev- meaningful generalizations, corresponding to eral Romance languages. well-attested phonological shifts documented The Spanish word’s first sound is [T], while the by historical linguistics. Italian word begins with [tS], the French word 1 Introduction with [s], Romansh with [ts] and Sardinian with [k]. This pattern is systematic, and will be found Historical linguists seek to identify and explain the throughout the languages. Working this way, his- various ways in which languages change through torical linguists reconstruct words in the proto- time. Research in historical linguistics has re- language from existing cognates in the daughter vealed that groups of languages (language fami- languages, and determine how words in the proto- lies) can often be traced into a common, ancestral language may have sounded. arXiv:1908.02477v3 [cs.CL] 9 May 2021 language, a “proto-language”. Large-scale lexical To what extent can a machine-learning model comparison of words across different languages learn to reconstruct proto-words from examples in enables linguists to identify cognates: words shar- this way? And what generalizations of phonetic ing a common proto-word. Comparing cognates change will it learn? We focus on the task of makes it possible to identify rules of phonetic his- proto-word reconstruction: the model is trained on toric change, and by back-tracing those rules one sets of cognates and their known proto-word, and can identify the form of the proto-word, which is then tasked with predicting the proto-word for is often not documented. That methodology is an unseen set of cognates. Our study concentrate called the comparative method (Anttila, 1989), on the romance language family2 and the model is and is the main tool used to reconstruct the lex- trained to reconstruct the Latin origin. We show icon and phonology of extinct languages. Infer- ring the form of proto-words from existing cog- 1The words as transcribed with International Phonetic Alphabet (IPA) characters. ∗Equal contribution 2All the languages that derived from Latin. Accepted as a long paper in NAACL 2021 2 RELATED WORK that a recurrent neural-network model can learn to ing approaches to this task has been focused on perform this task well (outperforming previous at- distance-based methods, which quantify the dis- tempts).3 tance (according to some metric), or the similarity, More interesting than the raw performance between a given candidate of cognates. The sim- numbers are the learned generalizations and error ilarity can be either static (e.g. Levenshtein dis- patterns. The Romance languages are widely stud- tance) or learned. Once the metric is established, ied (Ernst(2003); Ledgeway and Maiden(2016); a classification can be performed either based on Holtus et al.(1989) among others), and their hard-decision (words below a certain threshold phonological evolution from Latin is well mapped. are considered cognates) or by learning a clas- The existence of this comprehensive knowledge sifier over the distance measures and other fea- allows exploring to what extent neural models in- tures (Kondrak, 2001; Mann and Yarowsky, 2001; ternalize and capture the documented rules of lan- Inkpen et al., 2005; Ciobanu and Dinu, 2014a; List guage change, and where do they deviate from it. et al., 2016); Mulloni and Pekar(2006) have eval- We provide an extensive error analysis, relating uated an alternative approach, in which explicit errors patterns to knowledge in historical linguis- rules of transformation are derived based on edit tics. This is often not possible in common NLP operations. See Rama et al.(2018) for a recent tasks, such as parsing or semantic inference, in evaluation of the performance of several cognates which the rules governing linguistic phenomena— detection algorithms. or even the suitable framework to describe them— Several studies have gone beyond the stage are still in dispute among linguists. of cognates extraction, and used resulted list Contributions Inspection of existing datasets of of cognates to reconstruct the lexicon of proto- cognates in Romance languages has revealed in- languages. Most studies in this direction borrowed herent problems. We thus have collected a new techniques from computational phylogeny, draw- comprehensive dataset for performing the recon- ing a parallel between the hypothesized branch- struction task (§4). Besides the dataset, our main ing of (latent) proto words into their (observed) contribution is the extensive analysis of what is be- current forms and the gradual change of genes ing captured by the models, both on orthographic during evolution. Bouchard-Cotˆ e´ et al.(2007) and phonetic versions of the dataset (§6). We find has applied such a model to the development that the error patterns are not random, and they of the Romance languages, based on a dataset correlate with the relative opacity of the historic composed of aligned-translations. Bouchard-Cotˆ e´ change. These patterns were divided in different et al.(2009, 2013) used an extensive dataset of categories, each one motivated by a sound phono- Austronesian languages and their reconstructed logical explanation. Moreover, in order to further proto-languages, and built a parameterized graph- evaluate the learning of rules of phonetic change, ical model which models the probability of a pho- we evaluated models on a synthetic dataset (§6.3), netic change between a word and its ancestral showing that the model is able to correctly cap- form; the probability is branch-dependent, allow- ture several phonological change rules. Finally, ing for the learning of different trends of change we analyze the learned inner representations of the across lineages. While achieving impressive per- model, and show it learns phonologically mean- formance, even without necessitating a cognates ingful properties of phonemes (§6.4) and attributes lists as an input, their model is based on a given different importance to different daughter lan- phylogeny tree that accurately represents the de- guages (§6.5). velopment of the languages in question. Wu and Yarowsky(2018) have automatically 2 Related Work constructed cognate datasets for several lan- The related task of cognates detection has been guages, including Romance languages, and used extensively studied. In this task, a set of cog- a character-level NMT system to complete miss- nates should be extracted from word lists in dif- ing entries (not necessarily the proto-form). Sev- ferent languages. Most effort in Machine learn- eral works studied the induction of multilingual dictionaries from partial data in related languages. 3We note that the role of the ML model is easier than Wu et al.(2020) reconstruct cognates in Austrone- that of the historical linguist, as it is trained on sets of words that it took the historical linguistics discipline a considerable sian languages (where the proto-language is not effort to acquire. attested). Lewis et al.(2020) employ a mixture- 2 Accepted as a long paper in NAACL 2021 5 EXPERIMENTAL SETUP of-experts approach for lexical translation induc- freely available resource, Wiktionary, whose data tion, combining neural and probabilistic meth- were manually checked against DIEZ and Donkin ods, and Nishimura et al.(2020) translate from (1864) to ensure their etymological relatedness a multi-source input that contains partial trans- with the Latin source. The entries were transcribed lations to different languages, concatenated. Fi- into IPA using the transcription module of the eS- nally, Ciobanu and Dinu(2018) have applied a peak library5, which offers transcriptions for all CRF model with alignment to a dataset of Ro- languages in our dataset, including Latin. The mance cognates, created from automatic align- final dataset contains 8,799 cognate sets (not all ment of translations (Ciobanu