Reconstructing a Lexicon from Sister Languages Using Neural Machine Translation

Restoring the Sister: Reconstructing a Lexicon from Sister Languages using Neural Machine Translation Remo Nitschke The University of Arizona [email protected] Abstract the same language family that share some estab- lished common ancestor language and a significant The historical comparative method has a long amount of cognates with the target language. By history in historical linguists. It describes a reverse-engineering the historical phonological pro- process by which historical linguists aim to cesses that happened between the target language reverse-engineer the historical developments and the sister-languages, one can predict what the of language families in order to reconstruct proto-forms and familial relations between lan- lexical item in the target language should be. This guages. In recent years, there have been multi- is essentially a twist on the comparative method, us- ple attempts to replicate this process through ing the same principles, but to reconstruct a modern machine learning, especially in the realm of sister, as opposed to a proto-antecedent. cognate detection (List et al., 2016; Ciobanu While neural net systems have been used to em- and Dinu, 2014; Rama et al., 2018). So far, ulate the historical comparative method1 to recon- most of these experiments aimed at actual reconstruction have attempted the prediction of struct proto-forms (Meloni et al., 2019; Ciobanu a proto-form from the forms of the daughter and Dinu, 2018) and for cognate detection (List languages (Ciobanu and Dinu, 2018; Meloni et al., 2016; Ciobanu and Dinu, 2014; Rama et al., et al., 2019). Here, we propose a reimple- 2018), there have not, to the best of our knowl- mentation that uses modern related languages, edge, been any attempts to use neural nets to pre- or sisters, instead, to reconstruct the vocabu- dict/reconstruct lexical items of a sister language lary of a target language. In particular, we for revitalization/reconstruction purposes. show that we can reconstruct vocabulary of a target language by using a fairly small data Meloni et al.(2019) report success for a similar set of parallel cognates from different sister task (reconstructing Latin proto-forms) by using languages, using a neural machine translation cognate pattern lists as a training input. Instead of (NMT) architecture with a standard encoder- reconstructing Latin proto-forms from only Italian decoder setup. This effort is directly in fur- roots, they use Italian, Spanish, Portuguese, Roma- therance of the goal to use machine learning nian and French cognates of Latin, i.e., mapping tools to help under-served language communi- from many languages to one. As our intended use- ties in their efforts at reclaiming, preserving, or reconstructing their own languages. case (see section 1.1) is one that suffers from data sparsity, we explicitly explore the degree to which 1 Introduction expanding the list of sister-languages in the many- to-one mapping can compensate for fewer available Historical linguistics has long employed the his- data-points. Since the long-term goal of this project torical comparative method to establish familial is to aid language revitalization efforts, the question connections between languages and to reconstruct of available data is of utmost importance. Machine proto-forms (cf. Klein et al., 2017b; Meillet, 1967). learning often requires vast amounts of data, and More recently, the comparative method has been languages which are undergoing revitalization usu- employed by revitalization projects for lexical really have very sparse amounts of data available. construction of lost lexical items (cf. Delgado et al., Hence, the goal for a machine learning approach 2019). In the particular case of Delgado et al. (2019), lost lexical items of the target language are 1Due to the nature of neural nets we do not know whether these systems actually emulate the historical comparative reconstructed by using equivalent cognates of still- method or not. What is meant here is that they were used spoken modern sister languages, i.e., languages in for the same tasks. 122 Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pages 122–130 June 11, 2021. ©2021 Association for Computational Linguistics here is not necessarily the highest possible accu- from the Romance example. Regardless of this, racy, but rather the ability to operate with as little some insights gained here will still be applicable in data as possible, while still retaining a reasonable those cases, such as the question of compensating amount of accuracy. lack of data by using multiple languages. Our particular contributions are: Languages that are the focus of language revital- 1. We demonstrate an approach for reframing the ization projects are typically not targets for deep historical comparative method to reconstruct a learning projects. One of the reasons for this is the target language from its sisters using a neural fact that these languages usually do not have large machine translation framework. We show that amounts of data available for training state of the this can be done with easily accessible open art neural approaches. These systems need large source frameworks such as OpenNMT (Klein amounts of data, and Neural Machine Translation et al., 2017a). systems, as the one used in this project, are no ex- ception. For example, Cho et al.(2014) use data 2. We provide a detailed analysis of the degree to sets varying between 5.5million and 348million which inputs from additional sister languages words. However, the task of proto-form reconstruc- can overcome issues of data sparsity. We find tion, which is really a task of cognate prediction, that adding more related languages allows can be achieved with fairly small datasets, if par- for higher accuracy with fewer data points. allel language input is used. This was shown by However, we also find that blindly adding lan- Meloni et al.(2019), whose system predicted 84% guages to the input stream does not always within an edit distance of 1, meaning that 84% yield said higher accuracy. The results sug- of the predictions were so accurate that only one gest that there needs to be a significant amount or 0 edits were necessary to achieve the true tar- of cognates with the added input language and get. For example, if the target output is “grazie", the target language. the machine might predict “grazia" (one edit) or 1.1 Intended Use-Case and Considerations "grazie" (0 edits). Within a language revitalization context, this level of accuracy would actually be a This experiment was designed with a specific use- very good outcome. In this scenario, a linguist or case in mind: Lexical reconstruction for language speaker familiar with the language would vet the revitalization projects. Specifically, the situation output regardless, so small edit distances should where this type of model may be most appli- not pose a big problem. Further, all members of a cable would be a language reclamation project language revitalization project or language commu- in the definition of Leonhard(2007) or a lan- nity would ultimately vet the output, as they would guage revival process in the definition of Mc- make a decision on whether to accept or reject the Carty and Nicholas(2014). In essence, a lan- output as a lexical item of the language. guage where there is some need to recover or reconstruct a lexicon. An example of such a case This begs the question of why a language revital- might be the Wampanoag language reclamation ization project would want to go through the trou- project (https://www.wlrp.org/), or com- ble of using such an algorithm in the first place, if parable projects using the methods outlined in Del- they have someone available to vet the output, then gado et al.(2019). that person may as well do the reconstructive work As this is a proof-of-concept, we use the themselves, as proposed in Delgado et al.(2019). Romance language family, specifically the non- This all depends on two factors: First, how high is endangered languages of French, Spanish, Italian, the volume of lexical items that need to be recon- Portuguese and Romanian, and operate under as- structed or predicted? The effort may not be worth sumption that these results can inform how one can it for 10 or even a 100 lexical items, but beyond use this approach with other languages of interest. this an neural machine translation model can poten- However, we are aware that the Romance language tially outperform the manual labor. Once trained, morphology may be radically different from some the model can make thousands of predictions in of the languages that may be in the scope of this minutes, as long as input data is available. use case, such as agglutinative and polysynthetic Second, and potentially more important, it will languages, and that we cannot fully predict the per- depend on how well the historical phonological re- formance of this type of system for such languages lationships between the languages are understood. 123 Spanish French Portuguese Romanian Italian (target) status 1 - -esque -e:scere - - removed, no target 2 mosto moût mosto must mosto 3 - - - - lugano removed, no input 4 párrafo - - - paragrafo 5 -edad - -idade -itate -ità Table 1: Examples of data patterns, including types of data removed during cleanup (e.g., rows 1 and 3). For a family like Romance, we have a very good Continental Romance understanding of the historical genesis of the languages and the different phonological processes Italo Western Romance Eastern Romance they underwent, see for example Maiden et al. Italian Western Romance Balkan Romance (2013). However, there are many language families in the world where these relationships and West-Ibero Romance Gallo-Rhaetian Romanian histories are less than clear.

Reconstructing a Lexicon from Sister Languages Using Neural Machine Translation

《International Symposium》 Let's Talk About Trees National Museum

Species Tree Likelihood Computation Given SNP Data Using Ancestral Configurations

EVOLUTIONARY INFERENCE: Some Basics of Phylogenetic Analyses

The Probability of Monophyly of a Sample of Gene Lineages on a Species Tree

Application of the Comparative Method to Morpheme-Final Nasals in Nivkh Halm, R

Linguistic History and Language Diversity in India: Views and Counterviews

Indo-European Linguistics: an Introduction Indo-European Linguistics an Introduction

Proto-Indo-European Roots of the Vedic Aryans

Outer and Inner Indo-Aryan, and Northern India As an Ancient Linguistic Area

Linguistics 623 TOPICS in INDIC LINGUISTICS

The Shape of Phylogenetic Trees (Review Paper)

Autochthonous Aryans? the Evidence from Old Indian and Iranian Texts