Finding Sami Cognates with a Character-Based NMT Approach

Finding Sami Cognates with a Character-Based NMT Approach Mika Hämäläinen Jack Rueter Department of Digital Humanities Department of Digital Humanities University of Helsinki University of Helsinki [email protected] [email protected] Abstract The main motivation for this work is to extend the known cognate information available in the We approach the problem of expanding the Online Dictionary of Uralic Languages (Hämäläi- set of cognate relations with a sequence-to- nen and Rueter, 2018). This dictionary, at its cur- sequence NMT model. The language pair of rent stage, only has cognate relations recorded in interest, Skolt Sami and North Sami, has too limited a set of parallel data for an NMT model the Álgu database. as such. We solve this problem on the one Dealing with true cognates in a non-attested hy- hand, by training the model with North Sami pothetical proto-language presupposes adherence cognates with other Uralic languages and, on to a set of sound correlations posited by a given the other, by generating more synthetic train- school of thought. Since Proto-Samic is one such ing data with an SMT model. The cognates language, we have taken liberties to interpret the found using our method are made publicly term cognate in the context of this paper more available in the Online Dictionary of Uralic broadly, i.e. not only words that share the same Languages. hypothesized origin in Proto-Samic are considered 1 Introduction cognates (hence forth: true cognates), but also items that might be deemed loan words acquired Sami languages have received a fair share of in- from another language at separate points in the terest in purely linguistic study of cognate rela- temporal-spatial dimensions. This more permis- tions. Although various schools of Finno-Ugric sive definition makes it possible to tackle the prob- studies have postulated contrastive interpretations lem computationally easier given the limitation of where the Sami languages should be located imposed by the scarcity of linguistic resources. within the language family, there is strong ev- Our approach does not presuppose a seman- idence demonstrating regular sound correspon- tic similarity of the meaning of the cognate can- dence between Samic and Balto-Finnic, on the one didates, but rather explores cognate possibilities hand, and Samic and Mordvin, on the other. The based on grapheme changes. The key idea is that importance of this correspondence is accentuated the system can learn what kinds of changes are by the fact that the Samic might provide insight possible and typical for North Sami cognates with for second syllable vowel quality, as not all Samic- other Uralic languages in general. Taking leverage Mordvin vocabulary is attested in Balto-Finnic (cf. from this more general level knowledge, the model Korhonen, 1981). The Sami languages themselves can learn the cognate features between North Sami (there are seven written languages) also exhibit and Skolt Sami more specifically. regular sound correspondence, even though cog- We assimilate this problem with that of normal- nates, at times, may be opaque to the layman. ization of historical spelling variants. On a higher One token of cognate relation studies is the Álgu level, historical variation within one language can database (Kotus, 2006), which contains a set of be seen as discovering cognates of different tem- inter-Sami cognates. Cognates have applicabil- poral forms of the language. Therefore, we want ity in NLP research for low-resource languages to take the work done in that vein for the first time as they can, for instance, be used to induce the in the context of cognate detection. Using NMT predicate-argument structures from bilingual vec- (neural machine translation) on a character level tor spaces (Peirsman and Padó, 2010). has been shown to be the single most accurate 39 Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages: Vol. 1 Papers, pages 39–45, Honolulu, Hawai‘i, USA, February 26–27, 2019. method in normalization by a recent study with 3.1 The Data historical English (Hämäläinen et al., 2018). Our training data consists of Álgu (Kotus, 2006), In this paper, we use NMT in a similar character which is an etymological database of the Sami lan- level fashion for finding cognates. Furthermore, guages. From this database, we use all the cognate due to the limited availability of training data, we relations recorded for North Sami to all the other present an SMT (statistical machine translation) Finno-Ugric languages in the database. This pro- method for generating more data to boost the per- duces a parallel dataset of North Sami words and formance of the NMT model. their cognates in other languages. The North Sami to other languages parallel 2 Related Work dataset consists of 32905 parallel words, of which 2633 items represent the correlations between Automatic identification of cognates has received North Sami and Skolt Sami. a fair share of interest in the past from different We find cognates for nouns, adjectives, verbs methodological stand points. In this section, we and adverbs recorded in the Giellatekno dictionar- will go through some of these approaches. ies (Moshagen et al., 2013) for North Sami and Ciobanu and Dinu (2014) propose a method Skolt Sami. These dictionaries serve as an input based on orthographic alignment. This means a for the trained NMT model and for filtering the character level alignment of cognate pairs. After output produced by the model. the alignment, the mismatches around the aligned pairs are used as features for the machine learning 3.2 The NMT Model algorithm. For the purpose of our research we use OpenNMT Another take on cognate detection is that of (Klein et al., 2017) to train a character based NMT Rama (2016). This approach employs Siamese model that will take a Skolt Sami word as its in- convolutional networks to learn phoneme level put and produce a potential North Sami cognate as representation and language relatedness of words. its output. We use the default settings for Open- They based the study on Swadesh lists and used NMT1. hand-written phonetic features and 1-hot encoding We train a sequence to sequence model with the for the phonetic representation. list of known cognates in other languages as the Cognate detection has also been done by look- source data and their North Sami counterparts as ing at features such as semantics, phonetics and the target data. In this way, the system learns a regular sound correspondences (St. Arnaud et al., good representation of the target language, North 2017). Their approach implements a general Sami, and can learn what kind of changes are model and language specific models using support possible between cognates in general. Thus, the vector machine (SVM). model can learn additional information about cog- Rama et al. (2017) present an unsupervised nates that would not be present in the North Sami- method for cognate identification. The method Skolt Sami parallel data. consists of extracting suitable cognate pairs with In order to make the model adapt more to normalized Levenshtein distance, aligning the the North Sami-Skolt Sami pair in particular, we pairs and counting a point-wise mutual informa- continue training the model with only the North tion score for the aligned segments. New sets Sami-Skolt Sami parallel data for an additional of alignments are generated and the process of 10 epochs. The idea behind this is to bring the aligning and scoring is repeated until there are no model closer to the language pair of interest in changes in the average similarity score. this research, while still maintaining the additional knowledge it has learned about cognates in general from the larger dataset. 3 Finding Cognates 3.3 Using SMT to Generate More Data In this section, we describe our proposed approach in finding cognates between North Sami and Skolt Research in machine translation has shown that Sami. We present the dataset used for the training generating more synthetic parallel data that can be and an SMT approach in generating more training 1Version from the project’s master branch on the 13 April data. of 2018 40 noisy in the source language end but is not noisy there were a cognate for that word in North Sami. in the target end, can improve the overall transla- The approach produces many non-words which tions of an NMT model (Sennrich et al., 2015). In we filter out with the North Sami dictionary. The light of this finding, we will try a similar idea in resulting list of translated words that are actually our cognate detection task as well. found in the North Sami dictionary are considered Due to the limited amount of North Sami-Skolt to be potential cognates found by the method. Sami training data available, we use SMT instead of NMT to train a model that will produce plausi- 4 Results and Evaluation ble but slightly irregular Skolt Sami cognates for In this section, we present the results of both of the word list of North Sami words obtained from the NMT models, the one without SMT generated the Giellatekno dictionaries. data and the one with generated data. The results We use Moses (Koehn et al., 2007) baseline2 to shown in Table 1 indicate that the model with the train a translation model to the opposite direction additional SMT generated data outperformed the of the NMT model with the same parallel data. other model. The evaluation is based on a 200 ran- This means translating from North Sami to Skolt domly selected cognate pairs output by the mod- Sami. We use the same parallel data as for the els.

Load more