<<

Finding Sami Cognates with a Character-Based NMT Approach

Mika Hämäläinen Jack Rueter Department of Digital Humanities Department of Digital Humanities University of University of Helsinki [email protected] [email protected]

Abstract The main motivation for this work is to extend the known cognate information available in the We approach the problem of expanding the Online Dictionary of (Hämäläi- set of cognate relations with a sequence-to- nen and Rueter, 2018). This dictionary, at its cur- sequence NMT model. The language pair of rent stage, only has cognate relations recorded in interest, and North Sami, has too limited a set of parallel data for an NMT model the Álgu database. as such. We solve this problem on the one Dealing with true cognates in a non-attested hy- hand, by training the model with North Sami pothetical proto-language presupposes adherence cognates with other Uralic languages and, on to a set of sound correlations posited by a given the other, by generating more synthetic train- school of thought. Since Proto-Samic is one such ing data with an SMT model. The cognates language, we have taken liberties to interpret the found using our method are made publicly term cognate in the context of this paper more available in the Online Dictionary of Uralic broadly, i.e. not only words that share the same Languages. hypothesized origin in Proto-Samic are considered 1 Introduction cognates (hence forth: true cognates), but also items that might be deemed loan words acquired have received a fair share of in- from another language at separate points in the terest in purely linguistic study of cognate rela- temporal-spatial dimensions. This more permis- tions. Although various schools of Finno-Ugric sive definition makes it possible to tackle the prob- studies have postulated contrastive interpretations lem computationally easier given the limitation of where the Sami languages should be located imposed by the scarcity of linguistic resources. within the language family, there is strong ev- Our approach does not presuppose a seman- idence demonstrating regular sound correspon- tic similarity of the meaning of the cognate can- dence between Samic and Balto-Finnic, on the one didates, but rather explores cognate possibilities hand, and Samic and Mordvin, on the other. The based on grapheme changes. The key idea is that importance of this correspondence is accentuated the system can learn what kinds of changes are by the fact that the Samic might provide insight possible and typical for North Sami cognates with for second syllable vowel quality, as not all Samic- other Uralic languages in general. Taking leverage Mordvin vocabulary is attested in Balto-Finnic (cf. from this more general level knowledge, the model Korhonen, 1981). The Sami languages themselves can learn the cognate features between North Sami (there are seven written languages) also exhibit and Skolt Sami more specifically. regular sound correspondence, even though cog- We assimilate this problem with that of normal- nates, at times, may be opaque to the layman. ization of historical spelling variants. On a higher One token of cognate relation studies is the Álgu level, historical variation within one language can database (Kotus, 2006), which contains a set of be seen as discovering cognates of different tem- inter-Sami cognates. Cognates have applicabil- poral forms of the language. Therefore, we want ity in NLP research for low-resource languages to take the work done in that vein for the first time as they can, for instance, be used to induce the in the context of cognate detection. Using NMT predicate-argument structures from bilingual vec- (neural machine translation) on a character level tor spaces (Peirsman and Padó, 2010). has been shown to be the single most accurate

39 Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages: Vol. 1 Papers, pages 39–45, Honolulu, Hawai‘i, USA, February 26–27, 2019. method in normalization by a recent study with 3.1 The Data historical English (Hämäläinen et al., 2018). Our training data consists of Álgu (Kotus, 2006), In this paper, we use NMT in a similar character which is an etymological database of the Sami lan- level fashion for finding cognates. Furthermore, guages. From this database, we use all the cognate due to the limited availability of training data, we relations recorded for North Sami to all the other present an SMT (statistical machine translation) Finno- in the database. This pro- method for generating more data to boost the per- duces a parallel dataset of North Sami words and formance of the NMT model. their cognates in other languages. The North Sami to other languages parallel 2 Related Work dataset consists of 32905 parallel words, of which 2633 items represent the correlations between Automatic identification of cognates has received North Sami and Skolt Sami. a fair share of interest in the past from different We find cognates for nouns, , methodological stand points. In this section, we and adverbs recorded in the Giellatekno dictionar- will go through some of these approaches. ies (Moshagen et al., 2013) for North Sami and Ciobanu and Dinu (2014) propose a method Skolt Sami. These dictionaries serve as an input based on orthographic alignment. This means a for the trained NMT model and for filtering the character level alignment of cognate pairs. After output produced by the model. the alignment, the mismatches around the aligned pairs are used as features for the machine learning 3.2 The NMT Model algorithm. For the purpose of our research we use OpenNMT Another take on cognate detection is that of (Klein et al., 2017) to train a character based NMT Rama (2016). This approach employs Siamese model that will take a Skolt Sami word as its in- convolutional networks to learn phoneme level put and produce a potential North Sami cognate as representation and language relatedness of words. its output. We use the default settings for Open- They based the study on Swadesh lists and used NMT1. hand-written phonetic features and 1-hot encoding We train a sequence to sequence model with the for the phonetic representation. list of known cognates in other languages as the Cognate detection has also been done by look- source data and their North Sami counterparts as ing at features such as semantics, phonetics and the target data. In this way, the system learns a regular sound correspondences (St. Arnaud et al., good representation of the target language, North 2017). Their approach implements a general Sami, and can learn what kind of changes are model and language specific models using support possible between cognates in general. Thus, the vector machine (SVM). model can learn additional information about cog- Rama et al. (2017) present an unsupervised nates that would not be present in the North Sami- method for cognate identification. The method Skolt Sami parallel data. consists of extracting suitable cognate pairs with In order to make the model adapt more to normalized Levenshtein distance, aligning the the North Sami-Skolt Sami pair in particular, we pairs and counting a point-wise mutual informa- continue training the model with only the North tion score for the aligned segments. New sets Sami-Skolt Sami parallel data for an additional of alignments are generated and the process of 10 epochs. The idea behind this is to bring the aligning and scoring is repeated until there are no model closer to the language pair of interest in changes in the average similarity score. this research, while still maintaining the additional knowledge it has learned about cognates in general from the larger dataset. 3 Finding Cognates 3.3 Using SMT to Generate More Data In this section, we describe our proposed approach in finding cognates between North Sami and Skolt Research in machine translation has shown that Sami. We present the dataset used for the training generating more synthetic parallel data that can be and an SMT approach in generating more training 1Version from the project’s master branch on the 13 April data. of 2018

40 noisy in the source language end but is not noisy there were a cognate for that word in North Sami. in the target end, can improve the overall transla- The approach produces many non-words which tions of an NMT model (Sennrich et al., 2015). In we filter out with the North Sami dictionary. The light of this finding, we will try a similar idea in resulting list of translated words that are actually our cognate detection task as well. found in the North Sami dictionary are considered Due to the limited amount of North Sami-Skolt to be potential cognates found by the method. Sami training data available, we use SMT instead of NMT to train a model that will produce plausi- 4 Results and Evaluation ble but slightly irregular Skolt Sami cognates for In this section, we present the results of both of the word list of North Sami words obtained from the NMT models, the one without SMT generated the Giellatekno dictionaries. data and the one with generated data. The results We use Moses (Koehn et al., 2007) baseline2 to shown in Table 1 indicate that the model with the train a translation model to the opposite direction additional SMT generated data outperformed the of the NMT model with the same parallel data. other model. The evaluation is based on a 200 ran- This means translating from North Sami to Skolt domly selected cognate pairs output by the mod- Sami. We use the same parallel data as for the els. These pairs have then been checked by an NMT model, meaning that on the source side, we expert linguist according to principles outlined in have North Sami and on the target side we have all (4.1). the possible cognates in other languages. The par- allel data is aligned with GIZA++ (Och and Ney, NMT NMT + SMT 2003). accuracy 67.5% 83% Since we are training an SMT model, there are two ways we can make the noisy target of all the Table 1: Percentage of correctly found cognates other languages resemble more Skolt Sami. One is by using a language model. For this, we build Table 2 gives more insight on the number of a 10-gram language model with KenLM (Heafield cognates found and how they are represented in et al., 2013) from Skolt Sami words recorded in the original Álgu database. The results show the Giellatekno dictionaries. that while the models have poor performance in The other way of making the model more aware finding the cognates in the training data, they of Skolt Sami in particular is to tune the SMT work well in extending the cognates outside of the model after the initial training. For the tuning, we known cognate list. use the Skolt Sami-North Sami parallel data exclu- sively so that the SMT model will go more towards NMT NMT + SMT Skolt Sami when producing cognates. Same as in Álgu 75 61 We use the SMT model to translate all of the North Sami word in words extracted from the North Sami dictionary Álgu but no cognates 211 226 into Skolt Sami. This results in a parallel dataset with Skolt Sami of real, existing North Sami words and words that North Sami word in resemble Skolt Sami. We then use this data to Álgu with other 646 577 continue the training of the previously explained Skolt Sami cognates North Sami word not NMT model for 10 additional epochs. 848 936 in Álgu 3.4 Using the NMT Models Cognates found in total 1780 1800

We use both of the NMT models, i.e. the one with- Table 2: Distribution of cognates in relation to Álgu out SMT generated additional data and the one with the data separately to assess the difference in As one of the purposes of our work is to help their performance. We feed in the extracted Skolt evaluate and develop etymological research, we Sami words from the dictionary and translate each will conduct a more qualitative analysis of the cor- word to a North Sami word as it would look like if rectly and incorrectly identified cognates for the 2As described in better working model, i.e. the one with SMT gen- http://www.statmt.org/moses/?n=moses.baseline erated data. This means that the developers should

41 be aware not only of the comparative-linguistic whereas the latter represents the word type found rules etymologists use when assessing the regular- in ’nest’ North Sami beassi and Skolt Sami pie′ss ity of cognate candidates but semantics as well. and ’swamp’ North Sami jeaggi and Skolt Sami In an introduction to Samic language history jeä′gˇgˇ. Hence, on the basis of the North Sami (Korhonen, 1981: 110–114), the proto-language word bierdna ’bear’, one would posit a Skolt is divided into 4 separate phases. The first phase Sami form *piõrnn, whereas the Skolt Sami word involves vowel changes in the first and second syl- peä′rnn ’bear cub’ would presuppose a North lables (ä–e » e–e, u–e » o–e, e–ä » e–a˙, i–e » e–e, Sami form *beardni. Both types have firm rep- etc.) followed byˆ vowelˆ rotationˆ in the first sylla-ˆ ˆ resentation in both languages, so it would seem ble dependent on the quality of the second-syllable that these borrowings have entered the languages vowel (e–e » e–e but e–ä » ε–a˙). The second phase at separate points in the spatio-temporal dimen- entails the lossˆ of quantitative distinction for first- sions. syllable vowels such that high vowels are normal- ized as short, and non-high first-syllable vowels 4.1 Analysis of the Correct Cognates are normally long (e–e » e–e¯ , ε–a˙ » ε¯–e¯, etc.). Correct cognates were selected according to two The third phase involvesˆ a slightˆ vowel shift in the simples principles of similarity. On the one hand, first syllable and vowel split in the second. And there was the principle of conceptual similarity in finally, the fourth phase introduces diphthongiza- their referential denotations (i.e., this refers to fu- tion of non-high vowels in the first syllable (e–e¯ » ture work in semantic relations). On the other ie–e, ε¯–e¯ » eä–e¯, etc.). ˆ ˆ hand, a feasible cognate pair candidate should In Table 3, below, we provide an approxima- demonstrate adherence or near adherence to ac- tion of a few hypothesized sound changes for three cepted sound law theory. The question of adher- words that are attested in Balto-Finnic and Mord- ence versus near adherence indicated here can be vin, alike. *käte ‘hand; arm’ has true cognates directed to concepts of sound law theory, where in Finnish käsi, giehta, Skolt Sami conceivable irregularities may further be attributed ˇ kiõtt, Erzya ked’ and Moksha käd’,*tule ‘fire’ is to points in spatio-temporal dimensions (i.e., when represented by Finnish tuli, Northern Sami dolla, and where a particular word was introduced into Skolt Sami toll, and Mordvin tol, while *pesä the lexica of the two languages involved in the in- ‘nest’ is attested in Finnish pesä, Northern Sami vestigation). beassi, Skolt Sami pie ss, Erzya pize, and Moksha ′ In the investigation of 200 random cognate pair piza. The Roman numerals in the table correspond candidates, 166 cognate pair candidates exhibited to four separate phases in Proto-Samic. conceptual similarity which in some instances sur- passed what might have been discovered using a I II III IV bilingual dictionary. Of the 166 acceptable cog- ‘hand; arm’ käte kete kete¯ kiete nate pairs 131 candidate pairs demonstrated regu- tule toleˆ toleˆ toleˆ ‘fire’ lar correlation to received sound law theory. ‘nest’ pesä pˆεsˆa˙ pˆε¯sˆe¯ peäsˆe¯ Adherence to concepts of sound law theory can Table 3: Illustration of some vowel correlations in 4 be observed in the alignment of the North Sami phases of Proto-Samic words ˇcuoika ’mosquito’ and ada¯ ’marrow’ with their Skolt Sami counterparts ˇcuõškk and õõd¯, re- In the evaluation, our attention was drawn to spectively. Although these words may appear the adherence of (143) items to accepted sound opaque to the layman, and thus this alignment correlations while there were (57) candidates that might be deemed dubious at first, awareness of failed in this respect (cf. Korhonen, 1981; Lehti- cognate candidates in the Erzya Mordvin se´ ske´ ranta, 2001; Aikio, 2009). Irregular sound corre- ’mosquito’ and ud′em ’marrow’ helps to alleviate lation can be exemplified in the North Sami word initial misgivings. bierdna ’bear’ and its counterpart the Skolt Sami As may be observed above, North Sami fre- word peä′rnn (the ′ prime indicates palatalization quently has two-syllable words where Skolt Sami in Skolt Sami orthography) ’bear cub’. The for- attests to single-syllable words. This rela- mer appears to represent the word type found in tive length correlation between North Sami and ’hand’ North Sami giehta and Skolt Sami kiõttˇ , Skolt Sami is described through measurement in

42 prosodic feet (cf. Koponen and Rueter, 2016). There was one attested correlation for a 5- While North Sami exhibits retention of the the- syllable word eŋgelasgiella ’’ in oretical stem-final vowel in two-syllable words, North Sami and its 3-syllable counterpart in Skolt Skolt Sami appears to have lost it. In fact, stem- Sami eŋgglõskiõllˇ . Since we are looking at a com- final vowels in Skolt Sami are symptomatic of pound word with 3-to-2 and 2-to-1 correlations, proper nouns and (derivations). In we can assume that our model is recognizing indi- contrast, two-syllable words with consonant-final vidual adjacent segments within a larger unit. stems appear in both North Sami and Skolt Sami, Correct cognates do not necessarily require et- which means we can expect a number of basic ymologically identical source forms or structure. verbs attesting to two-syllable infinitives (North The recognized cognate pairs represent both re- Sami bidjat ’put’ and Skolt Sami piijjâd) and cent loan words or possible irregularities in sound contract-stem nouns (North Sami guoppar and law theory (35) and presumably older mutual lex- Skolt Sami kuõbbâr ’mushroom’). Upon inspec- icon (131) (see 4.2, below). They also attest to tion of longer words, it will appear that Skolt Sami differed structure and length (i.e., this may also words are at least one syllable shorter than their include derivation and compounding). While a cognates in North Sami, which can be attributed majority of the cognate candidate pairs linked to variation in foot development. words sharing the same derivational level, 11 rep- Cognate word correlations between North Sami resented instances of additional derivation in ei- and Skolt Sami can be approached by count- ther the North Sami or Skolt Sami word, and 3 ing syllables in the lemmas (dictionary forms). recognized instances where one of the languages Through this approach, we can attribute some was represented by a compound word. word lengths automatically to parts-of-speech, i.e. there is only one single-syllable in Skolt Sami 4.2 Analysis of the Incorrect Cognates lee′d ’be’, and it correlates to a single-syllable Incorrect cognates often offer vital input for cog- verb in North Sami leat. Other verbs are therefore nate detection development. There are, of course, two or more syllables in length in both languages. words pairs that diverge in regard to both accepted While two-syllable verbs in North Sami correlate sound law theory and semantic cohesion. These with two-syllable verbs in Skolt Sami, multiple- pairs have not yet been applied to development. In syllable verbs in North Sami usually correlate to contrast, word pairs that appear to adhere to sound Skolt Sami counterparts in an X<=>X-1 relation- law theory yet are not matched semantically might ship (number of syllables in the language forms, be regarded as false friends. These pairs can be repectively), where North Sami is one orthograph- potentially useful in further development. ical syllable longer. Of 34 semantically non-feasible candidates, 12 Short word pairs demonstrate a clear correlation stood out as false friends. One such example pair between two-syllable base words in North Sami is observed in the North Sami álgu ’beginning’ and single-syllable base words in Skolt Sami, and the Skolt Sami älgg ’piece of firewood’. These which with the exception of 4 words were all two words, it should be noted, can be associated nouns (54 all told). The high concentration of with the Finnish cognates alku ’beginning’ and noun attestation for single-syllable cognate nouns halko ’piece of split firewood [NB! there is a loss in the two languages of investigation is counter- of the word initial h]’, respectively. Since the theo- balanced by the representation of verbs in other retically expected vowel equivalent of the first syl- word-length correlation groups. lable a in Finnish is uo and ue in North Sami and Cognate pairs where both North Sami and Skolt Skolt Sami, respectively, we might assume that Sami attested to two-syllable lemmas were over- neither word comes from a mutual Samic-Finnic represented by verbs. There were 47 verbs, 22 proto-language. nouns, 10 adjectives and 1 adverb. This result is We do not know to what extent random selec- symptomatic of lemma-based research. Surpris- tion has affected our results. Had the first North ingly enough, however, the three-to-two syllable Sami noun álgu been replaced with its paradig- correlation between North Sami and Skolt Sami matic verb álgit ’begin’, the Skolt Sami ä′lgˇgedˇ , also showed a similar representation: verbs (13), also translated as ’begin’, would have shown di- nouns (2) and adverbs (1). rect correlation for á and ä in North Sami and

43 Skolt Sami, respectively. The second noun, mean- recall can be improved for the words where the ing ’piece of split firewood’, is actually hálgu in top candidate has been rejected by the dictionary North Sami, which simply demonstrates h reten- check as a non-word. tion and the possible recognition problems faced We have currently limited our research in cog- in the absence of semantic knowledge. nates between Skolt Sami and North Sami where the translation direction of the NMT model has 4.3 Summary of the Analyses been towards North Sami. An interesting future The cognate candidates were evaluated according direction would be to change the translation direc- to two criteria: One query checked for concep- tion. In addition to that, we are also interested in tual similarity (correct vs incorrect), and the other trying this method out on other languages recorded checked for regularity according to received sound in the Álgu database. law theory. While the majority (65%) of the word We are also interested in conducting research pairs evaluated were both conceptually similar and that is more linguistic in its nature based on the correlated to received sound law theory, an addi- cognate list produced in this paper. This will shed tional 17% of the candidates represented irregular more light in the current linguistic knowledge of sound correlation, as indicated by the figures 131 cognates in the Sami languages. The current re- and 35 below, respectively. sults of the better working NMT model are re- leased in the Online Dictionary for Uralic Lan- Similar Dissimilar guages3. Regular 131 12 Irregular 35 22 References Table 4: Cognate candidate evaluation Ante Aikio. 2009. The Saami Loan Words in Finnish and Karelian, first edition. Faculty of Humanities of The presence of an 11% negative score for both the University of Oulu, Oulu. Dissertation. sound law regularity and conceptual similarity in- dicates an improvement requirement of at least Alina Maria Ciobanu and Liviu P Dinu. 2014. Au- 6% before the machine can be considered relevant tomatic detection of cognates using orthographic alignment. In Proceedings of the 52nd Annual Meet- (95%). The 6% attestation of false friend discov- ing of the Association for Computational Linguistics ery, however, displays an already existing accu- (Volume 2: Short Papers), volume 2, pages 99–105. racy in our algorithm. Mika Hämäläinen and Jack Rueter. 2018. Advances 5 Conclusions and Future Work in Synchronized XML-MediaWiki Dictionary De- velopment in the Context of Endangered Uralic Lan- In this paper, we have shown that using a guages. In Proceedings of the Eighteenth EURALEX character-based NMT is a feasible way of expand- International Congress, pages 967–978. ing a list of cognates by training the model mostly Mika Hämäläinen, Tanja Säily, Jack Rueter, Jörg on the cognate pairs for North Sami words in lan- Tiedemann, and Eetu Mäkelä. 2018. Normalizing guages other than Skolt Sami. Furthermore, we early english letters to present-day english spelling. have shown that an SMT model can be used to In Proceedings of the Second Joint SIGHUM Work- shop on Computational Linguistics for Cultural generate synthetic parallel data by pushing the Heritage, Social Sciences, Humanities and Litera- model more towards the direction of Skolt Sami ture, pages 87–96. by introducing a Skolt Sami language model and tuning the model with Skolt Sami - North Sami Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H Clark, and Philipp Koehn. 2013. Scalable modi- parallel data. fied Kneser-Ney language model estimation. In Pro- In our evaluation, we have only considered the ceedings of the 51st Annual Meeting of the Associa- best cognate produced by the NMT model with the tion for Computational Linguistics (Volume 2: Short idea of one-to-one mapping. However, it is possi- Papers), volume 2, pages 690–696. ble to make the NMT model output more than one Guillaume Klein, Yoon Kim, Yuntian Deng, Jean possible translation. In the future, we can conduct Senellart, and Alexander M. Rush. 2017. Open- more evaluation for a list of top candidates to see NMT: Open-Source Toolkit for Neural Machine whether the model is able to find more than one Translation. In Proc. ACL. cognate for a given word and whether the overall 3http://akusanat.com/

44 Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Adam St. Arnaud, David Beck, and Grzegorz Kondrak. Callison-Burch, Marcello Federico, Nicola Bertoldi, 2017. Identifying cognate sets across dictionaries of Brooke Cowan, Wade Shen, Christine Moran, related languages. In Proceedings of the 2017 Con- Richard Zens, et al. 2007. Moses: Open source ference on Empirical Methods in Natural Language toolkit for statistical machine translation. In Pro- Processing, pages 2519–2528. ceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pages 177–180. Association for Computational Linguis- tics.

Eino Koponen and Jack Rueter. 2016. The first com- plete scientific skolt sami grammar in english. FUF 63: 254344, pages 254–266. Pp. 261–265.

Mikko Korhonen. 1981. Johdatus lapin kielen his- toriaan, volume 370 of Suomalaisen Kirjallisuu- den Seuran toimituksia. Suomalaisen Kirjallisuuden Seura, Helsinki.

Kotus. 2006. Álgu-tietokanta. Saamelaiskielten ety- mologinen tietokanta. http://kaino.kotus.fi/algu/.

Juhani Lehtiranta. 2001. Yhteissaamelainen sanasto, second edition, volume 200 of Suomalais- Ugrilaisen Seuran toimituksia. Suomalais- Ugrilainen Seura, Helsinki.

Sjur N. Moshagen, Tommi A. Pirinen, and Trond Trosterud. 2013. Building an open-source de- velopment infrastructure for language technology projects. In Proceedings of the 19th Nordic Con- ference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; . NEALT Proceedings Series 16, 85, pages 343–352. Linköping University Electronic Press.

Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51.

Yves Peirsman and Sebastian Padó. 2010. Cross- lingual induction of selectional preferences with bilingual vector spaces. In Human Language Tech- nologies: The 2010 Annual Conference of the North American Chapter of the Association for Computa- tional Linguistics, pages 921–929. Association for Computational Linguistics.

Taraka Rama. 2016. Siamese convolutional networks for cognate identification. In Proceedings of COL- ING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1018–1027.

Taraka Rama, Johannes Wahle, Pavel Sofroniev, and Gerhard Jäger. 2017. Fast and unsupervised meth- ods for multilingual cognate clustering. arXiv preprint arXiv:1702.04938.

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709.

45