Morphosyntactic Disambiguation in an Endangered Language Setting

Morphosyntactic Disambiguation in an Endangered Language Setting Jeff Ens♠ Mika Ham¨ al¨ ainen¨ ♦ Jack Rueter♦ Philippe Pasquier♠ ♠ School of Interactive Arts & Technology, Simon Fraser University ♦ Department of Digital Humanities, University of Helsinki [email protected], [email protected], [email protected], [email protected] Abstract available in the infrastructure, as they are not free Endangered Uralic languages present a of errors (Ham¨ al¨ ainen¨ et al., 2018; Ham¨ al¨ ainen,¨ high variety of inflectional forms in their 2018). morphology. This results in a high num- This paper presents a method to learn the mor- ber of homonyms in inflections, which in- phosyntax of a language on an abstract level by troduces a lot of morphological ambigu- learning patterns of possible morphologies within ity in sentences. Previous research has sentences. The resulting models can be used employed constraint grammars to address to evaluate the existing rule-based disambigua- this problem, however CGs are often un- tors, as well as to directly disambiguate sentences. able to fully disambiguate a sentence, and Our work focuses on the languages belonging to their development is labour intensive. We the Finno-Permic language family: Finnish (fin), present an LSTM based model for auto- Northern Sami (sme), Erzya (myv) and Komi- matically ranking morphological readings Zyrian (kpv). The vitality classification of the of sentences based on their quality. This three latter languages is definitely endangered ranking can be used to evaluate the exist- (Moseley, 2010). ing CG disambiguators or to directly mor- phologically disambiguate sentences. Our 2 Motivation approach works on a morphological ab- There are two main factors motivating this re- straction and it can be trained with a very search. First of all, data is often very scarce when small dataset. dealing with endangered Uralic langauges. Apart 1 Introduction from Northern Sami, other endangered Uralic languages may have a very small set of annotated Most of the languages in the Uralic language fam- samples at best, and no gold standard data at worst. ily are endangered. The low number of speak- As a result, evaluating disambiguated sentences ers, limited linguistic resources and the vast com- can often only be conducted by consulting native plexity in morphology typical to these languages speakers of the language or by relying on the re- makes their computational processing quite a chal- searcher’s own linguistic intuition. lenge. Over the past years, a great deal of work Secondly, canonical approaches involving Part- related to language technology for endangered of-Speech (POS) tagging will not suffice in this Uralic languages has been released openly on context due to the rich morphology of Uralic lan- the Giellatekno infrastructure (Moshagen et al., guages. For example the Finnish word form voita 2014). This includes lexicographic resources, FST can be lemmatized as voi (the singular partitive of (finite-state transducer) based morphological ana- butter), vuo (the plural partitive of fjord), voittaa lyzers and CG (constraint grammar) disambigua- (the imperative of win) or voitaa1 (the connegative tors. form of spread butter). Despite being a great resource, the Giellatekno The approach described in this paper, addresses infrastructure has tools and data originating from these two issues, as we use a generalized sen- different sources by different authors. Recent re- tence representation based on morphological tags search conducted with the resources for Komi- to capture morphological patterns. Moreover, our Zyrian, Skolt Sami, Erzya and Moksha has identi- fied a need for proper evaluation of the resources 1A non-standard form produced by the Finnish analyzer 345 Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa), pages 345–349 Turku, Finland, 30 September – 2 October, c 2019 Linköping University Electronic Press models can be trained on low resource langauges, biguation is needed in order to improve the per- and models that have been trained on high re- formance of higher-level NLP tools. In addition, source languages can be applied to low or no re- we do not want to assume bi-lingual parallel data source languages with reasonable success. or access to word embeddings as we want our approach to be applicable for truly endangered lan- 3 Related Work guages with extremely limited resources. The problem of morphological tagging in the 4 The Rule-based Tools and Data context of low-resource languages has been ap- proached using parallel text (Buys and Botha, We use the morphological FST analyzers in the 2016). From the aligned parallel sentences, their Gieallatekno infrastructure to produce morpholog- Wsabie-based model can learn to tag the low- ical readings with UralicNLP (Ham¨ al¨ ainen,¨ 2019). resource language based on the morphological They operate on a word level. This means that tags of the high-resource language sentences in for an input word form, they produce all the pos- the training data. A limitation of this approach is sible lemmas together with their parts-of-speech the morphological relatedness of the high-resource and morphological readings, without any weights and low-resource languages. to indicate which reading is the most probable one. A method for POS tagging of low-resource The existing CG disambiguators get the mor- languages has been proposed by Andrews et al. phological readings produced by the FST for each (2017). They use a bi-lingual dictionary between word in a sentence and apply their rules to re- a low and high-resource language together with move the non-possible readings. In some cases, monolingual data to build cross-lingual word em- a CG disambiguator might produce a fully disam- beddings. The POS tagger is trained on an LSTM biguated sentence, however these models are often neural network, and their approach performs con- unable to resolve all morphological ambiguity. sistently better than the other benchmarks they re- In this paper, we use the UD Treebanks for our port. languages of interest. For Finnish, we use Turku Lim et al. (2018) present work conducted on Dependency Treebank (Haverinen et al., 2014) syntactically parsing Komi-Zyrian and Northern with 202K tokens (14K sentences). The Northern Sami using multilingual word-embeddings. They Sami Treebank (Sheyanova and Tyers, 2017) is the use pretrained word-embeddings for Finnish and largest one for the endangered languages with 26K Russian, and train word-embeddings for the low- tokens (3K sentences). For Komi-Zyrian, we use resource languages from small corpora. These in- the Komi-Zyrian Lattice Treebank (Partanen et al., dividual word-embeddings are then projected into 2018) of 2K tokens (189 sentences) representing a single space by using bilingual dictionaries. The the standard written Komi. The Erzya Treebank parser was implemented as an LSTM model and it (Rueter and Tyers, 2018) is the second largest en- performed better in a POS tagging task than in pre- dangered language one we use in our research with dicting syntactic relations. The key finding for our 15k tokens (1,500 sentences). purposes is that including a related high-resource 5 Sentence Representation language (Finnish in this case) improved the accu- racy. We represent each word as a non-empty set of DsDs (Plank and Agic,´ 2018) is a neural net- morphological tags. This representation does not work based POS tagger for low-resource lan- contain the word form itself nor its lemma, as guages. The idea is to use a bi-LSTM model to we aim for a more abstract level morphologi- project POS tags from one language to another cal representation. This representation is meant with the help of word-embeddings and lexical in- to capture the possible morphologies following formation. In a low-resource setting, they find that each other in a sentence to learn morphosyntactic adding word-embeddings boosts the model, but inter-dependencies such as agreement rules. This lexical information can also help to a smaller de- level of abstraction makes it possible to apply the gree. learned structures for other morphosyntactically Much of the related work deals with POS tag- similar languages. ging. However, as the Uralic languages are mor- As we are looking into morphosyntax, we train phologically rich, a full morphological disam- our model only with the morphosyntactically rele- 346 vant morphological tags. These are case, number, 1.0 voice, mood, person, tense, connegative and verb form. This means that morphological tags such as 0.9 clitics and derivational morphology are not taken into account. We are also ignoring the dependency 0.8 information in the UD Treebanks as dependencies are not available for text and languages outside of 0.7 the Treebanks due to the fact that there are no ro- 0.6 bust dependency parsers available for many of the endangered Uralic language. 0.5 Gold Standard Sentence in Top-K Frequency 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Each sentence is simply a sequence of mor- K phological tag sets, represented as a sequence kpv sme fin myv of integers with a special token SP demar- cating spaces between words. For exam- Figure 1: The frequency with which the gold stan- ple the sentence ”Nyt on lungisti ottamisen dard sentence is ranked in the top-k with 1000 tri- aika.” (now it is time to relax), is encoded als per model averaged over 10 data splits. as [150, SP, 121, 138, 168, 178, 205, 214, 221, SP 150, SP, 25, 138, 158, SP, 31, 138, 158, SP, 165]. 7 Evaluation Equation 1 is used to measure the distance between two sentences containing n words, where xi We produce all the morphological readings for denotes set of morphological tags associated with each word in a gold standard sentence (GSS) using the ith word in x, denotes the number of FST analyzers, and construct incorrect sentences || · || elements in a set, and denotes the symmetric (INS) of varying quality by randomly selecting 4 difference of two sets.

Load more