Automatic Interlinear Glossing for Otomi Language

Automatic Interlinear Glossing for Otomi language Diego Barriga1;3 Victor Mijangos1;3 Ximena Gutierrez-Vasques2;3 1Universidad Nacional Autónoma de México (UNAM) 2URPP Language and Space, University of Zürich 3Comunidad Elotl {dbarriga, vmijangosc}@ciencias.unam.mx [email protected] Abstract Sentence hí tó=tsogí Glossing NEG 3.PRF=leave In linguistics, interlinear glossing is an essen- Translation ’I have not left it’ tial procedure for analyzing the morphology of languages. This type of annotation is useful Table 1: Example of morpheme-by-morpheme glosses for language documentation, and it can also for Otomi provide valuable data for NLP applications. We perform automatic glossing for Otomi, an under-resourced language. Our work also com- time consuming task that requires linguistic exper- prises the pre-processing and annotation of the tise. In particular, low-resource languages lack of corpus. documentation and language technologies (Mager We implement different sequential labelers. et al., 2018). CRF models represented an efficient and good solution for our task (accuracy above 90%). Our aim is to successfully produce automatic Two main observations emerged from our glossing annotation in a low resource scenario. We work: 1) models with a higher number of pa- focus on Otomi of Toluca, an indigenous language rameters (RNNs) performed worse in our low- spoken in Mexico (Oto-Manguean family). This resource scenario; and 2) the information en- is a morphological rich language with fusional coded in the CRF feature function plays an im- tendency. Moreover, it has scarcity of digital re- portant role in the prediction of labels; how- sources, e.g., monolingual and parallel corpora. ever, even in cases where POS tags are not available it is still possible to achieve competi- Our initial resource is a small corpus transcribed tive results. into a phonetic alphabet. We pre-process it and we perform manual glossing. Once we have this 1 Introduction dataset, we use it for training an automatic glossing system for Otomi. One of the important steps of linguistic documentation is to describe the grammar of a language. Mor- By using different variations of Conditional Ran- phological analysis constitutes one of the stages dom Fields (CRFs), we were able to achieve good for building this description. Traditionally, this is accuracy in the automatic glossing task (above done by means of interlinear glossing. This is an 90%), regardless the low-resource scenario. Fur- annotation task where linguists analyze sentences thermore, computationally more expensive meth- in a given language and they segment each word ods, i.e., neural networks, did not perform as well. with the aim of annotating the morphosyntactic cat- We also performed an analysis of the results egories of the morphemes within this word (see from the linguistics perspective. We explored the example in Table1). automatic glossing performance for a subset of la- This type of linguistic annotated data is a valu- bels to understand the errors that the model makes. able resource not only for documenting a language Our work can be a helpful tool for reducing the but it can also enable NLP technologies, e.g., by workload when manually glossing. This would providing training data for automatic morphologi- have an impact on language documentation. It can cal analyzers, taggers, morphological segmentation, also lead to an increment of annotated resources etc. for Otomi, which could be a starting point for de- However, not all languages have this type of an- veloping NLP technologies that nowadays are not notated corpora readily available. Glossing is a yet available for this language. 34 Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pages 34–43 June 11, 2021. ©2021 Association for Computational Linguistics 2 Background able source of morphological information for several NLP tasks. For instance, it could be used As we have mentioned before, glossing comprises to train state-of-the-art morphological segmenta- describing the morphological structure of a sen- tion systems for low-resource languages (Kann and tence by associating every morpheme with a mor- Schütze, 2018). The information contained in the phological label or gloss. In a linguistic gloss, there glosses is also helpful for training morphological are usually three levels of analysis: a) the segmen- reinflection systems (Cotterell et al., 2016), this tation by morphemes; b) the glosses describing consists in predicting the inflected form of a word these morphemes; and c) the translation or lexical given its lemma. It also can help in the automatic correspondences in a reference language. generation of morphological paradigms (Moeller Several works have tried to automatize this task et al., 2020). by using computational methods. In Snoek et al. These morphological tools can then be used to (2014), they use a rule-based approach (Finite State build downstream applications, e.g., machine trans- Transducer) to obtain glosses for Plains Cree, an lation, text generation. It is noteworthy that these Algonquian language. They focus only on the anal- are language technologies that are not yet available ysis of nouns. Samardzic et al.(2015) propose a for all languages, especially for under-resourced method for glossing Chintang language; they di- ones. vide the task into grammatical and lexical glossing. Grammatical glossing is approached as a su- 3 Methodology pervised part-of-speech tagging, while for lexical glossing, they use a dictionary. A fully automatized 3.1 Corpus procedure is not performed since word segmenta- Otomi is considered a group of languages spoken tion is not addressed. in Mexico (around 300,000 speakers). It belongs Some other works have approached the whole to the Oto-Pamean branch of the Oto-Manguean pipeline of automatic glossing as a supervised tag- family (Barrientos López, 2004). It is a morpho- ging task using machine learning sequential mod- logically rich language that shows particular phe- els, and they have particularly focused on under- nomena (Baerman et al., 2019; Lastra, 2001): resourced languages (Moeller and Hulden, 2018; Anastasopoulos et al., 2018; Zhao et al., 2020). In • fusional patterns for the inflection of the verbs Anastasopoulos et al.(2018), they make use of (it fuses person, aspect, tense and mood in a neural-based models with dual sources, they lever- single affix); age easy-to-collect translations. In Moeller and Hulden(2018), they perform au- • a complex system of inflectional classes; tomatic glossing for Lezgi (Nakh-Daghestanian • stem alternation, e.g., dí=pädi ‘I know’ and family) under challenging low-resource condi- bi=mbädi ‘He knew’; tions. They implement different methods, i.e., CRF, CRF+SVM, Seq2Seq neural network. The best re- • complex morphophnological patterns, e.g., sults are obtained with a CRF model that leverages dí=pädi ‘I know’, dí=pä-hu ‘We know’; POS tags. The glossing is mainly focused on tag- ¯ ging grammatical (functional) morphemes. While • complex noun inflectional patterns. the lexical items are tagged simply as stems. This latter approach especially influences our Furthermore, digital resources are scarce for this work. In fact, Moeller and Hulden(2018) highlight language. the importance of testing these models on other lan- We focus on the Otomi of Toluca variety.1 Our guages, particularly polysynthetic languages with starting point is the corpus compiled by Lastra fusion and complex morphonology. Our case of (1992), which is comprised of narrations and dia- study, Otomi, is precisely a language highly fu- logues. The corpus was originally transcribed into sional with complex morphophonological patterns, a phonetic alphabet. We pre-processed this corpus, as we will discuss on Section3. i.e., we performed digitization and orthographic Finally, automatic glossing is not only crucial 1An Otomi language spoken in the region of San Andrés for aiding linguistic research and language docu- Cuexcontitlán, Toluca, State of Mexico. Usually regarded as mentation. This type of annotation is also a valu- ots (iso639). 35 normalization.2 We used the orthographic standard each of the morphemes within the words, as it is proposed by INALI (INALI, 2014), although we shown in the Example1. Translation implies a had problems in processing the appropriate UTF-8 different level of analysis and, due to the scarce representations for some of the vocals (Otomi has digital resources, it is not addressed here. a wide range of vowels). Similar to previous works, we use a closed set of The corpus, then, was manually tagged,3 i.e., labels, i.e., we have labels for all the grammatical interlinear glossing and Part Of Speech (POS). We (functional) morphemes and a single label for all followed the Leipzig glossing rules (Comrie et al., the lexical morphemes (stem). We can see in the 2008). Example1 that morphemes like tsogí (‘leave’) are labeled as stem. Domain Count Narrative 32 (1) hí tó=tsogí Dialogues 4 NEG 3.PRF=stem Total sentences 1769 Once we have a gloss label associated to each Total words (tokens) 8550 morpheme, we prepare the training data, i.e., we pair each letter with a BIO-label. BIO-labeling Table 2: General information about the Otomi corpus consists on associating each original label with a Beginning-Inside-Outside (BIO) label. This means In addition to this corpus, we included 81 extra that each position of a morpheme is declared either short sentences that a linguist annotated; these ex- as the beginning (B) or inside (I). We neglected O amples contained particularly difficult phenomena, (outside). BIO-labels include the morpheme cate- e.g., stem alternation, reduction of the stem and gory (e.g. B-stem) or affix glosses (e.g. B-PST, for others. Table2 contains general information about past tense). For example, the labeled representation the final corpus size. of the word tótsogí would be as follows: We also show in Table3 the top ten most com- mon POS tags and gloss labels in the corpus.

Load more