Automating Gloss Generation in Interlinear Glossed Text

Automating Gloss Generation in Interlinear Glossed Text Angelina McMillan-Major University of Washington / Seattle, USA [email protected] Abstract fit end-stage projects such as machine translation between low-resource languages by improving the Interlinear Glossed Text (IGT) is a rich data accuracy of the pre-processing modules (Xia and type produced by linguists for the purposes Lewis, 2008). Georgi et al. (2012) used IGT of presenting an analysis of a language’s se- mantic and grammatical properties. I combine corpora to improve dependency parsing on low- linguistic knowledge and statistical machine resource languages using bootstrapping methods, learning to develop a system for automatically while Bender et al. (2014) and Zamaraeva et al. annotating low-resource language data. I train (2019) used IGT to build high-precision gram- a generative system for each language using on mars. Furthermore, language communities with the order of 1000 IGT. The input to the system trained IGT generators would be able to produce is the morphologically segmented source lan- IGT for any new text found or created to aid with guage phrase and its English translation. The system outputs the predicted linguistic annota- either language learning, documentation, or future tion for each morpheme of the source phrase. translation efforts. The final system is tested on held-out IGT IGT consist of a source language phrase, a sets for Abui [abz], Chintang [ctn], and Matsi- translation of that phrase into the language of the genka [mcb] and achieves 71.7%, 80.3%, and target audience, such as English, and glosses for 84.9% accuracy, respectively. each source morpheme. The glosses highlight the morphological and syntactic features of the source 1 Introduction language. Ex. 1 shows an IGT from the Kazakh While language documentation has a long history, dataset in the Online Database of INterlinear text warnings from linguists such as Hale et al. (1992) (ODIN) (Lewis and Xia, 2010), modified from and Krauss (1992) concerning language extinction Vinnitskaya et al. (2003). have revitalized and expanded documentation ef- (1) Kyz bolme-ge kir-di forts by communities and linguists, though there girl.NOM room-DAT enter-PAST is still much work to be done (Seifart et al., 2018). (A/the) girl entered (a/the) room. [ISO 639-3: kaz] According to Seifart et al. (2018), it can take 40 In Ex. 1, the first line is the source line, the sec- and 100 hours to transcribe an hour of recorded ond is the gloss line, and the third is the transla- material, and even more time is required to ana- tion line. The strings girl, NOM, room, etc. are lyze the language as a whole before annotating a all glosses, but glosses that refer to grammatical single segment of the data collected. Given the information, such as NOM, will be referred to as decreasing language diversity in the world, there grams and the glosses that refer to semantically is an identified and immediate need for automated contentful information, such as girl, will be re- systems to assist in reducing the human hours ferred to as stems. spent on the documentation process. In this paper I describe a system for produc- While costly to produce, the glosses in IGT al- ing the gloss line of IGT automatically. I re- low linguistic generalizations that are implicitly strict my system to producing just the gloss line, present in natural text to be explicitly available given a morphologically segmented source line for natural language processing. In addition to and its translation line. Morphological segmenta- supporting field linguists in collecting data, better tion packages such as Morfessor (Creutz and La- and more easily produced IGT would also bene- gus, 2007) are widely available, and in the documentation setting translations may be provided The suggestions are not automatically constrained, by a native speaker consultant. This system could however, so FLEx will suggest all previously seen be used in combination with such resources. The glosses regardless of their likelihood given the lo- input to the system at test time includes the mor- cal context unless the user explicitly provides the phemes in the segmented source line and the trans- constraint information. By contrast the system lation in the bottom line, and the target output is presented here calculates the likelihood of a source the gloss line. morpheme being labeled with each possible gloss This system does not, however, produce new given the current sequence of morphemes and se- analyses of the source language. Rather it is as- lects the most likely gloss automatically. sumed that the linguistic analyses at all levels and Palmer et al. (2009) (see also Baldridge and the transliteration are already formalized by the Palmer 2009 and Palmer et al. 2010) approached documentary team. The system is then learning the task of IGT glossing within an active learning patterns from the analyses in the training data and framework. In an active learning framework, an- reproducing the patterns when given new data. notators label the first small batch of input data, While the system can be trained on one set of anal- which is incorporated into the model in a new yses and tested on another, the performance will training phase, and then the next batch of data depend on the amount of variation between the is labeled by the model and corrected by the an- analyses. This is especially significant in the low- notators before being incorporated back into the resource setting, where each data instance con- model. They trained a maximum entropy clas- tributes a relatively large amount of information as sifier to predict a gloss given a morpheme and a compared to each data instance in a high-resource context window of two morphemes before and af- setting. ter the morpheme in question. They had two an- A survey of the literature on IGT curation, aug- notators label IGT for Uspanteko [usp] (Mayan, mentation and automation is provided in 2. In 3, Guatemala), using data from the OKMA corpus § § I present the data used for developing and testing (Pixabaj et al., 2007). This corpus contains 32 the system. 4 describes both the machine learn- glossed and 35 unglossed texts for a total of ap- § ing methods and the rule-based methods of this proximately 75,000 glossed tokens. They restrict particular system, where the rule-based methods the number of labels in the annotation schema by provide an implementation for handling out of vo- labeling stem morphemes with their part of speech cabulary, also referred to as OOV, tokens. This (POS) tags, as provided in the corpus. Palmer section also includes an explanation of the evalua- et al. found that the expert annotator was more ef- tion metrics. 5 presents the results on the devel- ficient and performed better when presented with § opment and test languages, as well as a systematic the model’s most uncertain predictions, but the error analysis. Finally, 6 discusses the challenges naive annotator annotated more accurately when § and limitations inherent in casting annotation as presented with random IGT rather than the most a classification task while exploring possible im- uncertain. These results suggest that active learn- provements to the current method for predicting ing strategies must take the annotator into account OOV tokens. in order to be optimally efficient, whereas au- tomatic annotation does not have this constraint. 2 Related Work Fully automated classification approaches provide an alternative method to IGT glossing when IGT Approaches to IGT creation tools range in terms have already been completed. of how much input is required from the human Samardziˇ c´ et al. (2015) took a classification ap- annotator to yield the finished product. A widely proach to IGT generation for the Chintang [ctn] used tool for documentation is FieldWorks Lan- (Kiranti, Nepal) Language Corpus dataset (Bickel guage Explorer (FLEx) (Baines, 2009). FLEx in- et al., 2009). This corpus is significantly larger cludes functionality for manually annotating inter- than the average documentation project with ap- linear text in addition to creating dictionaries and proximately 955,000 glossed tokens and a lexicon other language resources. The annotation software with POS tags. Samardziˇ c´ et al. used two clas- assists the user by retaining source-gloss pairs pre- sifiers to generate their labels. The first classi- viously entered by the user and suggesting these fier was based on Shen et al.’s (2007) version of glosses when the source morpheme appears again. Collins and Roark’s (2004) Perceptron learning al- tion projects as held-out test languages. Poor re- gorithm and jointly learns the order in which to sults on held-out languages compared to develop- tag the sequence and the predicted tags. It an- ment languages would suggest that the system is notated grammatical morphemes with their appro- inherently biased towards one language or one ty- priate label and contentful morphemes with their pological feature, such as word order; compara- POS tags, as in Palmer et al. (2009), to limit the ble results between the held-out and development total number of labels. The final step replaces the languages provide evidence that the system per- POS labels with an appropriate English lemma us- formance is not dependent on language-specific ing the provided lexicon which maps English lem- features. The datasets for Chintang [ctn] (Kiranti, mas to Chintang morphemes. Samardziˇ c´ et al. Nepal; Bickel et al. 2009), Abui [abz] (Trans-New trained a trigram language model on the lexicon Guinea, Indonesia; Kratochv´ıl 2017), and Mat- IDs to predict the most likely ID when multi- sigenka [mcb] (Maipurean, Peru; Michael et al. ple lemmas are possible, and back-off methods 2013) have been collected as part of language doc- are used when labeling a previously unseen mor- umentation projects and thus provide the oppor- pheme.

Load more