From Interlinear Glossed Texts to Paradigms
Total Page:16
File Type:pdf, Size:1020Kb
IGT2P: From Interlinear Glossed Texts to Paradigms Sarah Moeller Ling Liu Changbing Yang Katharina Kann Mans Hulden University of Colorado [email protected] Abstract An intermediate step in the linguistic analysis of an under-documented language is to find and organize inflected forms that are attested in natural speech. From this data, linguists generate unseen inflected word forms in or- der to test hypotheses about the language’s inflectional patterns and to complete inflec- tional paradigm tables. To get the data lin- guists spend many hours manually creating in- terlinear glossed texts (IGTs). We introduce a new task that speeds this process and automat- ically generates new morphological resources for natural language processing systems: IGT- to-paradigms (IGT2P). IGT2P generates entire morphological paradigms from IGT input. We show that existing morphological reinflection models can solve the task with 21% to 64% ac- curacy, depending on the language. We further Figure 1: Inflected word forms attested in interlin- find that (i) having a language expert spend ear glossed texts (IGT) train transformer encoder- only a few hours cleaning the noisy IGT data decoder to generalize morphological paradigmatic pat- improves performance by as much as 21 per- terns and generate word forms when given known mor- centage points, and (ii) POS tags, which are phosyntatic features of missing paradigm cells. Noisy generally considered a necessary part of NLP paradigms are automatically constructed from IGT and morphological reinflection input, have no ef- a language expert creates “cleaned” paradigms. Both fect on the accuracy of the models considered sets are tested on the same missing word forms and the here. results are compared. 1 Introduction inflectional patterns can be found, for example, in Over the last few years, multiple shared tasks have online dictionaries like Wiktionary.1 This limits encouraged the development of systems for learn- the development of NLP systems for morphology ing morphology, including generating inflected to languages for which morphological information forms of the canonical form—the lemma—of a can be easily extracted. word. NLP systems that account for morphology Here, we propose to instead make use of a re- can reduce data sparsity caused by an abundance source which is much more common, especially of individual word forms in morphologically rich for low-resource languages: we explore how to languages (Cotterell et al., 2016, 2017a, 2018; Mc- leverage interlinear glossed text (IGT)—a com- Carthy et al., 2019; Vylomova et al., 2020) and mon artifact of linguistic field research—to gen- help mitigate bias in training data for natural lan- erate unseen forms of inflectional paradigms, as guage processing (NLP) systems (Zmigrod et al., illustrated in Figure1. This task, which we call 2019). However, such systems have often been lim- IGT-to-paradigms (IGT2P), differs from the ex- ited to languages with publicly available structured data, i.e. languages for which tables containing 1https://www.wiktionary.org 5251 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 5251–5262, November 16–20, 2020. c 2020 Association for Computational Linguistics isting morphological inflection (Yarowsky and Wi- present past centowski, 2000; Faruqui et al., 2016) task in three sing. pl. sing. pl. aspects: (1) inflected forms extracted from IGT are 1 person am are was were noisier than curated training data for morphologi- cal generation, (2) since lemmas are not explicitly 2 person are are were were identified in IGT, systems cannot be trained on typ- 3 person is are was were ical lemma-to-form mappings and, instead, must be trained on form-to-form mappings, and (3) part- Table 1: The inflectional paradigm of the English verb of-speech (POS) tags are often unavailable in IGT. “to be”. This verb has more inflected forms than any IGT2P can thus be seen as a noisy version of mor- other English lemma, but is quite small compared to phological reinflection (Cotterell et al., 2016), but paradigms in many other languages. without explicit POS information. Our experiments show that morphological reinflection systems fol- of morphological features to the corresponding in- lowing preprocessing are strong baselines for this flected form. Σ is an alphabet of discrete symbols, task. i.e., the characters used in the natural language. We further perform two analyses: Γ(`) is the set of slots in lemma `’s paradigm. We (i) Part-of-speech (POS) tags are usually consid- will abbreviate f(`;~tγ) as fγ(`) for simplicity. Us- ered necessary inputs for learning morpholog- ing this notation, we now describe the most im- ical generation. However, they are frequently portant generation tasks from the computational missing from IGT, since they result from a morphology literature. later step in a linguist’s pipeline. Thus, we ask: are POS tags necessary for morpholog- Morphological inflection. The task of morpho- ical generation? Surprisingly, we find that logical inflection consists of generating unknown POS tags are of little use for morphological inflected forms, given a lemma ` and a feature vec- ~ generation systems. tor tγ. Thus, it corresponds to learning the mapping f :Σ∗ × T ! Σ∗. (ii) How much does manual cleaning of IGT data by a domain expert improve performance? As Morphological reinflection. Morphological expected, cleaning the data improves perfor- reinflection is a generalized version of the previous mance across the board with a transformer task. Here, instead of having a lemma as input, ~ model: by 1:27% to 16:32%, depending on system are given some inflected form f(`; tγ1 ) – ~ the language. optionally together with tγ1 – and a target feature ~ vector tγ2 . The goal is then to produce the inflected We examine which inflection model performs ~ form f(`; tγ2 ). better on noisy and cleaned IGT data and how the performance varies across languages and data qual- Paradigm completion. The task of paradigm ity or size. completion consists of, given a partial paradigm D E πP (`) = f(`;~tγ) of a lemma `, gener- 2 A New Morphological Task: IGT2P γ2ΓP (`) ating all inflected forms for all slots γ 2 Γ(`) − 2.1 Background: Morphological Generation ΓP (`). Training data for this task consists of entire An inflectional paradigm is illustrated in tables, paradigms. such as Table1. Paradigms can be large; for exam- ple, Polish verbs paradigms can have up to 30 cells Unsupervised morphological paradigm com- and other languages may have several more. Here pletion. For the unsupervised version of the we define the notation related to morphological paradigm completion task, systems are given a inflection systems for the remainder of this paper. corpus D = w1; : : : ; wjDj with a vocabulary V We denote the paradigm of a lemma ` as: of word types fwig and a lexicon L = f`jg with jLj lemmas belonging to the same part of speech. D E π(`) = f(`;~tγ) (1) However, no explicit paradigms are observed dur- γ2Γ(`) ing training. The task of unsupervised morphologi- where f :Σ∗ × T ! Σ∗ defines a mapping from a cal paradigm completion then consists of generat- tuple consisting of the lemma and a vector ~tγ 2 T ing the paradigms fπ(`)g`2L of all lemmas ` 2 L. 5252 2.2 IGT-to-Paradigms higher levels of linguistic information. They are of- ten archived in long-term repositories, and openly The task we propose, IGT-to-paradigms (IGT2P), accessible for non-commercial purposes, yet they can be described as the paradigm completion prob- are rarely utilized in NLP. lem above, with an additional step of inference regarding which of the attested forms is associated IGT2P has potential benefits for NLP (by in- with which lemma. creasing available resources in low-resource lan- guages) but also for linguistic inquiry. First, since Formally, systems are given IGTs consisting of machine-assistance has been shown to increase words with – potentially empty – morphological speed and accuracy of manual linguistic annota- feature vectors: D = (w ;~t ) :::; (w ;~t ) and 1 1 jDj jDj tion with just 60% model accuracy (Felt, 2012), a list U = fu g with jUj inflected words, u = j j such a model could assist the initial analysis of f(` ;~t ). The goal of IGT2P is to generate the j γj morphological patterns in IGT. Second, by quickly paradigms fπ(`j)gf(` ;~t )2U . j γj learning morphological patterns from word forms Similar to unsupervised paradigm completion, attested in IGT, IGT2P generates forms that fill we do not assume information about the lemma empty cells in a lemma’s paradigm. Since IGTs to be explicit. Similar to morphological reinflec- are unlikely to contain complete paradigms of lem- tion, the input includes word forms with features, mas, an accompanying step in fieldwork is that of and a system has to learn to generate inflections elicitation of inflectional paradigms for selected from other word forms and morphological feature lemmas. Presenting candidate words to a native vectors. IGT2P is further similar to paradigm com- speaker for acceptance or rejection is often easier pletion in that we aim at generating all inflected than asking the speaker to grasp the abstract con- 2 forms for each lemma. cept of a paradigm and to generate the missing cells in a table. With the help of IGT2P, linguists could 2.3 Why IGT2P? use the machine-generated word forms to support Descriptive linguistics aims to objectively analyze this elicitation process. IGT2P then becomes a primary language data in new languages and pub- tool for the discovery of morphological patterns in lish descriptions of their structure. This work in- under-described and endangered languages. forms our understanding of human language and provides resources for NLP development through 3 Related Work academic literature, which informs projects such as IGT for NLP. The AGGREGATION project UniMorph (Kirov et al., 2016), or through crowd- (Bender, 2014) has used IGT to automatically con- sourced effort such as Wiktionary.