Fortification of Neural Morphological Segmentation Models For
Total Page:16
File Type:pdf, Size:1020Kb
Fortification of Neural Morphological Segmentation Models for Polysynthetic Minimal-Resource Languages Katharina Kann∗ Manuel Mager∗ Center for Information and Instituto de Investigaciones en Language Processing Matematicas´ Aplicadas y en Sistemas LMU Munich, Germany Universidad Nacional Autonoma´ de Mexico´ [email protected] [email protected] Ivan Meza-Ruiz Hinrich Schutze¨ Instituto de Investigaciones en Center for Information and Matematicas´ Aplicadas y en Sistemas Language Processing Universidad Nacional Autonoma´ de Mexico´ LMU Munich, Germany [email protected] [email protected] Abstract 1 Introduction Morphological segmentation for polysyn- Due to the advent of computing technologies thetic languages is challenging, because to indigenous communities all over the world, a word may consist of many individual natural language processing (NLP) applications morphemes and training data can be ex- for languages with limited computer-readable tremely scarce. Since neural sequence- textual data are getting increasingly important. to-sequence (seq2seq) models define the This contrasts with current research, which fo- state of the art for morphological seg- cuses strongly on approaches which require large mentation in high-resource settings and amounts of training data, e.g., deep neural net- for (mostly) European languages, we first works. Those are not trivially applicable to show that they also obtain competitive per- minimal-resource settings with less than 1; 000 formance for Mexican polysynthetic lan- available training examples. We aim at closing this guages in minimal-resource settings. We gap for morphological surface segmentation, the then propose two novel multi-task train- task of splitting a word into the surface forms of its ing approaches—one with, one without smallest meaning-bearing units, its morphemes. need for external unlabeled resources—, Recovering morphemes provides information and two corresponding data augmentation about unknown words and is thus especially im- methods, improving over the neural base- portant for polysynthetic languages with a high line for all languages. Finally, we explore morpheme-to-word ratio and a consequently large cross-lingual transfer as a third way to for- overall number of words. To illustrate how seg- tify our neural model and show that we can mentation helps understanding unknown multiple- train one single multi-lingual model for morpheme words, consider an example in this pa- related languages while maintaining com- per’s language of writing: even if the word uncon- parable or even improved performance, ditionally did not appear in a given training corpus, thus reducing the amount of parameters by its meaning could still be derived from a combina- close to 75%. We provide our morphologi- tion of its morphs un, condition, al and ly. cal segmentation datasets for Mexicanero, Due to its importance for down-stream tasks Nahuatl, Wixarika and Yorem Nokki for (Creutz et al., 2007; Dyer et al., 2008), segmenta- future research. tion has been tackled in many different ways, con- *The∗ first two authors contributed equally. sidering unsupervised (Creutz and Lagus, 2002), supervised (Ruokolainen et al., 2013) and semi- Mexicanero Nahuatl Wixarika Yorem N. supervised settings (Ruokolainen et al., 2014). frq. m. frq. m. frq. m. frq. m. Here, we add three new questions to this line of re- 136 ni 155 o 327 p+ 102 k search: (i) Are data-hungry neural network models 128 ki 99 ni 230 ne 88 m applicable to segmentation of polysynthetic lan- 114 ti 84 ti 173 p 87 ne guages in minimal-resource settings? (ii) How can 105 u 81 k 169 ti 83 ka the performance of neural networks for surface 70 s 61 tl 167 ka 79 ta segmentation be improved if we have only unla- 44 mo 59 mo 98 u 54 po beled or no external data at hand? (iii) Is cross- 42 ka 55 s 97 ta 50 e’ lingual transfer for this task possible between re- 39 a 52 ki 95 a 36 ye lated languages? The last two questions are cru- 31 nich 48 i 92 pe 36 su cial: While for many languages it is difficult to 31 $i 43 tla 91 e 36 ri obtain the number of annotated examples used in 24 ta 39 ’ke 80 r 34 a earlier work on (semi-)supervised methods, a lim- 24 l 34 nech 74 wa 31 me ited amount might still be obtainable. 22 tahtanili 31 no 69 me 30 wa We experiment on four polysynthetic Mexican 21 no 27 ya 68 ni 30 re languages: Mexicanero, Nahuatl, Wixarika and 17 ya 27 tli 68 ke 27 na Yorem Nokki (details in x2). The datasets we use 17 t 24 x 66 eu 24 wi are, as far as we know, the first computer-readable 17 ke 23 tlanilia 58 ye 24 a datasets annotated for morphological segmenta- 17 ita 23 e 57 ri 23 te tion in those languages. 16 piya 21 tika 52 tsi 20 si Our experiments show that neural seq2seq mod- 15 an 21 n 52 te 16 ’wi els perform on par with or better than other strong Table 1: The most frequent morphs (m.) together baselines for our polysynthetic languages in a with their frequencies (frq.) in our datasets. minimal-resource setting. However, we further propose two novel multi-task approaches and two new data augmentation methods. Combining them 2 Polysynthetic Languages with our neural model yields up to 5:05% abso- lute accuracy or 3:40% F1 improvements over our Polysynthetic languages are morphologically rich strongest baseline. languages which are highly synthetic, i.e., sin- gle words can be composed of many individual Finally, following earlier work on cross-lingual morphemes. In extreme cases, entire sentences knowledge transfer for seq2seq tasks (Johnson consist of only one single token, whereupon “ev- et al., 2017; Kann et al., 2017), we investigate ery argument of a predicate must be expressed training one single model for all languages, while by morphology on the word that contains that as- sharing parameters. The resulting model performs signer” (Baker, 2006). This property makes sur- comparably to or better than the individual mod- face segmentation of polysynthetic languages at els, but requires only roughly as many parameters the same time complex and particularly relevant as one single model. for further linguistic analysis. In this paper, we experiment on four polysyn- Contributions. To sum up, we make the follow- thetic languages of the Yuto-Aztecan family ing contributions: (i) we confirm the applicability (Baker, 1997), with the goal of improving the of neural seq2seq models to morphological seg- performance of neural seq2seq models. The lan- mentation of polysynthetic languages in minimal- guages will be described in the rest of this section. resource settings; (ii) we propose two novel multi-task training approaches and two novel data Mexicanero is a Western Peripheral Nahuatl augmentation methods for neural segmentation variant, spoken in the Mexican state of Durango models; (iii) we investigate the effectiveness of by approximately one thousand people. This di- cross-lingual transfer between related languages; alect is isolated from the rest of the other branches and (iv) we provide morphological segmentation and has a strong process of Spanish stem incorpo- datasets for Mexicanero, Nahuatl, Wixarika and ration, while also having borrowed some suffixes Yorem Nokki. from that language (Vanhove et al., 2012). It is common to see Spanish words mixed with Nahu- Mexicanero Nahuatl Wixarika Yorem N. atl agglutinations. In the following example we train 427 540 665 511 can see an intrasentencial mixing of Spanish (in dev 106 134 176 127 uppercases) and Mexicanero: test 355 449 553 425 total 888 1123 1394 1063 ujnijye MALO – I was sick Table 2: Number of examples in the final data splits for all languages. Nahuatl is a large subgroup of the Yuto- Aztecan language family, and, including all of its Yorem Nokki is part of Taracachita subgroup of variants, the most spoken native language in Mex- the Yuto-Aztecan language family. Its Southern ico. Its almost two million native speakers live dialect is spoken by close to forty thousand people mainly in Puebla, Guerrero, Hidalgo, Veracruz, in the Mexican states of Sinaloa and Sonora, while and San Luis Potosi, but also in Oaxaca, Durango, its Northern dialect has about twenty thousand Modelos, Mexico City, Tlaxcala, Michoacan, Na- speakers. In this work, we consider the South- yarit and the State of Mexico. Three dialectical ern dialect. The nominal morphology of Yorem groups are known: Central Nahuatl, Occidental Nokki is rather simple, but, like in the other Yuto- Nahuatl and Oriental Nahuatl. The data collected Aztecan languages, the verb is highly complex. Its for this work belongs to the Oriental branch spo- alphabet consists of 28 characters and contains 8 ken by 70 thousand people in Northern Puebla. different vowels. An example verb is: Like all languages of the Yuto-Aztecan family, ko’korejyejne – I was sick Nahuatl is agglutinative and one word can consist of a combination of many different morphemes. 3 Morphological Segmentation Datasets Usually, the verb functions as the stem and gets extended by morphemes specifying, e.g., subject, To create our datasets, we make use of both seg- patient, object or indirect object. The most com- mentable (i.e., consisting of multiple morphemes) mon syntax sequence for Nahuatl is SOV. An ex- and non-segmentable (i.e., consisting of one single ample word is: morpheme) words described in books of the col- lection Archive of Indigenous Languages in Mexi- ojnejmojkokowajya – I was sick canero (Canger, 2001), Nahuatl (Lastra de Suarez´ , 1980), Wixarika (Gomez´ and Lopez´ , 1999), and Wixarika is a language spoken in the states of Yorem Nokki (Freeze, 1989). Statistics about the Jalisco, Nayarit, Durango and Zacatecas in Cen- data in the four languages are displayed in Ta- tral West Mexico by approximately fifty thousand bles1,2 and3.