Fortification of Neural Morphological Segmentation Models for Polysynthetic Minimal-Resource Languages

Katharina Kann∗ Manuel Mager∗ Center for Information and Instituto de Investigaciones en Language Processing Matematicas´ Aplicadas y en Sistemas LMU Munich, Germany Universidad Nacional Autonoma´ de ´ [email protected] [email protected]

Ivan Meza-Ruiz Hinrich Schutze¨ Instituto de Investigaciones en Center for Information and Matematicas´ Aplicadas y en Sistemas Language Processing Universidad Nacional Autonoma´ de Mexico´ LMU Munich, Germany [email protected] [email protected]

Abstract 1 Introduction

Morphological segmentation for polysyn- Due to the advent of computing technologies thetic languages is challenging, because to indigenous communities all over the world, a word may consist of many individual natural language processing (NLP) applications morphemes and training data can be ex- for languages with limited computer-readable tremely scarce. Since neural sequence- textual data are getting increasingly important. to-sequence (seq2seq) models define the This contrasts with current research, which fo- state of the art for morphological seg- cuses strongly on approaches which require large mentation in high-resource settings and amounts of training data, e.g., deep neural net- for (mostly) European languages, we first works. Those are not trivially applicable to show that they also obtain competitive per- minimal-resource settings with less than 1, 000 formance for Mexican polysynthetic lan- available training examples. We aim at closing this guages in minimal-resource settings. We gap for morphological surface segmentation, the then propose two novel multi-task train- task of splitting a word into the surface forms of its ing approaches—one with, one without smallest meaning-bearing units, its morphemes. need for external unlabeled resources—, Recovering morphemes provides information and two corresponding data augmentation about unknown words and is thus especially im- methods, improving over the neural base- portant for polysynthetic languages with a high line for all languages. Finally, we explore morpheme-to-word ratio and a consequently large cross-lingual transfer as a third way to for- overall number of words. To illustrate how seg- tify our neural model and show that we can mentation helps understanding unknown multiple- train one single multi-lingual model for morpheme words, consider an example in this pa- related languages while maintaining com- per’s language of writing: even if the word uncon- parable or even improved performance, ditionally did not appear in a given training corpus, thus reducing the amount of parameters by its meaning could still be derived from a combina- close to 75%. We provide our morphologi- tion of its morphs un, condition, al and ly. cal segmentation datasets for Mexicanero, Due to its importance for down-stream tasks , Wixarika and Yorem Nokki for (Creutz et al., 2007; Dyer et al., 2008), segmenta- future research. tion has been tackled in many different ways, con- *The∗ first two authors contributed equally. sidering unsupervised (Creutz and Lagus, 2002), supervised (Ruokolainen et al., 2013) and semi- Mexicanero Nahuatl Wixarika Yorem N. supervised settings (Ruokolainen et al., 2014). frq. m. frq. m. frq. m. frq. m. Here, we add three new questions to this line of re- 136 ni 155 o 327 p+ 102 k search: (i) Are data-hungry neural network models 128 ki 99 ni 230 ne 88 m applicable to segmentation of polysynthetic lan- 114 ti 84 ti 173 p 87 ne guages in minimal-resource settings? (ii) How can 105 u 81 k 169 ti 83 ka the performance of neural networks for surface 70 s 61 tl 167 ka 79 ta segmentation be improved if we have only unla- 44 mo 59 mo 98 u 54 po beled or no external data at hand? (iii) Is cross- 42 ka 55 s 97 ta 50 e’ lingual transfer for this task possible between re- 39 a 52 ki 95 a 36 ye lated languages? The last two questions are cru- 31 nich 48 i 92 pe 36 su cial: While for many languages it is difficult to 31 $i 43 tla 91 e 36 ri obtain the number of annotated examples used in 24 ta 39 ’ke 80 r 34 a earlier work on (semi-)supervised methods, a lim- 24 l 34 nech 74 wa 31 me ited amount might still be obtainable. 22 tahtanili 31 no 69 me 30 wa We experiment on four polysynthetic Mexican 21 no 27 ya 68 ni 30 re languages: Mexicanero, Nahuatl, Wixarika and 17 ya 27 tli 68 ke 27 na Yorem Nokki (details in §2). The datasets we use 17 t 24 x 66 eu 24 wi are, as far as we know, the first computer-readable 17 ke 23 tlanilia 58 ye 24 a datasets annotated for morphological segmenta- 17 ita 23 e 57 ri 23 te tion in those languages. 16 piya 21 tika 52 tsi 20 si Our experiments show that neural seq2seq mod- 15 an 21 n 52 te 16 ’wi els perform on par with or better than other strong Table 1: The most frequent morphs (m.) together baselines for our polysynthetic languages in a with their frequencies (frq.) in our datasets. minimal-resource setting. However, we further propose two novel multi-task approaches and two new data augmentation methods. Combining them 2 Polysynthetic Languages with our neural model yields up to 5.05% abso- lute accuracy or 3.40% F1 improvements over our Polysynthetic languages are morphologically rich strongest baseline. languages which are highly synthetic, i.e., sin- gle words can be composed of many individual Finally, following earlier work on cross-lingual morphemes. In extreme cases, entire sentences knowledge transfer for seq2seq tasks (Johnson consist of only one single token, whereupon “ev- et al., 2017; Kann et al., 2017), we investigate ery argument of a predicate must be expressed training one single model for all languages, while by morphology on the word that contains that as- sharing parameters. The resulting model performs signer” (Baker, 2006). This property makes sur- comparably to or better than the individual mod- face segmentation of polysynthetic languages at els, but requires only roughly as many parameters the same time complex and particularly relevant as one single model. for further linguistic analysis. In this paper, we experiment on four polysyn- Contributions. To sum up, we make the follow- thetic languages of the Yuto-Aztecan family ing contributions: (i) we confirm the applicability (Baker, 1997), with the goal of improving the of neural seq2seq models to morphological seg- performance of neural seq2seq models. The lan- mentation of polysynthetic languages in minimal- guages will be described in the rest of this section. resource settings; (ii) we propose two novel multi-task training approaches and two novel data Mexicanero is a Western Peripheral Nahuatl augmentation methods for neural segmentation variant, spoken in the Mexican state of models; (iii) we investigate the effectiveness of by approximately one thousand people. This di- cross-lingual transfer between related languages; alect is isolated from the rest of the other branches and (iv) we provide morphological segmentation and has a strong process of Spanish stem incorpo- datasets for Mexicanero, Nahuatl, Wixarika and ration, while also having borrowed some suffixes Yorem Nokki. from that language (Vanhove et al., 2012). It is common to see Spanish words mixed with Nahu- Mexicanero Nahuatl Wixarika Yorem N. atl agglutinations. In the following example we train 427 540 665 511 can see an intrasentencial mixing of Spanish (in dev 106 134 176 127 uppercases) and Mexicanero: test 355 449 553 425 total 888 1123 1394 1063 u|ni|ye MALO – I was sick Table 2: Number of examples in the final data splits for all languages. Nahuatl is a large subgroup of the Yuto- Aztecan , and, including all of its Yorem Nokki is part of Taracachita subgroup of variants, the most spoken native language in Mex- the Yuto-Aztecan language family. Its Southern ico. Its almost two million native speakers live dialect is spoken by close to forty thousand people mainly in Puebla, Guerrero, Hidalgo, Veracruz, in the Mexican states of Sinaloa and Sonora, while and San Luis Potosi, but also in Oaxaca, Durango, its Northern dialect has about twenty thousand Modelos, Mexico City, Tlaxcala, Michoacan, Na- speakers. In this work, we consider the South- yarit and the State of Mexico. Three dialectical ern dialect. The nominal morphology of Yorem groups are known: Central Nahuatl, Occidental Nokki is rather simple, but, like in the other Yuto- Nahuatl and Oriental Nahuatl. The data collected Aztecan languages, the verb is highly complex. Its for this work belongs to the Oriental branch spo- alphabet consists of 28 characters and contains 8 ken by 70 thousand people in Northern Puebla. different vowels. An example verb is: Like all languages of the Yuto-Aztecan family, ko’kore|ye|ne – I was sick Nahuatl is agglutinative and one word can consist of a combination of many different morphemes. 3 Morphological Segmentation Datasets Usually, the verb functions as the stem and gets extended by morphemes specifying, e.g., subject, To create our datasets, we make use of both seg- patient, object or indirect object. The most com- mentable (i.e., consisting of multiple morphemes) mon syntax sequence for Nahuatl is SOV. An ex- and non-segmentable (i.e., consisting of one single ample word is: morpheme) words described in books of the col- lection Archive of Indigenous Languages in Mexi- o|ne|mo|kokowa|ya – I was sick canero (Canger, 2001), Nahuatl (Lastra de Suarez´ , 1980), Wixarika (Gomez´ and Lopez´ , 1999), and Wixarika is a language spoken in the states of Yorem Nokki (Freeze, 1989). Statistics about the , , Durango and in Cen- data in the four languages are displayed in Ta- tral West Mexico by approximately fifty thousand bles1,2 and3. We include segmentable as well people. It belongs to the Coracholan group of lan- as non-segmentable words into our datasets in or- guages within the Yuto-Aztecan family. Wixarika der to ensure that our methods can correctly de- has five vowels {a,e,i,+1,u} with long and short cide against splitting up single morphemes. The variants. An example for a word in the language phrases in all languages are mostly parallel, such is: that the corpora are roughly equivalent. There- ne|p+|ti|kuye|kai – I was sick fore, we can compare the morphology of trans- lated words (cf. Table3), noticing that the lan- guage with most agglutination is Wixarika, with Like Nahuatl, it has an SOV syntax, with heavy an average rate of 3.25 morphemes per word; the agglutination on the verb. Wixarika is morpholog- other languages have an average of close to 2.2 ically more complex than other languages from the morphemes per word. This higher morphological same family, because it incorporates more infor- complexity naturally produces data sparsity at the mation into the verb (Leza and Lopez´ , 2006). This token level. Also, we can notice that Wixarika has leads to a higher number of morphemes per word more unique words than the rest of our studied lan- as can also be seen in Table3. guages. However, Nahuatl has with 810 the high- 1While linguists often use a dashed i (i) to denote this est number of unique morphemes. vowel, in practice almost all native speakers use a plus sym- bol (+). In this work, we choose to use the latter. Final splits. In order to make follow-up work Mex. Nahuatl Wixarika Yorem N. a second GRU which processes the input from the Words 888 1123 1385 1063 opposite side. SegWords 539 746 1131 774 Encoding with this bidirectional GRU yields the Morphs 1889 2467 4502 2266 −→ −→  forward hidden state h i = f h i−1, vi and the UniMorphs 602 810 653 662 ←− ←−  Seg/W 0.606 0.664 0.816 0.728 backward hidden state h i = f h i+1, vi , for a Morphs/W 2.127 2.196 3.250 2.131 non-linear activation function f. Their concatena- h−→ ←−i MaxMorphs 7 6 10 10 tion hi = hi ; hi is passed on to the decoder.

Table 3: Number of words, segmentable words Decoder. The second part of our network, the (SegWords), total morphs (Morphs), and unique decoder, is a single GRU, defining a probability morphs (UniMorphs) in our datasets. Seg/W: pro- distribution over strings in (Σ ∪ S)∗, for an alpha- portion of words consisting or more than one mor- bet Σ and a separation symbol S: pheme; Morphs/W: morphemes per word; Max- T Morphs: maximum number of morphemes found Yc in one word. pED(c | w) = p(ct | c1, . . . , ct−1, w). t=1 on minimal-resource settings for morphological where p(ct | c1, . . . , ct−1, w) is computed using segmentation easily comparable, we provide pre- an attention mechanism and an output softmax defined splits of our datasets2. 40% of the data layer over Σ ∪ S. constitute the test sets. Of the remaining data, we A more detailed description of the general use 20% for development and the rest for training. attention-based encoder-decoder architecture can The final numbers of words per dataset and lan- be found in the original paper by Bahdanau et al. guage are shown in Table2. (2015).

4 Neural Seq2seq Models for 5 Improving Neural Models for Segmentation Segmentation

In the beginning of this section, we will introduce 5.1 Multi-Task Training our neural architecture for segmentation. Subse- In order to leverage unlabeled data or even random quently, we will first describe our two proposed strings during training, we define an autoencoding multi-task training approaches and second our auxiliary task, which consists of encoding the in- data augmentation methods. Finally, we will elab- put and decoding an output which is identical to orate on expected differences between the two. the original string. Then, our multi-task training objective is to 4.1 Character-Based Encoder-Decoder RNN maximize the joint log-likelihood of this auxiliary Following work on segmentation by Kann et al. task and our segmentation main task: (2016) for high-resource settings, our approach is X based on the neural seq2seq model introduced by L(θ)= log pθ (c | e(w)) (1) Bahdanau et al.(2015) for machine translation. (w,c)∈T X + log p (a | e(a)) Encoder. The first part of our model is a bidi- θ a∈A rectional recurrent neural network (RNN) which encodes the input sequence, i.e., the sequence of T denotes the segmentation training data with ex- characters of a given word w = w1, w2, . . . , wTv , amples consisting of a word w and its segmenta- represented by the corresponding embedding vec- tion c. A denotes either a set of words in the lan- v , ..., v tors w1 wTv . In particular, our encoder con- guage of the system or a set of random strings. The sists of one gated recurrent neural network (GRU) function e describes the encoder and depends on which processes the input in forward direction and the model parameters θ, which are shared across the two tasks. For training, we use data from both 2Our datasets can be found to- gether with the code of our models at sets at the same time and mark each example with http://turing.iimas.unam.mx/wix/MexSeg. an additional, task-specific input symbol. We treat the size of A as a hyperparameter We call models trained on datasets augmented which we optimize on the development set sepa- with unlabeled corpus data or random strings DA- rately for each language. Values we experiment U or DA-R, respectively. with are m times the amount of instances in the original training set, with m ∈ {1, 2, 4, 8}.3 5.3 Differences Between Multi-task Training There are multiple reasons why we expect and Data Augmentation multi-task training to improve the performance of The difference between MTT-U (resp. MTT-R) the final model. First, multi-task training should and DA-U (resp. MTT-U) is a single element in act as a regularizer. Second, for our models, the the input sequence (the one representing the task). segmentation task consists in large parts of learn- However, this information enables the model to ing to copy the input character sequence to the handle each given instance correctly at inference output. This, however, can be learned from any time. As a result, it gets more robust against noisy string and does not require annotated segmenta- data, which seems crucial for our way of using un- tion boundaries. Third, in the case of unlabeled labeled corpora. Consider, for example, the Nahu- data (i.e., not for random strings), we expect the atl word onemokokowaya. Training on character language model in the decoder to im- prove, since it is trained on additional data. onemokokowaya 7→ onemokokowaya We denote models trained with multi-task train- ing using unlabeled corpus data as MTT-U and will make the model learn not to segment models trained with multi-task training using ran- words which consist of the morphemes dom strings as MTT-R. o, ne, mo, kokowa, ya, which should ultimately hurt performance. The multi-task approach, in 5.2 Data Augmentation contrast, mitigates this problem. As a conclusion, we expect the data augmen- A second option to make use of unlabeled data or tation approach with unlabeled data to not obtain random strings is to extend the available training outstanding performance, but rather consider it an data with new examples made from those. The important and informative baseline for the cor- main question to answer here is how to include the responding multi-task approach. Using random new data into the existing datasets. We do this by strings, the difference between the multi-task and building new training examples in a fashion sim- the data augmentation approaches is less obvious: ilar to the multi-task setup. All newly created in- Real morphemes should appear rarely enough in stances are of the form the created random character sequences to avoid the negative effect which we expect for corpus w 7→ w (2) words. We thus assume that the performances of MTT-R and DA-R should be similar. where either w ∈ V with V being the observed vocabulary of the language, e.g., words in a given 6 Experiments unlabeled corpus, or w ∈ R with R being a set of sequences of random characters from the alphabet 6.1 Data Σ of the language. We apply our models to the datasets described Again, we treat the amount of additional train- in §3. For the multi-task training and data aug- ing examples as a hyperparameter which we opti- mentation using unlabeled data, we use (unseg- mize on the development set separately for each mented) words from a parallel corpus collected by language. We explore m times the amount of Gutierrez-Vasques et al.(2016) for Nahuatl and instances in the original training set, with m ∈ the closely related Mexicanero. For Wixarika we {1, 2, 4, 8}. use data from Mager et al.(2018) and for Yorem The reasons why we expect our data augmenta- Nokki we use text from Maldonado Mart´ınez et al. tion methods to lead to better segmentation models (2010). are similar to those for multi-task training. 6.2 Baselines 3An exception is Yorem Nokki, for which we do not have enough unlabeled data available, such that we experiment Now, we will describe the baselines we use to eval- only with m ∈ {1, 2}. uate the overall performance of our approaches. Supervised seq2seq RNN (S2S). As a first model in dependence of the amount of auxiliary baseline, we employ a fully supervised neural task training data can be seen in Figure1. As model without data augmentation or multi-task a general tendency across all languages, adding training, i.e., an attention-based encoder-decoder more data seems better, particularly for the autoen- RNN (Bahdanau et al., 2015) which has been coding task with random strings. The only excep- trained only on the available annotated data. tion is Wixarika. The final configurations we choose for m (cf. Semi-supervised MORFESSOR (MORF). We §5.1) in the case of multi-task training with the further compare to the semi-supervised version auxiliary task of autoencoding corpus data are of MORFESSOR (Kohonen et al., 2010), a well- m = 4 for Mexicanero, Nahuatl and Wixarika and known morphological segmentation system. Dur- m = 1 for Yorem Nokki. For multi-task train- ing training, we tune the hyperparameters for each ing with autoencoding of random strings we select language on the respective development set. The m = 8 for Mexicanero, Nahuatl and Yorem Nokki best performing model is applied to the test set. and m = 4 for Wixarika. FlatCat (FC). Our next baseline is FlatCat (Gronroos¨ et al., 2014), a variant of MORFES- Optimizing the amount of artificial training SOR. It consists of a hidden Markov model for data for data augmentation. Figure2 shows segmentation. The states of the model correspond the performance of the encoder-decoder depend- either to a word boundary and one of the four ing on the amount of added artificial training data. morph categories stem, prefix, suffix, and non- In the case of random strings, again, adding more morpheme. It can work in an unsupervised way, training data seems to help more. However, us- but, similar to the previous baseline, can make ef- ing corpus data seems to hurt performance and the fective use of small amounts of labeled data. more such examples we use, the worse accuracy we obtain. Thus, we conclude that (as expected) CRF. We further compare to a conditional ran- data augmentation with corpus data is not a good dom fields (CRF) (Lafferty et al., 2001) model, in way to improve the model’s performance. We will particular a strong discriminative model for seg- discuss this in more detail in §6.5. mentation by Ruokolainen et al.(2014). It re- Even though the final conclusion should be to duces the task to a classification problem with not add much corpus data, we apply what gives four classes: beginning of a morph, middle of best results on the development set. The final con- a morph, end of a morph and single character figurations we thus choose for DA-U are m = 1 morph. Training is again semi-supervised and the for Mexicanero, Wixarika and Yorem Nokki and model was previously reported to obtain good re- m = 2 for Nahuatl. For DA-R, we select m = 4 sults for small amounts of unlabeled data (Ruoko- for Mexicanero, Wixarika and Yorem Nokki and lainen et al., 2014), which makes it very suitable m = 8 for Nahuatl. for our minimal-resource setting. 6.3 Hyperparameters 6.4 Evaluation Metrics Neural network parameters. All GRUs in Accuracy. First, we evaluate using accuracy on both the encoder and the decoder have 100- the token level. Thus, an example counts as correct dimensional hidden states. All embeddings are if and only if the output of the system matches the 300-dimensional. reference solution exactly, i.e., if all output sym- For training, we use ADADELTA (Zeiler, 2012) bols are predicted correctly. with a minibatch size of 20. We initialize all F1. Our second evaluation metric is border F1, weights to the identity matrix and biases to zero which measures how many segment boundaries (Le et al., 2015). All models are trained for a max- are predicted correctly by the model. While we imum of 200 epochs, but we evaluate after every use this metric because it is common for segmenta- 5 epochs and apply the best performing model at tion tasks, it is not ideal for our models since those test time. Our final reported results are averaged are not guaranteed to preserve the input character accuracies over 5 single training runs. sequence. We handle this problem as follows: In Optimizing the amount of auxiliary task data. order to compare borders, we identify them by the The performance of our neural segmentation position of their preceding letter, i.e., if in both the MTT - corpus data MTT - random strings 90 90 % accuracy % accuracy

80 80

70 70

60 60 Mexicanero Mexicanero Yorem Nokki Yorem Nokki 50 Nahuatl 50 Nahuatl Wixarika Wixarika

1 2 4 8 1 2 4 8

times labeled data times labeled data Figure 1: Accuracy on the development set in dependence of the amount of auxiliary task training data for multi-task learning.

DA - corpus data DA - random strings 90 90 % accuracy % accuracy

80 80

70 70

60 60 Mexicanero Mexicanero Yorem Nokki Yorem Nokki 50 Nahuatl 50 Nahuatl Wixarika Wixarika

1 2 4 8 1 2 4 8

times labeled data times labeled data Figure 2: Accuracy on the development set in dependence of the amount of additional training data. model’s guess and the gold solution a segment bor- the neural baseline: The accuracy of MTT-U is der appears after the second character, it counts as between 0.0141 (Wixarika) and 0.0547 (Mexi- correct. Wrong characters are ignored. Note that canero) higher than S2S’s. MTT-R improves this comes with the disadvantage of erroneously between 0.0380 (Wixarika) and 0.0532 (Yorem inserted characters leading to all subsequent seg- Nokki). Finally, DA-R outperforms S2S by ment borders being counted as incorrect. 0.0367 to 0.0479 accuracy for Yorem Nokki and Mexicanero, respectively. The overall picture 6.5 Test Results and Discussion when considering F1 looks similar. Comparing Table4 shows that accuracy and F1 seem to be our approaches to each other, there is no clear win- highly correlated for our task. The test results also ner. This might be due to differences in the unla- give an answer to our first research question: The beled data we use: the corpus we use for Mexi- neural model S2S performs on par with CRF, the canero and Nahuatl is from dialects different from strongest baseline, for all languages but Nahuatl. both respective test sets. Assuming that the effect Further, S2S and CRF both outperform MORF and of training a language model using unlabeled data FC by a wide margin. We may thus conclude that and erroneously learning to not segment words are neural models are indeed applicable to segmenta- working against each other for MTT-U, this might tion of polysynthetic languages in a low-resource explain why MTT-U is best for Mexicanero and setting. the gap between MTT-U and MTT-R is smaller for Second, we can see that all our proposed Nahuatl than for Yorem Nokki and Wixarika. methods except for DA-U improve over S2S, As mentioned before (cf. §5.3), a simple data Accuracy F1 MTT-U MTT-R DA-U DA-R S2S MORF CRF FC MTT-U MTT-R DA-U DA-R S2S MORF CRF FC Mex. .8051 .7955 .7611 .7983 .7504 .3364 .7837 .5420 .8786 .8694 .6715 .8683 .8618 .5121 .8639 .5621 Nahuatl .6004 .6027 .5541 .6018 .5585 .4044 .6444 .4888 .7388 .7367 .6865 .7328 .7266 .4154 .7487 .5185 Wixarika .5895 .6134 .5425 .6188 .5754 .3989 .5866 .4523 .7949 .8024 .7109 .8161 .7961 .4426 .7932 .5568 Yorem N. .6856 .7101 .6212 .6936 .6569 .4812 .6596 .5781 .7887 .8076 .7133 .7923 .7730 .3528 .7736 .6139 Table 4: Performances of our multi-task and data augmentation approaches compared to all baselines described in the text. The reported results for neural models are averages over 5 training runs. Best results per language and metric are in bold. augmentation method using unlabeled data should M-Lang S-Lang BestMTT BestDA hurt performance. This is indeed the result of our Mex. .6858 .7504 .8051 .7983 experiments: DA-U performs worse than S2S for Nahuatl .5955 .5585 .6027 .6018 all languages except for Mexicanero, where the Wixarika .6021 .5754 .6134 .6188 unlabeled corpus is from another language: the Yorem N. .6223 .6569 .7101 .6936 closely related Nahuatl. We thus conclude that Table 5: Accuracies of our model trained on all multi-task training (instead of simple data aug- languages (M-Lang) and the models trained on mentation) is crucial for the use of unlabeled data. single languages (S-Lang). The highest multi-task Finally, our methods compare favorably to all and data augmentation accuracies are repeated for baselines, with the exception of CRF for Nahu- an easy comparison. atl. While CRF is overall the strongest baseline for our considered languages, our methods out- perform it by up to 0.0214 accuracy or 0.0147 F1 we prepend one language-specific input symbol to for Mexicanero, 0.0322 accuracy or 0.0229 F1 for each instance. This corresponds to having one em- Wixarika and 0.0505 accuracy or 0.0340 F1 for bedding in the input which marks the task. An ex- Yorem Nokki. This shows the effectiveness of our ample training instance for Yorem Nokki is fortified neural models for minimal-resource mor- phological segmentation. L=YN ko0koreyene 7→ ko0kore|ye|ne,

7 Cross-Lingual Transfer Learning where L=YN indicates the language. We now want to investigate the performance of Due to the previous high correlation between one single model trained on all languages at once. accuracy and F1 we only use accuracy on the word This is done in analogy to the multi-task training level as the evaluation metric for this experiment. described in §5.1. We treat segmentation in each language as a separate task and train an attention- 7.2 Results and Discussion based encoder-decoder model on maximizing the In Table5, we show the results of the multi-lingual joint log-likelihood: model, which was trained on all languages, com- X X pared to all individual models, as well as each re- L(θ)= log pθ (c | e(w)) spective best multi-task approach and data aug- Li∈L (w,c)∈TL i mentation method. The results differ among lan- (3) guages: Most remarkably, for both Wixarika and Nahuatl, the accuracy of the multi-lingual model is TLi denotes the segmentation training data in lan- guage Li and L is the set of our languages. As higher than the one of the single-language model. before, each training example consists of a word This might be related to them being the languages w and its segmentation c. with most training data available (cf. Table3). Note, however, that even for the remaining 7.1 Experimental Setup two languages—Mexicanero and Yorem Nokki— We keep all model parameters and the training we hardly lose accuracy when comparing the regime as described in §6.3. However, our training multi-lingual to the individual models. Since we data now consists of a combination of all available only use one model (instead of four), without in- training data for all 4 languages. In order to en- creasing its size significantly, we thus reduce the able the model to differentiate between the tasks, amount of parameters by nearly 75%. 8 Related Work (2016). The authors, instead of using seq2seq models, treat the task as a sequence labeling prob- Work on morphological segmentation was started lem and use LSTMs to classify every character more than 6 decades ago (Harris, 1951). Since either as the beginning, middle or end of a mor- then, many approaches have been developed: In pheme, or as a single-character morpheme. the realm of unsupervised methods, two important Cross-lingual knowledge transfer via language systems are LINGUISTICS (Goldsmith, 2001) tags was proposed for neural seq2seq models be- and MORFESSOR (Creutz and Lagus, 2002). The fore, both for tasks that handle sequences of words latter was later extended to a semi-supervised ver- (Johnson et al., 2017) and tasks that work on se- sion (Kohonen et al., 2010) in order to make use of quences of characters (Kann et al., 2017). How- the abundance of unlabeled data which is available ever, to the best of our knowledge, we are the for many languages. first to try such an approach for a morphological Ruokolainen et al.(2013) focused explicitly segmentation task. In many other areas of NLP, on low-resource scenarios and applied CRFs to cross-lingual transfer has been applied success- morphological segmentation in several languages. fully, e.g., in entity recognition (Wang and Man- They reported better results than earlier work, in- ning, 2014), language modeling (Tsvetkov et al., cluding semi-supervised approaches. In the fol- 2016), or parsing (Cohen et al., 2011; Søgaard, lowing year, they extended their approach to be 2011; Ammar et al., 2016). able to use unlabeled data as well, further improv- ing performance (Ruokolainen et al., 2014). 9 Conclusion and Future Work Cotterell et al.(2015) trained a semi-Markov CRF (semi-CRF) (Sarawagi and Cohen, 2005) We first investigated the applicability of neural jointly on morphological segmentation, stemming seq2seq models to morphological surface segmen- and tagging. For the similar problem of Chi- tation for polysynthetic languages in minimal- nese word segmentation, Zhang and Clark(2008) resource settings, i.e., for considerably less than trained a model jointly on part-of-speech tagging. 1, 000 training instances. Although they are gen- However, we are not aware of any prior work on erally thought to require large amounts of training multi-task training or data augmentation for neural data, neural networks obtained an accuracy com- segmentation models. parable to or higher than several strong baselines. In fact, the two only neural seq2seq approaches Subsequently, we proposed two novel multi- for morphological segmentation we know of fo- task training approaches and two novel data aug- cused on canonical segmentation (Cotterell et al., mentation methods to further increase the perfor- 2016) which differs from the surface segmentation mance of our neural models. Adding those, we im- task considered here in that it restores changes to proved over the neural baseline for all languages, the surface form of morphemes which occurred and for Mexicanero, Wixarika and Yorem Nokki during word formation. Kann et al.(2016) also our final models outperformed all baselines by up used an encoder-decoder RNN and combined it to 5.05% absolute accuracy or 3.40% F1. Further- with a neural reranker. While our model archi- more, we explored cross-lingual transfer between tecture was inspired by them, their model was our languages and reduced the amount of neces- purely supervised. Additionally, they did not in- sary model parameters by about 75%, while im- vestigate the applicability of their neural seq2seq proving performance for some of the languages. model in low-resource settings or for polysyn- We publically release our datasets for morpho- thetic languages. Ruzsics and Samardzic(2017) logical surface segmentation of the polysynthetic extended the standard encoder-decoder architec- minimal-resource languages Mexicanero, Nahu- ture for canonical segmentation to contain a lan- atl, Wixarika and Norem Yokki. guage model over segments and improved results. However, a big difference to our work is that they Acknowledgments still used more than ten times as much training data as we have available for the indigenous Mex- We would like to thank Paulina Grnarova, Rodrigo ican languages we are working on here. Nogueira and Ximena Gutierrez-Vasques for their Another neural approach—this time for sur- helpful feedback. face segmentation—was presented by Wang et al. References Zellig Sabbettai Harris. 1951. Methods in structural linguistics. Chicago University Press. Waleed Ammar, George Mulcaire, Miguel Ballesteros, Chris Dyer, and Noah Smith. 2016. Many lan- Melvin Johnson, Mike Schuster, Quoc Le, Maxim guages, one parser. TACL 4:431–444. Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Fernanda Viegas,´ Martin Wattenberg, Greg Corrado, gio. 2015. Neural machine translation by jointly Macduff Hughes, and Jeffrey Dean. 2017. Google’s learning to align and translate. ICLR . multilingual neural machine translation system: En- abling zero-shot translation. TACL 5:339–351. Mark C Baker. 1997. Complex predicates and agree- ment in polysynthetic languages. Complex predi- Katharina Kann, Ryan Cotterell, and Hinrich Schutze.¨ cates, pages 247–288. 2016. Neural morphological analysis: Encoding- decoding canonical segments. In EMNLP. Mark C Baker. 2006. On zero agreement and polysyn- thesis. In Arguments and Agreement, pages 289– Katharina Kann, Ryan Cotterell, and Hinrich Schutze.¨ 320. 2017. One-shot neural cross-lingual transfer for Una Canger. 2001. Mexicanero de la sierra madre oc- paradigm completion. In ACL. cidental. El Colegio de Mexico.´ Oskar Kohonen, Sami Virpioja, and Krista Lagus. Shay B Cohen, Dipanjan Das, and Noah A Smith. 2010. Semi-supervised learning of concatenative 2011. Unsupervised structure prediction with non- morphology. In SIGMORPHON. parallel multilingual guidance. In EMNLP. John Lafferty, Andrew McCallum, and Fernando CN Ryan Cotterell, Thomas Muller,¨ Alexander Fraser, and Pereira. 2001. Conditional random fields: Prob- Hinrich Schutze.¨ 2015. Labeled morphological seg- abilistic models for segmenting and labeling se- mentation with semi-markov models. In CoNLL. quence data. In ICML. Ryan Cotterell, Tim Vieira, and Hinrich Schutze.¨ 2016. Yolanda Lastra de Suarez.´ 1980. Nahuatl´ de Acaxo- A joint model of orthography and morphological chitlan´ (Hidalgo). Colegio de Mexico.´ segmentation. In NAACL-HLT. Mathias Creutz, Teemu Hirsimaki,¨ Mikko Kurimo, Jose´ Luis Iturrioz Leza and Paula Gomez´ Lopez.´ 2006. Antti Puurula, Janne Pylkkonen,¨ Vesa Siivola, Matti Gramatica´ wixarika. Lincom Europa. Varjokallio, Ebru Arisoy, Murat Sarac¸lar, and An- dreas Stolcke. 2007. Morph-based speech recog- Manuel Mager, Carrillo Dionico, and Meza Ivan. 2018. nition and modeling of out-of-vocabulary words Probabilistic finite-state morphological segmenter across languages. TSLP 5(1):3:1–3:29. for the Wixarika () language. Journal of In- telligent & Fuzzy Systems (Special Issue). Mathias Creutz and Krista Lagus. 2002. Unsupervised discovery of morphemes. In Workshop on Morpho- Juan Pedro Maldonado Mart´ınez, Crescencio logical and Phonological Learning. Buitimea Valenzuela, and Angel´ Macochini Alonzo. 2010. Vivimos en un pueblo yaqui. SEP. Christopher Dyer, Smaranda Muresan, and Philip Resnik. 2008. Generalizing word lattice translation. Teemu Ruokolainen, Oskar Kohonen, Sami Virpioja, In ACL-HLT. and Mikko Kurimo. 2013. Supervised morphologi- Ray A Freeze. 1989. Mayo de Los Capomos, Sinaloa. cal segmentation in a low-resource learning setting El Colegio de Mexico.´ using conditional random fields. In CoNLL.

John Goldsmith. 2001. Unsupervised learning of the Teemu Ruokolainen, Oskar Kohonen, Sami Virpioja, morphology of a natural language. Computational and Mikko Kurimo. 2014. Painless semi-supervised Linguistics 27(2):153–198. morphological segmentation using conditional ran- dom fields. In EACL. Paula Gomez´ and Paula Gomez´ Lopez.´ 1999. Huichol de San Andres´ Cohamiata, Jalisco. El Colegio de Tatyana Ruzsics and Tanja Samardzic. 2017. Neu- Mexico.´ ral sequence-to-sequence learning of internal word Stig-Arne Gronroos,¨ Sami Virpioja, Peter Smit, and structure. In CoNLL. Mikko Kurimo. 2014. Morfessor FlatCat: An HMM-based method for unsupervised and semi- Sunita Sarawagi and William W Cohen. 2005. Semi- supervised learning of morphology. In COLING. Markov conditional random fields for information extraction. In NIPS. Ximena Gutierrez-Vasques, Gerardo Sierra, and Isaac Hernandez Pompa. 2016. Axolotl: A web Anders Søgaard. 2011. Data point selection for cross- accessible parallel corpus for Spanish-Nahuatl. In language adaptation of dependency parsers. In ACL- LREC. HLT. Yulia Tsvetkov, Sunayana Sitaram, Manaal Faruqui, Guillaume Lample, Patrick Littell, David Mortensen, Alan W Black, Lori Levin, and Chris Dyer. 2016. Polyglot neural language models: A case study in cross-lingual phonetic representation learning. In NAACL-HLT. Martine Vanhove, Thomas Stolz, Aina Urdze, and Hit- omi Otsuka. 2012. Morphologies in contact. Walter de Gruyter. Linlin Wang, Zhu Cao, Yu Xia, and Gerard de Melo. 2016. Morphological segmentation with window LSTM neural networks. In AAAI. Mengqiu Wang and Christopher D Manning. 2014. Cross-lingual pseudo-projected expectation regular- ization for weakly supervised learning. TACL 2:55– 66. Yue Zhang and Stephen Clark. 2008. Joint word seg- mentation and POS tagging using a single percep- tron. In ACL.