<<

Neural Graphical Models over Strings for Principal Parts Morphological Paradigm Completion

Ryan Cotterell1,2 John Sylak-Glassman2 Christo Kirov2 1Department of Computer Science 2Center for Language and Speech Processing Johns Hopkins University [email protected] [email protected] [email protected]

Abstract Nicolai et al., 2015; Ahlberg et al., 2015; Faruqui et al., 2016) has made tremendous progress on this Many of the world’s languages contain an narrower task, paradigm completion is not only abundance of inflected forms for each lex- broader in scope, but is better solved without priv- eme. A major task in processing such lan- ileging the over other forms. By forcing guages is predicting these inflected forms. string-to-string transformations from one inflected We develop a novel statistical model for form to another to through the lemma, the trans- the problem, drawing on graphical model- formation problem is often made more complex ing techniques and recent advances in deep than by allowing transformations to happen directly learning. We derive a Metropolis-Hastings or through a different intermediary form. This inter- algorithm to jointly decode the model. Our pretation is inspired by ideas from linguistics and Bayesian network draws inspiration from language pedagogy, namely principal parts mor- principal parts morphological analysis. We phology, which argues that forms in a paradigm are demonstrate improvements on 5 languages. best derived using a set of citation forms rather than a single form (Finkel and Stump, 2007a; Finkel and 1 Introduction Stump, 2007b). Inflectional morphology modifies the form Directed graphical models provide a natural for- of words to convey grammatical distinctions malism for principal parts morphology since a (e.g. tense, case, and number), and is an extremely graph topology can represent relations between common and productive phenomenon throughout inflected forms and principal parts. Specifically, the world’s languages (Dryer and Haspelmath, we apply string-valued graphical models (Dreyer 2013). For instance, the Spanish poner may and Eisner, 2009; Cotterell et al., 2015) to the prob- transform into one of over fifty unique inflectional lem. We develop a novel, neural parameterization forms depending on context, e.g. the 1st person of string-valued graphical models where the con- present form is pongo, but the 2nd person present ditional probabilities in the Bayesian network are form is pones. These variants cause data sparsity, given by a sequence-to-sequence model (Sutskever which is problematic for machine learning since et al., 2014). However, under such a parameteriza- many word forms will not occur in training tion, exact inference and decoding are intractable. corpora. Thus, a necessity for improving NLP Thus, we derive a sampling-based decoding algo- on morphologically rich languages is the ability rithm. We experiment on 5 languages: Arabic, to analyze all inflected forms for any lexical German, , Russian, and Spanish, showing that entry. One way to do this is through paradigm our model outperforms a baseline approach that completion, which generates all the inflected forms privileges the lemma form. associated with a given lemma. Until recently, paradigm completion has been 2 A Generative Model of Principal Parts narrowly construed as the task of generating a full paradigm (e.g. noun declension, verb conjuga- We first formally define the task of paradigm com- tion) based on a single privileged form—the lemma pletion and relate it to research in principal parts (i.e. the citation form, such as poner). While recent morphology. Let Σ be a discrete alphabet of char- work (Durrett and DeNero, 2013; Hulden, 2014; acters in a language. Formally, for a given lemma ` ∈ Σ∗, we define the complete paradigm of poner that lemma π(`) = hm1, . . . , mN i, where each ∗ 1 mi ∈ Σ is an inflected form. For example, the pondr´ıas pongo paradigm for the English lemma to run is defined pondr´ıan pondr´ıais pondr´ıa as π (RUN) = hrun, runs, ran, runningi. While the size of a typical English verbal paradigm is compar- pongas atively small (|π| = N = 4), in many languages pongo poner pondr´ıas ponga pondr´ıan pondr´ıais the size of the paradigms can be very large (Kib- rik, 1998). The task of paradigm completion is pongas pongan to predict all elements of the tuple π given one ponga pongan or more forms (mi). Paradigm completion solely (a) Lemma paradigm tree. (b) Principal parts paradigm tree. from the lemma (m`), however, largely ignores the linguistic structure of the paradigm. Given certain Figure 1: Two potential graphical models for the paradigm inflected word forms, termed principal parts, the completion task. The topology in (a) encodes the the network where all forms are predicted from the lemma. The topology construction of a set of other word forms in the in (b) is a principle-parts-inspired topology introduced here. same paradigm is fully deterministic. Latin are famous for having four such principal parts (Finkel and Stump, 2009). Heuristic Network. We heuristically induce a paradigm tree with the following procedure. For Inspired by the concept of principal parts, we each ordered pair of forms in a paradigm π, we present a solution to the paradigm completion task compute the number of distinct edit scripts that in which target inflected forms are predicted from convert one form into the other. The edit script other forms in the paradigm, rather than only from procedure is similar to that described in Chrupała et the lemma. We implement this solution in the form al. (2008). For each ordered pair (i, j) of inflected of a generative probabilistic model of the paradigm. forms in π, we count the number of distinct edit We define a joint probability distribution over the paths mapping from m to m , which serves as entire paradigm: i j a weight on the edge wi→j. Empirically, wi→j Y p(π) = p(m | m ) (1) is a good proxy for how deterministic a mapping i paT (i) i is. We use Edmonds’ algorithm (Edmonds, 1967) where paT (·) is a function that returns the parent to find the minimal directed spanning tree. The of the node i with respect to the tree T , which intuition behind this procedure is that the number encodes the source form from which each target of deterministic arcs should be maximized. form is predicted. In terms of graphical modeling, this p(π) is a Bayesian network over string-valued Gold Network. Finally, for Latin verbs we con- variables (Cotterell et al., 2015). Trees provide a sider a graph that matches the classic pedagogical natural formalism for encoding the intuition behind derivation of Latin verbs from four principal parts. principal parts theory, and provide a fixed paradigm 3 Inflection Generation with RNNs structure prior to inference. We construct a graph with nodes for each cell in the paradigm, as in RNNs have recently achieved state-of-the-art re- Figure 1. The parent of each node is another form sults for many sequence-to-sequence mapping in the paradigm that best predicts that node. problems and paradigm completion is no excep- tion. Given the success of LSTM-based (Hochre- 2.1 Paradigm Trees iter and Schmidhuber, 1997) and GRU-based (Cho et al., 2014) morphological inflectors (Faruqui et Baseline Network. Predicting inflected forms al., 2016; Cotterell et al., 2016), we choose a neural only from the lemma involves a particular graphi- parameterization for our Bayesian network, i.e. the cal model in which all the forms are leaves attached conditional probability p(mi | m ) is com- to the lemma. This network is treated as a baseline, paT (i) puted using a RNN. Our graphical modeling ap- and is depicted in Figure 1a. proach as well as the inference algorithms subse- quently discussed in §4.2 are agnostic to the minu- 1We constrain the task such that the number of forms in a paradigm (|π| = N) is fixed, and each possible form of a tiae of any one parameterization, i.e. the encoding paradigm is assumed to have consistent semantics. p(m | m ) is a black box. i paT (i) Algorithm 1 Decoding by Simulated Annealing 4.1 Parameter Estimation 1: procedure SIMULATED-ANNEALING(T , d, ) Following previous work (Faruqui et al., 2016), 2: τ ← 10.0 ; m ← [ε, . . . , ε] 3: repeat we train our model in the fully observed setting 4: i ∼ uniform({1, 2,..., |m|}) . sample latent node in T with complete paradigms as training data. As our 0 5: mi ∼ qi . sample string from proposal distribution " 1/τ # model is directed, this makes parameter estimation  p(m0 |m )  i paT (i) qi(mi) 6: a ← min 1, p(m |m ) q (m0 ) relatively straightforward. We may estimate the i paT (i) i i parameters of each LSTM independently without 7: if uniform(0, 1) ≤ a then 0 performing joint inference during training. We fol- 8: mi ← mi . update string to new value if accepted 9: τ ← τ · d . decay temperature where d ∈ (0, 1) low the training procedure of Aharoni et al. (2016), 10: until τ ≤  . repeat until convergence; see Spall (2003, Ch. 8) using a maximum of 300 epochs of SGD. 11: return m 4.2 Approximate Joint Decoding 3.1 LSTMs with Hard Monotonic Attention In a Bayesian network, the maximum-a-posteriori We define the conditional distributions in our (MAP) inference problem refers to finding the most Bayesian network p(m | m ) as LSTMs with probable configuration of the variables given some i paT (i) hard monotonic attention (Aharoni et al., 2016; evidence. In our case, this requires finding the best Aharoni and Goldberg, 2016), which we briefly set of inflections to complete the paradigm given overview. These networks map one inflection to an observed set of inflected forms. Returning to running the English verbal paradigm, given the another, e.g. mapping the English gerund rd to the past tense ran, using an encoder-decoder form ran and the 3 person present singular runs, architecture (Sutskever et al., 2014) run over an the goal of MAP inference is to return the most augmented alignment alphabet, consisting of copy, probable assignment to the past tense and gerund substitution, deletion and insertion, as in Dreyer et form (the correct assignment is ran and running). al. (2008). For strings x, y ∈ Σ∗, the alignment is In many Bayesian networks, e.g. models with finite extracted from the minimal weight edit path using support, exact MAP inference can be performed the BioPython toolkit (Cock et al., 2009). Crucially, efficiently with the sum-product belief propagation as the model is locally normalized we may sam- algorithm (Pearl, 1988) when the model has a tree ple strings from the conditional p(m | m ) structure. Despite the tree structure, the LSTM i paT (i) efficiently using forward sampling. This network makes exact inference intractable. Thus, we resort stands in contrast to attention models (Bahdanau et to an approximate scheme. al., 2015) in which the alignments are soft and not 4.3 Simulated Annealing necessarily monotonic. We refer the reader to Aha- roni et al. (2016) for exact implementation details The Metropolis-Hastings algorithm (Metropolis et as we use their code out-of-the-box.2 al., 1953; Hastings, 1970) is a popular Markov- Chain Monte Carlo (MCMC) (Robert and Casella, 4 Neural Graphical Models over Strings 2013) algorithm for approximate sampling from intractable distributions. As with all MCMC al- Our Bayesian network defined in Equation (1) is gorithms, the goal is to construct a Markov chain a graphical model defined over multiple string- whose stationary distribution is the target distribu- valued random variables, a framework formalized tion. Thus, after having mixed, taking a random in Dreyer and Eisner (2009). In contrast to previ- walk on the Markov chain is equivalent to sampling ous work, e.g. Cotterell and Eisner (2015; Peng et from the intractable distribution. Here, we are in- al. (2015), which considered conditional distribu- terested in sampling from p(π), where part of π tions encodable by finite-state machines, we offer may be observed. the first neural parameterization for such graphi- Simulated annealing (Kirkpatrick et al., 1983; cal models. With the increased expressivity comes Andrieu et al., 2003) is a slight modification of the computational challenges—inference becomes in- Metropolis-Hastings algorithm suitable for MAP tractable. Thus, we fashion an efficient sampling inference. We add the temperature hyperparameter algorithm. τ, which we decrease on a schedule. We achieve

2 https://github.com/roeeaharoni/ the MAP estimate as τ → 0. The algorithm works morphological-reinflection as follows: Given a paradigm with tree T , we sam- Language Baseline Heuristic Tree Gold Tree Language (POS) Train Dev Test Arabic 70.3% 92.7% N/A Arabic (N) 632 79 79 German 93.3% 98.8% N/A German (N) 1723 200 200 Latin 92.8% 98.3% 98.9% Latin (V) 2660 333 333 Russian 84.2% 84.4% N/A Russian (N) 8266 1032 1033 Spanish 99.2% 99.2% N/A Spanish (V) 2973 372 372

Table 1: Accuracy on the paradigm completion task comparing Table 2: Lemmata per dataset. Bayesian network topologies over 5 languages.

SIGMORPHON shared task on morphological re- i ple a latent node in the tree uniformly at random. inflection (Cotterell et al., 2016). Notably, the m0 We then sample a new string i from the proposal winning system used an encoder-decoder archi- q § distribution i (see 4.4), which we accept (replac- tecture (Kann and Schutze,¨ 2016). Neural net- m ing i) with probability works have been used in other areas of compu- tational morphology, e.g. morpheme segmentation  !1/τ  p(m0 | m ) (Wang et al., 2016; Kann et al., 2016; Cotterell and i paT (i) qi(mi) a=min 1, 0  . (2) Schutze,¨ 2017), morphological tagging (Heigold p(mi | m ) qi(m ) paT (i) i et al., 2016), and language modeling (Botha and Blunsom, 2014). We iterate until convergence and accept the final configuration of values as our approximate MAP 6 Experiments and Results estimate. We give pseudocode in Algorithm 1 for Our proposed model generalizes previous efforts clarity. in paradigm completion since all previously pro- 4.4 Proposal Distribution posed models take the form of Figure 1a, i.e. a graphical model where all leaves connect to the We define a tractable proposal distribution for our lemma. Unfortunately, in that configuration, ob- neural graphical model over strings using a pro- serving additional forms cannot help at test time cedure similar to the stochastic inverse method of since information must flow through the lemma, Stuhlmuller¨ et al. (2013) for probabilistic program- which is always observed. We conjecture that prin- ming. In addition to estimating the parameters of cipal parts-based topologies will outperform the an LSTM defining the distribution p(mi | m ), paT (i) baseline topology for that reason. We propose a we also estimate parameters of an LSTM to define controlled experiment in which we consider iden- the inverse distribution p(m | mi). As we ob- paT (i) tical training and testing conditions and vary only serve only complete paradigms at training time, we the topology. train networks as in §4.1. First, we define the neigh- borhood of a node i as all those nodes adjacent to Data. Data for training, development, and testing i (connected by an ingoing or outgoing edge). We is randomly sampled from the UniMorph dataset define the proposal distribution as a mixture model (Sylak-Glassman et al., 2015).3 We run experi- of all conditional distributions in the neighborhood ments on Arabic, German, and Russian nominal N (i), i.e. paradigms and Latin and Spanish verbal paradigms. The sizes of the resulting data splits are given in −1 X qi(mi) = |N (i)| p(mi | mj). (3) Table 2. For the development and test splits we j∈N (i) always include the lemma (as is standard) while Crucially, some of the distributions are stochastic sampling additional observed forms. On average inverses. Sampling from qi is tractable: We sample one third of all forms are observed. a mixture component uniformly and then sample a string. Evaluation. Evaluation of the held-out sets pro- ceeds as follows: Given the observed forms in the 5 Related Work paradigm, we jointly decode the remaining forms as discussed in §4.2; joint decoding is performed Our effort is closest to Faruqui et al. (2016), without Algorithm 1 for the baseline—instead, we who proposed the first neural paradigm completer. Many neural solutions were also proposed in the 3http://www.unimorph.org decode as in Aharoni et al. (2016). We measure morphological inflection generation: The biu-mit accuracy (macro-averaged) on the held-out forms. systems for the sigmorphon 2016 shared task for morphological reinflection. In Proceedings of the Results. In general, we find that our princi- 14th SIGMORPHON Workshop on Computational pal parts-inspired networks outperform lemma- Research in Phonetics, Phonology, and Morphology, pages 41–48, Berlin, Germany, August. Association centered baseline networks. In Arabic, German, for Computational Linguistics. and Latin, we find the largest gains (for Latin, our heuristic topology closely matches that of the gold Malin Ahlberg, Markus Forsberg, and Mans Hulden. tree, validating the heuristics we use). We attribute 2015. Paradigm classification in supervised learn- ing of morphology. In Proceedings of the 2015 Con- the gains to the ability to use knowledge from ference of the North American Chapter of the Asso- attested forms that are otherwise difficult to pre- ciation for Computational Linguistics: Human Lan- dict, e.g. forms based on the Arabic broken plural, guage Technologies, pages 1024–1029, Denver, Col- the German plural, and any of the Latin present orado, May–June. Association for Computational perfect forms. In the case of paradigms with por- Linguistics. tions which are difficult to predict without knowl- Christophe Andrieu, Nando De Freitas, Arnaud Doucet, edge of a representative form, knowing multiple and Michael I Jordan. 2003. An introduction to principle parts will be a boon given a proper tree MCMC for machine learning. Machine Learning, improvement—we attribute this to the fact that al- 50(1-2):5–43. most all of the test examples were regular -ar verbs Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- and, thus, fully predictable. Finally, in the case gio. 2015. Neural machine translation by jointly of Russian we see only minor improvements—this learning to align and translate. In ICLR. stems from need to maintain a different optimal Jan A Botha and Phil Blunsom. 2014. Compositional topology for each declension. Because our model morphology for word representations and language assumes a fixed paradigmatic structure in the form modelling. In ICML, pages 1899–1907. of a tree, using multiple topologies is not possible. Kyunghyun Cho, Bart van Merrienboer, C¸aglar 7 Conclusion Gulc¸ehre,¨ Dzmitry Bahdanau, Fethi Bougares, Hol- ger Schwenk, and Yoshua Bengio. 2014. Learning We have presented a directed graphical model over phrase representations using RNN encoder-decoder strings with a RNN parameterization for principle- for statistical machine translation. In EMNLP, pages 1724–1734. parts-inspired morphological paradigm completion. This paradigm gives us the best of two worlds. We Grzegorz Chrupała, Georgiana Dinu, and Josef van can exploit state-of-the-art neural morphological Genabith. 2008. Learning morphology with Mor- inflectors while injecting linguistic insight into the fette. In LREC, volume 8. structure of the graphical model itself. Due to the Peter Cock, Tiago Antao, Jeffrey Chang, Brad Chap- expressivity of our parameterization, exact decod- man, Cymon Cox, Andrew Dalke, Iddo Friedberg, ing becomes intractable. To solve this, we derive an Thomas Hamelryck, Frank Kauff, Bartek Wilczyn- efficient MCMC approach to approximately decode ski, et al. 2009. Biopython: Freely available Python tools for computational molecular biology and bioin- the model. We validate our model experimentally formatics. Bioinformatics, 25(11):1422–1423. and show gains over a baseline which represents the topology used in nearly all previous research. Ryan Cotterell and Jason Eisner. 2015. Penalized expectation propagation for graphical models over Acknowledgements strings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for The first author was supported by a DAAD Long- Computational Linguistics: Human Language Tech- Term Research Grant and an NDSEG fellowship. nologies, pages 932–942, Denver, Colorado, May– June. Association for Computational Linguistics.

Ryan Cotterell and Hinrich Schutze.¨ 2017. Joint se- References mantic synthesis and morphological analysis of the Roee Aharoni and Yoav Goldberg. 2016. Sequence derived word. CoRR, abs/1701.00946. to sequence transduction with hard monotonic atten- tion. arXiv preprint arXiv:1611.01487. Ryan Cotterell, Nanyun Peng, and Jason Eisner. 2015. Modeling word forms using latent underlying Roee Aharoni, Yoav Goldberg, and Yonatan Belinkov. morphs and phonology. Transactions of the Associ- 2016. Improving sequence to sequence learning for ation for Computational Linguistics, 3. Ryan Cotterell, Christo Kirov, John Sylak-Glassman, Georg Heigold, Guenter Neumann, and Josef van David Yarowsky, Jason Eisner, and Mans Hulden. Genabith. 2016. Neural morphological tagging 2016. The sigmorphon 2016 shared taskmorpholog- from characters for morphologically rich languages. ical reinflection. In Proceedings of the 14th SIG- CoRR, abs/1606.06640. MORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. 10–22, Berlin, Germany, August. Association for Long short-term memory. Neural computation, Computational Linguistics. 9(8):1735–1780. Markus Dreyer and Jason Eisner. 2009. Graphical Mans Hulden. 2014. Generalizing inflection tables models over multiple strings. In Proceedings of the into paradigms with finite-state operations. In Pro- 2009 Conference on Empirical Methods in Natural ceedings of the 14th SIGMORPHON Workshop on Language Processing, pages 101–110, Singapore, Computational Research in Phonetics, Phonology, August. Association for Computational Linguistics. and Morphology, pages 29–36, Baltimore, Mary- land. Association for Computational Linguistics. Markus Dreyer, Jason Smith, and Jason Eisner. 2008. Latent-variable modeling of string transductions Katharina Kann and Hinrich Schutze.¨ 2016. MED: with finite-state methods. In Proceedings of the The LMU system for the SIGMORPHON 2016 2008 Conference on Empirical Methods in Natu- shared task on morphological reinflection. In Pro- ral Language Processing, pages 1080–1089, Hon- ceedings of the 14th SIGMORPHON Workshop on olulu, Hawaii, October. Association for Computa- Computational Research in Phonetics, Phonology, tional Linguistics. and Morphology, pages 62–70, Berlin, Germany, August. Association for Computational Linguistics. Matthew S. Dryer and Martin Haspelmath, editors. 2013. WALS Online. Max Planck Institute for Evo- Katharina Kann, Ryan Cotterell, and Hinrich Schutze.¨ lutionary Anthropology, Leipzig. 2016. Neural morphological analysis: Encoding- decoding canonical segments. In Proceedings of Greg Durrett and John DeNero. 2013. Supervised the 2016 Conference on Empirical Methods in Nat- learning of complete morphological paradigms. In ural Language Processing, pages 961–967, Austin, Proceedings of the 2013 Conference of the North Texas, November. Association for Computational American Chapter of the Association for Computa- Linguistics. tional Linguistics: Human Language Technologies, pages 1185–1195, Atlanta, Georgia, June. Associa- Aleksandr E. Kibrik. 1998. Archi. In Andrew Spencer tion for Computational Linguistics. and Arnold M. Zwicky, editors, The Handbook of Morphology, pages 455–476. Blackwell, Oxford. Jack Edmonds. 1967. Optimum branchings. Journal of Research of the National Bureau of Standards B, Scott Kirkpatrick, C Daniel Gelatt, Mario P Vecchi, 71(4):233–240. et al. 1983. Optimization by simmulated annealing. Manaal Faruqui, Yulia Tsvetkov, Graham Neubig, and Science, 220(4598):671–680. Chris Dyer. 2016. Morphological inflection gener- Nicholas Metropolis, Arianna W. Rosenbluth, Mar- ation using character sequence to sequence learning. shall N. Rosenbluth, Augusta H. Teller, and Edward In Proceedings of the 2016 Conference of the North Teller. 1953. Equations of state calculations by fast American Chapter of the Association for Computa- computing machines. Journal of Chemical Physics, tional Linguistics: Human Language Technologies, 21(6):1087–1092. pages 634–643, San Diego, California, June. Asso- ciation for Computational Linguistics. Garrett Nicolai, Colin Cherry, and Grzegorz Kondrak. Raphael Finkel and Gregory Stump. 2007a. Principal 2015. Inflection generation as discriminative string parts and degrees of paradigmatic transparency (no. transduction. In Proceedings of the 2015 Confer- tr 470-07). Technical report, Department of Com- ence of the North American Chapter of the Associ- puter Science, University of Kentucky, Lexington, ation for Computational Linguistics: Human Lan- KY. guage Technologies, pages 922–931, Denver, Col- orado, May–June. Association for Computational Raphael Finkel and Gregory Stump. 2007b. Princi- Linguistics. pal parts and morphological typology. Morphology, 17:39–75, November. Judea Pearl. 1988. Probabilistic Reasoning in Intelli- gent Systems: Networks of Plausible Inference. Mor- Raphael Finkel and Gregory Stump. 2009. What your gan Kaufmann. teacher told you is true: Latin verbs have four prin- cipal parts. Digital Humanities Quarterly (DHQ), Nanyun Peng, Ryan Cotterell, and Jason Eisner. 2015. 3(1). Dual decomposition inference for graphical models over strings. In Proceedings of the 2015 Conference Wilfred Keith Hastings. 1970. Monte Carlo sampling on Empirical Methods in Natural Language Process- methods using Markov chains and their applications. ing, pages 917–927, Lisbon, Portugal, September. Biometrika, 57(1):97–109. Association for Computational Linguistics. Christian Robert and George Casella. 2013. Monte Carlo statistical methods. Springer Science & Busi- ness Media. James C Spall. 2003. Introduction to stochastic search and optimization: estimation, simulation, and con- trol, volume 65. John Wiley & Sons. Andreas Stuhlmuller,¨ Jacob Taylor, and Noah Good- man. 2013. Learning stochastic inverses. In NIPS, pages 3048–3056.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. CoRR, abs/1409.3215.

John Sylak-Glassman, Christo Kirov, David Yarowsky, and Roger Que. 2015. A language-independent feature schema for inflectional morphology. In Pro- ceedings of the 53rd Annual Meeting of the Associa- tion for Computational Linguistics and the 7th Inter- national Joint Conference on Natural Language Pro- cessing (Volume 2: Short Papers), pages 674–680, Beijing, China, July. Association for Computational Linguistics. Linlin Wang, Zhu Cao, Yu Xia, and Gerard de Melo. 2016. Morphological segmentation with window LSTM neural networks. In AAAI, pages 2842–2848.