Neural Graphical Models Over Strings for Principal Parts Morphological Paradigm Completion
Total Page:16
File Type:pdf, Size:1020Kb
Neural Graphical Models over Strings for Principal Parts Morphological Paradigm Completion Ryan Cotterell1;2 John Sylak-Glassman2 Christo Kirov2 1Department of Computer Science 2Center for Language and Speech Processing Johns Hopkins University [email protected] [email protected] [email protected] Abstract Nicolai et al., 2015; Ahlberg et al., 2015; Faruqui et al., 2016) has made tremendous progress on this Many of the world’s languages contain an narrower task, paradigm completion is not only abundance of inflected forms for each lex- broader in scope, but is better solved without priv- eme. A major task in processing such lan- ileging the lemma over other forms. By forcing guages is predicting these inflected forms. string-to-string transformations from one inflected We develop a novel statistical model for form to another to go through the lemma, the trans- the problem, drawing on graphical model- formation problem is often made more complex ing techniques and recent advances in deep than by allowing transformations to happen directly learning. We derive a Metropolis-Hastings or through a different intermediary form. This inter- algorithm to jointly decode the model. Our pretation is inspired by ideas from linguistics and Bayesian network draws inspiration from language pedagogy, namely principal parts mor- principal parts morphological analysis. We phology, which argues that forms in a paradigm are demonstrate improvements on 5 languages. best derived using a set of citation forms rather than a single form (Finkel and Stump, 2007a; Finkel and 1 Introduction Stump, 2007b). Inflectional morphology modifies the form Directed graphical models provide a natural for- of words to convey grammatical distinctions malism for principal parts morphology since a (e.g. tense, case, and number), and is an extremely graph topology can represent relations between common and productive phenomenon throughout inflected forms and principal parts. Specifically, the world’s languages (Dryer and Haspelmath, we apply string-valued graphical models (Dreyer 2013). For instance, the Spanish verb poner may and Eisner, 2009; Cotterell et al., 2015) to the prob- transform into one of over fifty unique inflectional lem. We develop a novel, neural parameterization forms depending on context, e.g. the 1st person of string-valued graphical models where the con- present form is pongo, but the 2nd person present ditional probabilities in the Bayesian network are form is pones. These variants cause data sparsity, given by a sequence-to-sequence model (Sutskever which is problematic for machine learning since et al., 2014). However, under such a parameteriza- many word forms will not occur in training tion, exact inference and decoding are intractable. corpora. Thus, a necessity for improving NLP Thus, we derive a sampling-based decoding algo- on morphologically rich languages is the ability rithm. We experiment on 5 languages: Arabic, to analyze all inflected forms for any lexical German, Latin, Russian, and Spanish, showing that entry. One way to do this is through paradigm our model outperforms a baseline approach that completion, which generates all the inflected forms privileges the lemma form. associated with a given lemma. Until recently, paradigm completion has been 2 A Generative Model of Principal Parts narrowly construed as the task of generating a full paradigm (e.g. noun declension, verb conjuga- We first formally define the task of paradigm com- tion) based on a single privileged form—the lemma pletion and relate it to research in principal parts (i.e. the citation form, such as poner). While recent morphology. Let Σ be a discrete alphabet of char- work (Durrett and DeNero, 2013; Hulden, 2014; acters in a language. Formally, for a given lemma ` 2 Σ∗, we define the complete paradigm of poner that lemma π(`) = hm1; : : : ; mN i, where each ∗ 1 mi 2 Σ is an inflected form. For example, the pondr´ıas pongo paradigm for the English lemma to run is defined pondr´ıan pondr´ıais pondr´ıa as π (RUN) = hrun; runs; ran; runningi. While the size of a typical English verbal paradigm is compar- pongas atively small (jπj = N = 4), in many languages pongo poner pondr´ıas ponga pondr´ıan pondr´ıais the size of the paradigms can be very large (Kib- rik, 1998). The task of paradigm completion is pongas pongan to predict all elements of the tuple π given one ponga pongan or more forms (mi). Paradigm completion solely (a) Lemma paradigm tree. (b) Principal parts paradigm tree. from the lemma (m`), however, largely ignores the linguistic structure of the paradigm. Given certain Figure 1: Two potential graphical models for the paradigm inflected word forms, termed principal parts, the completion task. The topology in (a) encodes the the network where all forms are predicted from the lemma. The topology construction of a set of other word forms in the in (b) is a principle-parts-inspired topology introduced here. same paradigm is fully deterministic. Latin verbs are famous for having four such principal parts (Finkel and Stump, 2009). Heuristic Network. We heuristically induce a paradigm tree with the following procedure. For Inspired by the concept of principal parts, we each ordered pair of forms in a paradigm π, we present a solution to the paradigm completion task compute the number of distinct edit scripts that in which target inflected forms are predicted from convert one form into the other. The edit script other forms in the paradigm, rather than only from procedure is similar to that described in Chrupała et the lemma. We implement this solution in the form al. (2008). For each ordered pair (i; j) of inflected of a generative probabilistic model of the paradigm. forms in π, we count the number of distinct edit We define a joint probability distribution over the paths mapping from m to m , which serves as entire paradigm: i j a weight on the edge wi!j. Empirically, wi!j Y p(π) = p(m j m ) (1) is a good proxy for how deterministic a mapping i paT (i) i is. We use Edmonds’ algorithm (Edmonds, 1967) where paT (·) is a function that returns the parent to find the minimal directed spanning tree. The of the node i with respect to the tree T , which intuition behind this procedure is that the number encodes the source form from which each target of deterministic arcs should be maximized. form is predicted. In terms of graphical modeling, this p(π) is a Bayesian network over string-valued Gold Network. Finally, for Latin verbs we con- variables (Cotterell et al., 2015). Trees provide a sider a graph that matches the classic pedagogical natural formalism for encoding the intuition behind derivation of Latin verbs from four principal parts. principal parts theory, and provide a fixed paradigm 3 Inflection Generation with RNNs structure prior to inference. We construct a graph with nodes for each cell in the paradigm, as in RNNs have recently achieved state-of-the-art re- Figure 1. The parent of each node is another form sults for many sequence-to-sequence mapping in the paradigm that best predicts that node. problems and paradigm completion is no excep- tion. Given the success of LSTM-based (Hochre- 2.1 Paradigm Trees iter and Schmidhuber, 1997) and GRU-based (Cho et al., 2014) morphological inflectors (Faruqui et Baseline Network. Predicting inflected forms al., 2016; Cotterell et al., 2016), we choose a neural only from the lemma involves a particular graphi- parameterization for our Bayesian network, i.e. the cal model in which all the forms are leaves attached conditional probability p(mi j m ) is com- to the lemma. This network is treated as a baseline, paT (i) puted using a RNN. Our graphical modeling ap- and is depicted in Figure 1a. proach as well as the inference algorithms subse- quently discussed in x4.2 are agnostic to the minu- 1We constrain the task such that the number of forms in a paradigm (jπj = N) is fixed, and each possible form of a tiae of any one parameterization, i.e. the encoding paradigm is assumed to have consistent semantics. p(m j m ) is a black box. i paT (i) Algorithm 1 Decoding by Simulated Annealing 4.1 Parameter Estimation 1: procedure SIMULATED-ANNEALING(T , d, ) Following previous work (Faruqui et al., 2016), 2: τ 10:0 ; m ["; : : : ; "] 3: repeat we train our model in the fully observed setting 4: i ∼ uniform(f1; 2;:::; jmjg) . sample latent node in T with complete paradigms as training data. As our 0 5: mi ∼ qi . sample string from proposal distribution " 1/τ # model is directed, this makes parameter estimation p(m0 jm ) i paT (i) qi(mi) 6: a min 1; p(m jm ) q (m0 ) relatively straightforward. We may estimate the i paT (i) i i parameters of each LSTM independently without 7: if uniform(0; 1) ≤ a then 0 performing joint inference during training. We fol- 8: mi mi . update string to new value if accepted 9: τ τ · d . decay temperature where d 2 (0; 1) low the training procedure of Aharoni et al. (2016), 10: until τ ≤ . repeat until convergence; see Spall (2003, Ch. 8) using a maximum of 300 epochs of SGD. 11: return m 4.2 Approximate Joint Decoding 3.1 LSTMs with Hard Monotonic Attention In a Bayesian network, the maximum-a-posteriori We define the conditional distributions in our (MAP) inference problem refers to finding the most Bayesian network p(m j m ) as LSTMs with probable configuration of the variables given some i paT (i) hard monotonic attention (Aharoni et al., 2016; evidence. In our case, this requires finding the best Aharoni and Goldberg, 2016), which we briefly set of inflections to complete the paradigm given overview.