<<

A Distributional and Orthographic Aggregation Model for English Derivational

Daniel Deutsch∗, John Hewitt∗ and Dan Roth Department of Computer and Information Science University of Pennsylvania {ddeutsch,johnhew,danroth}@seas.upenn.edu

Abstract

Modeling derivational morphology to gen- erate words with particular is useful in many text generation tasks, such as machine translation or abstractive ques- tion answering. In this work, we tackle the task of derived word generation. That is, given the word “run,” we attempt to gener- ate the word “runner” for “someone who runs.” We identify two key problems in generating derived words from root words and transformations: suffix ambiguity and Figure 1: Diagram depicting the flow of our aggregation model. Two models generate a hypothesis according to or- orthographic irregularity. We contribute a thogonal information; then one is chosen as the final model novel aggregation model of derived word generation. Here, the hypothesis from the distributional model generation that learns derivational transfor- is chosen. mations both as orthographic functions us- ing sequence-to-sequence models and as smaller parts. The AGENT derivational transforma- functions in distributional word embedding tion, for example, answers the question, what is the space. Our best open-vocabulary model, word for ‘someone who runs’? with the answer, a which can generate novel words, and our runner.1 Here, AGENT is spelled out as suffixing best closed-vocabulary model, show 22% -ner onto the root run. and 37% relative error reductions over cur- We tackle the task of derived word generation. rent state-of-the-art systems on the same In this task, a root word x and a derivational trans- dataset. formation t are given to the learner. The learner’s job is to produce the result of the transformation on the root word, called the derived word y. Table 1 Introduction 1 gives examples of these transformations. Previous approaches to derived word genera- The explicit modeling of morphology has been tion model the task as a character-level sequence- shown to improve a number of tasks (Seeker and to-sequence (seq2seq) problem (Cotterell et al., C¸ etinoglu, 2015; Luong et al., 2013). In a large 2017b). The letters from the root word and some number of the world’s languages, many words are encoding of the transformation are given as input to composed through morphological operations on a neural encoder, and the decoder is trained to pro- subword units. Some languages are rich in inflec- duce the derived word, one letter at a time. We iden- tional morphology, characterized by syntactic trans- tify the following problems with these approaches: formations like pluralization. Similarly, languages First, because these models are unconstrained, like English are rich in derivational morphology, they can generate sequences of characters that do where the semantics of words are composed from 1We use the verb run as a demonstrative example; the ∗These authors contributed equally; listed alphabetically. transformation can be applied to most .

1938 Proceedings of the 56th Annual Meeting of the Association for Computational (Long Papers), pages 1938–1947 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics x t y improve the accuracy of seq2seq-only derived word systems by adding external information through wise → wisely constrained decoding and hypothesis rescoring. simulate RESULT → simulation These methods provide orthogonal gains to our approve RESULT → approval main contribution. overstate RESULT → overstatement We evaluate models in two categories: open vo- yodel AGENT → yodeler cabulary models that can generate novel words survive AGENT → survivor unattested in a preset vocabulary, and closed- intense NOMINAL → intensity vocabulary models, which cannot. Our best open- effective NOMINAL → effectiveness vocabulary and closed-vocabulary models demon- pessimistic NOMINAL → pessimism strate 22% and 37% relative error reductions over the current state of the art. Table 1: The goal of derived word generation is to produce the derived word, y, given both the root word, x, and the transformation t, as demonstrated here with examples from 2 Background: Derivational Morphology the dataset. Derivational transformations generate novel words that are semantically composed from the root word not form actual words. We argue that requiring the and the transformation. We identify two unsolved model to generate a known word is a reasonable problems in derived word transformation, each of constraint in the special case of English derivational which we address in Sections3 and4. morphology, and doing so avoids a large number First, many plausible choices of suffix for a sin- of common errors. gle pair of root word and transformation. For ex- Second, sequence-based models can only gen- ample, for the verb ground, the RESULT transfor- 2 eralize string manipulations (such as “add -ment”) mation could plausibly take as many forms as if they appear frequently in the training data. Be- (ground, RESULT) → grounding cause of this, they are unable to generate derived (ground, RESULT) → *groundation words that do not follow typical patterns, such as generating truth as the nominative derivation of (ground, RESULT) → *groundment true. We propose to learn a function for each trans- (ground, RESULT) → *groundal formation in a low dimensional vector space that corresponds to mapping from representations of However, only one is correct, even though each the root word to the derived word. This eliminates suffix appears often in the RESULT transformation the reliance on orthographic information, unlike re- of other words. We will refer to this problem as lated approaches to distributional semantics, which “suffix ambiguity.” operate at the suffix level (Gupta et al., 2017). Second, many derived words seem to lack a gen- eralizable orthographic relationship to their root We contribute an aggregation model of derived words. For example, the RESULT of the verb speak word generation that produces hypotheses inde- is speech. It is unlikely, given an orthographically pendently from two separate learned models: one similar verb creak, that the RESULT be creech in- from a seq2seq model with only orthographic in- stead of, say, creaking. Seq2seq models must grap- formation, and one from a feed-forward network ple with the problem of derived words that are using only distributional semantic information in the result of unlikely or potentially unseen string the form of pretrained word vectors. The model transformations. We refer to this problem as “or- learns to choose between the hypotheses accord- thographic irregularity.” ing to the relative confidence of each. This system can be interpreted as learning to decide between 3 Sequence Models and Corpus positing an orthographically regular form or a se- Knowledge mantically salient word. See Figure1 for a diagram of our model. In this section, we introduce the prior state-of-the- We show that this model helps with two open art model, which serves as our baseline system. problems with current state-of-the-art seq2seq de- Then we build on top of this system by incorpo- rived word generation systems, suffix ambiguity rating a dictionary constraint and rescoring the and orthographic irregularity (Section2). We also 2The * indicates a non-word.

1939 model’s hypotheses with token frequency informa- The goal of decoding is to find the most probable tion to address the suffix ambiguity problem. structure yˆ conditioned on some observation x and transformation t. That is, the problem is to solve 3.1 Baseline Architecture yˆ = arg max p(y | x, t) (1) We begin by formalizing the problem and defin- y∈Y ing some notation. For source word x = = arg min − log p(y | x, t) (2) x1, x2, . . . xm, a derivational transformation t, and y∈Y target word y = y , y , . . . y , our goal is to learn 1 2 n where Y is the set of valid structures. Sequential some function from the pair (x, t) to y. Here, x i models have a natural ordering y = y , y , . . . y and y are the ith and jth characters of the input 1 2 n j over which − log p(y | x, t) can be decomposed strings x and y. We will sometimes use x1:i to denote x1, x2, . . . xi, and similarly for y1:j. n X The current state-of-the-art model for derived- − log p(y | x, t) = − log p(yt | y1:t−1, x, t) form generation approaches this problem by learn- t=1 ing a character-level encoder-decoder neural net- (3) work with an attention mechanism (Cotterell et al., Solving Equation2 can be viewed as solving a 2017b; Bahdanau et al., 2014). shortest path problem from a special starting state to a special ending state via some path which The input to the bidirectional LSTM en- uniquely represents y. Each vertex in the graph coder (Hochreiter and Schmidhuber, 1997; Graves represents some sequence y , and the weight of and Schmidhuber, 2005) is the sequence #, 1:i the edge from y1:i to y1:i+1 is given by x1, x2, . . . xm, #, t, where # is a special symbol to denote the start and end of a word, and the encoding − log p(y | y , x, t) (4) of the derivational transformation t is concatenated i+1 1:i−1 to the input characters. The model is trained to The weight of the path from the start state to the end minimize the cross entropy of the training data. We state via the unique path that describes y is exactly refer to our reimplementation of this model as SEQ. equal to Equation3. When the vocabulary size is For a more detailed treatment of neural sequence- too large, the exact shortest path is intractable, and to-sequence models with attention, we direct the approximate search methods, such as beam search, reader to Luong et al.(2015). are used instead. In derived word generation, Y is an infinite set 3.2 Dictionary Constraint of strings. Since Y is unrestricted, almost all of The suffix ambiguity problem poses challenges for the strings in Y are not valid words. Given a dic- models which rely exclusively on input charac- tionary YD, the search space is restricted to only ters for information. As previously demonstrated, those words in the dictionary by searching over the words derived via the same transformation may trie induced from YD, which is a subgraph of the take different suffixes, and it is hard to select among unrestricted graph. By limiting the search space them based on character information alone. Here, to YD, the decoder is guaranteed to generate some we describe a process for restricting our inference known word. Models which use this dictionary- procedure to only generate known English words, constrained inference procedure will be labeled which we call a dictionary constraint. We believe with +DICT. Algorithm1 has the pseudocode for that for English morphology, a large enough cor- our decoding procedure. pus will contain the vast majority of derived forms, We discuss specific details of the search pro- so while this approach is somewhat restricting, it cedure and interesting observations of the search removes a significant amount of ambiguity from space in Section6. Section 5.2 describes how we the problem. obtained the dictionary of valid words. To describe how we implemented this dictionary constraint, it is useful first to discuss how decoding 3.3 Word Frequency Knowledge through in a seq2seq model is equivalent to solving a short- Rescoring est path problem. The notation is specific to our We also consider the inclusion of explicit word model, but the argument is applicable to seq2seq frequency information to help solve suffix ambi- models in general. guity, using the intuition that “real” derived words

1940 are likely to be frequently attested. This permits a 4.1 Feed-forward derivational high-recall, potentially noisy dictionary. transformations We are motivated by very high top-10 accu- For all source words x and all target words y, we racy compared to top-1 accuracy, even among look up static distributional embeddings vx, vy ∈ d dictionary-constrained models. By rescoring the R . For each derivational transformation t, we d d hypotheses of a model using word frequency (a learn a function ft : R → R that maps vx to vy. word-global signal) as a feature, attempt to recover ft is parametrized as two-layer perceptron, trained a portion of this top-10 accuracy. using a squared loss, When a model has been trained, we query it for L = bTb (5) its top-10 most likely hypotheses. The union of all hypotheses for a subset of the training observations b = ft(vx) − vy (6) forms the training set for a classifier that learns We perform inference by nearest neighbor search to predict whether a hypothesis generated by the in the embedding space. This inference strategy model is correct. Each hypothesis is labelled with requires a subset of strings for our embedding dic- its correctness, a value in {±1}. We train a sim- tionary, YV . ple combination of two scores: the seq2seq model Upon receiving (x, t) at test time, we compute score for the hypothesis, and the log of the word ft(vx) and find the most similar embeddings in YV . frequency of the hypothesis. Specifically, we find the top-k most similar embed- To permit a nonlinear combination of word fre- dings, and take the most similar derived word that quency and model score, we train a small multi- starts with the same 4 letters as the root word, and layer perceptron with the model score and the fre- is not identical to it. This heuristic filters out highly quency of a derived word hypothesis as features. implausible hypotheses. At testing time, the 10 hypotheses generated by We use the single-word subset of the Google a single seq2seq model for a single observation are News vectors (Mikolov et al., 2013) as YV , so the rescored. The new model top-1 hypothesis, then, size of the vocabulary is 929k words. is the argmax over the 10 hypotheses according to the rescorer. In this way, we are able to incorporate 4.2 SEQ and DIST Aggregation word-global information, e.g. word frequency, that The seq2seq and distributional models we have pre- is ill-suited for incorporation at each character pre- sented learn with disjoint information to solve sepa- diction step of the seq2seq model. We label models rate problems. We leverage this intuition to build a that are rescored in this way +FREQ. model that chooses, for each observation, whether to generate according to orthographic information 4 Distributional Models via the SEQ model, or produce a potentially irregu- lar form via the DIST model. So far, we have presented models that learn deriva- To train this model, we use a held-out portion of tional transformations as orthographic operations. the training set, and filter it to only observations for Such models struggle by construction with the or- which exactly one of the two models produces the thographic irregularity problem, as they are trained correct derived form. Finally, we make the strong to generalize orthographic information. However, assumption that the probability of a derived form the semantic relationships between root words and being generated correctly according to 1 model derived words are the same even when the orthog- as opposed to the other is dependent only on the raphy is dissimilar. It is salient, for example, that unnormalized model score from each. We model irregular word speech is related to its root speak in this as a logistic regression (t is omitted for clarity): about the same way as how exploration is related P (·|yD, yS, x) = to the word explore. softmax(W [DIST(y |x); SEQ(y |x)] + b ) We model distributional transformations as func- e D S e tions in dense distributional word embedding where We and be are learned parameters, yD and spaces, crucially learning a function per deriva- yS are the hypotheses of the distributional and tional transformation, not per suffix pair. In this seq2seq models, and DIST(·) and SEQ(·) are the way, we aim to explicitly model the semantic trans- models’ likelihood functions. We denote this ag- formation, not the othographic information. gregate AGGR in our results.

1941 5 Datasets Method Accuracy Avg. #States GREEDY 75.9 11.8 In this section we describe the derivational mor- BEAM 76.2 101.2 phology dataset used in our experiments and how SHORTEST 76.2 11.8 we collected the dictionary and token frequencies DICT+GREEDY 77.2 11.7 used in the dictionary constraint and rescorer. DICT+BEAM 82.6 91.2 DICT+SHORTEST 82.6 12.4 5.1 Derivational Morphology Table 2: The average accuracies and number of states explored In our experiments, we use the derived word gen- in the search graph of 30 runs of the SEQ model with various eration derivational morphology dataset released search procedures. The BEAM models use a beam size of 10. in Cotterell et al.(2017b). The dataset, derived from NomBank (Meyers et al., 2004) , consists of 4,222 training, 905 validation, and 905 test triples only slightly more expensive than greedy search of the form (x, t, y). The transformations are from and significantly less expensive than beam search. the following categories: ADVERB (ADJ → ADV), To quantify this claim, we measured the ac- RESULT (V → N), AGENT (V → N), and NOMI- curacy and number of states explored by greedy NAL (ADJ → N). Examples from the dataset can search, beam search, and shortest path with and be found in Table1. without a dictionary constraint on the development data. Table2 shows the results averaged over 30 5.2 Dictionary and Token Frequency runs. As expected, beam search and shortest path Statistics have higher accuracies than greedy search and ex- plore more of the search space. Surprisingly, beam The dictionary and token frequency statistics used search and shortest path have nearly identical ac- in the dictionary constraint and frequency rerank- curacies, but shortest path explores significantly ing come from the Google Books NGram corpus fewer hypotheses. (Michel et al., 2011). The unigram frequency counts were aggregated across years, and any to- At least two factors contribute to the tractability kens which appear fewer than approximately 2,000 of exact search in our model. First, our character- times, do not end in a known possible suffix, or level sequence model has a vocabulary size of 63, contain a character outside of our vocabulary were which is significantly smaller than token-level mod- removed. els, in which a vocabulary of 50k words is not un- The frequency threshold was determined using common. The search space of sequence models is development data, optimizing for high recall. We dependent upon the size of the vocabulary, so the collect a set of known suffixes from the training model’s search space is dramatically smaller than data by removing the longest common prefix be- for a token-level model. tween the source and target words from the target Second, the inherent structure of the task makes word. The result is a dictionary with frequency it easy to eliminate large subgraphs of the search information for around 360k words, which covers space. The first several characters of the input word 98% of the target words in the training data.3 and output word are almost always the same, so the model assigns very low probability to any se- 6 Inference Procedure Discussion quence with different starting characters than the input. Then, the rest of the search procedure is In many sequence models where the vocabulary dedicated to deciding between suffixes. Any suffix size is large, exact inference by finding the true which does not appear frequently in the training shortest path in the graph discussed in Section 3.2 data receives a low score, leaving the search to de- is intractable. As a result, approximate inference cide between a handful of possible options. The techniques such as beam search are often used, or result is that the learned probability distribution is the size of the search space is reduced, for exam- very spiked; it puts very high probability on just ple, by using a Markov assumption. We, however, a few output sequences. It is empirically true that observed that exact inference via a shortest path the top few most probable sequences have signif- algorithm is not only tractable in our model, but icantly higher scores than the next most probable sequences, which supports this hypothesis. 3 The remaining 2% is mostly words with hyphens or mistakes in the dataset. In our subsequent experiments, we decode using

1942 Algorithm 1 The decoding procedure uses a Model Accuracy Edit shortest-path algorithm to find the most probable Cotterell et al.(2017b) 71.7 0.97 output sequence. The dictionary constraint is (op- DIST 54.9 3.23 tionally) implemented on line9 by only considering SEQ 74.2 0.88 prefixes that are contained in some trie T . AGGR 78.0 0.83 SEQ+FREQ 79.3 0.71 1: procedure DECODE(x, t, V , T ) DUAL+FREQ 82.0 0.64

2: H ← Heap() SEQ+DICT 80.4 0.72 3: H.insert(0, #) AGGR+DICT 81.0 0.78 SEQ+FREQ+DICT 81.2 0.71 4: while H is not empty do AGGR+FREQ+DICT 82.4 0.67 5: y ← H.remove() 6: if y is a complete word then return y Table 3: The accuracies and edit distances of the models 7: for y ∈ V do presented in this paper and prior work. For edit distance, 8: y0 ← y + y lower is better. The dictionary-constrained models are on the lower half of the table. 9: if y0 ∈ T then 10: s ← FORWARD(x, t, y0) 0 11: H.insert(s, y ) depth analysis of the strengths of SEQ and DIST in Section 7.1. exact inference by running a shortest path algo- Closed-vocabulary models We now consider rithm (see Algorithm1). For reranking models, closed-vocabulary models that improve upon the instead of typically using a beam of size k, we use seq2seq model in AGGR. First, we see that re- the top k most probable sequences. stricting the decoder to only generate known words is extremely useful, with SEQ+DICT improving 7 Results over SEQ by 6.2 points. Qualitatively, we note In all of our experiments, we use the training, de- that this constraint helps solve the suffix ambiguity velopment, and testing splits provided by Cotterell problem, since orthographically plausible incorrect et al.(2017b) and average over 30 random restarts. hypotheses are pruned as non-words. See Table6 Table3 displays the accuracies and average edit for examples of this phenomenon. Additionally, distances on the test set of each of the systems pre- we observe that the dictionary-constrained model sented in this work and the state-of-the-art model outperforms the unconstrained model according to from Cotterell et al.(2017b). top-10 accuracy (see Table5). First, we observed that SEQ outperforms the re- Rescoring (+FREQ) provides further improve- sults reported in Cotterell et al.(2017b) by a large ment of 0.8 points, showing that the decoding dic- margin, despite the fact that the model architectures tionary constraint provides a higher-quality beam are the same. We attribute this difference to better that still has room for top-1 improvement. All to- hyperparameter settings and improved learning rate gether, AGGR+FREQ+DICT provides a 4.4 point annealing. improvement over the best open-vocabulary model, Then, it is clear that the accuracy of the distri- AGGR. This shows the disambiguating power of butional model, DIST, is significantly lower than assuming a closed vocabulary. any seq2seq model. We believe the - Edit Distance One interesting side effect of the informed models perform better because most ob- dictionary constraint appears when comparing servations in the dataset are orthographically regu- AGGR+FREQ with and without the dictionary con- lar, providing low-hanging fruit. straint. Although the accuracy of the dictionary- Open-vocabulary models Our open-vocabulary constrained model is better, the average edit dis- aggregation model AGGR improves performance tance is worse. The unconstrained model is free by 3.8 points accuracy over SEQ, indicating that to put invalid words which are orthographically the sequence models and the distributional model similar to the target word in its top-k, however the are contributing complementary signals. AGGR constrained model can only choose valid words. is an open-vocabulary model like Cotterell et al. This means it is easier for the unconstrained model (2017b) and improves upon it by 6.3 points, making to generate words which have a low edit distance it our best comparable model. We provide an in- to the ground truth, whereas the constrained model

1943 Cotterell et al. AGGR+FREQ Cotterell et al. AGGR SEQ SEQ+DICT (2017b) +DICT (2017b) acc edit acc edit acc edit top-10-acc top-10-acc top-10-acc NOMINAL 35.1 2.67 68.0 1.32 62.1 1.40 NOMINAL 70.2 73.7 87.5 RESULT 52.9 1.86 59.1 1.83 69.7 1.29 RESULT 72.6 79.9 90.4 AGENT 65.6 0.78 73.5 0.65 79.1 0.57 ADVERB 93.3 0.18 94.0 0.18 95.0 0.22 AGENT 82.2 88.4 91.6 ADVERB 96.5 96.9 96.9

Table 4: The accuracies and edit distances of our best Table 5: The accuracies of the top-10 best outputs for the SEQ, open-vocabulary and closed-vocabulary models, AGGR and SEQ+DICT, and prior work demonstrate that the dictionary AGGR+FREQ+DICT compared to prior work, evaluated at the constraint significantly improves the overall candidate quality. transformation level. For edit distance, lower is better.

can only do that if such a word exists. The result is a more accurate, yet more orthographically diverse, set of hypotheses. Results by Transformation Next, we compare our best open vocabulary and closed vocabulary models to previous work across each derivational transformation. These results are in Table4. The largest improvement over the baseline sys- tem is for NOMINAL transformations, in which the AGGR has a 49% reduction in error. We attribute most of this gain to the difficulty of this particular transformation. NOMINAL is challenging because Figure 2: Aggregating across 30 random restarts, we tallied there are several plausible endings (e.g. -ity, -ness, when SEQ and DIST correctly produced derived forms of each -ence) which occur at roughly the same rate. Addi- suffix. The y-axis shows the logarithm of the difference, per suffix, between the tally for DIST and the tally for SEQ. On tionally, NOMINAL examples are the least frequent the x-axis is the logarithm of the frequency of derived words transformation in the dataset, so it is challenging with each suffix in the training data. A linear regression line is for a sequential model to learn to generalize. The plotted to show the relationship between log suffix frequency and log difference in correct predictions. Suffixes that differ distributional model, which does not rely on suffix only by the first letter, as with -ger and -er, have been merged information, does not have this same weakness, so and represented by the more frequent of the two. the aggregation AGGR model has better results. The performance of AGGR+FREQ+DICT is orthographically irregular or infrequent in the train- worse than AGGR, however. This is surprising ing data. Figure2 quantifies this phenomenon, because, in all other transformations, adding dic- analyzing the difference in accuracy between the tionary information improves the accuracies. We two models, and plotting this in relationship to the believe this is due to the ambiguity of the ground frequency of the suffix in the training data. The truth: Many root words have seemingly multi- plot shows that SEQ excels at generating derived ple plausible nominal transformations, such as words ending in -ly, -ion, and other suffixes that rigid → {rigidness, rigidity} and equivalent → appeared frequently in the training data. DIST’s {equivalence, equivalency}. The dictionary con- improvements over SEQ are generally much less straint produces a better set of hypotheses to frequent in the training data, or as in the case of rescore, as demonstrated in Table5. Therefore, -ment, are less frequent than other suffixes for the the dictionary-constrained model is likely to have same transformation (like -ion.) By producing de- more of these ambiguous cases, which makes the rived words whose suffixes show up rarely in the task more difficult. training data, DIST helps solve the orthographic 7.1 Strengths of SEQ and DIST irregularity problem. In this subsection we explore why AGGR improves 8 Prior Work consistently over SEQ even though it maintains an open vocabulary. We have argued that DIST is There has been much work on the related task of able to correctly produce derived words that are inflected word generation (Durrett and DeNero,

1944 x t DIST SEQ AGGR AGGR+DICT

approve RESULT approval approvation approval approval bankrupt NOMINAL bankruptcy bankruption bankruptcy bankruptcy irretrievable ADVERB irreparably irretrievably irretrievably irretrievably connect RESULT connectivity connection connection connection stroll AGENT strolls stroller stroller stroller emigrate SUBJECT emigre emigrator emigrator emigrant ubiquitous NOMINAL ubiquity ubiquit ubiquit ubiquity hinder AGENT hinderer hinderer hinderer hinderer vacant NOMINAL vacance vacance vacance vacance

Table 6: Sample output from a selection of models. The words in bold mark the correct derivations. “Hindrance” and “vacancy” are the expected derived words for the last two rows.

2013; Rastogi et al., 2016; Hulden et al., 2014). It is ing derivational transformations (Gladkova et al., a structurally similar task to ours, but does not have 2016). However, we use more expressive, non- the same difficulty of challenges (Cotterell et al., linear functions to model derivational transforma- 2017a,b), which we have addressed in our work. tions and report positive results. Gupta et al.(2017) The paradigm completion for derivational morphol- then learn a linear transformation per orthographic ogy dataset we use in this work was introduced in rule to solve a word analogy task. Our distribu- Cotterell et al.(2017b). They apply the model that tional model learns a function per derivational trans- won the 2016 SIGMORPHON shared task on in- formation, not per orthographic rule, which allows flectional morphology to derivational morphology it to generalize to unseen orthography. (Kann and Schutze¨ , 2016; Cotterell et al., 2016). We use this as our baseline. 9 Implementation Details Our implementation of the dictionary constraint Our models are implemented in Python using the is an example of a special constraint which can DyNet deep learning library (Neubig et al., 2017). be directly incorporated into the inference algo- The code is freely available for download.4 rithm at little additional cost. Roth and Yih(2004, Sequence Model The sequence-to-sequence 2007) propose a general inference procedure that model uses character embeddings of size 20, which naturally incorporates constraints through recasting are shared across the encoder and decoder, with inference as solving an integer linear program. a vocabulary size of 63. The hidden states of the Beam or hypothesis rescoring to incorporate an LSTMs are of size 40. expensive or non-decomposable signal into search For training, we use Adam with an initial learn- has a history in machine translation (Huang and ing rate of 0.005, a batch size of 5, and train for a Chiang, 2007). In inflectional morphology, Nico- maximum of 30 epochs. If after one epoch of the lai et al.(2015) use this idea to rerank hypothe- training data, the loss on the validation set does not ses using orthographic features and Faruqui et al. decrease, we anneal the learning rate by half and (2016) use a character-level language model. Our revert to the previous best model. approach is similar to Faruqui et al.(2016) in that During decoding, we find the top 1 most prob- we use statistics from a raw corpus, but at the token able sequence as discussed in Section6 unless level. rescoring is used, in which we use the top 10. There have been several attempts to use distri- Rescorer The rescorer is a 1-hidden-layer per- butional information in morphological generation ceptron with a tanh nonlinearity and 4 hidden and analysis. Soricut and Och(2015) collect pairs units. It is trained for a maximum of 5 epochs. of words related by any morphological change in an unsupervised manner, then select a vector off- Distributional Model The DIST model is a 1- set which best explains their observations. There hidden-layer perceptron with a tanh nonlinearity has been subsequent work exploring the vector 4https://github.com/danieldeutsch/ offset method, finding it unsuccessful in captur- acl2018

1945 and 100 hidden units. It is trained for a maximum Ryan Cotterell, Christo Kirov, John Sylak-Glassman, of 25 epochs. David Yarowsky, Jason Eisner, and Mans Hulden. 2016. The sigmorphon 2016 shared taskmorpholog- ical reinflection. ACL 2016 page 10. 10 Conclusion Ryan Cotterell, Ekaterina Vylomova, Huda Khayral- In this work, we present a novel aggregation model lah, Christo Kirov, and David Yarowsky. 2017b. for derived word generation. This model learns to Paradigm completion for derivational morphol- choose between the predictions of orthographically- ogy. In Proceedings of the 2017 Confer- ence on Empirical Methods in Natural Language and distributionally-informed models. This amelio- Processing. Association for Computational Lin- rates suffix ambiguity and orthographic irregularity, guistics, Copenhagen, Denmark, pages 714–720. the salient problems of the generation task. Concur- https://www.aclweb.org/anthology/D17-1074. rently, we show that derivational transformations Greg Durrett and John DeNero. 2013. Supervised can be usefully modeled as nonlinear functions on learning of complete morphological paradigms. distributional word embeddings. The distributional In Proceedings of the 2013 Conference of the and orthographic models aggregated contribute or- North American Chapter of the Association for thogonal information to the aggregate, as shown Computational Linguistics: Human Language Technologies. Association for Computational Lin- by substantial improvements over state-of-the-art guistics, Atlanta, Georgia, pages 1185–1195. results, and qualitative analysis. Two ways of incor- http://www.aclweb.org/anthology/N13-1138. porating corpus knowledge – constrained decoding and rescoring – demonstrate further improvements Manaal Faruqui, Yulia Tsvetkov, Graham Neubig, and Chris Dyer. 2016. Morphological inflection gener- to our main contribution. ation using character sequence to sequence learn- ing. In Proceedings of the 2016 Conference of Acknowledgements the North American Chapter of the Association for Computational Linguistics: Human Language We would like to thank Shyam Upadhyay, Jordan Technologies. Association for Computational Lin- Kodner, and Ryan Cotterell for insightful discus- guistics, San Diego, California, pages 634–643. http://www.aclweb.org/anthology/N16-1077. sions about derivational morphology. We would also like to thank our anonymous reviewers for Anna Gladkova, Aleksandr Drozd, and Satoshi Mat- helpful feedback on clarity and presentation. suoka. 2016. Analogy-based detection of mor- phological and semantic relations with word em- This work was supported by Contract HR0011- beddings: what works and what doesn’t. In 15-2-0025 with the US Defense Advanced Re- Proceedings of the NAACL Student Research search Projects Agency (DARPA). Approved for Workshop. Association for Computational Lin- Public Release, Distribution Unlimited. The views guistics, San Diego, California, pages 8–15. http://www.aclweb.org/anthology/N16-2002. expressed are those of the authors and do not reflect the official policy or position of the Department of Alex Graves and Jurgen¨ Schmidhuber. 2005. Frame- Defense or the U.S. Government. wise phoneme classification with bidirectional lstm and other neural network architectures. Neural Net- works 18(5-6):602–610.

References Arihant Gupta, Syed Sarfaraz Akhtar, Avijit Vaj- payee, Arjit Srivastava, Madan Gopal Jhanwar, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua and Manish Shrivastava. 2017. Exploiting mor- Bengio. 2014. Neural machine translation by phological regularities in distributional word rep- jointly learning to align and translate. CoRR resentations. In Proceedings of the 2017 Con- abs/1409.0473. ference on Empirical Methods in Natural Lan- guage Processing. Association for Computational Ryan Cotterell, Christo Kirov, John Sylak-Glassman, Linguistics, Copenhagen, Denmark, pages 292–297. Geraldine´ Walther, Ekaterina Vylomova, Patrick https://www.aclweb.org/anthology/D17-1028. Xia, Manaal Faruqui, Sandra Kubler,¨ David Yarowsky, Jason Eisner, and Mans Hulden. 2017a. Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. Long Conll-sigmorphon 2017 shared task: Universal mor- short-term memory. Neural computation 9(8):1735– phological reinflection in 52 languages. In Proceed- 1780. ings of the CoNLL SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection. Asso- Liang Huang and David Chiang. 2007. Forest ciation for Computational Linguistics, Vancouver, rescoring: Faster decoding with integrated lan- pages 1–30. http://www.aclweb.org/anthology/K17- guage models. In Proceedings of the 45th An- 2001. nual Meeting of the Association of Computational

1946 Linguistics. Association for Computational Lin- Cynthia Gan, Dan Garrette, Yangfeng Ji, Lingpeng guistics, Prague, Czech Republic, pages 144–151. Kong, Adhiguna Kuncoro, Gaurav Kumar, Chai- http://www.aclweb.org/anthology/P07-1019. tanya Malaviya, Paul Michel, Yusuke Oda, Matthew Richardson, Naomi Saphra, Swabha Swayamdipta, Mans Hulden, Markus Forsberg, and Malin Ahlberg. and Pengcheng Yin. 2017. Dynet: The dy- 2014. Semi-supervised learning of morpho- namic neural network toolkit. arXiv preprint logical paradigms and lexicons. In Proceed- arXiv:1701.03980 . ings of the 14th Conference of the European Chapter of the Association for Computational Garrett Nicolai, Colin Cherry, and Grzegorz Kondrak. Linguistics. Association for Computational Lin- 2015. Inflection generation as discriminative string guistics, Gothenburg, Sweden, pages 569–578. transduction. In Proceedings of the 2015 Confer- http://www.aclweb.org/anthology/E14-1060. ence of the North American Chapter of the As- sociation for Computational Linguistics: Human Katharina Kann and Hinrich Schutze.¨ 2016. Single- Language Technologies. Association for Computa- model encoder-decoder with explicit morphologi- tional Linguistics, Denver, Colorado, pages 922– cal representation for reinflection. In Proceed- 931. http://www.aclweb.org/anthology/N15-1093. ings of the 54th Annual Meeting of the As- sociation for Computational Linguistics (Volume Pushpendre Rastogi, Ryan Cotterell, and Jason Eisner. 2: Short Papers). Association for Computational 2016. Weighting finite-state transductions with neu- Linguistics, Berlin, Germany, pages 555–560. ral context. In Proceedings of the 2016 Confer- http://anthology.aclweb.org/P16-2090. ence of the North American Chapter of the Associ- ation for Computational Linguistics: Human Lan- Thang Luong, Hieu Pham, and Christopher D. Man- guage Technologies. Association for Computational ning. 2015. Effective approaches to attention-based Linguistics, San Diego, California, pages 623–633. neural machine translation. In Proceedings of the http://www.aclweb.org/anthology/N16-1076. 2015 Conference on Empirical Methods in Natu- ral Language Processing. Association for Compu- D. Roth and W. Yih. 2004. A linear program- tational Linguistics, Lisbon, Portugal, pages 1412– ming formulation for global inference in natural 1421. http://aclweb.org/anthology/D15-1166. language tasks. In Hwee Tou Ng and Ellen Riloff, editors, Proc. of the Conference on Compu- Thang Luong, Richard Socher, and Christopher Man- tational Natural Language Learning (CoNLL). As- ning. 2013. Better word representations with re- sociation for Computational Linguistics, pages 1–8. cursive neural networks for morphology. In Pro- http://cogcomp.org/papers/RothYi04.pdf. ceedings of the Seventeenth Conference on Computa- tional Natural Language Learning. Association for D. Roth and W. Yih. 2007. Global in- Computational Linguistics, Sofia, Bulgaria, pages ference for entity and relation identifica- 104–113. http://www.aclweb.org/anthology/W13- tion via a linear programming formulation 3512. http://cogcomp.org/papers/RothYi07.pdf.

A. Meyers, R. Reeves, C. Macleod, R. Szekely, Wolfgang Seeker and Ozlem¨ C¸etinoglu. 2015. A graph- V. Zielinska, B. Young, and R. Grishman. 2004. The based lattice dependency parser for joint morpholog- nombank project: An interim report. In A. Meyers, ical segmentation and syntactic analysis. Transac- editor, HLT-NAACL 2004 Workshop: Frontiers in tions of the Association for Computational Linguis- Corpus Annotation. Association for Computational tics 3:359–373. Linguistics, Boston, Massachusetts, USA, pages 24– 31. Radu Soricut and Franz Och. 2015. Unsuper- vised morphology induction using word embed- Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser dings. In Proceedings of the 2015 Conference Aiden, Adrian Veres, Matthew K. Gray, Joseph P. of the North American Chapter of the Associa- Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, tion for Computational Linguistics: Human Lan- Jon Orwant, Steven Pinker, Martin A. Nowak, and guage Technologies. Association for Computational Erez Lieberman Aiden. 2011. Quantitative analysis Linguistics, Denver, Colorado, pages 1627–1637. of culture using millions of digitized books. Science http://www.aclweb.org/anthology/N15-1186. 331 6014:176–82.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In ICLR Workshop Papers. Scottsdale, Arizona. https://arxiv.org/pdf/1301.3781.pdf.

Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopou- los, Miguel Ballesteros, David Chiang, Daniel Cloth- iaux, Trevor Cohn, Kevin Duh, Manaal Faruqui,

1947