A Distributional and Orthographic Aggregation Model for English Derivational Morphology
Total Page:16
File Type:pdf, Size:1020Kb
A Distributional and Orthographic Aggregation Model for English Derivational Morphology Daniel Deutsch∗, John Hewitt∗ and Dan Roth Department of Computer and Information Science University of Pennsylvania fddeutsch,johnhew,[email protected] Abstract Modeling derivational morphology to gen- erate words with particular semantics is useful in many text generation tasks, such as machine translation or abstractive ques- tion answering. In this work, we tackle the task of derived word generation. That is, given the word “run,” we attempt to gener- ate the word “runner” for “someone who runs.” We identify two key problems in generating derived words from root words and transformations: suffix ambiguity and Figure 1: Diagram depicting the flow of our aggregation model. Two models generate a hypothesis according to or- orthographic irregularity. We contribute a thogonal information; then one is chosen as the final model novel aggregation model of derived word generation. Here, the hypothesis from the distributional model generation that learns derivational transfor- is chosen. mations both as orthographic functions us- ing sequence-to-sequence models and as smaller parts. The AGENT derivational transforma- functions in distributional word embedding tion, for example, answers the question, what is the space. Our best open-vocabulary model, word for ‘someone who runs’? with the answer, a which can generate novel words, and our runner.1 Here, AGENT is spelled out as suffixing best closed-vocabulary model, show 22% -ner onto the root verb run. and 37% relative error reductions over cur- We tackle the task of derived word generation. rent state-of-the-art systems on the same In this task, a root word x and a derivational trans- dataset. formation t are given to the learner. The learner’s job is to produce the result of the transformation on the root word, called the derived word y. Table 1 Introduction 1 gives examples of these transformations. Previous approaches to derived word genera- The explicit modeling of morphology has been tion model the task as a character-level sequence- shown to improve a number of tasks (Seeker and to-sequence (seq2seq) problem (Cotterell et al., C¸ etinoglu, 2015; Luong et al., 2013). In a large 2017b). The letters from the root word and some number of the world’s languages, many words are encoding of the transformation are given as input to composed through morphological operations on a neural encoder, and the decoder is trained to pro- subword units. Some languages are rich in inflec- duce the derived word, one letter at a time. We iden- tional morphology, characterized by syntactic trans- tify the following problems with these approaches: formations like pluralization. Similarly, languages First, because these models are unconstrained, like English are rich in derivational morphology, they can generate sequences of characters that do where the semantics of words are composed from 1We use the verb run as a demonstrative example; the ∗These authors contributed equally; listed alphabetically. transformation can be applied to most verbs. 1938 Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 1938–1947 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics x t y improve the accuracy of seq2seq-only derived word systems by adding external information through wise ADVERB ! wisely constrained decoding and hypothesis rescoring. simulate RESULT ! simulation These methods provide orthogonal gains to our approve RESULT ! approval main contribution. overstate RESULT ! overstatement We evaluate models in two categories: open vo- yodel AGENT ! yodeler cabulary models that can generate novel words survive AGENT ! survivor unattested in a preset vocabulary, and closed- intense NOMINAL ! intensity vocabulary models, which cannot. Our best open- effective NOMINAL ! effectiveness vocabulary and closed-vocabulary models demon- pessimistic NOMINAL ! pessimism strate 22% and 37% relative error reductions over the current state of the art. Table 1: The goal of derived word generation is to produce the derived word, y, given both the root word, x, and the transformation t, as demonstrated here with examples from 2 Background: Derivational Morphology the dataset. Derivational transformations generate novel words that are semantically composed from the root word not form actual words. We argue that requiring the and the transformation. We identify two unsolved model to generate a known word is a reasonable problems in derived word transformation, each of constraint in the special case of English derivational which we address in Sections3 and4. morphology, and doing so avoids a large number First, many plausible choices of suffix for a sin- of common errors. gle pair of root word and transformation. For ex- Second, sequence-based models can only gen- ample, for the verb ground, the RESULT transfor- 2 eralize string manipulations (such as “add -ment”) mation could plausibly take as many forms as if they appear frequently in the training data. Be- (ground; RESULT) ! grounding cause of this, they are unable to generate derived (ground; RESULT) ! *groundation words that do not follow typical patterns, such as generating truth as the nominative derivation of (ground; RESULT) ! *groundment true. We propose to learn a function for each trans- (ground; RESULT) ! *groundal formation in a low dimensional vector space that corresponds to mapping from representations of However, only one is correct, even though each the root word to the derived word. This eliminates suffix appears often in the RESULT transformation the reliance on orthographic information, unlike re- of other words. We will refer to this problem as lated approaches to distributional semantics, which “suffix ambiguity.” operate at the suffix level (Gupta et al., 2017). Second, many derived words seem to lack a gen- eralizable orthographic relationship to their root We contribute an aggregation model of derived words. For example, the RESULT of the verb speak word generation that produces hypotheses inde- is speech. It is unlikely, given an orthographically pendently from two separate learned models: one similar verb creak, that the RESULT be creech in- from a seq2seq model with only orthographic in- stead of, say, creaking. Seq2seq models must grap- formation, and one from a feed-forward network ple with the problem of derived words that are using only distributional semantic information in the result of unlikely or potentially unseen string the form of pretrained word vectors. The model transformations. We refer to this problem as “or- learns to choose between the hypotheses accord- thographic irregularity.” ing to the relative confidence of each. This system can be interpreted as learning to decide between 3 Sequence Models and Corpus positing an orthographically regular form or a se- Knowledge mantically salient word. See Figure1 for a diagram of our model. In this section, we introduce the prior state-of-the- We show that this model helps with two open art model, which serves as our baseline system. problems with current state-of-the-art seq2seq de- Then we build on top of this system by incorpo- rived word generation systems, suffix ambiguity rating a dictionary constraint and rescoring the and orthographic irregularity (Section2). We also 2The * indicates a non-word. 1939 model’s hypotheses with token frequency informa- The goal of decoding is to find the most probable tion to address the suffix ambiguity problem. structure y^ conditioned on some observation x and transformation t. That is, the problem is to solve 3.1 Baseline Architecture y^ = arg max p(y j x; t) (1) We begin by formalizing the problem and defin- y2Y ing some notation. For source word x = = arg min − log p(y j x; t) (2) x1; x2; : : : xm, a derivational transformation t, and y2Y target word y = y ; y ; : : : y , our goal is to learn 1 2 n where Y is the set of valid structures. Sequential some function from the pair (x; t) to y. Here, x i models have a natural ordering y = y ; y ; : : : y and y are the ith and jth characters of the input 1 2 n j over which − log p(y j x; t) can be decomposed strings x and y. We will sometimes use x1:i to denote x1; x2; : : : xi, and similarly for y1:j. n X The current state-of-the-art model for derived- − log p(y j x; t) = − log p(yt j y1:t−1; x; t) form generation approaches this problem by learn- t=1 ing a character-level encoder-decoder neural net- (3) work with an attention mechanism (Cotterell et al., Solving Equation2 can be viewed as solving a 2017b; Bahdanau et al., 2014). shortest path problem from a special starting state to a special ending state via some path which The input to the bidirectional LSTM en- uniquely represents y. Each vertex in the graph coder (Hochreiter and Schmidhuber, 1997; Graves represents some sequence y , and the weight of and Schmidhuber, 2005) is the sequence #, 1:i the edge from y1:i to y1:i+1 is given by x1; x2; : : : xm, #, t, where # is a special symbol to denote the start and end of a word, and the encoding − log p(y j y ; x; t) (4) of the derivational transformation t is concatenated i+1 1:i−1 to the input characters. The model is trained to The weight of the path from the start state to the end minimize the cross entropy of the training data. We state via the unique path that describes y is exactly refer to our reimplementation of this model as SEQ. equal to Equation3. When the vocabulary size is For a more detailed treatment of neural sequence- too large, the exact shortest path is intractable, and to-sequence models with attention, we direct the approximate search methods, such as beam search, reader to Luong et al.(2015). are used instead.