Recurrent Neural Networks in Linguistic Theory: Revisiting Pinker and Prince (1988) and the Past Tense Debate

Christo Kirov Ryan Cotterell Johns Hopkins University Johns Hopkins University Baltimore, MD Baltimore, MD [email protected] [email protected]

Abstract In the field of natural language processing (NLP), however, neural networks have experienced a Can advances in NLP help advance cog- renaissance. With novel architectures, large new nitive modeling? We examine the role of data sets available for training, and access to exten- artificial neural networks, the current state sive computational resources, neural networks now of the art in many common NLP tasks, by constitute the state of the art in many NLP tasks. returning to a classic case study. In 1986, Rumelhart and McClelland famously in- However, NLP as a discipline has a distinct prac- troduced a neural architecture that learned tical bent and more often concerns itself with the to transduce English verb stems to their large-scale engineering applications of language past tense forms. Shortly thereafter, in 1988, technologies. As such, the field’s findings are not Pinker and Prince presented a comprehen- always considered relevant to the scientific study sive rebuttal of many of Rumelhart and of language (i.e., the field of linguistics). Recent McClelland’s claims. Much of the force of work, however, has indicated that this perception their attack centered on the empirical in- adequacy of the Rumelhart and McClelland is changing, with researchers, for example, prob- model. Today, however, that model is ing the ability of neural networks to learn syntactic severely outmoded. We show that the dependencies like subject–verb agreement (Linzen Encoder-Decoder network architectures used et al., 2016). in modern NLP systems obviate most of Moreover, in the domains of morphology and Pinker and Prince’s criticisms without requir- phonology, both NLP practitioners and linguists ing any simplification of the past tense map- have considered virtually identical problems, seem- ping problem. We suggest that the empirical ingly unbeknownst to each other. For example, performance of modern networks warrants both computational and theoretical morphologists a reexamination of their utility in linguistic and cognitive modeling. are concerned with how different inflected forms in the lexicon are related and how one can learn to generate such inflections from data. Indeed, the 1 Introduction original R&M network focuses on such a generation task, namely, generating English past tense In their famous 1986 opus, Rumelhart and forms from their stems. R&M’s network, however, McClelland (R&M) describe a neural network capa- was severely limited and did not generalize cor- ble of transducing English verb stems to their past rectly to held-out data. In contrast, state-of-the art tense. The strong cognitive claims in the article morphological generation networks used in NLP, fomented a veritable brouhaha in the linguistics built from the modern evolution of recurrent neural community and eventually led to the highly influ- networks (RNNs) explored by Elman (1990) and ential rebuttal of Pinker and Prince(1988) (P&P). others, solve the same problem almost perfectly P&P highlighted the extremely poor empirical per- (Cotterell et al., 2016). This level of performance formance of the R&M model, and pointed out a on a cognitively relevant problem suggests that it number of theoretical issues with the model, which is time to consider further incorporating network they suggested would apply to any neural network, modeling into the study of linguistics and cognitive contemporarily branded connectionist approaches. science. Their critique was so successful that many linguists Crucially, we wish to sidestep one of the issues and cognitive scientists to this day do not consider that framed the original debate between P&P and neural networks a viable approach to modeling lin- R&M—whether or not neural models learn and guistic data and human cognition. use “rules.” From our perspective, any system that 651

Transactions of the Association for Computational Linguistics, vol. 6, pp. 651–666, 2018. Action Editor: Sebastian Padó. Submission batch: 2/2018; Revision batch: 5/2018; Published 12/2018. c 2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. picks up systematic, predictable patterns in data orthographic IPA may be referred to as rule-governed. We focus stem past part. stem past part. infl. type instead on an empirical assessment of the ability go went gone goU wEnt gon suppletive of a modern state-of-the-art neural architecture to sing sang sung sIN sæN sUN ablaut swim swam swum swIm swæm swUm ablaut learn linguistic patterns, asking the following ques- sack sacked sacked sæk sækt sækt [-t] tions: (i) Does the learner induce the full set of sag sagged sagged sæg sægd sægd [-d] correct generalizations about the data? Given a pat patted patted pæt pætId pætId [-Id] pad padded padded pæd pædId pædId [-Id] range of novel inputs, to what extent does it apply the correct transformations to them? (ii) Does the Table 1: Examples of inflected English verbs. behavior of the learner mimic humans? Are the errors human-like? phology is relatively impoverished, distinguishing In this work, we run new experiments examin- maximally five forms for the copula to be and only ing the ability of the Encoder-Decoder architecture three forms for most verbs. In this work, we con- developed for machine translation (Bahdanau et al., sider learning to conjugate the English verb forms, 2014; Sutskever et al., 2014) to learn the English rendered as phonological strings. As it is the focus past tense. The results suggest that modern nets of the original R&M study, we focus primarily on absolutely meet the first criterion, and often meet the English past tense formation. the second. Furthermore, they do this given lim- Both regular and irregular patterning exist in ited prior knowledge of linguistic structure: The English. Orthographically, the canonical regular networks we consider do not have phonological suffix is -ed, which, phonologically, may be ren- features built into them and must instead learn their dered as one of three phonological strings: [-Id], own representations for input phonemes. The de- [-d], or [-t]. The choice among the three is deter- sign and performance of these networks invalidate ministic, depending only on the phonological prop- many of the criticisms in Pinker and Prince(1988). erties of the previous segment. English selects [-Id] We contend that, given the gains displayed in this where the previous phoneme is a [t] or [d], for ex- case study, which is characteristic of problems ample, [pæt]7→[pætId] (pat7→patted) and [pæd]7→ in the morpho-phonological domain, researchers [pædId] (pad7→padded). In other cases, English across linguistics and cognitive science should con- enforces voicing agreement: It opts for [-d] when sider evaluating modern neural architectures as part the proceeding phoneme is a voiced consonant or of their modeling toolbox. a vowel (e.g., [sæg]7→[sægd] (sag7→sagged) and This paper is structured as follows. Section2 [SoU]7→[SoUd] (show7→showed)), and for [-t] when describes the problem under consideration, the the proceeding phoneme is an unvoiced consonant English past tense. Section3 lays out the origi- (e.g., [sæk]7→[sækt] (sack7→sacked)). English ir- nal Rumelhart and McClelland model from 1986 regulars are either suppletive (e.g., [goU]7→[wEnt] in modern machine-learning parlance, and com- (go7→went)), or exist in sub-regular islands de- pares it to a state-of-the-art Encoder-Decoder ar- fined by processes like ablaut (e.g., sing7→sang) chitecture. A historical perspective on alternative that may contain several verbs (Nelson, 2010); see approaches to modeling, both neural and non- Table1. neural, is provided in Section4. The empirical performance of the Encoder-Decoder architecture is Single vs. Dual Route. A frequently discussed evaluated in Section5. Section6 provides a sum- cognitive aspect of past tense processing concerns mary of which of Pinker and Prince’s original crit- whether or not irregular forms have their own pro- icisms have effectively been resolved, and which cessing pipeline in the brain. Pinker and Prince ones still require further consideration. Concluding (1988) proposed separate modules for regular and remarks follow. irregular verbs; regular verbs go through a general, rule-governed transduction mechanism, and excep- 2 The English Past Tense tional irregulars are produced via simple memory look-up.1 While some studies (e.g., Marslen- Many languages mark words with syntactico- Wilson and Tyler, 1997; Ullman et al., 1997) semantic distinctions. For instance, English marks the distinction between present and past tense verbs, 1Note that irregular look-up can simply be recast as the for example, walk and walked. English verbal mor- application of a context-specific rule.

652 provide corroborating evidence from speakers with 3 1986 vs. Today selective impairments to regular or irregular verb In this section, we compare the original R&M archi- production, others have called these results into tecture from 1986 to today’s state-of-the-art neural doubt (Stockall and Marantz, 2006). From the per- architecture for morphological transduction, the spective of this paper, a complete model of the Encoder-Decoder model. English past tense should cover both regular and irregular transformations. The neural network ap- 3.1 Rumelhart and McClelland (1986) proaches we advocate for achieve this goal, but For many linguists, the face of neural networks do not clearly fall into either the single or dual- to this day remains the work of R&M. Here, we route category—internal computations performed describe in detail their original architecture, us- by each network remain opaque, so we cannot at ing modern machine learning parlance whenever present make a claim whether two separable com- possible. Fundamentally, R&M were interested putation paths are present. in designing a sequence-to-sequence network for variable-length input using a small feed-forward 2.1 Acquisition of the Past Tense network. From an NLP perspective, this work con- stitutes one of the first attempts to design a net- The English past tense is of considerable theoretical work for a task reminiscent of popular NLP tasks interest because of the now well-studied acquisi- today that require variable-length input (e.g., part- tion patterns of children. As first shown by Berko of-speech tagging, parsing, and generation). (1958) in the so-called wug-test, knowledge of English morphology cannot be attributed solely to Wickelphones and Wickelfeatures. Unfortu- memorization. Indeed, both adults and children are nately, a fixed-sized feed-forward network is not fully capable of generalizing the patterns to novel immediately compatible with the goal of trans- words (e.g., [w2g]7→[w2gd] (wug7→wugged)). Dur- ducing sequences of varying lengths. Rumelhart ing acquisition, only a few types of errors are com- and McClelland decided to get around this lim- mon; children rarely blend regular and irregular itation by representing each string as the set of forms—for example, the past tense of come is its constituent phoneme trigrams. Each trigram either produced as comed or came, but rarely camed is termed a Wickelphone (Wickelgren, 1969). (Pinker, 1999). As a concrete example, the IPA-string [#kæt#], marked with a special beginning- and end-of- Acquisition Patterns for Irregular Verbs. It is string character, contains three distinct Wickel- widely claimed that children learning the past tense phones: [#kæ], [kæt], [æt#]. In fact, Rumelhart forms of irregular verbs exhibit a “U-shaped” learn- and McClelland went one step further and de- ing curve. At first, they correctly conjugate irregu- composed Wickelphones into component Wickel- lar forms (e.g., come7→came), then they regress features, or trigrams of phonological features, during a period of overregularization produc- one for each Wickelphone phoneme. For exam- ing the past tense as comed as they acquire the ple, the Wickelphone [ipt] is represented by the general past tense formation. Finally, they learn Wickelfeatures h+vowel,+unvoiced,+interruptedi to produce both the regular and irregular forms. and h+high,+stop,+stopi. Because there are far Plunkett and Marchman, however, observed a more fewer Wickelfeatures than Wickelphones, words nuanced form of this behavior. Rather than a macro could be represented with fewer units (of key im- U-shaped learning process that applies globally and portance for 1986 hardware) and more shared Wickel- uniformly to all irregulars, they noted that many features potentially meant better generalization. irregulars oscillate between correct and overreg- We can describe R&M’s representations us- ularized productions (Marchman, 1988). These ing the modern linear-algebraic notation standard oscillations, which Plunkett and Marchman refer to among researchers in neural networks. First, we as a micro U-shape, further apply at different rates assume that the language under consideration con- for different verbs (Plunkett and Marchman, 1991). tains a fixed set of phonemes Σ, plus an edge sym- Interestingly, although the exact pattern of irreg- bol # marking the beginning and end of words. ular acquisition may be disputed, children rarely Then, we construct the set of all Wickelphones overirregularize, that is, misconjugate a regular Φ and the set of all Wickelfeatures F by enu- verb as if it were irregular, such as ping7→pang. meration. The first layer of the R&M neural

653 network consists of two deterministic functions: Decoding. Decoding the R&M network necessi- ∗ |Φ| |Φ| |F| (i) φ :Σ → B and (ii) f : B → B , tates solving a tricky optimization problem. Given (i) where we define B = {−1, 1}. The first function φ an input phoneme string x , we then must find the maps a phoneme string to the set of Wickelphones corresponding y0 ∈ Σ∗ that minimizes that fire, as it were, on that string; for example, n o π(y0) − threshold W π(x(i)) + b (2) φ ([#kaet#]) = {[#kæ], [kæt], [æt#]}. The output 0 subset of Φ may be represented by a binary vector where threshold is a step function that maps all of length |Φ|, where a 1 means that the Wickel- non-positive reals to −1 and all positive reals to phone appears in the string and a −1 that it does 1. In other words, we seek the phoneme string y0 not.2 The second function f maps a set of Wickel- that shares the most features with the maximum phones to its corresponding set of Wickelfeatures. a posteriori decoded binary vector. This problem Pattern Associator Network. Here we define is intractable, and so R&M provide an approxima- the complete network of R&M. We denote strings tion. For each test stem, they hand-selected a set ∗ th of likely past-tense candidate forms, for example, of phonemes as x ∈ Σ , where xi is the i phoneme in a string. Given source and target good candidates for the past tense of break would phoneme strings x(i), y(i) ∈ Σ∗, R&M optimize be {break, broke, brake, braked}, and choose the the following objective, a sum over the individual form with Wickelfeatures closest to the network’s losses for each of the i = 1, ..., N training items: output. This manual approximate decoding procedure is not intended to be cognitively plausible. N X n o max 0, −π(y(i)) W π(x(i)) + b Architectural Limitations. R&M used Wickel- 1 i=1 phones and Wickelfeatures in order to help with (1) generalization and limit their network to a tractable where max{·} is taken point-wise, is point-wise size. However, this came at a significant cost to the |F|×|F| multiplication, W ∈ R is a projection ma- network’s ability to represent unique strings—the |F| trix, b ∈ R is a bias term, and π = φ ◦ f is the encoding is lossy: Two words may have the same composition of the Wickelphone and Wickelfeature set of Wickelphones or features. The easiest way encoding functions. Using modern terminology, to see this shortcoming is to consider morpholog- the architecture is a linear model for a multi-label ical reduplication, which is common in many of classification problem (Tsoumakas and Katakis, the world’s languages. P&P provide an example 2006): The goal is to predict the set of Wickel- from the Australian language of Oykangand, which features in the target form y(i) given the input distinguishes between algal ‘straight’ and algalgal form x(i) using a point-wise perceptron loss (hinge ‘ramrod straight’; both of these strings have the loss without a margin); that is, a binary percep- identical Wickelphone set {[#al], [alg], [lga], [gal], tron predicts each feature independently, but there [al#]}. Moreover, P&P point out that phonologi- is one set of parameters {W , b}. The total loss cally related words such as [slIt] and [sIlt] have dis- incurred is the sum of the per-feature loss, hence joint sets of Wickelphones: {[#sl], [slI], [lIt], [It#]} the use of the L1 norm. The model is trained with and {[#sI], [sIl], [Ilt], [lt#]}, respectively. These stochastic sub-gradient descent (the perceptron up- two words differ only by an instance of metathesis, date rule) (Rosenblatt, 1958; Bertsekas, 2015) with or swapping the order of nearby sounds. The use a fixed learning rate.3 Later work augmented the of Wickelphone representations imposes the strong architecture with multiple layers and nonlinearities claim that they have nothing in common phonolog- (Marcus, 2001, Table 3.3). ically, despite sharing all phonemes. P&P suggest this is unlikely to be the case. As one point of evi- 2 We have chosen −1 instead of the more traditional 0 so dence, the metathesis of the kind that differentiates that the objective function that Rumelhart and McClelland optimize may be more concisely written. [slIt] and [sIlt] is a common diachronic change. In 3Follow-up work, e.g., Plunkett and Marchman(1991), English, for example, [horse] evolved from [hross], has speculated that the original experiments in R&M may not and [bird] from [brid] (Jesperson, 1942). have converged. Indeed, convergence may not be guaranteed depending on the fixed learning rate chosen. As Equation (1) 3.2 Encoder-Decoder Architectures is jointly convex in its parameters {W, b}, there exist convex optimization algorithms that will guarantee convergence, The NLP community has recently developed an albeit often with a decaying learning rate. analogue to the past-tense generation task originally

654 considered by R&M: morphological paradigm com- (2014) for the complete architectural specification pletion (Durrett and DeNero, 2013; Ahlberg et al., of the specific ED model we apply in this paper. 2015; Cotterell et al., 2015; Nicolai et al., 2015; Faruqui et al., 2016). The goal is to train a model Theoretical Improvements. Although there are capable of mapping the lemma (stem in the case a number of possible variants of the ED archi- of English) to each form in the paradigm. In the tecture (Luong et al., 2015),4 they all share sev- case of English, the goal would be to map a lemma, eral critical features that make up for many of for example, walk, to its past-tense word walked as the theoretical shortcomings of the feed-forward well as its gerund and third person present singular, R&M model. The encoder reads in each phoneme walking and walks, respectively. This task gener- sequentially, preserving identity and order and al- alizes the R&M setting in that it requires learning lowing any string of arbitrary length to receive a more mappings than simply lemma to past tense. unique representation. Despite this encoding, a By definition, any system that solves the more flexible notion of string similarity is also main- general morphological paradigm completion task tained, as the ED model learns a fixed embedding must also be able to solve the original R&M task. for each phoneme that forms part of the representa- As we wish to highlight the strongest currently tion of all strings that share the phoneme. When the available alternative to R&M, we focus on the state network encodes [sIlt] and [slIt], it uses the same of the art in morphological paradigm comple- phoneme embeddings—only the order changes. Fi- tion: the Encoder-Decoder network architecture nally, the decoder permits sampling and scoring (ED) (Cotterell et al., 2016). This architecture arbitrary length fully formed strings in polynomial consists of two RNNs coupled together by an time (forward sampling), so there is no need to attention mechanism. The encoder RNN reads determine which string a non-unique set of Wickel- each symbol in the input string one at a time, features represents. However, we note that decod- first assigning it a unique embedding, then pro- ing the 1-best string from a sequence-to-sequence cessing that embedding to produce a represen- model is likely NP-hard (1-best string decoding is tation of the phoneme given the rest of the even hard for weighted finite-state transducers phonemes in the string. The decoder RNN pro- [Goodman, 1998]). duces a sequence of output phonemes one at a time, using the attention mechanism to peek back at the Multi-Task Capability. A single ED model is encoder states as needed. Decoding ends when a easily adapted to multi-task learning (Caruana, halt symbol is output. Formally, the ED architec- 1997; Collobert et al., 2011), where each task is ture encodes the probability distribution over forms a single transduction (e.g., stem 7→ past). Note that R&M would need a separate network for each N Y transduction (e.g., stem 7→ gerund and stem 7→ past p(y | x) = p(yi | y1, . . . , yi−1, ci) (3) participle). In fact, the current state of the art in i=1 NLP is to learn all morphological transductions N Y in a paradigm jointly. The key insight is to con- = g(yi−1, si, ci) (4) struct a single network p(y | x, t) to predict all i=1 inflections, marking the transformation in the in- where g is a non-linear function (in our case it is put string—that is, we feed the network the string a multi-layer perceptron), si is the hidden state “w a l k ” as input, augmenting the al- of the decoder RNN, y = (y1, . . . , yN ) is the phabet Σ to include morphological descriptors. We output sequence (a sequence of N = |y| char- refer to reader to Kann and Schütze(2016) for the acters), and finally ci is an attention-weighted encoding details. Thus, one network predicts all sum of the the encoder RNN hidden states hi, us- forms; for example, p(y | x=walk, t=past) yields ing the attention weights αk(si−1) that are com- a distribution over past tense forms for walk and puted based on the previous decoder hidden state: p(y|x=walk, t=gerund) yields a distribution over P|x| gerunds. ci = k=1 αk(si−1)hk.

In contrast to the R&M network, the ED network 4 optimizes the log-likelihood of the training data, For the experiments in this paper, we use the variant in Bahdanau et al.(2014), which has explicitly been shown to be PM (i) (i) that is, i=1 log p(y | x ) for i=1, ..., M train- state of the art in morphological transduction (Cotterell et al., ing items. We refer the reader to Bahdanau et al. 2016).

655 Type of Model Reference Input Output Feedforward Network Rumelhart and McClelland(1986) Wickelphones Wickelphones Feedforward Network MacWhinney and Leinbach(1991) Fixed Size Phonological Template Fixed Size Phonological Template Feedforward Network Plunkett and Marchman(1991) Fixed Size Phonological Template Fixed Size Phonological Template Attractor Network Hoeffner (1992) Semantics Fixed Size Phonological Template Feedforward Network Plunkett & Marchman (1993) Fixed Size Phonological Template Fixed Size Phonological Template Recurrent Neural Network Cottrell & Plunkett (1994) Semantics Phonological String Feedforward Network Hare, Elman, & Daugherty (1995) Fixed Size Phonological Template Inflection Class Feedforward Neural Network Hare & Elman (1995) Semantics Fixed Size Phonological Template Recurrent Neural Network Westermann & Goebel (1995) Phonological String Phonological String Feedforward Neural Network Nakisa & Hahn (1996) Fixed Size Phonological Template Inflection Class Convolutional Neural Network Bullinaria (1997) Phonological String Phonological String Feedforward Neural Network Plunkett & Nakisa (1997) Fixed Size Phonological Template Inflection Class Feedforward Neural Network Plunkett & Juola (1999) Fixed Size Phonological Template Fixed Size Phonological Template Feedforward Neural Network Hahn & Nakisa (2000) Fixed Size Phonological Template Inflection Class

Table 2: A curated list of related work, categorized by aspects of the technique. Based on a similar list found in Marcus(2001, page 82).

4 Related Work input template to include extra units marking par- ticular transformations (e.g., past or gerund), en- In this section, we first describe direct follow-ups to abling their network to learn multiple mappings. the original 1986 R&M model, using various neu- Some proposals simplify the problem even fur- ral architectures. Then we review competing non- ther, mapping fixed-size inputs into a small finite neural systems of context-sensitive rewrite rules set of categories, solving a classification prob- in the style of the Sound Pattern of English (SPE) lem rather than a transduction problem. (Nakisa (Halle and Chomsky, 1968), as favored by Pinker and Hahn, 1996; Hahn and Nakisa, 2000) classify and Prince. German noun stems into their appropriate plural 4.1 Follow-ups to Rumelhart and McClelland inflection classes. Plunkett and Nakisa(1997) do (1986) Over the Years the same for Arabic stems. Following R&M, a cottage industry devoted to Hoeffner(1992), Hare and Elman(1995), and cognitively plausible connectionist models of in- Cottrell and Plunkett(1994) also solve an alterna- flection learning sprouted in the linguistics and tive problem—mapping semantic representations cognitive science literature. We provide a summary (usually one-hot vectors with one unit per possible listing of the various proposals, along with broad- word type, and one unit per possible inflection) to brush comparisons, in Table2. phonological outputs. As these networks use un- Although many of the approaches apply more structured semantic inputs to represent words, they modern feed-forward architectures than R&M, intro- must act as memories—the phonological content ducing multiple layers connected by nonlinear trans- of any word must be memorized. This precludes formations, most continue to use feed-forward generalization to word types that were not seen architectures with limited ability to deal with variable- during training. length inputs and outputs and remain unable to pro- Of the proposals that map semantics to phonol- duce and assign probability to arbitrary output strings. ogy, the architecture in Hoeffner(1992) is unique MacWhinney and Leinbach(1991), Plunkett and in that it uses an attractor network rather than a Marchman(1991, 1993), and Plunkett and Juola feed-forward network, with the main difference be- (1999) map phonological strings to phonological ing training using Hebbian learning rather than the strings using feed-forward networks, but rather standard backpropagation algorithm. Cottrell and than turning to Wickelphones to imprecisely rep- Plunkett(1994) present an early use of a simple resent strings of any length, they use fixed-size recurrent network (Elman, 1990) to decode out- input and output templates, with units represent- put strings, making their model capable of variable ing each possible symbol at each input and out- length output. put position. For example, Plunkett and Marchman Bullinaria(1997) includes one of the few mod- (1991, 1993) simplify the past-tense mapping prob- els proposed that can deal with variable length inputs. lem by only considering a language of artificially They use a derivative of the NETtalk pronuncia- generated words of exactly three syllables and tion model (Sejnowski and Rosenberg, 1987) that a limited set of constructed past-tense formation would today be considered a convolutional net- patterns. MacWhinney and Leinbach(1991) and work. Each input phoneme in a stem is read in- Plunkett and Juola(1999) additionally modify the dependently along with its left and right context

656 phonemes within a limited context window (i.e., a Several systems that follow this rule-based tem- convolutional kernel). Each kernel is then mapped plate have been developed in NLP. Although the to zero or more output phonemes within a fixed SPE itself does not impose detailed restrictions on template. Because each output fragment is inde- rule structure, these systems generate rules that can pendently generated, the architecture is limited to be compiled into finite-state transducers (Kaplan learning only local constraints on output string and Kay, 1994; Ahlberg et al., 2015). These sys- structure. Similarly, the use of a fixed context win- tems generalize well, but even the most successful dow also means that inflectional patterns that de- variants have been shown to perform significantly pend on long-distance dependencies between input worse than state-of-the-art neural networks at mor- phonemes cannot be captured. phological inflection, often with a >10 percent- Finally, the model of Westermann and Goebel age point differential in accuracy on held-out data (1995) is arguably the most similar to a modern (Cotterell et al., 2016). ED architecture, relying on simple recurrent net- In the linguistics literature, the most straight- works to both encode input strings and decode out- forward, direct, machine-implemented instantiation put strings. However, the model was intended to of the P&P proposal is, arguably, the Minimal Gen- explicitly instantiate a dual route mechanism and eralization Learner (MGL) of Albright and Hayes contains an additional explicit memory component (2003) (c.f., Allen and Becker, 2015; Taatgen and to memorize irregulars. Despite the addition of this Anderson, 2002). This model takes a mapping memory, the model was unable to fully learn the of phonemes to phonological features and makes mapping from German verb stems to their partici- feature-level generalizations like the post-voice ple forms, failing to capture the correct form for [-d] rule described earlier. For a detailed technical strong training verbs, including the copular sein → description, see Albright and Hayes(2002). We gewesen. As the authors note, this may be due to treat the MGL as a baseline in §5. the difficulty of training simple recurrent networks, Unlike Taatgen and Anderson(2002), who which tend to converge to poor local minima. explicitly account for dual route processing by Modern RNN varieties, such as long short-term including both memory retrieval and rule appli- memory (LSTM) networks in the ED model, were cation submodules, Albright and Hayes(2003) specifically designed to overcome these training and Allen and Becker(2015) rely on discovering limitations (Hochreiter and Schmidhuber, 1997). and correctly weighting (using weights learned by log-linear regression) highly stem-specific rules to 4.2 Non-neural Learners account for irregular transformations. Within the context of rule-based systems, several P&P describe several basic ideas that underlie a proposals focus on the question of rule generaliza- traditional, symbolic rule learner. Such a learner tion, rather than rule synthesis. That is, given a set produces SPE-style rewrite rules that may be ap- of predefined rules, the systems implement metrics plied to deterministically transform the input string to decide whether rules should generalize to novel into the target. Rule candidates are found by com- forms, depending on the number of exceptions in paring the stem and the inflected form, treating the the data set. Yang(2016) defines the ‘tolerance prin- portion that changes as the rule that governs the ciple,’ a threshold for exceptionality beyond which transformation. This is typically a set of non–copy a rule will fail to generalize. O’Donnell(2011) edit operations. If multiple stem–past pairs share treats the question of whether a rule will generalize similar changes, these may be collapsed into a sin- as one of optimal Bayesian inference. gle rule by generalizing over the shared phonological features involved in the changes. For example, 5 Evaluation of the ED Learner if multiple stems are converted to the past tense by the addition of the suffix [-d], the learner may We evaluate the performance of the ED architecture create a collapsed rule that adds the suffix to all in light of the criticisms P&P levied against the stems that end in a [+voice] sound. Different rules original R&M model. We show that, in most cases, 5 may be assigned weights (e.g., probabilities) de- these criticisms no longer apply. rived from how many stem–past pairs exemplify 5Data sets and code for all experiments are avail- the rules. These weights decide which rules to able at https://github.com/ckirov/Revisit apply to produce the past tense. PinkerAndPrince.

657 The most potent line of attack P&P use against true to the external environment of language learn- the R&M model is that it simply does not learn the ing, it may not accurately represent the psycholog- English past tense very well. Although the non- ical reality of how that environment is processed. deterministic, manual, and non-precise decoding A body of psycholinguistic evidence (Bybee, 1995, procedure used by R&M makes it difficult to ob- 2001; Pierrehumbert, 2001) suggests that human tain exact accuracy numbers, P&P estimate that the learners generalize phonological patterns based on model only prefers the correct past tense form for the count of word types they appear in, ignoring the about 67% of English verb stems. Furthermore, token frequency of those types. Thus, we chose to many of the errors made by the R&M network are weigh all verb types equally for training, effecting a unattested in human performance. For example, uniform distribution over types as described above. the model produces blends of regular and irregular past-tense formation (e.g., eat 7→ ated) that Hyperparameters and Other Details. Our ar- children do not produce unless they mistake ate chitecture is nearly identical to that used in for a present stem (Pinker, 1999). Furthermore, Bahdanau et al.(2014), with hyperparameters set the R&M model frequently produces irregular past following Kann and Schütze(2016, §4.1.1). Each tense forms when a regular formation is expected input character has an embedding size of 300 units. (e.g., ping 7→ pang). Humans are more likely to The encoder consists of a bidirectional LSTM overregularize. These behaviors suggest that the (Hochreiter and Schmidhuber, 1997) with two lay- R&M model learns the wrong kind of generaliza- ers. There is a dropout value of 0.3 between the tions. As shown subsequently, the ED architecture layers. The decoder is a unidirectional LSTM with seems to avoid these pitfalls, while outperforming two layers. Both the encoder and decoder have 100 a P&P-style non-neural baseline. hidden units. Training was done using the Adadelta procedure (Zeiler, 2012) with a learning rate of 5.1 Experiment 1: Learning the Past Tense 1.0 and a minibatch size of 20. We train for 100 In the first experiment, we seek to show: (i) the ED epochs to ensure that all verb forms in the training model successfully learns to conjugate both regular data are adequately learned. We decode the model and irregular verbs in the training data, and gener- with beam search (k = 12). The code for our ex- alizes to held-out data at convergence and (ii) the periments is derived from the OpenNMT package pattern of errors the model exhibits is compatible (Klein et al., 2017). We use accuracy as our metric with attested speech errors. of performance. We train the MGL as a non-neural baseline, using the code distributed with Albright CELEX Data Set. Our base data set consists of and Hayes(2003) with default settings. 4,039 verb types in the CELEX database (Baayen et al., 1993). Each verb is associated with a present Results and Discussion. The non-neural MGL tense form (stem) and past tense form, both in IPA. baseline unsurprisingly learns the regular past- Each verb is also marked as regular or irregular tense pattern nearly perfectly, given that it is im- (Albright and Hayes, 2003). A total of 168 of the bued with knowledge of phonological features as 4,039 verb types were marked as irregular. We as- well as a list of phonologically illegal phoneme signed verbs to train, development, and test sets sequences to avoid in its output. However, in our according to a random 80-10-10 split. Each verb testing of the MGL, the preferred past-tense output appears in exactly one of these sets once. This corre- for all verbs was never an irregular formulation. sponds to a uniform distribution over types because This was true even for irregular verbs that were every verb has an effective frequency of 1. observed by the learner in the training set. One In contrast, the original R&M model was trained might say that the MGL is only intended to account and tested (data was not held out) on a set of 506 for the regular route of a dual route system. How- stem/past pairs derived from Kuceraˇ and Francis ever, the intended scope of the MGL seems to be (1967). A total of 98 of the 506 verb types were wider. The model is billed as accurately learning marked as irregular. “islands of subregularity” within the past tense sys- Types vs. Tokens. In real human communica- tem, and Albright and Hayes use the model to make tion, words follow a Zipfian distribution, with many predictions about which irregular forms of novel irregular verbs being exponentially more common verb stems are preferable to human speakers (see than regular verbs. Although this condition is more the subsequent discussion of wugs).

658 all regular irregular train dev test train dev test train dev test 100 Single-Task (MGL) 96.0 96.0 94.5 99.9 100.0 100.0 0.0 0.0 0.0 Single-Task (Type) 99.8† 97.4 95.1 99.9 99.2 98.9 97.6† 53.3† 28.6† 80 Multi-Task (Type) 100.0† 96.9 95.1 100.0 99.5 99.7 99.2† 33.3† 28.6† 60 Table 3: Results on held-out data in English past tense Multi-Task Regulars prediction for single- and multi-task scenarios. The Single-Task Regulars 40 MGL achieves perfect accuracy on regular verbs, and Multi-Task All Verbs 0 accuracy on irregular verbs. † indicates that a neu- Single-Task All Verbs ral model’s performance was found to be signiﬁcantly 20 Multi-Task Irregulars different (p < 0.05) from the MGL. Single-Task Irregulars 0 0 20 40 60 80 100

In contrast, the ED model, despite no built- Figure 1: Single-task vs. multi-task. Learning curves in knowledge of phonology, successfully learns for the English past tense. The x-axis is the number of to conjugate nearly all the verbs in the training epochs (one complete pass over the training data) and data, including irregulars—no reduction in scope the y-axis is the accuracy on the training data (not the is needed. This capacity to account for specific metric of optimization). exceptions to the regular rule does not result in overfitting. We note similarly high accuracy on → withdrew). One (sling → slung) is an analogy held-out regular data—98.9% to 99.2% at conver- to similar training words (fling, cling). The final gence depending on the condition. We report the conjugation, forsake → forsook, is an interesting full accuracy in all conditions in Table3. The † combination, with the prefix “for,” but an unat- indicates when a neural model’s performance was tested base form “sake” that is similar to “take.”6 found to be significantly different (p < 0.05) from From the training data, the only regular verb the MGL according to a χ2 test. The ED model with an error is compartmentalized, whose past achieves near-perfect accuracy on regular verbs, tense is predicted to be “compartmentalized,” with and irregular verbs seen during training, as well a spurious vowel change that would likely be ironed as substantial accuracy on irregular verbs in the out with additional training. Among the regular dev and test sets. This behavior jointly results in verbs in the development and test sets, the errors better overall performance for the ED model when also consisted of single vowel changes (the full all verbs are considered. Figure1 shows learning set of these errors was “thin” → “thun,” “try” → curves for regular and irregular verbs types in dif- “traud,” “institutionalize” → “instititionalized,” and ferent conditions. “drawl” → “drooled”). An error analysis of held-out data shows that the Overall then, the ED model performs extremely errors made by this network do not show any of well, a far cry from the ≈67% accuracy of the the problems of the R&M architecture. There are R&M model. It exceeds any reasonable standard no blend errors of the eat 7→ ated variety. Indeed, of empirical adequacy, and shows human-like error the only error the network makes on irregulars is behavior. overregularization (e.g., throw → throwed). In fact, the overregularization-caused lower accuracy Acquisition Patterns. R&M made several that we observe for irregular verbs in development claims that their architecture modeled the detailed and test is expected and desirable; it matches the acquisition of the English past tense by children. human tendency to treat novel words as regular, The core claim was that their model exhibited a lacking knowledge of irregularity (Albright and macro U-shaped learning curve as in §2. Irregulars Hayes, 2003). were initially produced correctly, followed by Although most held-out irregulars are regular- a period of overregularization preceding a final ized, as expected, the ED model does, perhaps correct stage. However, P&P point out that R&M surprisingly, correctly conjugate a handful of ir- only achieve this pattern by manipulating the regular forms it has not seen during training—five input distribution fed into their network. They in the test set. However, three of these are pre- trained only on irregulars for a number of epochs, fixed versions of irregulars that exist in the training 6[s] and [t] are both coronal consonants, a fricative and a set (retell → retold, partake → partook, withdraw stop, respectively.

659 CLING MISLEAD CATCH FLY English Network MG # output # output # output # output Regular (rife ∼ rifed, n=58) 0.48 0.35 5 [klINd] 8 [mIsli:dId] 7 [kætS] 6 [flaId] Irregular (rife ∼ rofe, n=74) 0.45 0.36 11 [kl2N] 19 [mIslEd] 31 [kætS] 31 [flu:] 13 [klIN] 21 [mIslEd] 43 [kOt] 40 [flaId] Table 5: Spearman’s ρ of human wug production proba- 14 [klINd] 23 [mIslEd] 44 [kætS] 42 [fleI] bilities with MG scores and ED probability estimates. 18 [kl2N] 24 [mIsli:dId] 51 [kætS] 47 [flaId] 21 [klINd] 29 [mIslEd] 52 [kOt] 56 [flu:] 28 [kl2N] 30 [mIsli:dId] 66 [kætS] 62 [flaId] tense form. They then counted the percentage of 40 [kl2N] 41 [mIslEd] 73 [kOt] 70 [flu:] participants who produced the pre-chosen regular and irregular forms (production probability). The Table 4: Here we evince the oscillating development of production probabilities for each pre-chosen reg- single words in our corpus. For each stem, e.g., CLING, we show the past form that produced at change points to ular and irregular form could then be correlated show the diversity of alternation. Beyond the last epoch with the predicted scores derived from the MGL. displayed, each verb was produced correctly. In Table5, we compare the correlations based on their model scores, with correlations comparing the human scores to the output probabilities given before flooding the network with regular verb by an ED model. As the wug data provided with forms. R&M justify this by claiming that young Albright and Hayes(2003) use a different phonetic children’s vocabulary consists disproportionately transcription than the one we used, we trained a of irregular verbs early on, but P&P cite contrary separate ED model for this comparison. Model evidence. A survey of child-directed speech shows architecture, training verbs, and hyperparameters that the ratio of regular to irregular verbs a child remained the same. Only the transcription used hears is constant while they are learning their to represent input and output strings was changed language (Slobin, 1971). Furthermore, psycholin- to match Albright and Hayes(2003). Following guistic results suggest that there is no early skew the original paper, we correlate the probabilities towards irregular verbs in the vocabulary children for regular and irregular transformations separately. understand or use (Brown, 1973). We apply Spearman’s rank correlation, as we don’t Although we do not wish to make a strong claim necessarily expect a linear relationship. We see that that the ED architecture accurately mirrors chil- the ED model probabilities are slightly more corre- dren’s acquisition, only that it ultimately learns lated than the MGL’s scores. the correct generalizations, we wanted to see if it would display a child-like learning pattern without 5.2 Experiment 2: Joint Multi-Task Learning changing the training inputs fed into the network Another objection levied by P&P is R&M’s focus over time—that is, in all of our experiments, the on learning a single morphological transduction: data sets remained fixed for all epochs, unlike in stem to past tense. Many phonological patterns R&M. We do not clearly see a macro U-shape, but in a language, however, are not restricted to a sin- we do observe Plukett and Marchman’s predicted gle transduction—they make up a core part of the oscillations for irregular learning—the so-called phonological system and take part in many differ- micro U-shaped pattern. As shown in Table4, in- ent processes. For instance, the voicing assimila- dividual verbs oscillate between correct production patterns found in the past tense also apply to tion and overregularization before they are fully the third person singular: we see the affix -s ren- mastered. dered as [-s] after voiceless consonants and [-z] Wug Testing. As a further test of the MGL as after voiced consonants and vowels. a cognitive model, Albright and Hayes created a P&P argue that the R&M model would not be set of 74 nonce English verb stems with varying able to take advantage of these shared generaliza- levels of similarity to both regular and irregular tions. Assuming a different network would need verbs. For each stem (e.g., rife), they picked one to be trained for each transduction (e.g., stem to regular output form (rifed), and one irregular out- gerund and stem to past participle), it would be im- put form (rofe). They used these stems and po- possible to learn that they have any patterns in com- tential past-tense variants to perform a wug test mon. However, as discussed in §3.2, a single ED with human participants. For each stem, they had model can learn multiple types of mapping, simply 24 participants freely attempt to produce a past by tagging each input–output pair in the training

660 set with the transduction it represents. A network cisms P&P levied against R&M. Most importantly, trained in such a way shares the same weights and the empirical performance of neural models is no phoneme embeddings across tasks, and thus has the longer an issue. The past tense transformation is capacity to generalize patterns across all transduc- learned nearly perfectly, compared to an approxi- tions, naturally capturing the overall phonology of mate accuracy of 67% for R&M. Furthermore, the the language. Because different transductions mu- ED architecture solves the problem in a fully gen- tually constrain each other (e.g., English in general eral setting. A single network can easily be trained does not allow sequences of identical vowels), we on multiple mappings at once (and appears to gener- actually expect faster learning of each individual alize knowledge across them). No representational pattern, which we test in the following experiment. cludges such as Wickelphones are required—ED We trained a model with an architecture iden- networks can map arbitrary length strings to arbi- tical to that used in Experiment 1, but this time trary length strings. This permits training and eval- to jointly predict four mappings associated with uating the ED model on realistic data, including the English verbs (past, gerund, past participle, third- ability to assign an exact probability to any arbi- person singular). trary output string, rather than “representative” data designed to fit in a fixed-size neural architecture Data. For each of the verb types in our base train- (e.g., fixed input and output templates). Evaluation ing set from Experiment 1, we added the three shows that the ED model does not appear to display remaining mappings. The gerund, past-participle, any of the degenerate error-types P&P note in the and third-person singular forms were identified in output of R&M (e.g., regular/irregular blends of CELEX according to their labels in Wiktionary the ate → ated variety). (Sylak-Glassman et al., 2015). The network was Despite this litany of successes, some outstand- trained on all individual stem 7→ inflection pairs in ing criticisms of R&M still remain to be addressed. the new training set, with each input string modi- On the trivial end, P&P correctly point out that the fied with additional characters representing the cur- R&M model does not handle homophones: write rent transduction (Kann and Schütze, 2016): take 7→ wrote, but right 7→ righted. This is because it 7 7→ took, but take 7→ taken. only takes the phonological make-up of the input Results. Table3 and Figure1 show the results. string into account, without concern for its lexical Overall, accuracy is >99% after convergence on identity. This issue affects the ED models we dis- train. Although the difference in final performance cuss in this paper as well—lexical disambiguation is never statistically significant compared to single- is outside of their intended scope. However, even task learning, the learning curves are much steeper, the rule learner that P&P propose does not have so this level of performance is achieved much more such functionality. Furthermore, if lexical mark- quickly. This provides evidence for our intuition ings were available, we could incorporate them into that cross-task generalization facilitates individual the model just as with different transductions in the task learning due to shared phonological patterning multi-task set-up (i.e., by adding the disambiguat- (i.e., jointly generating the gerund hastens past- ing markings to the input). tense learning). More importantly, we need to limit any claims regarding treating ED models as proxies for child language learners. P&P criticized such claims 6 Summary of Resolved and from R&M because they manipulated the input Outstanding Criticisms data distribution given to their network over time In this paper, we have argued that the Encoder- to effect a U-shaped learning curve, despite no Decoder architecture obviates many of the criti- evidence that the manipulation reflected children’s perception or production capabilities. We avoid this 7Without input annotation to mark the different mappings criticism in our experiments, keeping the input dis- the network must learn, it would treat all input/output pairs tribution constant. We even show that the ED model as belonging to the same mapping, with each inflected form captures at least one observed pattern of child lan- of a single stem as an equally likely output variant associated guage development—Plukett and Marchman’s pre- with that mapping. It is not within the scope of this net- dicted oscillations for irregular learning, the micro work architecture to solve problems other than morphological transduction, such as discovering the range of morphological U-shaped pattern. However, we did not observe a paradigm slots. macro U-shape, nor was the micro effect consistent

661 across all irregular verbs. More study is needed to irregular past tense formation in observed training determine the ways in which ED architectures do data and generalizes to held-out verbs, all with- or do not reflect children’s behavior. Even if nets out built-in knowledge of phonology. Although do not match the development patterns of any indi- not necessarily intended to act as a proxy for a vidual, they may still be useful if they ultimately child learner, the ED model also shows one of the achieve a knowledge state that is comparable to development patterns that has been observed in that of an adult or, possibly, the aggregate usage children, namely, a micro U-shaped (oscillating) statistics of a population of adults. learning curve for irregular verbs. The accurate Along this vein, P&P note that the R&M model and substantially human-like performance of the is able to learn highly unnatural patterns that do ED model warrants consideration of its use as a re- not exist in any language. For example, it is trivial search tool in theoretical linguistics and cognitive to map each Wickelphone to its reverse, effectively science. creating a mirror-image of the input, for example, brag7→garb. Although an ED model could likely learn linguistically unattested patterns as References well, some patterns may be more difficult to learn Malin Ahlberg, Markus Forsberg, and Mans than others—for example, they might require in- Hulden. 2015. Paradigm classification in super- creased time-to-convergence. It remains an open vised learning of morphology. In Proceedings question for future research to determine which pat- of the 2015 Conference of the North American terns RNNs prefer, and which changes are needed Chapter of the Association for Computational to account for over- and underfitting. Indeed, any Linguistics: Human Language Technologies, sufficiently complex learning system (including pages 1024–1029, Denver, Colorado. Associa- rule-based learners) would have learning biases tion for Computational Linguistics. that require further study. There are promising directions from which to Adam Albright and Bruce Hayes. 2002. Model- approach this study. Networks are in a way analo- ing English past tense intuitions with minimal gous to animal models (McCloskey, 1991), in that generalization. In Proceedings of the Workshop they share interesting properties with human learn- on Morphological and Phonological Learning, ers, as shown empirically, but are much easier and volume 6, pages 58–69. Association for Compu- less costly to train and manipulate across multiple tational Linguistics. experiments. Initial experiments could focus on default architectures, as we do in this paper, effec- Adam Albright and Bruce Hayes. 2003. Rules tively treating them as inductive baselines (Gildea vs. analogy in English past tenses: A and Jurafsky, 1996) and measuring their perfor- computational/experimental study. Cognition, mance given limited domain knowledge. Our ED 90:119–161. networks, for example, have no built-in knowledge of phonology or morphology. Failures of these Blake Allen and Michael Becker. 2015. Learn- baselines would then point the way towards the ing alternations from surface forms with sublexi- biases required to learn human language, and mod- cal phonology. Technical report, University of els modified to incorporate these biases could be British Columbia and Stony Brook University. tested. R. Harald Baayen, Richard Piepenbrock, and Rijn 7 Conclusion van H. 1993. The CELEX lexical data base on CD-ROM. We have shown that the application of the ED architecture to the problem of learning the English Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua past tense obviates many, though not all, of the Bengio. 2014. Neural machine translation by objections levied by P&P against the first neural jointly learning to align and translate. arXiv network proposed for the task, suggesting that the preprint arXiv:1409.0473. criticisms do not extend to all neural models, as P&P imply. Compared with a non-neural base- Jean Berko. 1958. The child’s learning of English line, the ED model accounts for both regular and morphology. Word, 14(2-3):150–177.

662 Dimitri P. Bertsekas. 2015. Convex Optimization Jeffrey L Elman. 1990. Finding structure in time. Algorithms. Athena Scientiﬁc. Cognitive science, 14(2):179–211.

Roger Brown. 1973. A First Language: The Early Manaal Faruqui, Yulia Tsvetkov, Graham Neubig, Stages. Harvard University Press, Cambridge, and Chris Dyer. 2016. Morphological inﬂection MA. generation using character sequence to sequence learning. In Proceedings of the 2016 Conference John A Bullinaria. 1997. Modeling reading, of the North American Chapter of the Associa- spelling, and past tense learning with arti- tion for Computational Linguistics: Human Lan- ﬁcial neural networks. Brain and Language, guage Technologies, pages 634–643, San Diego, 59(2):236–266. California. Association for Computational Joan Bybee. 1995. Regular morphology and the Linguistics. lexicon. Language and Cognitive Processes, Daniel Gildea and Daniel Jurafsky. 1996. Learning 10:425–455. bias and phonological-rule induction. Computa- Joan Bybee. 2001. Phonology and Language Use. tional Linguistics, 22(4):497–530. Cambridge University Press, Cambridge, UK. Joshua Goodman. 1998. Parsing inside-out. Har- Rich Caruana. 1997. Multitask learning. Machine vard Computer Science Group Technical Report Learning, 28:41–75. TR-07-98.

Ronan Collobert, Jason Weston, Léon Bottou, Ulrike Hahn and Ramin Charles Nakisa. 2000. Michael Karlen, Koray Kavukcuoglu, and Pavel German inflection: Single route or dual route? Kuksa. 2011. Natural language processing Cognitive Psychology, 41(4):313–360. (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537. Morris Halle and Noam Chomsky. 1968. The Sound Pattern of English. Harper & Row. Ryan Cotterell, Christo Kirov, John Sylak- Glassman, David Yarowsky, Jason Eisner, Mary Hare and Jeffrey L Elman. 1995. Learning and Mans Hulden. 2016. The SIGMORPHON and morphological change. Cognition, 56(1): 2016 shared task–morphological reinflection. In 61–98. Proceedings of the 14th SIGMORPHON Work- shop on Computational Research in Phonet- Sepp Hochreiter and Jürgen Schmidhuber. 1997. ics, Phonology, and Morphology, pages 10–22, Long short-term memory. Neural computation, Berlin, Germany. Association for Computational 9(8):1735–1780. Linguistics. James Hoeffner. 1992. Are rules a thing of the Ryan Cotterell, Nanyun Peng, and Jason Eisner. past? The acquisition of verbal morphology 2015. Modeling word forms using latent under- by an attractor network. In Proceddings of the lying morphs and phonology. Transactions of 14th Annual Conference of the Cognitive Science the Association for Computational Linguistics, Society, pages 861–866. 3(1):433–447. Otto Jesperson. 1942. A Modern English Gram- Garrison W Cottrell and Kim Plunkett. 1994. Ac- mar on Historical Principles, volume 6. George quiring the mapping from meaning to sounds. Allen & Unwin Ltd., London, UK. Connection Science, 6(4):379–412. Katharina Kann and Hinrich Schütze. 2016. Med: Greg Durrett and John DeNero. 2013. Supervised The LMU system for the SIGMORPHON 2016 learning of complete morphological paradigms. shared task on morphological reinflection. In In Proceedings of the 2013 Conference of the Proceedings of the 14th SIGMORPHON Work- North American Chapter of the Association for shop on Computational Research in Phonet- Computational Linguistics: Human Language ics, Phonology, and Morphology, pages 62–70, Technologies, pages 1185–1195, Atlanta, Georgia. Berlin, Germany. Association for Computational Association for Computational Linguistics. Linguistics.

663 Ronald M. Kaplan and Martin Kay. 1994. Regular Gerald Nelson. 2010. English: an Essential Gram- models of phonological rule systems. Computa- mar. Routledge. tional linguistics, 20(3):331–378. Garrett Nicolai, Colin Cherry, and Grzegorz Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Kondrak. 2015. Inﬂection generation as discrim- Senellart, and Alexander Rush. 2017. Opennmt: inative string transduction. In Proceedings of Open-source toolkit for neural machine the 2015 Conference of the North American translation. In Proceedings of ACL 2017, System Chapter of the Association for Computational Demonstrations, pages 67–72, Vancouver, Linguistics: Human Language Technologies, Canada. Association for Computational pages 922–931, Denver, Colorado. Association Linguistics. for Computational Linguistics.

Henry Kuceraˇ and Winthrop Nelson Francis. 1967. Timothy J. O’Donnell. 2011. Productivity and Computational Analysis of Present-day American Reuse in Language. Ph.D. thesis, Harvard Uni- English. Dartmouth Publishing Group. versity, Cambridge, MA.

Tal Linzen, Emmanuel Dupoux, and Yoav Janet Pierrehumbert. 2001. Stochastic phonology. Goldberg. 2016. Assessing the ability of LSTMs GLOT, 5:1–13. to learn syntax-sensitive dependencies. Trans- actions of the Association for Computational Steven Pinker. 1999. Words and Rules: The Ingre- Linguistics (TACL), 4:521–535. dients of Language. Basic Books. Thang Luong, Hieu Pham, and Christopher D. Steven Pinker and Alan Prince. 1988. On language Manning. 2015. Effective approaches to attention- and connectionism: Analysis of a parallel dis- based neural machine translation. In Proceedings tributed processing model of language acquisi- of the 2015 Conference on Empirical Methods in tion. Cognition, 28(1):73–193. Natural Language Processing , pages 1412–1421, Kim Plunkett and Patrick Juola. 1999. A connec- Lisbon, Portugal. Association for Computational tionist model of English past tense and plural Linguistics. morphology. Cognitive Science, 23(4):463–490. Brian MacWhinney and Jared Leinbach. 1991. Kim Plunkett and Virginia Marchman. 1991. Implementations are not conceptualizations: U-shaped learning and frequency effects in a Revising the verb learning model. Cognition, multi-layered perception: Implications for child 40(1):121–157. language acquisition. Cognition, 38(1):43–102. Virginia Marchman. 1988. Rules and regulari- Kim Plunkett and Virginia Marchman. 1993. From ties in the acquisition of the English past tense. rote learning to system building: Acquiring verb Center for Research in Language Newsletter, morphology in children and connectionist nets. 2(4):04. Cognition, 48(1):21–69. Gary Marcus. 2001. The Algebraic Mind. MIT Kim Plunkett and Ramin Charles Nakisa. 1997. A Press. connectionist model of the Arabic plural system. William D Marslen-Wilson and Lorraine K Tyler. Language and Cognitive processes, 12(5-6): 1997. Dissociating types of mental computation. 807–836. Nature, 387(6633):592. Frank Rosenblatt. 1958. The Perceptron: A prob- Michael McCloskey. 1991. Networks and theories: abilistic model for information storage and or- The place of connectionism in cognitive science. ganization in the brain. Psychological review, Psychological science, 2(6):387–395. 65(6):386. Ramin Charles Nakisa and Ulrike Hahn. 1996. David E. Rumelhart and James L. McClelland. Where defaults don’t help: the case of the 1986. On learning the past tenses of English German plural system. In Proceedings of the verbs. In Parallel Distributed Processing: Ex- 18th Annual Conference of the Cognitive Science plorations in the Microstructure of Cognition, Society, pages 177–182. San Diego, CA. volume 2, pages 216–271. MIT Press.

664 Terrence J Sejnowski and Charles R Rosenberg. Niels A Taatgen and John R Anderson. 2002. Why 1987. Parallel networks that learn to pronounce do children learn to say broke? a model of learn- English text. Complex systems, 1(1):145–168. ing the past tense without feedback. Cognition, 86(2):123–155. D.I. Slobin. 1971. On the learning of morphological rules: A reply to Palermo and Eberhart. In Grigorios Tsoumakas and Ioannis Katakis. 2006. The Ontogenesis of Grammar: A Theoretical Multi-label classiﬁcation: An overview. Interna- Symposium. Academic Press, New York. tional Journal of Data Warehousing and Mining, 3(3):1–13.

Linnaea Stockall and Alec Marantz. 2006. A single Michael T Ullman, Suzanne Corkin, Marie route, full decomposition model of morpholog- Coppola, Gregory Hickok, John H Growdon, ical complexity: MEG evidence. The mental Walter J Koroshetz, and Steven Pinker. 1997. A lexicon, 1(1):85–123. neural dissociation within language: Evidence that the mental dictionary is part of declarative Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. memory, and that grammatical rules are pro- 2014. Sequence to sequence learning with cessed by the procedural system. Journal of neural networks. In Advances in Neural Infor- cognitive neuroscience, 9(2):266–276. mation Processing Systems 27: Annual Confer- ence on Neural Information Processing Systems Gert Westermann and Rainer Goebel. 1995. Con- 2014, December 8-13 2014, Montreal, Quebec, nectionist rules of language. In Proceedings Canada, pages 3104–3112. of the 17th annual conference of the Cognitive Science Society, pages 236–241. John Sylak-Glassman, Christo Kirov, David Yarowsky, and Roger Que. 2015. A language- Wayne A Wickelgren. 1969. Context-sensitive coding, independent feature schema for inﬂectional mor- associative memory, and serial order in (speech) phology. In Proceedings of the 53rd Annual behavior. Psychological Review, 76(1):1. Meeting of the Association for Computational Charles Yang. 2016. The price of linguistic produc- Linguistics and the 7th International Joint tivity: How children learn to break the rules of Conference on Natural Language Processing language. MIT Press. (Volume 2: Short Papers), pages 674–680, Beijing, China. Association for Computational Matthew D Zeiler. 2012. ADADELTA: An adap- Linguistics. tive learning rate method. CoRR, abs/1212.5701.

665