Lexicon Learning for Few-Shot Neural Sequence Modeling

Ekin Akyürek Jacob Andreas Massachusetts Institute of Technology {akyurek,jda}@mit.edu

Abstract Train Test dax r lug b wif g zup y zup fep y y y Sequence-to-sequence transduction is the core lexicon zup blicket lug ___? problem in language processing applications lug fep b b b dax blicket zup ___? as diverse as semantic parsing, machine trans- dax fep r r r zup kiki dax ___? lation, and instruction following. The neural lug blicket wif b g b wif kiki zup ___? network models that provide the dominant so- wif blicket dax g r g lution to these problems are brittle, especially lug kiki wif g b in low-resource settings: they fail to generalize dax kiki lug b r correctly or systematically from small datasets. Past work has shown that many failures of sys- Figure 1: A fragment of the Colors dataset from Lake tematic generalization arise from neural mod- et al.(2019), a simple sequence-to-sequence translation els’ inability to disentangle lexical phenomena task. The output vocabulary is only the colored circles from syntactic ones. To address this, we aug- r , g , b , y . Humans can reliably fill in the miss- ment neural decoders with a lexical transla- ing test labels on the basis of a small training set, but tion mechanism that generalizes existing copy standard neural models cannot. This paper describes a mechanisms to incorporate learned, decontex- neural sequence model that obtains improved general- tualized, token-level translation rules. We de- ization via a learned lexicon of token translation rules. scribe how to initialize this mechanism using a variety of lexicon learning algorithms, and show that it improves systematic generaliza- and small datasets (Lake and Baroni, 2018), pos- tion on a diverse set of sequence modeling ing a fundamental challenge for NLP tools in the tasks drawn from cognitive science, formal se- mantics, and .1 low-data regime. Pause for a moment to fill in the missing labels 1 Introduction in Fig.1. While doing so, which training exam- ples did you pay the most attention to? How many Humans exhibit a set of structured and remarkably times did you find yourself saying means or maps consistent inductive biases when learning from lan- to? Explicit representations of lexical items and guage data. For example, in both natural language their meanings play a key role diverse models of acquisition and toy language-learning problems syntax and (Joshi and Schabes, 1997; like the one depicted in Fig.1, human learners Pollard and Sag, 1994; Bresnan et al., 2015). But exhibit a preference for systematic and composi- one of the main findings in existing work on gener- tional interpretation rules (Guasti 2017, Chapter 4; alization in neural models is that they fail to cleanly Lake et al. 2019). These inductive biases in turn separate lexical phenomena from syntactic ones support behaviors like one-shot learning of new (Lake and Baroni, 2018). Given a dataset like the concepts (Carey and Bartlett, 1978). But in natural one depicted in Fig.1, models conflate (lexical) language processing, recent work has found that information about the correspondence between zup state-of-the-art neural models, while highly effec- and y with the (syntactic) fact that y appears tive at in-domain prediction, fail to generalize in only in a sequence of length 1 at training time. human-like ways when faced with rare phenomena Longer input sequences containing the word zup 1Our code is released under https://github.com/ in new syntactic contexts cause models to output ekinakyurek/lexical tokens only seen in longer sequences (Section5).

4934

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 4934–4946 August 1–6, 2021. ©2021 Association for Computational Linguistics In this paper, we describe a parameterization human-like generalization about language data, es- for sequence decoders that facilitates (but does pecially in the low-data regime (e.g. Fodor et al., not enforce) the learning of context-independent 1988; Marcus, 2018). Our results suggest that the word meanings. Specifically, we augment decoder situation is more complicated: by offloading the output layers with a lexical translation mecha- easier lexicon learning problem to simpler models, nism which generalizes neural copy mechanisms neural sequence models are actually quite effec- (e.g. See et al., 2017) and enables models to gen- tive at modeling (and generalizing about) about erate token-level translations purely attentionally. syntax in synthetic tests of generalization and real While the lexical translation mechanism is quite translation tasks. general, we focus here on its ability to improve few-shot learning in sequence-to-sequence models. 2 Related Work On a suite of challenging tests of few-shot seman- Systematic generalization in neural sequence tic parsing and instruction following, our model models The desired inductive biases noted above exhibits strong generalization, achieving the high- are usually grouped together as “systematicity” but est reported results for neural sequence models on in fact involve a variety of phenomena: one-shot datasets as diverse as COGS (Kim and Linzen 2020, learning of new concepts and composition rules with 24155 training examples) and Colors (Lake (Lake and Baroni, 2018), zero-shot interpretation et al. 2019, with 14). Our approach also generalizes of novel words from context cues (Gandhi and to real-world tests of few-shot learning, improving Lake, 2020), and interpretation of known concepts BLEU scores (Papineni et al., 2002) by 1.2 on a in novel syntactic configurations (Keysers et al., low-resource English–Chinese machine translation 2020; Kim and Linzen, 2020). What they share is a task (2.2 on test sentences requiring one-shot word common expectation that learners should associate learning). specific production or transformation rules with In an additional set of experiments, we explore specific input tokens (or phrases), and generalize effective procedures for initializing the lexical to use of these tokens in new contexts. translation mechanism using lexicon learning al- Recent years have seen tremendous amount of gorithms derived from information theory, statis- modeling work aimed at encouraging these gener- tical machine translation, and Bayesian cognitive alizations in neural models, primarily by equipping modeling. We find that both mutual-information- them with symbolic scaffolding in the form of pro- and alignment- based lexicon initializers perform gram synthesis engines (Nye et al., 2020), stack ma- well across tasks. Surprisingly, however, we show chines (Grefenstette et al., 2015; Liu et al., 2020), that both approaches can be matched or outper- or symbolic data transformation rules (Gordon formed by a rule-based initializer that identifies et al., 2019; Andreas, 2020). A parallel line of work high-precision word-level token translation pairs. has investigated the role of continuous representa- We then explore joint learning of the lexicon and tions in systematic generalization, proposing im- decoder, but find (again surprisingly) that this gives proved methods for pretraining (Furrer et al., 2020) only marginal improvements over a fixed initializa- and procedures for removing irrelevant contex- tion of the lexicon. tual information from word representations (Arthur In summary, this work: et al., 2016; Russin et al., 2019; Thrush, 2020). The latter two approaches proceed from similar intu- • Introduces a new, lexicon-based output mech- ition to ours, aiming to disentangle word meanings anism for neural encoder–decoder models. from syntax in encoder representations via alterna- • Investigates and improves upon lexicon learn- tive attention mechanisms and adversarial training. ing algorithms for initialising this mechanism. Our approach instead focuses on providing an ex- plicit lexicon to the decoder; as discussed below, • Uses it to solve challenging tests of generaliza- this appears to be considerably more effective. tion in instruction following, semantic parsing and machine translation. Copying and lexicon learning In neural encoder–decoder models, the clearest example A great deal of past work has suggested that of benefits from special treatment of word-level neural models come equipped with an inductive production rules is the copy mechanism. A great bias that makes them fundamentally ill-suited to deal of past work has found that neural models

4935 Inputs Outputs Lexicon Entries blessed 7→ bless A crocodile blessed William . crocodile(x_1) AND bless.agent (x_2, x_1) AND bless.theme (x_2, William) needed 7→ need William needed to walk . need.agent (x_1 , William) AND need.xcomp(x_1, x_3) AND walk.agent (x_3, William) William 7→ William saturn 7→ 土星 Many moons orbit around Saturn 許多 衛星 繞著 土星 運行. earth 7→ 地球 Earth is a planet . 地球 是 一個 行星. moon 7→ 衛星 walk around left LTURN IWALK LTURN IWALK LTURN IWALK LTURN IWALK walk 7→ IWALK turn right RTURN jump 7→ IJUMP turn left LTURN right 7→ RTURN jump IJUMP left 7→ LTURN jump right after look left LTURN ILOOK RTRUN IJUMP RTURN IJUMP look 7→ ILOOK

Table 1: We present example (input,output) pairs from COGS, English-to-Chinese machine translation and SCAN datasets. We also present some of the lexicon entries which can be learned by proposed lexicon learning methods and that are helpful to make generalizations required in each of the datasets. benefit from learning a structural copy operation based generative model; like the present work, that that selects output tokens directly from the input model effectively disentangles syntactic and lexical sequence without requiring token identity to be information by using training examples as implicit carried through all neural computation in the representations of lexical correspondences. encoder and the decoder. These mechanisms We generalize and extend this previous work in a are described in detail in Section3, and are number of ways, providing a new parameterization widely used in models for language generation, of attentive token-level translation and a detailed summarization and semantic parsing. Our work study of initialization and learning. But perhaps generalizes these models to structural operations the most important contribution of this work is the on the input that replace copying with general observation that many of the hard problems stud- context-independent token-level translation. ied as “compositional generalization” have direct As will be discussed, the core of our approach analogues in more conventional NLP problems, es- is a (non-contextual) lexicon that maps individual pecially machine translation. Research on system- input tokens to individual output tokens. Learn- aticity and generalization would benefit from closer ing lexicons like this is of interest in a number attention to the ingredients of effective translation of communities in NLP and language science at scale. more broadly. A pair of representative approaches (Brown et al., 1993; Frank et al., 2007) will be dis- 3 Sequence-to-Sequence Models With cussed in detail below; other work on lexicon learn- Lexical Translation Mechanisms ing for semantics and translation includes Liang et al.(2009); Goldwater(2007); Haghighi et al. This paper focuses on sequence-to-sequence lan- (2008) among numerous others. guage understanding problems like the ones de- Finally, and closest to the modeling contribution picted in Table1, in which the goal is to map from in this work, several previous papers have proposed a natural language input x = [x1, x2, . . . , xn] to a alternative generalized copy mechanisms for tasks structured output y = [y1, y2, . . . , ym]—a logical other than semantic lexicon learning. Concurrent form, action sequence, or translation. We assume work by Prabhu and Kann(2020) introduces a sim- input tokens xi are drawn from a input vocabu- ilar approach for grapheme-to-phoneme translation lary Vx, and output tokens from a corresponding (with a fixed functional lexicon rather than a train- output vocabulary Vy. able parameter matrix), and Nguyen and Chiang (2018) andG u¯ et al.(2019) describe less expres- Neural encoder–decoders Our approach builds sive mechanisms that cannot smoothly interpolate on the standard neural encoder–decoder model with between lexical translation and ordinary decoding attention (Bahdanau et al., 2014). In this model, an at the token level. Pham et al.(2018) incorpo- encoder represents the input sequence [x , . . . , x ] rate lexicon entries by rewriting input sequences 1 n as a sequence of representations [e , . . . , e ] prior to ordinary sequence-to-sequence translation. 1 n Akyürek et al.(2021) describe a model in which a copy mechanism is combined with a retrieval- e = encoder(x) (1)

4936 Next, a decoder generates a distribution over out- put sequences y according to the sequentially: y X log p(y | x) = log p(yi | y αi ∝ exp(hi Watt ej) (3) |x| X j Figure 2: An encoder-decoder model with a lexical ci = αi ej (4) translation mechanism applied to English-to-Chinese j=1 translation. At decoder step t = 4, attention is focused on the English token Saturn. The lexical translation The output distribution over Vy, which we denote mechanism is activated by pgate, and the model outputs pwrite,i, is calculated by a final projection layer: the token 土星 directly from the lexicon. 地球 means Earth Saturn p(y =w|x) = p (w) ∝ exp(W [c , h ]) and appears much more frequently than i writei write i i in the training set. (5)

Copying A popular extension of the model de- Our model: Lexical translation When the in- scribed above is the copy mechanism, in which put and output vocabularies are significantly dif- output tokens can be copied from the input se- ferent, copy mechanisms cannot provide further quence in addition to being generated directly by improvements on a sequence-to-sequence model. the decoder (Jia and Liang, 2016; See et al., 2017). However, even for disjoint vocabularies as in Fig.1, Using the decoder hidden state hi from above, the there may be strict correspondences between indi- model first computes a gate probability: vidual words on input and output vocabularies, e.g. > zup 7→ y in Fig.1. Following this intuition, the pgate = σ(wgatehi) (6) lexical translation mechanism we introduce in and then uses this probability to interpolate be- this work extends the copy mechanism by intro- tween the distribution in Eq. (5) and a copy dis- ducing an additional layer of indirection between tribution that assigns to each word in the output the input sequence x and the output prediction yi vocabulary a probability proportional to that word’s as shown in Fig.2. Specifically, after selecting an weight in the attention vector over the input: input token xj ∈ Vx, the decoder can “translate” it |x| to a context-independent output token ∈ Vy prior X 1 j to the final prediction. We equip the model with pcopy(yi = w | x) = [xj = w] · αi (7) an additional lexicon parameter L, a |Vx| × |Vy| j=1 P matrix in which w Lvw = 1, and finally define p(yi = w | x) = pgate · pwrite(yi = w | x) |x| + (1 − p ) · p (yi = w | x) (8) gate copy X j plex(yi = w | x) = Lxj w · αi (9) (note that this implies Vy ⊇ Vx). j=1 Content-independent copying is particularly use- p(yi = w | x) = pgate · pwrite(yi = w | x) ful in tasks like summarization and machine transla- + (1 − p ) · p (y = w | x) tion where rare words (like names) are often reused gate lex i (10) between the input and output. The model is visualized in Fig.2. Note that when 2All experiments in this paper use LSTM encoders and Vx = Vy and L = I is diagonal, this is iden- decoders, but it could be easily integrated with CNNs or trans- tical to the original copy mechanism. However, formers (Gehring et al. 2017; Vaswani et al. 2017). We only assume access to a final layer hi, and final attention weights this approach can in general be used to produce a αi; their implementation does not matter. larger set of tokens. As shown in Table1, coher-

4937 ent token-level translation rules can be identified 4.1 Existing Approaches to Lexicon Learning for many tasks; the lexical translation mechanism Statistical alignment In the natural language allows them to be stored explicitly, using param- processing literature, the IBM translation models eters of the base sequence-to-sequence model to (Brown et al., 1993) have served as some of the record general structural behavior and more com- most popular procedures for learning token-level plex, context-dependent translation rules. input–output mappings. While originally devel- oped for machine translation, they have also been used to initialize semantic lexicons for semantic 4 Initializing the Lexicon parsing (Kwiatkowski et al., 2011) and grapheme- to-phoneme conversion (Rama et al., 2009). We initialize the lexicon parameter L using Model 2. The lexicon parameter L in the preceding section Model 2 defines a generative process in which can be viewed as an ordinary fully-connected layer source words yi are generated from target words inside the copy mechanism, and trained end-to-end xj via latent alignments ai. Specifically, given a with the rest of the network. As with other neu- (source, target) pair with n source words and m ral network parameters, however, our experiments target words, the probability that the target word i will show that the initialization of the parameter is aligned to the source word j is: L significantly impacts downstream model perfor- i j mance, and specifically benefits from initialization  p(ai = j) ∝ exp − − (11) with a set of input–output mappings learned with m n an offline lexicon learning step. Indeed, while not Finally, each target word is generated by its aligned widely used in neural sequence models (though c.f. source word via a parameter θ: p(yi = w) =

Section2), lexicon-based initialization was a stan- θ(v, xai ). Alignments ai and lexical parameters dard feature of many complex non-neural sequence θ can be jointly estimated using the expectation– transduction models, including semantic parsers maximization algorithm (Dempster et al., 1977). (Kwiatkowski et al., 2011) and phrase-based ma- In neural models, rather than initializing lexi- chine translation systems (Koehn et al., 2003). cal parameters L directly with corresponding IBM θ But an important distinction between our ap- model parameters , we run Model 2 in both the proach and these others is the fact that we can forward and reverse directions, then extract counts intersecting handle outputs that are not (transparently) com- by these alignments and applying a τ positional. Not every fragment of an input will softmax with temperature : correspond to a fragment of an output: for exam- |y| ple, thrice in SCAN has no corresponding output −1 X X 1 1  Lvw ∝ exp τ [xai = v] [yi = w] token and instead describes a structural transforma- (x,y) i=1 tion. Moreover, the lexicon is not the only way to (12) generate: complex mappings can also be learned by pwrite without going through the lexicon at all. For all lexicon methods discussed in this paper, if an input v is not aligned to any output w, we Thus, while most existing work on lexicon learn- map it to itself if Vx ⊆ Vy. Otherwise we align it ing aims for complete coverage of all word mean- uniformly to any unmapped output words (a mutual ings, the model described in Section3 benefits from exclusivity bias, Gandhi and Lake 2020). a lexicon with high-precision coverage of rare phe- nomena that will be hard to learn in a normal neu- Mutual information Another, even simpler pro- ral model. Lexicon learning is widely studied in cedure for building a lexicon is based on identi- language processing and cognitive modeling, and fying pairs that have high pointwise mutual infor- several approaches with very different inductive mation. We estimate this quantity directly from biases exist. To determine how to best initialize co-occurrence statistics in the training corpus: L , we begin by reviewing three algorithms in Sec- #(v, w) tion 4.1, and identify ways in which each of them pmi(v; w) = log + log |Dtrain| (13) #(v)#(w) fail to satisfy the high precision criterion above. In Section 4.2, we introduce a simple new lexicon where #(w) is the number of times the word w learning rule that addresses this shortcoming. appears in the training corpus and #(w, v) is the

4938 number of times that w appears in the input and IBM Model-2 PMI and thrice v appears in the output. Finally, we populate the twice opposite parameter L via a softmax transformation: Lvw ∝ after around exp((1/τ) pmi (v; w)). right walk run left Bayesian lexicon learning Last, we explore the look jump Bayesian cognitive model of lexicon learning de- Bayesian Simple and scribed by Frank et al.(2007). Like IBM model thrice twice opposite 2, this model is defined by a generative process; after around here, however, the lexicon itself is part of the gener- right walk ative model. A lexicon ` is an (unweighted, many- run left look to-many) map defined by a collection of pairs (x, jump description length p(`) ∝ e−|`| y) with a prior: IRUN IRUN IRIGHTIWALK ILEFTILOOKIJUMP IRIGHTIWALK ILEFTILOOKIJUMP (where |`| is the number of (input, output) pairs in the lexicon). As in Model 2, given a meaning Figure 3: Learned lexicons for the around right split in SCAN (τ = 0.1). The rule-based lexicon learn- y and a natural-language description x, each xi ing procedure (Simple) produces correct alignments, is generated independently. We define the prob- while other methods fail due to the correlation between non-referentially ability of a word being used as around and left in training data. pNR(xi | `) ∝ 1 if xi 6∈ ` and κ otherwise. The probability of being used referentially is: pR(xj | that, surprisingly, matchers or outperforms all three yi, `) ∝ 1(x ,y )∈`. Finally, j i baseline methods in most of our experiments. What makes an effective, precise lexicon learn- p(xj | yi, `) = (1 − γ)pNR(xj | `) ing rule? As a first step, consider a maximally |y| X restrictive criterion (which we’ll call C1) that ex- + γ p (x | y , `) (14) R j i tracts only pairs (v, w) for which the presence of v i=1 in the input is a necessary and sufficient condition To produce a final lexical translation matrix L for the presence of w in the output. for use in our experiments, we set Lvw ∝ nec.(v, w) = ∀xy. (w ∈ y) → (v ∈ x) (15) exp((1/τ) p((v, w) ∈ `)): each entry in L is the suff.(v, w) = ∀xy. (v ∈ x) → (w ∈ y) (16) posterior probability that the given entry appears in a lexicon under the generative model above. Param- C1(v, w) = nec.(v, w) ∧ suff.(v, w) (17) eters are estimated using the Metropolis–Hastings C1 is too restrictive: in many language understand- algorithm, with details described in AppendixC. ing problems, the mapping from surface forms to meanings is many-to-one (in Table1, both blessed 4.2 A Simpler Lexicon Learning Rule and bless are associated with the logical form Example lexicons learned by the three models bless). Such mappings cannot be learned by the above are depicted in Fig.3 for the SCAN task algorithm described above. We can relax the neces- shown in Table1. Lexicons learned for remain- sity condition slightly, requiring either that v is a ing tasks can be found in AppendixB. It can be necessary condition for w, or is part of a group that seen that all three models produce errors: the PMI collectively explains all occurrences of w: and Bayesian lexicons contain too many entries (in 0 0 no-winner(w) = @v .C1(v , w) (18) both cases, numbers are associated with the turn right action and prepositions are associated with C2(v, w) = suff.(v, w) ∧ the turn left action). For the IBM model, one of (nec.(v, w) ∨ no-win.(w)) (19) the alignments is confident but wrong, because the As a final refinement, we note that C2 is likely around preposition is associated with turn left to capture function words that are present in most action. In order to understand these errors, and to sentences, and exclude these by restricting the lexi- better characterize the difference between the de- con to words below a certain frequency threshold: mands of lexical translation model initializers and past lexicon learning schemes, we explore a sim- 0 0 ple logical procedure for extracting lexicon entries C3 = C2 ∧ {v : suff.(v , w)} ≤  (20) 4939 two test examples most frequently predicted incor- The lexicon matrix L is computed by taking the rectly require generalization to longer sequences word co-occurrence matrix, zeroing out all entries than seen during training. More details (includ- where C3 does not hold, then computing a soft- ing example-level model and human accuracies) max: Lvw ∝ C3(v, w) exp((1/τ) #(v, w)). Sur- are presented in the appendix AppendixA). These prisingly, as shown in Fig.3 and and evaluated results show that LSTMs are quite effective at learn- below, this rule (which we label Simple) produces ing systematic sequence transformation rules from the most effective lexicon initializer for three of ≈ 3 examples per function word when equipped the four tasks we study. The simplicity (and ex- with lexical translations. Generalization to longer treme conservativity) of this rule highlight the dif- sequences remains as an important challenge for ferent demands on L made by our model and more future work. conventional (e.g. machine translation) approaches: 5.2 SCAN the lexical translation mechanism benefits from a small number of precise mappings rather than a Task SCAN (Lake and Baroni, 2018) is a larger large number of noisy ones. collection of tests of systematic generalization that pair synthetic English commands (e.g. turn left 5 Experiments twice and jump) to action sequences (e.g. LTURN LTURN IJUMP) as shown in Table1. Following We investigate the effectiveness of the lexical trans- previous work, we focus on the jump and around lation mechanism on sequence-to-sequence mod- right splits, each of which features roughly 15,000 els for four tasks, three focused on compositional training examples, and evaluate models’ ability to generalization and one on low-resource machine perform 1-shot learning of new primitives (jump) translation. In all experiments, we use an LSTM and zero-shot interpretation of composition rules encoder–decoder with attention as the base predic- (around right). While these tasks are now solved by tor. We compare our approach (and variants) with a number of specialized approaches, they remain a two other baselines: GECA (Andreas 2020; a data challenge for conventional neural sequence models, augmentation scheme) and SynAtt (Russin et al. and an important benchmark for new models. 2019; an alternative seq2seq model parameteriza- Results jump tion). Hyper-parameter selection details are given In the split, all initializers improve in the AppendixC. Unless otherwise stated, we use significantly over the base LSTM when combined τ = 0 and do not fine-tune L after initialization. with lexical translation. Most methods achieve 99% accuracy at least once across seeds. These 5.1 Colors results are slightly behind GECA (in which all runs succeed) but ahead of SynAtt.3 Again, they show Task The Colors sequence translation task (see that lexicon learning is effective for systematic gen- AppendixA for full dataset) was developed to eralization, and that simple initializers (PMI and measure human inductive biases in sequence-to- Simple) outperform complex ones. sequence learning problems. It poses an extreme test of low-resource learning for neural sequence 5.3 COGS models: it has only 14 training examples that com- Task COGS (Compositional Generalization for bine four named colors and three composition op- Semantic Parsing; Kim and Linzen 2020) is an au- erations that perform concatenation, repetition and tomatically generated English-language semantic wrapping. Liu et al.(2020) solve this dataset with parsing dataset that tests systematic generalization a symbolic stack machine; to the best of our knowl- in learning language-to-logical-form mappings. It edge, our approach is the first “pure” neural se- includes 24155 training examples. Compared to quence model to obtain non-trivial accuracy. the Colors and SCAN datasets, it has a larger vo- cabulary (876 tokens) and finer-grained inventory Results Both the Simple and IBMM2 initializers of syntactic generalization tests (Table3). produce a lexicon that maps only color words to colors. Both, combined with the lexical translation Results Notably, because some tokens appear mechanism, obtain an average test accuracy of 79% in both inputs and logical forms in the COGS across 16 runs, nearly matching the human accu- 3SynAtt results here are lower than reported in the original racy of 81% reported by Lake et al.(2019). The paper, which discarded runs with a test accuracy of 0%.

4940 Colors jump (SCAN) around right (SCAN) COGS LSTM 0.00 ±0.00 0.00 ±0.00 0.09 ±0.05 0.51 ±0.05 GECA 0.41 ±0.11 1.00 ±0.00 0.98 ±0.02 0.48 ±0.05 SyntAtt 0.57 ±0.26 0.57 ±0.38 0.28 ±0.26 0.15 ±0.14 LSTM + copy - - - 0.66 ±0.03 LSTM + Lex.: Simple 0.79 ±0.02 0.92 ±0.17 0.95 ±0.01 0.82 ±0.01 LSTM + Lex.: PMI 0.41 ±0.19 0.95 ±0.08 0.02 ±0.04 0.82 ±0.00 LSTM + Lex.: IBMM2 0.79 ±0.02 0.79 ±0.27 0.00 ±0.00 0.82 ±0.00 LSTM + Lex.: Bayesian 0.51 ±0.21 0.82 ±0.21 0.02 ±0.04 0.70 ±0.04

Table 2: Exact match accuracy results for baselines and lexicon learning models on 4 different compositional generalization splits. Errors are standard deviation among 16 different seeds for Colors, 10 seeds for COGS and SCAN. Unbolded numbers are significantly(p < 0.01) worse than the best result in the column. Models with lexical translation mechanisms and Simple initialization consistently improve over ordinary LSTMs.

Categories LSTM + copy + simple ENG-CHN primitive → {subj, obj, inf} full 1-shot active → passive obj PP → subj PP LSTM 24.18 ±0.37 17.47 ±0.64 passive → active LSTM + GECA 23.90 ±0.55 17.94 ±0.43 recursion LSTM + Lex.: PMI 24.36 ±0.09 18.46 ±0.13 unacc → transitive LSTM + Lex.: Simple 24.35 ±0.09 18.46 ±0.19 obj → subj proper LSTM + Lex.: IBMM2 25.49 ±0.42 19.62 ±0.64 subj → obj common PP dative ↔ obj dative all Table 4: BLEU scores for English-Chinese translation. full shows results on the full test set, and 1-shot shows Table 3: COGS accuracy breakdown according to syn- results for text examples in which the English text con- tactic generalization types for word usages. The label tains a token seen only once during training. a → b indicates that syntactic context a appears in the training set and b in the test set. featuring English words that appeared only once in the training set, BLEU improves by more than task, even a standard sequence-to-sequence model 2 points, demonstrating that this approach is par- with copying significantly outperforms the baseline ticularly effective at one-shot word learning (or models in the original work of Kim and Linzen fast mapping; Carey and Bartlett 1978). Fig.2 (2020), solving most tests of generalization over shows an example from this dataset, in which the syntactic roles for nouns (but performing worse at model learns to reliably translate Saturn from a generalizations over verbs, including passive and single training example. GECA, which makes spe- dative alternations). As above, the lexical transla- cific generative assumptions about data distribu- tion mechanism (with any of the proposed initial- tions, does not generalize to a more realistic low izers) provides further improvements, mostly for resource MT problem. However, the lexical trans- verbs that baselines model incorrectly (Table3). lation mechanism remains effective in natural tasks with large vocabularies and complex grammars. 5.4 Machine Translation Task To demonstrate that this approach is useful 5.5 Fine-Tuning the Lexicon beyond synthetic tests of generalization, we eval- In all the experiments above, the lexicon was dis- uate it on a low-resource English–Chinese transla- cretized (τ = 0) and frozen prior to training. In tion task (the Tatoeba4 dataset processed by Kelly this final section, we revisit that decision, eval- 2021). For our experiments, we split the data ran- uating whether the parameter L can be learned domly into 19222 training and 2402 test pairs. from scratch, or effectively fine-tuned along with decoder parameters. Experiments in this section Results Results are shown in Table4. Models focus on the COGS dataset. with a lexical translation mechanism obtain modest Offline initialization of the lexicon is crucial. improvements (up to 1.5 BLEU) over the baseline. Rather than initializing L using any of the algo- Notably, if we restrict evaluation to test sentences rithms described in Section3, we initialized L to a 4https://tatoeba.org/ uniform distribution for each word and optimized

4941 COGS sources were provided by a gift from NVIDIA LSTM 0.51 ±0.06 through the NVAIL program and by the Lincoln Lex.: Uniform 0.56 ±0.07 Laboratory Supercloud. Lex.: Simple 0.82 ±0.01 Soft 0.83 ±0.00 Learned 0.83 ±0.01 References Table 5: Ablation experiments on the COGS dataset. Ekin Akyürek, Afra Feyza Akyürek, and Jacob An- Uniform shows results for a lexicon initialized to a uni- dreas. 2021. Learning to recombine and resample form distribution. Soft sets τ = 0.1 with the Sim- data for compositional generalization. In Interna- ple lexicon learning rule (rather than 0 in previous ex- tional Conference on Learning Representations. periments). Learned shows results for a soft lexicon Jacob Andreas. 2020. Good-enough compositional fine-tuned during training. Soft lexicons with or with- data augmentation. In Proceedings of the 58th An- out learning improve significantly (p < 0.01) but very nual Meeting of the Association for Computational slightly over fixed initialization. Linguistics, pages 7556–7566. Philip Arthur, Graham Neubig, and Satoshi Nakamura. it during training. This improves over the base 2016. Incorporating discrete translation lexicons Uniform into neural machine translation. In Proceedings of LSTM ( in Table5), but performs signifi- the 2016 Conference on Empirical Methods in Natu- cantly worse than pre-learned lexicons. ral Language Processing, pages 1557–1567. Benefits from fine-tuning are minimal. We first increased the temperature parameter τ to 0.1 (pro- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2014. Neural machine translation by jointly viding a “soft” lexicon); this gave a 1% improve- learning to align and translate. arXiv preprint ment on COGS (Table5. Soft). Finally, we updated arXiv:1409.0473. this soft initialization via gradient descent; this pro- Joan Bresnan, Ash Asudeh, Ida Toivonen, and Stephen vided no further improvement (Table5, Learned). Wechsler. 2015. Lexical-functional syntax. John Wi- One important feature of COGS (and other tests ley & Sons. of compositional generalization) is perfect train- ing Peter F Brown, Stephen A Della Pietra, Vincent J accuracy is easily achieved; thus, there is little Della Pietra, and Robert L Mercer. 1993. The math- pressure on models to learn generalizable lexicons. ematics of statistical machine translation: Parameter This pressure must instead come from inductive estimation. Computational linguistics, 19(2):263– bias in the initializer. 311. Susan Carey and Elsa Bartlett. 1978. Acquiring a sin- 6 Conclusion gle new word. Papers and Reports on Child Lan- guage Development, 2. We have described a lexical translation mecha- nism for representing token-level translation rules Arthur P Dempster, Nan M Laird, and Donald B Rubin. 1977. Maximum likelihood from incomplete data in neural sequence models. We have additionally via the EM algorithm. Journal of the Royal Statisti- described a simple initialization scheme for this cal Society: Series B (Methodological), 39(1):1–22. lexicon that outperforms a variety of existing algo- Chris Dyer, Victor Chahuneau, and Noah A Smith. rithms. Together, lexical translation and proper ini- 2013. A simple, fast, and effective reparameteriza- tialization enable neural sequence models to solve tion of IBM Model 2. In Proceedings of the 2013 a diverse set of tasks—including semantic pars- Conference of the North American Chapter of the ing and machine translation—that require 1-shot Association for Computational Linguistics: Human Language Technologies word learning and 0-shot compositional generaliza- , pages 644–648. tion. Future work might focus on generalization Jerry A Fodor, Zenon W Pylyshyn, et al. 1988. Connec- to longer sequences, learning of atomic but non- tionism and cognitive architecture: A critical analy- concatenative translation rules, and online lexicon sis. Cognition, 28(1-2):3–71. learning in situated contexts. Michael C. Frank, Noah D. Goodman, and J. Tenen- baum. 2007. A Bayesian framework for cross- Acknowledgements situational word-learning. In NeurIPS. Daniel Furrer, Marc van Zee, Nathan Scales, and This work was supported by the Machine- Nathanael Schärli. 2020. Compositional generaliza- LearningApplications initiative at MIT CSAIL and tion in semantic parsing: Pre-training vs. specialized the MIT–IBM Watson AI lab. Computing re- architectures. arXiv preprint arXiv:2007.08970.

4942 Kanishk Gandhi and Brenden M Lake. 2020. Mutual B. Lake, Tal Linzen, and M. Baroni. 2019. Human exclusivity as a challenge for deep neural networks. few-shot learning of compositional instructions. In Advances in Neural Information Processing Systems, CogSci. 33. Brenden Lake and Marco Baroni. 2018. Generalization Jonas Gehring, M. Auli, David Grangier, Denis Yarats, without systematicity: On the compositional skills and Yann Dauphin. 2017. Convolutional sequence of sequence-to-sequence recurrent networks. In In- to sequence learning. In ICML. ternational Conference on Machine Learning, pages 2873–2882. PMLR. Sharon J Goldwater. 2007. Nonparametric Bayesian Models of Lexican Acquisition. Citeseer. Percy Liang, Michael I Jordan, and Dan Klein. 2009. Learning semantic correspondences with less super- Jonathan Gordon, David Lopez-Paz, Marco Baroni, vision. In Proceedings of the Joint Conference of and Diane Bouchacourt. 2019. Permutation equiv- the 47th Annual Meeting of the ACL and the 4th In- ariant models for compositional generalization in ternational Joint Conference on Natural Language language. In International Conference on Learning Processing of the AFNLP, pages 91–99. Representations. Qian Liu, Shengnan An, Jian-Guang Lou, Bei Chen, Edward Grefenstette, Karl Moritz Hermann, Mustafa Zeqi Lin, Yan Gao, Bin Zhou, Nanning Zheng, and Suleyman, and Phil Blunsom. 2015. Learning to Dongmei Zhang. 2020. Compositional generaliza- transduce with unbounded memory. In NIPS. tion by learning analytical expressions. Advances in Neural Information Processing Systems, 33. Jetic Gu,¯ Hassan S Shavarani, and Anoop Sarkar. 2019. Pointer-based fusion of bilingual lexicons Gary Marcus. 2018. Deep learning: A critical ap- into neural machine translation. arXiv preprint praisal. arXiv preprint arXiv:1801.00631. arXiv:1909.07907. Toan Q. Nguyen and David Chiang. 2018. Improving Maria Teresa Guasti. 2017. Language acquisition: The lexical choice in neural machine translation. ArXiv, growth of grammar. MIT press. abs/1710.01329. A. Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, Maxwell Nye, Armando Solar-Lezama, Josh Tenen- and D. Klein. 2008. Learning bilingual lexicons baum, and Brenden M Lake. 2020. Learning com- ACL from monolingual corpora. In . positional rules via neural program synthesis. In Robin Jia and Percy Liang. 2016. Data recombination Advances in Neural Information Processing Systems, for neural semantic parsing. In Proceedings of the volume 33, pages 10832–10842. Curran Associates, 54th Annual Meeting of the Association for Compu- Inc. tational Linguistics (Volume 1: Long Papers) , pages Kishore Papineni, Salim Roukos, Todd Ward, and Wei- 12–22. Jing Zhu. 2002. BLEU: a method for automatic eval- Aravind K Joshi and Yves Schabes. 1997. Tree- uation of machine translation. In Proceedings of the adjoining grammars. In Handbook of formal lan- 40th annual meeting of the Association for Compu- guages, pages 69–123. Springer. tational Linguistics, pages 311–318. Charles Kelly. 2021. [link]. Ngoc-Quan Pham, Jan Niehues, and Alex Waibel. 2018. Towards one-shot learning for rare-word Daniel Keysers, Nathanael Schärli, Nathan Scales, translation with external experts. In Proceedings of Hylke Buisman, Daniel Furrer, Sergii Kashubin, the 2nd Workshop on Neural Machine Translation Nikola Momchev, Danila Sinopalnikov, Lukasz and Generation, pages 100–109. Stafiniak, Tibor Tihon, Dmitry Tsarkov, Xiao Wang, Marc van Zee, and Olivier Bousquet. 2020. Measur- Carl Pollard and Ivan A Sag. 1994. Head-driven ing compositional generalization: A comprehensive phrase structure grammar. University of Chicago method on realistic data. In ICLR. Press. Najoung Kim and Tal Linzen. 2020. COGS: A com- Nikhil Prabhu and K. Kann. 2020. Making a point: positional generalization challenge based on seman- Pointer-generator transformers for disjoint vocabu- tic interpretation. In Proceedings of the 2020 Con- laries. In AACL. ference on Empirical Methods in Natural Language Processing (EMNLP), pages 9087–9105. Taraka Rama, Anil Kumar Singh, and Sudheer Ko- lachina. 2009. Modeling letter-to-phoneme con- Philipp Koehn, F. Och, and D. Marcu. 2003. Statistical version as a phrase based statistical machine trans- phrase-based translation. In HLT-NAACL. lation problem with minimum error rate training. In Proceedings of Human Language Technologies: T. Kwiatkowski, Luke Zettlemoyer, S. Goldwater, and The 2009 Annual Conference of the North Ameri- Mark Steedman. 2011. Lexical generalization in can Chapter of the Association for Computational CCG grammar induction for semantic parsing. In Linguistics, Companion Volume: Student Research EMNLP. Workshop and Doctoral Consortium, pages 90–95.

4943 Jake Russin, Jason Jo, Randall C O’Reilly, and Yoshua Bengio. 2019. Compositional generalization in a deep seq2seq model by separating syntax and seman- tics. arXiv preprint arXiv:1904.09708. Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer- generator networks. In Proceedings of the 55th An- nual Meeting of the Association for Computational Linguistics, pages 1073–1083. Tristan Thrush. 2020. Compositional neural machine translation by removing the lexicon from syntax. arXiv preprint arXiv:2002.08899. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- cessing Systems, pages 6000–6010.

4944 A Colors Dataset & Detailed Results IBM Model-2 PMI fep

blicket

Here we present the full dataset in Table6 from kiki Lake et al.(2019), and detailed comparisons of dax each model with human results in Table7. wif zup

lug TRAIN TEST Bayesian Simple INPUT OUTPUT INPUT OUTPUT fep dax r zup fep y y y blicket lug b zup kiki dax r y wif g wif kiki zup y g kiki zup y zup blicket lug y b y dax lug fep b b b dax blicket zup r y r wif dax fep r r r wif kiki zup fep y y y g lug blicket wif b g b zup fep kiki lug b y y y zup wif blicket dax g r g lug kiki wif blicket zup g y g b lug lug kiki wif g b zup blicket wif kiki dax fep r r r y g y dax kiki lug b r zup blicket zup kiki zup fep y y y y y y RED RED GREEN BLUE GREEN BLUE lug fep kiki wif g b b b YELLOW YELLOW wif kiki dax blicket lug r b r g lug kiki wif fep g g g b wif blicket dax kiki lug b g r g Figure 5: Learned lexicons from Colors datset with τ = 0.1 Table 6: Full Colors dataset with Train and Test exam- ples (Lake et al., 2019)

IBM Model-2 PMI

noticed Test Examples Simple/IBM-M2 Bayesian GECA SyntAtt Human baked zup fep 1.0±0.00 0.88±0.33 1.0±0.00 0.7±0.5 0.88 zup kiki dax 1.0±0.00 0.88±0.33 1.0±0.00 0.7±0.5 0.86 shattered wif kiki zup 1.0±0.00 0.8±0.4 1.0±0.00 0.8±0.4 0.86 dax blicket zup 1.0±0.00 0.88±0.33 1.0±0.00 0.8±0.4 0.88 blessed zup blicket lug 0.94±0.24 0.8±0.4 1.0±0.00 0.8±0.4 0.79 hoped wif kiki zup fep 1.0±0.00 0.3±0.5 0.0±0.00 0.4±0.00 5 0.85 zup fep kiki lug 1.0±0.00 0.2±0.4 0.0±0.00 0.8±0.4 0.85 Bayesian Simple lug kiki wif blicket zup 1.0±0.00 0.4±0.5 0.0±0.00 0.4±0.5 0.65 zup blicket wif kiki dax fep 0.0±0.00 0.0±0.00 0.0±0.00 0.0±0 0.70 noticed zup blicket zup kiki zup fep 0.0±0.00 0.0±0.00 0.0±0.00 0.0±0.00 0.75 baked

Table 7: Colors dataset exact match breakdown for shattered each individual test example. Human results are taken blessed from (Lake et al., 2019)Fig2. hoped

notice bake shatter bless hope notice bake shatter bless hope B Learned Lexicons Figure 6: Learned lexicons from COGS datset with Here we provide lexicons for each model and τ = 0.1. We only show important rare words resposi- dataset (see Fig.2 and Fig.3 for remaining ble for our model’s improvements over the baseline. datasets). For COGS, we show a representative subset of words. C Hyper-parameter Settings

IBM Model-2 PMI and C.1 Neural Seq2Seq thrice twice opposite Most of the datasets we evaluate do not come with after around right a out-of-distribution validation set, making prin- walk run cipled hyperparameter tuning difficult. We were left look jump unable to reproduce the results of Kim and Linzen Bayesian Simple (2020) with the hyperparameter settings reported and thrice twice there with our base LSTM setup, and so adjusted opposite after them until training was stabilized. Like the original around right paper, we used a unidirectional 2-layer LSTM with walk run left 512 hidden units, an embedding size of 512, gradi- look jump ent clipping of 5.0, a Noam learning rate scheduler

IRUN IRUN IRIGHTIWALK ILEFTILOOKIJUMP IRIGHTIWALK ILEFTILOOKIJUMP with 4000 warm-up steps, and a batch size of 512. Unlike the original paper, we found it necessary to Figure 4: Learned lexicons from SCAN datset jump reduce learning rate to 1.0, increase dropout value split with τ = 0.1 to 0.4, and the reduce maximum step size timeout

4945 to 8000. D.2 SyntAtt We use same parameters for all COGS, SCAN, We used the public GitHub repository of SyntAtt5 and machine translation experiments. For SCAN and reproduced reported results for the SCAN and Colors, we applied additional dropout (p=0.5) dataset. For other datasets, we also explored ”syn- in the last layer of pwrite. tax action” option, in which both contextualized Since Colors has 14 training examples, we need context (syntax) and un-contextualized embeddings a different batch size, set to 1/3 of the training (semantics) used in final layer Russin et al.(2019). set size (= 5). Qualitative evaluation of gradi- We additionally performed a search over hidden ents in training time revealed that stricter gradient layer sizes {128,256,512} and depths {1,2}. We clipping was also needed (= 0.5). Similarly, we report the best results for each dataset. decreased warm-up steps to 32 epochs. All other hyper-parameters remain the same. E Datasets & Evaluation & Tokenization E.1 Datasets and Sizes C.2 Lexicon Learning around_right jump COGS Colors ENG-CHN train 15225 14670 24155 14 19222 Simple Lexicon The only parameter in the sim- validation - - 3000 - 2402 ple lexicon is , set to 3 in all experiments. test 4476 7706 21000 10 2402 E.2 Evaluation Bayesian The original work of Frank et al. We report exact match accuracies and BLEU scores. (2007) did not report hyperparemeter settings or In both evaluations we include punctuation. For sampler details. We found α = 2, γ = 0.95 and BLEU we use NLTK 6 library’s default implemen- κ = 0.1 to be effective. The M–H proposal distri- tation. bution inserts or removes a word from the lexicon with 50% probability. For deletions, an entry is E.3 Tokenization removed uniformly at random. For insertions, an We use Moses library7 for English tokenization, entry is added with probability proportional to the and jieba8 library for Chinese tokenization. In empirical joint co-occurrence probability of the other datasets, we use default space tokenization. input and output tokens. Results were averaged across 5 runs, with a burn-in period of 1000 and a F Computing Infrastructure sample drawn every 10 steps. Experiments were performed on a DGX-2 with NVIDIA 32GB VOLTA-V100 GPUs. Experiments IBM Model 2 We used the FastAlign implemen- take at most 2.5 hours on a single GPU. tation (Dyer et al., 2013) and experimented with a variety of hyperparameters in the alignment algo- rithm itself (favoring diagonal alignment, optimiz- ing tension, using dirichlet priors) and diagonaliza- tion heuristics (grow-diag, grow-diag-final, grow- diag-final-and, union). We found that optimizing tension and using the “intersect” diagonalization heuristic works the best overall.

D Baseline Results

D.1 GECA

We reported best results for SCAN dataset from reproduced results in (Akyürek et al., 2021). For other datasets (COGS and Colors), we performed 5(https://github.com/jlrussin/syntactic_ attention) a hyperparameter search over augmentation ratios 6https://www.nltk.org/ of 0.1 and 0.3 and hidden sizes of {128, 256, 512}. 7https://pypi.org/project/mosestokenizer/ We report the best results for each dataset. 8https://github.com/fxsjy/jieba

4946