Lexicon Learning for Few-Shot Neural Sequence Modeling

Lexicon Learning for Few-Shot Neural Sequence Modeling Ekin Akyürek Jacob Andreas Massachusetts Institute of Technology {akyurek,jda}@mit.edu Abstract Train Test dax r lug b wif g zup y zup fep y y y Sequence-to-sequence transduction is the core lexicon zup blicket lug ___? problem in language processing applications lug fep b b b dax blicket zup ___? as diverse as semantic parsing, machine trans- dax fep r r r zup kiki dax ___? lation, and instruction following. The neural lug blicket wif b g b wif kiki zup ___? network models that provide the dominant so- wif blicket dax g r g lution to these problems are brittle, especially lug kiki wif g b in low-resource settings: they fail to generalize dax kiki lug b r correctly or systematically from small datasets. Past work has shown that many failures of sys- Figure 1: A fragment of the Colors dataset from Lake tematic generalization arise from neural mod- et al.(2019), a simple sequence-to-sequence translation els’ inability to disentangle lexical phenomena task. The output vocabulary is only the colored circles from syntactic ones. To address this, we aug- r , g , b , y . Humans can reliably fill in the miss- ment neural decoders with a lexical transla- ing test labels on the basis of a small training set, but tion mechanism that generalizes existing copy standard neural models cannot. This paper describes a mechanisms to incorporate learned, decontex- neural sequence model that obtains improved general- tualized, token-level translation rules. We de- ization via a learned lexicon of token translation rules. scribe how to initialize this mechanism using a variety of lexicon learning algorithms, and show that it improves systematic generaliza- and small datasets (Lake and Baroni, 2018), pos- tion on a diverse set of sequence modeling ing a fundamental challenge for NLP tools in the tasks drawn from cognitive science, formal semantics, and machine translation.1 low-data regime. Pause for a moment to fill in the missing labels 1 Introduction in Fig.1. While doing so, which training examples did you pay the most attention to? How many Humans exhibit a set of structured and remarkably times did you find yourself saying means or maps consistent inductive biases when learning from lan- to? Explicit representations of lexical items and guage data. For example, in both natural language their meanings play a key role diverse models of acquisition and toy language-learning problems syntax and semantics (Joshi and Schabes, 1997; like the one depicted in Fig.1, human learners Pollard and Sag, 1994; Bresnan et al., 2015). But exhibit a preference for systematic and composi- one of the main findings in existing work on gener- tional interpretation rules (Guasti 2017, Chapter 4; alization in neural models is that they fail to cleanly Lake et al. 2019). These inductive biases in turn separate lexical phenomena from syntactic ones support behaviors like one-shot learning of new (Lake and Baroni, 2018). Given a dataset like the concepts (Carey and Bartlett, 1978). But in natural one depicted in Fig.1, models conflate (lexical) language processing, recent work has found that information about the correspondence between zup state-of-the-art neural models, while highly effec- and y with the (syntactic) fact that y appears tive at in-domain prediction, fail to generalize in only in a sequence of length 1 at training time. human-like ways when faced with rare phenomena Longer input sequences containing the word zup 1Our code is released under https://github.com/ in new syntactic contexts cause models to output ekinakyurek/lexical tokens only seen in longer sequences (Section5). 4934 Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 4934–4946 August 1–6, 2021. ©2021 Association for Computational Linguistics In this paper, we describe a parameterization human-like generalization about language data, es- for sequence decoders that facilitates (but does pecially in the low-data regime (e.g. Fodor et al., not enforce) the learning of context-independent 1988; Marcus, 2018). Our results suggest that the word meanings. Specifically, we augment decoder situation is more complicated: by offloading the output layers with a lexical translation mecha- easier lexicon learning problem to simpler models, nism which generalizes neural copy mechanisms neural sequence models are actually quite effec- (e.g. See et al., 2017) and enables models to gen- tive at modeling (and generalizing about) about erate token-level translations purely attentionally. syntax in synthetic tests of generalization and real While the lexical translation mechanism is quite translation tasks. general, we focus here on its ability to improve few-shot learning in sequence-to-sequence models. 2 Related Work On a suite of challenging tests of few-shot seman- Systematic generalization in neural sequence tic parsing and instruction following, our model models The desired inductive biases noted above exhibits strong generalization, achieving the high- are usually grouped together as “systematicity” but est reported results for neural sequence models on in fact involve a variety of phenomena: one-shot datasets as diverse as COGS (Kim and Linzen 2020, learning of new concepts and composition rules with 24155 training examples) and Colors (Lake (Lake and Baroni, 2018), zero-shot interpretation et al. 2019, with 14). Our approach also generalizes of novel words from context cues (Gandhi and to real-world tests of few-shot learning, improving Lake, 2020), and interpretation of known concepts BLEU scores (Papineni et al., 2002) by 1.2 on a in novel syntactic configurations (Keysers et al., low-resource English–Chinese machine translation 2020; Kim and Linzen, 2020). What they share is a task (2.2 on test sentences requiring one-shot word common expectation that learners should associate learning). specific production or transformation rules with In an additional set of experiments, we explore specific input tokens (or phrases), and generalize effective procedures for initializing the lexical to use of these tokens in new contexts. translation mechanism using lexicon learning al- Recent years have seen tremendous amount of gorithms derived from information theory, statis- modeling work aimed at encouraging these gener- tical machine translation, and Bayesian cognitive alizations in neural models, primarily by equipping modeling. We find that both mutual-information- them with symbolic scaffolding in the form of pro- and alignment- based lexicon initializers perform gram synthesis engines (Nye et al., 2020), stack ma- well across tasks. Surprisingly, however, we show chines (Grefenstette et al., 2015; Liu et al., 2020), that both approaches can be matched or outper- or symbolic data transformation rules (Gordon formed by a rule-based initializer that identifies et al., 2019; Andreas, 2020). A parallel line of work high-precision word-level token translation pairs. has investigated the role of continuous representa- We then explore joint learning of the lexicon and tions in systematic generalization, proposing im- decoder, but find (again surprisingly) that this gives proved methods for pretraining (Furrer et al., 2020) only marginal improvements over a fixed initializa- and procedures for removing irrelevant contex- tion of the lexicon. tual information from word representations (Arthur In summary, this work: et al., 2016; Russin et al., 2019; Thrush, 2020). The latter two approaches proceed from similar intu- • Introduces a new, lexicon-based output mech- ition to ours, aiming to disentangle word meanings anism for neural encoder–decoder models. from syntax in encoder representations via alterna- • Investigates and improves upon lexicon learn- tive attention mechanisms and adversarial training. ing algorithms for initialising this mechanism. Our approach instead focuses on providing an explicit lexicon to the decoder; as discussed below, • Uses it to solve challenging tests of generaliza- this appears to be considerably more effective. tion in instruction following, semantic parsing and machine translation. Copying and lexicon learning In neural encoder–decoder models, the clearest example A great deal of past work has suggested that of benefits from special treatment of word-level neural models come equipped with an inductive production rules is the copy mechanism. A great bias that makes them fundamentally ill-suited to deal of past work has found that neural models 4935 Inputs Outputs Lexicon Entries blessed 7! bless A crocodile blessed William . crocodile(x_1) AND bless.agent (x_2, x_1) AND bless.theme (x_2, William) needed 7! need William needed to walk . need.agent (x_1 , William) AND need.xcomp(x_1, x_3) AND walk.agent (x_3, William) William 7! William saturn 7! 土星 Many moons orbit around Saturn 1多 [星 ^W 土星 KL. earth 7! 0球 Earth is a planet . 0球 / 一個 L星. moon 7! [星 walk around left LTURN IWALK LTURN IWALK LTURN IWALK LTURN IWALK walk 7! IWALK turn right RTURN jump 7! IJUMP turn left LTURN right 7! RTURN jump IJUMP left 7! LTURN jump opposite right after look left LTURN ILOOK RTRUN IJUMP RTURN IJUMP look 7! ILOOK Table 1: We present example (input,output) pairs from COGS, English-to-Chinese machine translation and SCAN datasets. We also present some of the lexicon entries which can be learned by proposed lexicon learning methods and that are helpful to make generalizations required in each of the datasets. benefit from learning a structural copy operation based generative model; like the present work, that that selects output tokens directly from the input model effectively disentangles syntactic and lexical sequence without requiring token identity to be information by using training examples as implicit carried through all neural computation in the representations of lexical correspondences. encoder and the decoder. These mechanisms We generalize and extend this previous work in a are described in detail in Section3, and are number of ways, providing a new parameterization widely used in models for language generation, of attentive token-level translation and a detailed summarization and semantic parsing.

Lexicon Learning for Few-Shot Neural Sequence Modeling

Lexical Semantics I

LEXADV - a Multilingual Semantic Lexicon for Adverbs

A New Semantic Lexicon and Similarity Measure in Bangla

The Diachronic Semantic Lexicon of Dutch As Linked Open Data

Natural Language Processing

Linking Wordnet to the SIL Semantic Domains

Semanticnet-Perception of Human Pragmatics

Ten Choices for Lexical Semantics

Filling Knowledge Gaps in a Broad-Coverage Machine Translation System*

Approaches for Natural Language Processing (NLP)

Towards Acquiring Case Indexing Taxonomies from Text

A Semantic Approach for Extracting Domain Taxonomies from Text