Neural Shift-Reduce CCG Semantic Parsing

Neural Shift-Reduce CCG Semantic Parsing Dipendra K. Misra and Yoav Artzi Department of Computer Science and Cornell Tech Cornell University New York, NY 10011 {dkm,yoav}@cs.cornell.edu Abstract 2000) from a dynamic-programming CKY model to a shift-reduce neural network architecture. We present a shift-reduce CCG semantic We focus on the feed-forward architecture of parser. Our parser uses a neural network architecture that balances model capacity and Chen and Manning (2014), where each parsing step computational cost. We train by transferring a is a multi-class classification problem. The state of model from a computationally expensive log- the parser is represented using simple feature em- linear CKY parser. Our learner addresses two beddings that are passed through a multilayer per- challenges: selecting the best parse for learn- ceptron to select the next action. While simple, the ing when the CKY parser generates multiple capacity of this model to capture interactions be- correct trees, and learning from partial deriva- tween primitive features, instead of relying on sparse tions when the CKY parser fails to parse. We evaluate on AMR parsing. Our parser per- complex features, has led to new state-of-the-art per- forms comparably to the CKY parser, while formance (Andor et al., 2016). However, applying doing significantly fewer operations. We also this architecture to semantic parsing presents learn- present results for greedy semantic parsing ing and inference challenges. with a relatively small drop in performance. In contrast to dependency parsing, semantic pars- 1 Introduction ing corpora include sentences labeled with the system response or the target formal representation, and Shift-reduce parsing is a class of parsing methods omit derivation information. CCG induction from that guarantees a linear number of operations in sen- such data relies on latent-variable techniques and re- tence length. This is a desired property for practical quires careful initialization (e.g., Zettlemoyer and applications that require processing large amounts of Collins, 2005, 2007). Such feature initialization text or real-time response. Recently, such techniques does not directly transfer to a neural network archi- were used to build state-of-the-art syntactic parsers, tecture with dense embeddings, and the use of hid- and have demonstrated the effectiveness of deep den layers further complicates learning by adding neural architectures for decision making in linear- a large number of latent variables. We focus on time dependency parsing (Chen and Manning, 2014; data that includes sentence-representation pairs, and Dyer et al., 2015; Andor et al., 2016; Kiperwasser learn from a previously induced log-linear CKY and Goldberg, 2016). In contrast, semantic parsing parser. This drastically simplifies learning, and can often relies on algorithms with polynomial number be viewed as bootstrapping a fast parser from a slow of operations, which results in slow parsing times one. While this dramatically narrows down the num- unsuitable for practical applications. In this paper, ber of parses per sentence, it does not eliminate am- we apply shift-reduce parsing to semantic parsing. biguity. In our experiments, we often get multiple Specifically, we study transferring a learned Combi- correct parses, up to 49K in some cases. We also natory Categorial Grammar (CCG; Steedman, 1996, observe that the CKY parser generates no parses for Some old networks remain inoperable NP[x]=N[x] N[x]=N[x] N[pl] SnNP[pl]=(N[pl]=N[pl]) N[x]=N[x] λf:A(λx.f(x) ^ quant(x; λf.λx.f(x)^ λn.network(n) λf.λx.f(λr:remain-01(r)^ λf.λx.f(x) ^ ARG3(x; A(λp.possible(p)^ A(λs.some(s)))) mod(x; A(λo.old(o))) ARG1(r; x)) polarity(p; −) ^ domain(p; A(λo.operate-01(o))))) > > N[pl] SnNP[pl] λn.network(n)^ λx.λr:remain-01(r) ^ ARG1(r; x) ^ ARG3(r; A(λp.possible(p) mod(n; A(λo.old(o))) ^polarity(p; −) ^ domain(p; A(λo.operate-01(o))))) > NP[pl] A(λn.network(n) ^ mod(n; A(λo.old(o))) ^ quant(n; A(λs.some(s)))) < S λr:remain-01(r) ^ ARG1(r; A(λn.network(n) ^ mod(n; A(λo.old(o))) ^ quant(n; A(λs.some(s)))))^ ARG3(r; A(λp.possible(p) ^ polarity(p; −) ^ domain(p; A(λo.operate-01(o))))) Figure 1: Example CCG tree with five lexical entries, three forward applications (>) and a backward application (<). a significant number of training sentences. There- no task-specific assumptions and has potential for fore, we propose an iterative algorithm that automat- learning efficient models for structured prediction ically selects the best parses for training at each iter- from the output of more expensive ones.1 ation, and identifies partial derivations for best-effort learning, if no parses are available. 2 Task and Background CCG parsing largely relies on two types of ac- Our goal is to learn a function that, given a sentence tions: using a lexicon to map words to their cate- x, maps it to a formal representation of its meaning gories, and combining categories to acquire the cat- z with a linear number of operations in the length of egories of larger phrases. In most semantic pars- x. We assume access to a training set of N examples ing approaches, the number of operations is dom- = (x(i); z(i)) N , each containing a sentence inated by the large number of categories available D f gi=1 x(i) and a logical form z(i). Since does not con- for each word in the lexicon. For example, the lex- D tain complete derivations, we instead assume access icon in our experiments includes 1.7M entries, re- to a CKY parser learned from the same data. We sulting in an average of 146, and up to 2K, ap- evaluate performance on a test set (x(i); z(i)) M plicable actions. Additionally, both operations and f gi=1 of M sentences x(i) labeled with logical forms z(i). parser state have complex structures, for example While we describe our approach in general terms, including both syntactic and semantic information. we apply our approach to AMR parsing and evalu- Therefore, unlike in dependency parsing (Chen and ate on a common benchmark (Section 6). Manning, 2014), we can not treat action selection as multi-class classification, and must design an archi- To map sentences to logical forms, we use CCG, tecture that can accommodate a varying number of a linguistically-motivated grammar formalism for actions. We present a network architecture that con- modeling a wide-range of syntactic and seman- siders a variable number of actions, and emphasizes tic phenomena (Steedman, 1996, 2000). A CCG is defined by a lexicon Λ and sets of unary u low computational overhead per action, instead fo- R and binary b rules. In CCG parse trees, each cusing computation on representing the parser state. R node is a category. Figure 1 shows a CCG tree We evaluate on Abstract Meaning Representa- for the sentence Some old networks remain inop- tion (AMR; Banarescu et al., 2013) parsing. We erable. For example, S NP =(N =N ): demonstrate that our modeling and learning contri- n [pl] [pl] [pl] λf.λx.f(λr:remain-01(r) ARG1(r; x)) is the cat- butions are crucial to effectively commit to early de- ^ egory of the verb remain. The syntactic type cisions during parsing. Somewhat surprisingly, our S NP =(N =N ) indicates that two argu- shift-reduce parser provides equivalent performance n [pl] [pl] [pl] ments are expected: first an adjective N =N and to the CKY parser used to generate the training data, [pl] [pl] then a plural noun phrase NP . The final syntac- despite requiring significantly fewer operations, on [pl] tic type will be S. The forward slash = indicates average two orders of magnitude less. Similar to the argument is expected on the right, and the back- previous work, we use beam search, but also, for ward slash indicates it is expected on the left. The the first time, report greedy CCG semantic parsing n syntactic attribute pl is used to express the plural- results at a relatively modest 9% decrease in perfor- ity constraint of the verb. The simply-typed lambda mance, while the source CKY parser with a beam of one demonstrates a 71% decrease. While we focus 1The source code and pre-trained models are available at on semantic parsing, our learning approach makes http://www.cs.cornell.edu/~dkm/ccgparser. calculus logical form in the category represents se- argument trees to create a new tree with the argu- mantic meaning. The typing system includes atomic ments as children. We treat lexical entries as trees types (e.g., entity e, truth value t) and functional with a single node. There are three types of actions:4 types (e.g., e; t is the type of a function from e to h i SHIFT(l; σ; xi xj β ) = σ g; β t). In the example category above, the expression on h j · · · j j i h j i BINARY(b; σ s2 s1; β ) = σ b(s2; s1); β the right of the colon is a e; t ; e; t ; e; e; t - h j j i h j i hhh i h ii h h iii UNARY(u; σ s1; β ) = σ u(s1); β . typed function expecting first an adjectival modi- h j i h j i Where b b is a binary rule, u u is a unary fier and then an ARG1 modifier. The conjunction 2 R 2 R specifies the roles of remain-01. The lexicon Λ rule, and l is a lexical entry xi; : : : xj g for the to- ^ ` maps words to CCG categories.

Load more