Neural Shift-Reduce CCG Semantic Parsing

Dipendra K. Misra and Yoav Artzi Department of Computer Science and Cornell Tech Cornell University New York, NY 10011 {dkm,yoav}@cs.cornell.edu

Abstract 2000) from a dynamic-programming CKY model to a shift-reduce neural network architecture. We present a shift-reduce CCG semantic We focus on the feed-forward architecture of parser. Our parser uses a neural network ar- chitecture that balances model capacity and Chen and Manning (2014), where each parsing step computational cost. We train by transferring a is a multi-class classification problem. The state of model from a computationally expensive log- the parser is represented using simple feature em- linear CKY parser. Our learner addresses two beddings that are passed through a multilayer per- challenges: selecting the best parse for learn- ceptron to select the next action. While simple, the ing when the CKY parser generates multiple capacity of this model to capture interactions be- correct trees, and learning from partial deriva- tween primitive features, instead of relying on sparse tions when the CKY parser fails to parse. We evaluate on AMR parsing. Our parser per- complex features, has led to new state-of-the-art per- forms comparably to the CKY parser, while formance (Andor et al., 2016). However, applying doing significantly fewer operations. We also this architecture to semantic parsing presents learn- present results for greedy semantic parsing ing and inference challenges. with a relatively small drop in performance. In contrast to dependency parsing, semantic pars- 1 Introduction ing corpora include sentences labeled with the sys- tem response or the target formal representation, and Shift-reduce parsing is a class of parsing methods omit derivation information. CCG induction from that guarantees a linear number of operations in sen- such data relies on latent-variable techniques and re- tence length. This is a desired property for practical quires careful initialization (e.g., Zettlemoyer and applications that require processing large amounts of Collins, 2005, 2007). Such feature initialization text or real-time response. Recently, such techniques does not directly transfer to a neural network archi- were used to build state-of-the-art syntactic parsers, tecture with dense embeddings, and the use of hid- and have demonstrated the effectiveness of deep den layers further complicates learning by adding neural architectures for decision making in linear- a large number of latent variables. We focus on time dependency parsing (Chen and Manning, 2014; data that includes sentence-representation pairs, and Dyer et al., 2015; Andor et al., 2016; Kiperwasser learn from a previously induced log-linear CKY and Goldberg, 2016). In contrast, semantic parsing parser. This drastically simplifies learning, and can often relies on with polynomial number be viewed as bootstrapping a fast parser from a slow of operations, which results in slow parsing times one. While this dramatically narrows down the num- unsuitable for practical applications. In this paper, ber of parses per sentence, it does not eliminate am- we apply shift-reduce parsing to semantic parsing. biguity. In our experiments, we often get multiple Specifically, we study transferring a learned Combi- correct parses, up to 49K in some cases. We also natory Categorial Grammar (CCG; Steedman, 1996, observe that the CKY parser generates no parses for Some old networks remain inoperable

NP[x]/N[x] N[x]/N[x] N[pl] S\NP[pl]/(N[pl]/N[pl]) N[x]/N[x] λf.A(λx.f(x) ∧ quant(x, λf.λx.f(x)∧ λn.network(n) λf.λx.f(λr.remain-01(r)∧ λf.λx.f(x) ∧ ARG3(x, A(λp.possible(p)∧ A(λs.some(s)))) mod(x, A(λo.old(o))) ARG1(r, x)) polarity(p, −) ∧ domain(p, A(λo.operate-01(o))))) > > N[pl] S\NP[pl] λn.network(n)∧ λx.λr.remain-01(r) ∧ ARG1(r, x) ∧ ARG3(r, A(λp.possible(p) mod(n, A(λo.old(o))) ∧polarity(p, −) ∧ domain(p, A(λo.operate-01(o))))) > NP[pl] A(λn.network(n) ∧ mod(n, A(λo.old(o))) ∧ quant(n, A(λs.some(s)))) < S λr.remain-01(r) ∧ ARG1(r, A(λn.network(n) ∧ mod(n, A(λo.old(o))) ∧ quant(n, A(λs.some(s)))))∧ ARG3(r, A(λp.possible(p) ∧ polarity(p, −) ∧ domain(p, A(λo.operate-01(o))))) Figure 1: Example CCG tree with five lexical entries, three forward applications (>) and a backward application (<). a significant number of training sentences. There- no task-specific assumptions and has potential for fore, we propose an iterative that automat- learning efficient models for ically selects the best parses for training at each iter- from the output of more expensive ones.1 ation, and identifies partial derivations for best-effort learning, if no parses are available. 2 Task and Background CCG parsing largely relies on two types of ac- Our goal is to learn a function that, given a sentence tions: using a lexicon to map words to their cate- x, maps it to a formal representation of its meaning gories, and combining categories to acquire the cat- z with a linear number of operations in the length of egories of larger phrases. In most semantic pars- x. We assume access to a training set of N examples ing approaches, the number of operations is dom- = (x(i), z(i)) N , each containing a sentence inated by the large number of categories available D { }i=1 x(i) and a logical form z(i). Since does not con- for each word in the lexicon. For example, the lex- D tain complete derivations, we instead assume access icon in our experiments includes 1.7M entries, re- to a CKY parser learned from the same data. We sulting in an average of 146, and up to 2K, ap- evaluate performance on a test set (x(i), z(i)) M plicable actions. Additionally, both operations and { }i=1 of M sentences x(i) labeled with logical forms z(i). parser state have complex structures, for example While we describe our approach in general terms, including both syntactic and semantic information. we apply our approach to AMR parsing and evalu- Therefore, unlike in dependency parsing (Chen and ate on a common benchmark (Section 6). Manning, 2014), we can not treat action selection as multi-class classification, and must design an archi- To map sentences to logical forms, we use CCG, tecture that can accommodate a varying number of a linguistically-motivated grammar formalism for actions. We present a network architecture that con- modeling a wide-range of syntactic and seman- siders a variable number of actions, and emphasizes tic phenomena (Steedman, 1996, 2000). A CCG is defined by a lexicon Λ and sets of unary u low computational overhead per action, instead fo- R and binary b rules. In CCG parse trees, each cusing computation on representing the parser state. R node is a category. Figure 1 shows a CCG tree We evaluate on Abstract Meaning Representa- for the sentence Some old networks remain inop- tion (AMR; Banarescu et al., 2013) parsing. We erable. For example, S NP /(N /N ): demonstrate that our modeling and learning contri- \ [pl] [pl] [pl] λf.λx.f(λr.remain-01(r) ARG1(r, x)) is the cat- butions are crucial to effectively commit to early de- ∧ egory of the verb remain. The syntactic type cisions during parsing. Somewhat surprisingly, our S NP /(N /N ) indicates that two argu- shift-reduce parser provides equivalent performance \ [pl] [pl] [pl] ments are expected: first an adjective N /N and to the CKY parser used to generate the training data, [pl] [pl] then a plural noun phrase NP . The final syntac- despite requiring significantly fewer operations, on [pl] tic type will be S. The forward slash / indicates average two orders of magnitude less. Similar to the argument is expected on the right, and the back- previous work, we use beam search, but also, for ward slash indicates it is expected on the left. The the first time, report greedy CCG semantic parsing \ syntactic attribute pl is used to express the plural- results at a relatively modest 9% decrease in perfor- ity constraint of the verb. The simply-typed lambda mance, while the source CKY parser with a beam of one demonstrates a 71% decrease. While we focus 1The source code and pre-trained models are available at on semantic parsing, our learning approach makes http://www.cs.cornell.edu/~dkm/ccgparser. calculus logical form in the category represents se- argument trees to create a new tree with the argu- mantic meaning. The typing system includes atomic ments as children. We treat lexical entries as trees types (e.g., entity e, truth value t) and functional with a single node. There are three types of actions:4 types (e.g., e, t is the type of a function from e to h i SHIFT(l, σ, xi xj β ) = σ g, β t). In the example category above, the expression on h | · · · | | i h | i BINARY(b, σ s2 s1, β ) = σ b(s2, s1), β the right of the colon is a e, t , e, t , e, e, t - h | | i h | i hhh i h ii h h iii UNARY(u, σ s1, β ) = σ u(s1), β . typed function expecting first an adjectival modi- h | i h | i Where b b is a binary rule, u u is a unary fier and then an ARG1 modifier. The conjunction ∈ R ∈ R specifies the roles of remain-01. The lexicon Λ rule, and l is a lexical entry xi, . . . xj g for the to- ∧ ` maps words to CCG categories. For example, the kens xi,. . . ,xj and CCG category g.SHIFT creates a lexical entry remain S NP /(N /N ): tree given a lexical entry for the words at the top of ` \ [pl] [pl] [pl] λf.λx.f(λr.remain-01(r) ARG1(r, x)) pairs the the buffer, BINARY applies a binary rule to the two ∧ example category with remain. The parse tree in the trees at the head of the stack, and UNARY applies a figure includes four binary operations: three forward unary rule to the tree at head of the stack. A config- applications (>) and a backward application (<). uration is terminal when no action is applicable. Given a sentence x, a derivation is a sequence of 3 Neural Shift Reduce Semantic Parsing action-configuration pairs c1, a1 ,..., ck, ak , hh i h ii where action ai is applied to configuration ci to gen- Given a sentence x = x1, . . . , xm with m tokens h i erate configuration ci+1. The result configuration xi and a CCG lexicon Λ, let GEN(x; Λ) be a function ck+1 is of the form [s], [] , where s represents a that generates CCG parse trees. We design GEN as h i complete parse tree, and the logical form z at the a shift-reduce parser, and score decisions using em- root category represents the meaning of the com- beddings of parser states and candidate actions. plete sentence. Following previous work with CKY 3.1 Shift-Reduce Parsing for CCG parsing (Zettlemoyer and Collins, 2005), we disal- Shift-reduce parsers perform a single pass of the low consecutive unary actions. We denote the set of actions allowed from configuration c as (c). sentence from left to right to construct a parse tree. A 2 The parser configuration is defined with a stack and 3.2 Model a buffer. The stack contains partial parse trees, and the buffer the remainder of the sentence to be pro- Our goal is to balance computation and model ca- cessed. Formally, a parser configuration c is a tu- pacity. To recover a rich representation of the con- ple σ, β , where the stack σ is a list of CCG trees figuration, we use a multilayer (MLP) to h i create expressive interactions between a small num- [sl s1], and the buffer β is a list of tokens from x ··· 3 ber of simple features. However, since we con- to be processed [xi xm]. For example, the top- ··· left of Figure 2 shows a parsing configuration with sider many possible actions in each step, comput- two partial trees on the stack and two words on the ing activations for multiple hidden layers for each buffer (remain and inoperable). action is prohibitively expensive. Instead, we opt Parsing starts with the configuration for a computationally-inexpensive action represen- tation computed by concatenating feature embed- [], [x1 xm] , where the stack is empty and h ··· i the buffer is initialized with x. In each parsing dings. Figure 2 illustrates our architecture. c step, the parser either consumes a word from Given a configuration , the probability of an ac- a the buffer and pushes a new tree to the stack, or tion is: exp {φ(a, c)W F(ξ(c))} applies a parsing rule to the trees at the top of the p(a | c) = b , P exp {φ(a0, c)W F(ξ(c))} stack. For simplicity, we apply CCG rules to trees, a0∈A(c) b where a rule is applied to the root categories of the where φ(a, c) is the action embedding, ξ(c) is the 2 We use the terms parser configuration and parser state configuration embedding, and is an MLP. Wb interchangeably. F 3The head of the stack σ is the right-most entry, and the 4We follow the standard notation of L|x indicating a list head of the buffer β is the left-most entry. with all the entries from L and x as the right-most element. Configuration c Actions (c) Stack s2 s1 b1 b2 Buffer A a1 = Binary(forward-apply,c) Some old networks remain inoperable SomeSome old old networks networks remain remainEmbedding Layer inoperable inoperable(a1,c) NP[x]/N[x] N[x]/N[x] N[pl] S NP[pl]/(N[pl]/N[pl]) N[x]/N[x] NPNP[x]/N/N[x] NN[x\]/N/N[x] NN[pl] SSNPNP[pl]//(N(N[pl]/N/N[pl])) NN[x]/N/N[x] f. (x.f(x) quant(x, f.x.f(xSome) [x] n.[xnetwork(] n) oldf.[xx.f] ([xr.] remain networks-01(r) [pl] f.x.f( remainx)\ ARG3([pl] x, [pl(]p.[possible(pl] p) inoperable [x] [x] A ^ f.f.^((x.fx.f(x(x)) quant(quant(x,x, f.f.x.fx.f(x(x)) n.n.network(^network(nn)) f.f.^x.f\x.f((r.r.remainremainA -01(-01(rr)) ^f.f.x.fx.f(x(x)) ARG3(ARG3(x,x, ((p.p.possible(possible(pp)) (s.some(s)))) MOD(x, (o.old(AAo)))^^ ARG1(^^ r, x)) polarity(p, ) domain(p, (o.operate^^ -01(o))))) ^^ AA ^^ A A NP[x]/N(([xs.]s.some(some(ss))))))))N MOD([x MOD(]>/N[xx,]x, ((o.o.old(old(oo))))))N[pl] S NP[pl]/^(N[ ARG1(pl ARG1(]/N[plr,])r,A x x)))) polarity( polarity(>p,p, N))[x]/Ndomain(domain([x] p,p, ((o.o.operateoperate-01(-01(oo)))))))))) AA AA \>> ^^ AA >> f. (x.f(x) Nquant([pl] x, f.x.f(x) n.network(n) f.Sx.fNP([plr.]remain-01(r) f.x.f(x) ARG3(x, (p.possible(p) A (s.some(^s)))) MOD(x, (o.old(^ o)))NN[pl[pl] ] \ ARG1(r, x))^ polarity(p, a)2SS^domain(=NPNP[Unarypl[pl] ] p, A(o.operate(bare-01(-^oplural))))) ,c) A n.network(n) A x.r.remain-01(r) ARG1(r, x) ARG3(r, (p.possible(p) ^\\ A MOD(n, (o.old(^o))) n.n.network(network(polarity(nn)) p, >)^ domain(p, ^(o.operatex.x.r.Ar.remain-remain01(o)))))-01(-01(rr)) ARG1(ARG1(r,r, x x)) ARG3(ARG3(r,r, ((p.p.possible(possible(p>p)) MOD(N[pl] n, (o.old(^o^))) polarity(p, S)^^NPdomain([pl] p, ^^(o.operateAA-01(o))))) NP/NA x > MOD(n,^ (o.old(o))) ^ A polarity(p, ) Embeddingdomain(p, ( Layero.operate-01((oa)))))2,c) NP n.network(n)AA >> x.r.remain-^01(^ r) ARG1(\r,^^ x) ARG3(AAr, (p.possible(p) [pl] NP (n.network(n) MOD(n, (o.old(o))) quant(n, (s.some(s))))MOD(NP[n,pl[pl] ] (o.old(^o))) polarity(p, )^ domain(p, ^(o.operateA-01(o))))) A ^ A Configuration((^n.n.network(network(An n)) MOD(MOD(n,n, ((o.o.old(old(Aoo)))))) quant(quant(n,n, ((s.s.some(>some(ss)))))))) ^ ^ A A ^ A ^ A pl < EmbeddingA ^ NP[pl]A S ^N A << r.remain(-n.01(network(r) ARG1(⇠(nc)) r,MOD((n.n,network((o.old(n)o)))MOD(quant(n, n,(o.(old(s.some(o))) squant()))) n, (s.some(SS s))))) A ARG3(^ r, (^p.possible(A pA) polarity(r.r.remainremain^ p,^-01(-01()rr))domain(AARG1(ARG1(A p,r,r,(((o.^n.operaten.network(network(-01(Anno)))))))MOD(MOD(n,n, (^(o.o.old(old(oo)))))) quant(quant(n,n, ((s.s.some(some(ss)))))))))) < A ^ ARG3(ARG3(^ ^^r,r, ((p.p.possible(Apossible(AA pp)) polarity(polarity(S ^^ p,p, )) domain(Adomain(A p,p, ((o.^o.^operateoperate-01(-01(AAoo)))))))))) ^^ r.remain-01(r) ARG1(r, A(An.network(n) ^^MOD(n, (o.^old(^ o))) quant(AA n, (s.some(s))))) ARG3(^ r, (p.possible(A p) polarity(^ p, ) domain(A p, (o.^operate-01(Ao))))) ^ A ^ ^ ASome networks remain inoperable NP[x]/N[x] a =NShift[pl] S NP[pl]/(N[pl]/N[pl]) ,c N[x]/N[x] MLP h1 = max 0, W1h0 + b1f. (x.f(x) REL(x,(c) n.network(n) f.\x.f(r.remain-01(r) f.x.f(x) REL(x, (p.possible(p) REL(p, ) Hidden { }A1 ^ |A | ^ ^ A3 ^ ^ F 2(s.some(s)))) ARG1(r, x))! REL(p, 4(o.operate-01(o))))) Layers A > A > NP[pl] S NP[pl] h2 = max 0, W2h1 + b2 (n.network(n) REL(n, Embedding(s.some(s)))) Layer x.r.(remaina (c-)01(,cr)) ARG1(r, x) \REL(r, (p.possible(p) REL(p, ) { A1} ^ A2 |A | ^ ^ A3 ^ ^ REL(p, 4(o.operate-01(o))))) A < S Dimensionality h3 = W3h2 + b3 r.remain-01(r) ARG1(r, (n.network(n) REL(n, (s.some(s))))) ^ A1 ^ A2 ^ Reduction Layer REL(r,Bilinear3(p.possible( Softmaxp) REL(Layerp, ) REL(p, 4(o.operate-01(o))))) A ^ ^ A P (a ) exp((a ,c)W h ) i / i b 3 i =1... (c) 8 |A | Some old networks remain inoperable Figure 2: Illustration of scoring the next action given the configuration c when parsing the sentence Some old networks NP[x]/N[x] N[x]/N[x] N[pl] S NP[pl]/(N[pl]/N[pl]) N[x]/N[x] f. (x.f(x) quant(x, f.x.f(x) n.network(n) f.\x.f(r.remain-01(r) f.x.f(x) ARG3(x, (p.possible(p) remain inoperable. Embeddings of the same feature type are colored the same. The configuration embedding ξ(c) is a A (s.some(^s)))) MOD(x, (o.old(^ o))) ARG1(r, x))^ polarity(p, ) ^domain(p, A(o.operate-01(^o))))) A A > ^ A > concatenation of syntax embeddings (green) and the logical form embedding (blue; computed by ψ) for the top entries N S NP [pl] \ [pl] in the stack. We then pass ξ(c) through the MLP . Given the actions (c), we compute the embeddings φ(ai, c), n.network(n) x.r.remain-01(r) ARG1(r, x) ARG3(r, (p.possible(p) F A MOD(n, (o.old(^o))) polarity(p, )^ domain(p, ^(o.operateA-01(o))))) i = 1 ... (c) . The actions and MLP representation are combined with a bilinear softmax layer. The number of A > ^ ^ A 1 NP |A | 11 [pl] concatenated vectors and stack elements used is for illustration. The details are described in Section 3.2. (n.network(n) MOD(n, (o.old(o))) quant(n, (s.some(s)))) A ^ A ^ A < 1 5 S is a bilinear transformation matrix. Given a sen- s3. When a feature is not present, for example when r.remain-01(r) ARG1(r, (n.network(n) MOD(n, (o.old(o))) quant(n, (s.some(s))))) ARG3(^ r, (p.possible(A p) polarity(^ p, ) domain(A p, (o.^operate-01(Ao))))) ^ tence x and a sequence of action-configuration pairs the stack or buffer are too small, we use a tunable A ^ ^ A c1, a1 ,..., ck, ak , the probability of a CCG null embedding. hh i h ii tree y is Given a tree on the stack sj, we define two syn- Y p(y | x) = p(ai | ci) . tactic features: attribute set and stripped syntax. i=1...k The attribute feature is created by extracting all the 1 The probability of a logical form z is then syntactic attributes of the root category of sj. The X p(z | x) = p(y | x) , stripped syntax feature is the syntax of the root cat- y∈Y(z) egory without the syntactic attributes. For example, where (z) is the set of CCG trees with the logical in Figure 2, we embed the stripped category N and Y form z at the root. attribute pl for s1, and NP/N and x for s2. The at- tributes are separated from the syntax to reduce spar- MLP Architecture We use a MLP with two F sity, and the interaction between them is computed hidden layers parameterized by W1, W2, b1, b2 by . The sparse features are converted to dense { } F with a ReLu non-linearity (Glorot et al., 2011). embeddings using a lookup table and concatenated. 1 Since the output of influences the dimensionality F In addition, we also embed the logical form at the W W of b, we add a linear layer parameterized by 3 root of sj. Figure 3 illustrates the recursive embed- and b3 to reduce the dimensionality of the configu- ding function ψ.6 Using a recursive function to em- ration, thereby reducing the dimensionality of Wb. bed logical forms is computationally intensive. Due Configuration Embedding ξ(c) Given a config- to strong correlation between sentence length and uration c = [sl s1], [xi xm] , the input to logical form complexity, this computation increases h ··· ··· i F is a concatenation of syntactic and semantic embed- the cost of configuration embedding by a factor lin- dings, as illustrated in Figure 2. We concatenate em- 5For simplicity, the figure shows only the top two trees. 6 beddings from the top three trees in the stack s1, s2, The algorithm is provided in the supplementary material. x.arg0(x, JOHN) our objective into action φ(a, c) and configuration

Wr, r (ξ(c)) embeddings. While every parse step in- { } F arg0(x, JOHN) volves a single configuration, the number of actions

x Wr, r is significantly higher. With the goal of minimizing { } arg0 (x, JOHN) the amount of computation per action, we use simple

Wr, r Wr, r concatenation only for action embedding. However, { } { } JOHN this requires retaining sparse conjunctive action fea- x Wr, r tures since they are never combined through hidden arg0 e, e, t { } h h ii layers similar to configuration features. JOHN e 3.3 Inference Figure 3: Illustration of embedding the logical form λx.arg0(x, JOHN) with the recursive embedding func- To compute the set of parse trees GEN(x; Λ), we tion ψ. In each level in ψ, the children nodes are com- perform beam search to recover the top-k parses. bined with a single-layer neural network parameterized The beam contains configurations. At each step, we by Wr, δr, and the tanh activation function. Com- puted embeddings are in dark gray, and embeddings from expand all configurations with all actions, and keep lookup tables are in light gray. Constants are embed- only the top-k new configurations. To promote di- ded by combining name and type embeddings, literals versity in the beam, given two configurations with are unrolled to binary recursive structures, and lambda terms are combinations of variable type and body em- the same signature, we keep only the highest scor- beddings. For example, JOHN is embedded by com- ing one. The signature includes the previous config- bining the embeddings of its name and type, the literal uration in the derivation, the state of the buffer, and arg0(x, JOHN) is recursively embedded by first embed- ding the arguments (x, JOHN) and then combining the the root categories of all stack elements. Since all predicate, and the lambda term is embedded to create the features are computed from these elements, this op- embedding of the entire logical form. timization does not affect the max-scoring tree. Ad- ear in sentence length. In Section 6, we experiment ditionally, since words are assigned structured cat- with including this option, balancing between poten- egories, a key problem is unknown words or word tial expressivity and speed. uses. Following Zettlemoyer and Collins (2007), we Action Embedding φ(a, c) Given an action a use a two-pass parsing strategy, and allow skipping ∈ (c), and the configuration c, we generate the action words controlled by the term γ in the second pass. A representation by computing sparse features, con- The term γ is added to the exponent of the action verting them to dense embeddings via table lookup, probability when words are skipped. See the sup- and concatenating. If more than one feature of the plementary material for the exact form. same type is triggered, we average their embed- Complexity Analysis The shift-reduce parser pro- dings. When no features of a given type are trig- cesses the sentence from left to right with a linear gered, we use a tunable placeholder embedding in- number of operations in sentence length. We define stead. The features include all the features used by an operation as applying an action to a configuration. Artzi et al. (2015), including all conjunctive fea- Formally, the number of operations for a sentence of tures, as well as properties of the action and configu- length m is bounded by O(4mk( λ + b + u )), 7 | | |R | |R | ration, such as the POS tags of tokens on the buffer. where λ is the number of lexical entries per to- | | Discussion Our use of an MLP is inspired by Chen ken, k is the beam size, b is the set of binary R and Manning (2014). However, their architecture is rules, and u the set of unary rules. In compari- R designed to handle only a fixed number of actions, son, the number of operations for the CKY parser, while we observe varying number of actions. There- where an operation is applying a rule to a single fore, we adopt a probabilistic model similar to Dyer cell or two adjacent cells in the chart, is bounded 3 2 2 et al. (2015) to effectively combine the benefits of by O(m λ + m k b + m b u ). For sentence | | |R | |R | the two approaches.8 We factorize the exponent in length 25, the mean in our experiments, the shift- reduce parser performs 100 time fewer operations. 7See the supplementary material for feature details. 8We experimented with an LSTM parser similar to Dyer See the supplementary material for the full analysis. et al. (2015). However, performance was not competitive. This direction remains an important avenue for future work. 4 Learning Algorithm 1 The learning algorithm. (i) (i) N We assume access to a training set of N examples Input: Training set D = {(x , z )}i=1, µ, (i) (i) N regularization parameter `2, and number of iterations T . = (x , z ) i=1, each containing a sentence D(i) { } (i) Definitions: GENMAXCKY(x, z) returns the set of max- x and a logical form z . The data does not in- scoring CKY parses for x with z at the root. SCORE(y, θ) clude information about the lexical entries and CCG scores a tree y according to the parameters θ (Section 3.2). parsing operations required to construct the correct CONFGEN(x, y) is the sequence of action-configuration derivations. We bootstrap this information from a pairs that generates y given x (Section 3.1). BP(∆J ) takes the objective J and back-propagates the error ∇J learned parser. In our experiments we use a learned through the computation graph for the sample used to com- dynamic-programming CKY parser. We transfer the pute the objective. ADAGRAD(∆) applies a per-feature lexicon Λ directly from the input parser, and focus learning rate to the gradient ∆ (Duchi et al., 2011). on estimating the parameters θ, which include fea- Output: Model parameters θ. 1: » Get trees from CKY parser. ture embeddings, hidden layer matrices, and bias 2: Y ← [] terms. The main challenge is learning from the noisy 3: for i = 1 to N do (i) (i) supervision provided by the input parser. In our ex- 4: Y[i] = GENMAXCKY(x , z ) periments, the CKY parser fails to correctly parse 5: for t = 1 to T do » 40% of the training data, and returns on average 147 6: Pick max-scoring trees and create action dataset. 7: DA = ∅ max-scoring correct derivations for the rest. We pro- 8: for i = 1 to N do pose an iterative algorithm that treats the choice be- 9: if Y[i] 6= ∅ then (i) tween multiple parse trees as latent, and effectively 10: A ← CONFGEN(x , learns from partial analysis when no correct deriva- 11: arg maxy∈Y[i] SCORE(y, θ)) for hc, ai ∈ A do tion is available. 12: 13: DA ← DA ∪ {hc, ai} The learning algorithm (Algorithm 1) starts by 14: » Back-propagate the loss through the network. processing the data using the CKY parser (lines 3 - 15: for hc, ai ∈ DA do (i) x def `2 T 4). For each sentence , we collect the max- 16: J = − log p(a | c) + 2 θ θ scoring CCG trees with z(i) at the root. The CKY 17: ∆ ← BP(∇J ) parser often contains many correct parses with iden- 18: θ ← θ − µADAGRAD(∆) tical scores, up to 49K parses per sentence. There- 19: return θ fore, we randomly sample and keep up to 1K trees. 4.1 Learning from Partial Derivations This process is done once, and the algorithm then The input parser often fails to generate correct runs for T iterations. At each iteration, given the sets parses. In our experiments, this occurs for 40% of of parses from the CKY parser , we select the max- Y the training data. In such cases, we can obtain a probability parse according to our current parame- forest of partial parse trees p. Each partial tree ters θ (line 10) and add all the shift-reduce decisions Y y p corresponds to a span of tokens in the sen- from this parse to A (line 12), the action data set ∈ Y D tence and is scored by the input parser. In practice, that we use to estimate the parameters. We approxi- the spans are often overlapping. Our goal is to gen- mate the arg max with beam search using an oracle erate high quality configuration-action pairs c, a computed from the CKY parses.9 CONFGEN aggre- h i from p. These pairs will be added to A for train- Y D gates the configuration-action pairs from the highest ing. While extracting actions a is straightforward, scoring derivation. Parse selection depends on θ and generating configurations c requires reconstructing this choice will gradually converge as the parame- the stack σ from an incomplete forest of partial trees ters improve. The action data set is used to compute p. Figure 4 illustrates our proposed process. Let the ` -regularized negative log-likelihood objective Y 2 CKYSCORE(y) be the CKY score of the partial tree (line 16) and back-propagate the error to compute J y. To reconstruct σ, we select non-overlapping par- the gradient (line 17). We use AdaGrad (Duchi et al., tial trees Y that correspond to the entire sentence 2011) to update the parameters θ (line 18). by solving arg maxY ⊆Yp CKYSCORE(y) under two 9Our oracle is non-deterministic and incomplete (Goldberg constraints: (a) no two trees from Y correspond to and Nivre, 2013). overlapping tokens, and (b) for each token in x, there x1:6 x11:14 parsing rely on CKY parsing with beam search (e.g., Zettlemoyer and Collins, 2005, 2007; Kwiatkowski et al., 2010, 2011; Artzi and Zettlemoyer, 2011, 2013; Artzi et al., 2014; Matuszek et al., 2012; x1 x2 x3 x16 Figure 4: Partial derivation selection for learning (Sec- Kushman and Barzilay, 2013). Semantic parsing tion 4.1). The dotted triangles represent skipped spans with other formalisms also often relied on CKY- in the sentence, where no high quality partial trees were style algorithms (e.g., Liang et al., 2009; Kim and found. Dark triangles represent the selected partial trees. We identify two contiguous spans, 1-6 and 11-14, and Mooney, 2012). With a similar goal to ours, Berant generate two synthetic sentences for training: the tokens and Liang (2015) designed an agenda-based parser. are treated as complete sentences and actions and stack In contrast, we focus on a method with linear num- state are generated from the partial trees. ber of operations guarantee. exists y Y that corresponds to it. We solve the ∈ Following the work of Collins and Roark (2004) arg max using dynamic programming. The gener- on learning for syntactic parsers, Artzi et al. (2015) ated set Y approximates an intermediate state of a proposed an early update procedure for inducing shift-reduce derivation. However, p often does not Y CCG grammars with a CKY parser. Our partial contain high quality partial derivation for all spans. derivations learning method generalizes this method To skip low quality partial trees and spans that have to parsers with global features. no trees, we generate empty trees ye for every span, where CKYSCORE(ye) = 0, and add them to p. Y 6 Experimental Setup If the set of selected partial trees Y includes empty trees, we divide the sentence to separate examples Task and Data We evaluate on AMR parsing with and ignore these parts. This results in partial and CCG. AMR is a general-purpose meaning represen- approximate stack reconstruction. Finally, since P Y tation, which has been used in multiple tasks (Pan is noisy, we prune from it partial trees with a root et al., 2015; Liu et al., 2015; Sawai et al., 2015; that does not match the syntactic type for this span Garg et al., 2016), We use the newswire portion from an automatically generated CCGBank (Hock- of AMR Bank 1.0 release (LDC2014T12), which enmaier and Steedman, 2007) syntactic parse. displays some of the fundamental challenges in se- Our complete learning algorithm alternates be- mantic parsing, including long newswire sentences tween epochs of learning with complete parse trees with a broad array of syntactic and semantic phe- and learning with partial derivations. In epochs nomena. We follow the standard train/dev/test split where we use partial derivations, we use a modified of 6603/826/823 sentences. We evaluate with the version of Algorithm 1, where lines 9-10 are updated SMATCH metric (Cai and Knight, 2013). Our parser to use the above process. is incorporated into the two-stage approach of Artzi 5 Related work et al. (2015). The approach includes a bi-directional and deterministic conversion between AMR and Our approach is inspired by recent results in de- lambda calculus. Distant references, for example pendency parsing, specifically by the architecture such as introduced by pronouns, are represented of Chen and Manning (2014), which was further using Skolem IDs, globally-scoped existentially- developed by Weiss et al. (2015) and Andor et al. quantified unique IDs. A derivation includes a CCG (2016). Dyer et al. (2015) proposed to encode tree, which maps the sentence to an underspecified the parser state using an LSTM recurrent architec- logical form, and a constant mapping, which maps ture, which has been shown generalize well between underspecified elements to their fully specified form. languages (Ballesteros et al., 2015; Ammar et al., The key to the approach is the underspecified logi- 2016). Our network architecture combines ideas cal forms, where distant references and most rela- from the two threads: we use feature embeddings tions are not fully specified, but instead represented and a simple MLP to score actions, while our prob- as placeholders. Figure 5 shows an example AMR, ability distribution is similar to the LSTM parser. its lambda calculus conversion, and its underspec- The majority of CCG approaches for semantic ified logical form. (Artzi et al., 2015) use a CKY AMR Underspecified Logical Form Logical Form (c/conclude-02 A1(λc.conclude-02(c) ∧ A1(λc.conclude-02(c) ∧ :ARG0 (l/lawyer) ARG0(c, A2(λl.lawyer(l))) ∧ ARG0(c, A2(λl.lawyer(l))) ∧ :ARG1 (a/argument ARG1(c, A3(λa.argument(a) ∧ ARG1(c, A3(λa.argument(a) ∧ :poss l) poss(a, R(IDIDID)))) ∧ poss(a, R(2)))) ∧ :time (l2/late)) RELREL(c, A4(λl2.late(l2)))) time(c, A4(λl2.late(l2)))) Figure 5: AMR for the sentence the lawyer concluded his arguments late. In Artzi et al. (2015), The AMR (left) is deterministically converted to the logical form (right). The underspecified logical form is the result of the first stage, CCG parsing, and contains two placeholders (bolded): ID for a reference, and REL for a relation. To generate the final logical form, the second stage resolves ID to the identifier of the lawyer (2), and REL to the relation time. We focus on a model for the first stage and use an existing model for the second stage. parser to identify the best CCG tree, and a factor Parser P R F graph for the second stage. We integrate our shift- CKY (Artzi et al., 2015) 67.2 65.1 66.1 reduce parser into the two-stage setup by replacing Greedy CKY 64.1 11.29 19.19 the CKY parser. We use the same CCG configura- SR (complete model) 67.0 63.4 65.3 w/o semantic embedding 67.1 63.3 65.1 tion and integrate our parser into the join probabilis- w/o partial derivation learning 66.0 62.2 64.0 tic model. Formally, given a sentence x, the proba- Ensemble SR (syntax) 68.2 64.1 66.0 bility of an AMR logical form z is Ensemble SR (syntax, semantics) 68.1 63.9 65.9 X X p(z | x) = p(z | u, x) p(y | x) , SR with CKY model 52.5 49.36 50.88 u y∈Y(u) Table 1: Development SMATCH results. where u is an underspecified logical form, (u) is Parser P R F Y the set of CCG trees with u at the root. We use our JAMR11 67.8 59.2 63.2 shift-reduce parser to compute p(y x) and use the CKY (Artzi et al., 2015) 66.8 65.7 66.3 | pre-trained model from Artzi et al. (2015) for p(z Shift Reduce 68.1 64.2 66.1 | 13 u, x). Following Artzi et al. (2015), we disallow Wang et al. (2015a) 72.0 67.0 70.0 configurations that will not result in a valid AMR, Table 2: Test SMATCH results.12 and design a heuristic post-processing technique to sions are listed in the supplementary material. We recover a single logical form from terminal config- use 65 ReLU units for h and h , and 50 units for urations that include multiple disconnected partial 1 2 h . We initialize θ with the initialization scheme of trees on the stack. We use the recovery technique 3 Glorot and Bengio (2010), except the bias term for when no complete parses are available. ReLu layers, which we initialize to 0.1 to increase Tools We evaluate with the SMATCH metric (Cai the number of active units on initialization. During and Knight, 2013). We use EasyCCG (Lewis and test, we use the vector 0 as embedding for unseen Steedman, 2014) for CCGBank categories (Sec- features. We use a beam of 512 for testing and 2 for tion 4.1). We implement our system using Cornell CONFGEN (Section 4). SPF (Artzi, 2016), and the library.10 Model Ensemble For our final results, we The setup of Artzi et al. (2015) also includes the Illi- marginalize the output over three models using nois NER (Ratinov and Roth, 2009) and Stanford M p(z x, θ, Λ) = 1 p(z m, x, θ, Λ). CoreNLP POS Tagger (Manning et al., 2014). | |M| m∈M | Parameters and Initialization We minimize our 7 Results P loss on a held-out 10% of the training data to tune Table 1 shows development results. We trained each our parameters, and train the final model on the model three times and report the best performance. full data. We set the number of epochs T = 3, −6 We observed a variance of roughly 0.5 in these runs. regularization coefficient `2 = 10 , learning rate We experimented with different features for con- µ = 0.05, skipping term γ = 1.0. We set the di- figuration embedding and with removing learning mensionality of feature embeddings based on the vo- with partial derivations (Section 4.1). The com- cabulary size of the feature type. The exact dimen- plete model gives the best single-model performance 10http://deeplearning4j.org/ of 65.3 F1 SMATCH, and we observe the benefits 66

for semantic embeddings and learning from partial F1 derivations. Using partial derivations allowed us 62 59 to learn 370K more features, 22% of observed em- MATCH S beddings. We also evaluate ensemble performance. 1 32 128 256 512 600 We observe an overall improvement in performance. Beam size However, with multiple models, the benefit of us- Figure 6: The effect of beam size on model performance. ing semantic embeddings vanishes. This result is encouraging since semantic embeddings can be ex- 60 pensive to compute if the logical form grows with 40 sentence length. We also provide results for run- 20 ning a shift-reduce log-linear parser p(a c) Wall time (sec.) 0 T | ∝ 5 10 15 20 25 30 35 40 45 50 55 60 exp w φCKY(a, c) using the input CKY model. { } Sentence length We observe a significant drop in performance, which demonstrates the overall benefit of our architecture. Figure 7: Wall-time performance of shift reduce parser with only syntax features (blue), with syntax and seman- Figure 6 shows the development performance of tic features (orange) and the CKY parser of Artzi et al. our best performing ensemble model for different (2015) (black). beam sizes. The performance decays slowly with parser of Artzi et al. (2015). We run both parsers decreasing beam size. Surprisingly, our greedy with 16 cores and 122GB memory. The shift-reduce parser achieves 59.77 SMATCH F1, while the CKY parser is three times faster on average, and up to ten parser with a beam of 1 achieves only 19.2 SMATCH times faster on long sentences. Since our parser is F1 (Table 1). This allows our parser to trade-off a currently using CPUs, future work focused on GPU modest drop in accuracy for a significant improve- porting is likely to see further improvements. ment in runtime. Table 2 shows the test results using our best per- 8 Conclusion forming model (ensemble with syntax features). We compare our approach to the CKY parser of Artzi Our parser design emphasizes a balance between et al. (2015) and JAMR (Flanigan et al., 2014).11,12 model capacity and the ability to combine atomic We also list the results of Wang et al. (2015b), who features against the computational cost of scor- demonstrated the benefit of auxiliary analyzers and ing actions. We also design a learning algorithm is the current state of the art.13 Our performance is to transfer learned models and learn neural net- comparable to the CKY parser of (Artzi et al., 2015), work models from ambiguous and partial supervi- which we use to bootstrap our system. This demon- sion. Our model shares many commonalities with strates the ability of our parser to match the perfor- transition-based dependency parsers. This makes it mance of a dynamic-programming parser, which ex- a good starting point to study the effectiveness of ecutes significantly more operations per sentence. other dependency parsing techniques for semantic Finally, Figure 7 shows our parser runtime rel- parsing, for example global normalization (Andor ative to sentence length. In this analysis, we fo- et al., 2016) and bidirectional LSTM feature repre- cus on runtime, and therefore use a single model. sentations (Kiperwasser and Goldberg, 2016). We compare two versions of our system, including and excluding semantic embeddings, and the CKY Acknowledgments

11 JAMR results are taken from Artzi et al. (2015). This research was supported in part by gifts from 12 Pust et al. (2015), Flanigan et al. (2014), and Wang et al. Google and Amazon. The authors thank Kenton Lee (2015b) report results on different sections of the corpus. These for technical advice, and Adam Gibson and Alex results are not comparable to ours. 13 Black of Skymind for help with Deeplearning4j. We Our goal is to study the effectiveness of our model trans- also thank Tom Kwiatkowski, Arzoo Katiyar, Tianze fer approach and architecture. Therefore, we avoid using any resources used in (Wang et al., 2015b) that are not used in the Shi, Vlad Niculae, the Cornell NLP Group, and the CKY parser we compare to. reviewers for helpful advice. References Chen, D. and Manning, C. D. (2014). A fast and ac- curate dependency parser using neural networks. Ammar, W., Mulcaire, G., Ballesteros, M., Dyer, C., In Proceedings of the Conference on Empirical and Smith, N. A. (2016). Many languages, one Methods in Natural Language Processing. parser. Transactions of the Association for Com- putational Linguistics. Collins, M. and Roark, B. (2004). Incremental pars- ing with the perceptron algorithm. In Proceedings Andor, D., Alberti, C., Weiss, D., Severyn, A., of the Annual Meeting on Association for Compu- Presta, A., Ganchev, K., Petrov, S., and Collins, tational Linguistics. M. (2016). Globally normalized transition-based Duchi, J., Hazan, E., and Singer, Y. (2011). Adap- neural networks. CoRR. tive subgradient methods for online learning and Artzi, Y. (2016). Cornell SPF: Cornell semantic stochastic optimization. The Journal of Machine parsing framework. ArXiv e-prints. Learning Research. Artzi, Y., Das, D., and Petrov, S. (2014). Learn- Dyer, C., Ballesteros, M., Ling, W., Matthews, A., ing compact lexicons for CCG semantic parsing. and Smith, N. A. (2015). Transition-based depen- In Proceedings of the Conference on Empirical dency parsing with stack long short-term memory. Methods in Natural Language Processing. In Proceedings of the Annual Meeting on Associ- ation for Computational Linguistics. Artzi, Y., Lee, K., and Zettlemoyer, L. (2015). Broad-coverage CCG semantic parsing with Flanigan, J., Thomson, S., Carbonell, J., Dyer, C., AMR. In Proceedings of the Conference on Em- and Smith, N. A. (2014). A discriminative graph- pirical Methods in Natural Language Processing. based parser for the Abstract Meaning Represen- tation. In Proceedings of the Conference of the Artzi, Y. and Zettlemoyer, L. S. (2011). Bootstrap- Association of Computational Linguistics. ping semantic parsers from conversations. In Pro- ceedings of the Conference on Empirical Methods Garg, S., Galstyan, A., Hermjakob, U., and Marcu, in Natural Language Processing. D. (2016). Extracting biomolecular interactions using semantic parsing of biomedical text. In Pro- Artzi, Y. and Zettlemoyer, L. S. (2013). Weakly su- ceedings of the Conference on Artificial Intelli- pervised learning of semantic parsers for mapping gence. instructions to actions. Transactions of the Asso- Glorot, X. and Bengio, Y. (2010). Understanding the ciation for Computational Linguistics, 1. difficulty of training deep feedforward neural net- Ballesteros, M., Dyer, C., and Smith, N. A. (2015). works. In International Conference on Artificial Improved transition-based parsing by modeling Intelligence and Statistics. characters instead of words with LSTMs. Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep Banarescu, L., Bonial, C., Cai, S., Georgescu, M., sparse rectifier neural networks. In International Griffitt, K., Hermjakob, U., Knight, K., Koehn, Conference on Artificial Intelligence and Statis- P., Palmer, M., and Schneider, N. (2013). Abstract tics. meaning representation for sembanking. In Pro- Goldberg, Y. and Nivre, J. (2013). Training de- ceedings of the Linguistic Annotation Workshop. terministic parsers with non-deterministic ora- Berant, J. and Liang, P. (2015). Imitation learning cles. Transactions of the Association for Com- of agenda-based semantic parsers. Transactions putational Linguistics, 1. of the Association for Computational Linguistics, Hockenmaier, J. and Steedman, M. (2007). CCG- 3. Bank: A corpus of CCG derivations and depen- Cai, S. and Knight, K. (2013). Smatch: an eval- dency structures extracted from the Penn Tree- uation metric for semantic feature structures. In bank. Computational Linguistics. Proceedings of the Conference of the Association Kim, J. and Mooney, R. J. (2012). Unsupervised of Computational Linguistics. PCFG induction for grounded language learning with highly ambiguous supervision. In Proceed- Matuszek, C., FitzGerald, N., Zettlemoyer, L. S., ings of the Conference on Empirical Methods in Bo, L., and Fox, D. (2012). A joint model of lan- Natural Language Processing. guage and perception for grounded attribute learn- Kiperwasser, E. and Goldberg, Y. (2016). Simple ing. In Proceedings of the International Confer- and accurate dependency parsing using bidirec- ence on . tional LSTM feature representations. Transac- Pan, X., Cassidy, T., Hermjakob, U., Ji, H., and tions of the Association for Computational Lin- Knight, K. (2015). Unsupervised entity linking guistics, 4. with Abstract Meaning Representation. In Pro- ceedings of the North American Association for Kushman, N. and Barzilay, R. (2013). Using se- Computational Linguistics. mantic unification to generate regular expressions from natural language. In Proceedings of the Pust, M., Hermjakob, U., Knight, K., Marcu, D., Human Language Technology Conference of the and May, J. (2015). Parsing english into abstract North American Association for Computational meaning representation using syntax-based ma- Linguistics. chine translation. In Proceedings of the Confer- ence on Empirical Methods in Natural Language Kwiatkowski, T., Zettlemoyer, L. S., Goldwater, S., Processing. and Steedman, M. (2010). Inducing probabilistic CCG grammars from logical form with higher- Ratinov, L. and Roth, D. (2009). Design challenges order unification. In Proceedings of the Confer- and misconceptions in named entity recognition. ence on Empirical Methods in Natural Language In Proceedings of the Conference on Computa- Processing. tional Natural Language Learning. Kwiatkowski, T., Zettlemoyer, L. S., Goldwater, S., Sawai, Y., Shindo, H., and Matsumoto, Y. (2015). and Steedman, M. (2011). Lexical generalization Semantic structure analysis of noun phrases using in CCG grammar induction for semantic parsing. abstract meaning representation. In Proceedings In Proceedings of the Conference on Empirical of the annual meeting on Association for Compu- Methods in Natural Language Processing. tational Linguistics. Lewis, M. and Steedman, M. (2014). A* CCG pars- Steedman, M. (1996). Surface Structure and Inter- ing with a supertag-factored model. In Proceed- pretation. The MIT Press. ings of the Conference on Empirical Methods in Steedman, M. (2000). The Syntactic Process. The Natural Language Processing. MIT Press. Liang, P., Jordan, M., and Klein, D. (2009). Learn- Wang, C., Xue, N., and Pradhan, S. (2015a). Boost- ing semantic correspondences with less supervi- ing transition-based amr parsing with refined ac- sion. In Proceedings of the Joint Conference of tions and auxiliary analyzers. In Proceedings of the Association for Computational Linguistics the the Annual Meeting of the Association for Com- International Joint Conference on Natural Lan- putational Linguistics. guage Processing. Wang, C., Xue, N., Pradhan, S., and Pradhan, S. Liu, F., Flanigan, J., Thomson, S., Sadeh, N., and (2015b). A transition-based algorithm for AMR Smith, N. A. (2015). Toward abstractive summa- parsing. In Proceedings of the North American rization using semantic representations. In Pro- Association for Computational Linguistics. ceedings of the North American Association for Weiss, D., Alberti, C., Collins, M., and Petrov, S. Computational Linguistics. (2015). Structured training for neural network Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., transition-based parsing. In Proceedings of the Bethard, S. J., and McClosky, D. (2014). The annual meeting on Association for Computational Stanford CoreNLP natural language processing Linguistics. toolkit. In Proceedings of the Annual Meeting of Zettlemoyer, L. S. and Collins, M. (2005). Learning the Association for Computational Linguistics. to map sentences to logical form: Structured clas- sification with probabilistic categorial grammars. In Proceedings of the Conference on Uncertainty in Artificial Intelligence. Zettlemoyer, L. S. and Collins, M. (2007). Online learning of relaxed CCG grammars for parsing to logical form. In Proceedings of the Joint Confer- ence on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.