Exploring Neural Models for Natural Language into First-Order Logic

Hrituraj Singh ∗ Milan Aggrawal Balaji Krishnamurthy Adobe Research Adobe Systems Adobe Systems

Abstract a knowledge base (such as FreeBase (Bollacker et al., 2008)) for retrieving concise answers (Fur- Semantic parsing is the task of obtaining bach et al., 2010). Such representations can be used machine-interpretable representations from to specify instructions to robots (Artzi and Zettle- natural language text. We consider one such formal representation - First-Order Logic moyer, 2013) or conversational agents (Artzi and (FOL) and explore the capability of neural Zettlemoyer, 2011) for executing desired action(s) models in parsing English sentences to FOL. in an environment. Similarly, natural language We model FOL parsing as a sequence to se- queries are transformed into executable database quence mapping task where given a natural programming language instructions (such as SQL) language sentence, it is encoded into an in- to retrieve or generate correct results in a database termediate representation using an LSTM fol- (Sun et al., 2018; Zhong et al., 2017). lowed by a decoder which sequentially gener- ates the predicates in the corresponding FOL A variety of logical forms and meaning represen- formula. We improve the standard encoder- tations have been proposed for text. These include decoder model by introducing a variable align- graph-based formalisms (Banarescu et al., 2013; ment mechanism that enables it to align vari- Abend and Rappoport, 2013; Oepen et al., 2014; ables across predicates in the predicted FOL. Kollar et al., 2018) where text is represented as We further show the effectiveness of predict- a typed graph. The entities and action events are ing the category of FOL entity - Unary, Binary, represented as nodes with labeled edges depicting Variables and Scoped Entities, at each decoder relations between them. Semantic dependency tree step as an auxiliary task on improving the con- sistency of generated FOL. We perform rigor- (Oepen et al., 2014) is a directed graph depicting ous evaluations and extensive ablations. We the syntactic structure of a sentence in the form of also aim to release our code as well as large modifier relations between its words. AMR (Ab- scale FOL dataset along with models to aid fur- stract Meaning Representation) graphs (Banarescu ther research in logic-based parsing and infer- et al., 2013) use variables to annotate nodes fol- ence in NLP. lowing neo-Davidsonian style (Davidson, 1969). Lambda Dependency-based Compositional Seman- 1 Introduction tics (λ-DCS) (Liang, 2013) was proposed as a for- Semantic parsing aims at mapping natural language mal language adapting Dependency-Based Compo- arXiv:2002.06544v1 [cs.CL] 16 Feb 2020 to structured meaning representations. This en- sitional Semantics (Liang et al., 2013) borrowing ables a machine to understand unstructured text the expressiveness of (Barendregt better which is central to many tasks requiring nat- et al., 1984) but aiming to remove explicit use of ural language understanding such as question an- variables. swering (Berant et al., 2013; Pasupat and Liang, In this work, we focus on first-order logic (FOL) 2015), robot navigation (MacMahon et al., 2006; (Smullyan, 2012) as the language formalism for Artzi and Zettlemoyer, 2013), database querying text. FOL represents entities and actions in natu- (Zelle and Mooney, 1996) etc. For question an- ral language through quantified variables and con- swering, natural language question is converted to sists of functions (called predicates) which take formal semantics which facilitates interaction with variables as arguments. The predicates attach se- This∗ work was done when the author was working as a mantics to variables and express relations between Software Developer at Adobe Systems objects (Blackburn, 2005). For instance, a simple sentence - “a man is eating” can be represented neural approaches owing to complexities in its rep- through FOL as resentation. Since it is one of the first such explo- ration for FOL, we treat the popular sequence to ∃A(∃B(man(A) ∧ eat(B) ∧ agent(B,A))) sequence model coupled with attention mechanism Advanced natural language concepts as in sen- (Bahdanau et al., 2014) as our baseline. We pro- tence “the man and woman are seated facing each pose to disentangle the prediction of different types other” can be expressed as of FOL syntactic entities (unary and binary pred- icates, variables etc) while parsing sentences and ∃A(∃B(∃D(∃C(man(A) ∧ woman(B) ∧ show improvements through performing category seat(D) ∧ subset of(A, C) ∧ type prediction as an auxiliary task. We further subset of(B,C) ∧ theme(D,C) ∧ show major improvements by explicitly constrain- not(∃E(other(E) ∧ not(∃F (face(F ) ∧ ing the decoder to align variables across unary and theme(F,E) ∧ agent(F,C))))))))) binary predicates. This restricts the model to main- where “man” and “woman” are represented to- tain consistency while expressing standalone entity gether through shared variableC and “facing each attributes and relations between them. other” is represented by negating the existence of a Our contributions can be enumerated as: 1) We thing E for which C is not facing E holds true. explore and develop an open domain neural se- The success of learning based neural approaches mantic parser to parse natural language sentences in NLP tasks like (Cho et al., to FOL using Seq2Seq framework; 2) We pro- 2014; Sutskever et al., 2014; Vaswani et al., 2017), pose disentangled FOL entity type prediction along paraphrase generation (Prakash et al., 2016; Gupta with FOL parsing under multi-task learning and et al., 2018), dialog modeling (Vinyals and Le, FOL variable alignment through decoder alignment 2015; Kottur et al., 2017), machine comprehension mechanism. We perform extensive ablation studies (Wang et al., 2017), logical inference (Kim et al., to establish the improvements registered; 3) We 2019) has motivated their use for semantic parsing also aim to release our code, models and large (Kociskˇ y` et al., 2016; Buys and Blunsom, 2017; scale dataset used comprising of sentence-FOL Cheng et al., 2017; Liu et al., 2018; Li et al., 2018) mappings to aid further research in FOL based as well. Many such works use the encoder-decoder NLP. framework to model it as a sequence transduction task. Since they were designed for solving specific 2 Background tasks like , such methods (Jia and Liang, 2016; Dong and Lapata, 2016) have Text to FOL Conversion : In this section, we mainly focussed on confined logical formalism for give a brief overview of syntactic-semantic analy- specific domains such as flight reservation, restau- sis pipeline used for obtaining the mappings data rant booking, etc (Wang et al., 2015) capturing through Boxer (Bos, 2008) based on Combina- limited vocabulary and semantic concepts. tory Categorial Grammar (CCG) (Steedman and In this paper, we aim at developing a general- Baldridge, 2011) and Discourse Representation purpose open-domain neural first-order logic parser Theory (DRT) (Kamp et al., 2011). CCG is phrase- for natural language sentences to examine the ca- level grammar which defines rules for generating pabilities of such models. We train our model constituency-based structures. CCG comprises of by obtaining a large corpus of text-FOL pairs for syntactically typed lexical items such that each sentences in SNLI Dataset (Bowman et al., 2015) item is a lambda-expression and uses combinatorial through C&C parser (Clark and Curran, 2007) logic (lambda calculus) to combine them through and Boxer (Bos, 2008) (discussed later in detail).1 the application of combinators. CCG derivation Apart from meaning depiction, parsing sentences guides semantic composition to obtain Discourse to FOL would enable neural models to capture Representation Structures (DRS) from CCG parses. complex relationships between entities resulting in DRS comprises of discourse referents and condi- richer embeddings which might be useful in several tions defined on them which can be recursive. DRS other NLP tasks. Such an examination would help is capable of representing varied linguistic phe- understand challenges in generating FOL through nomena such as anaphora, presupposition, tense and aspect. These DRSs are compatible and can be 1https://github.com/valeriobasile/candcapi converted to FOL through a set of syntactic trans- formations (Bunt et al., 2001). Formally, predi- on the input sentence and the previously generated cates in FOL are atomic formulas that are combined tokens. Our input X consists of a sequence of through logical connectives - logical and (∧), logi- m tokens {x0, x1, ..., xm} which get encoded into cal or (∨) ; and quantifiers. In general, a predicate hidden contextual representations by an Encoder. P (v1, v2, ..., vn) is an n-ary function of variables. The Decoder, then, generates an output sequence There are two types of quantifiers, universal (∀)- of n tokens {y0, y1, ..., yn}. which specifies that sub-formula within its scope n is true for all instances of the variable and exis- Y P (Y |X) = P (y |y ..y ,X) (1) tential (∃) - which asserts existence of at least one i 0 i−1 i=0 instance represented by a variable under which the sub-formula holds true. For example, “All humans 3.1 Encoder eat” can be represented as Our Encoder E is a bidirectional LSTM (biLSTM) ∀A(∃B(human(A) ∧ eat(B) ∧ agent(B,A))) which encodes a sequence of input tokens X : Following generalized De Morgan’s law (John- {x0, x1..., xm} into a sequence of hidden states H : {h , h , ..., h } h ∈ Rdh stone, 1979), universal quantifiers can be repre- e e0 e1 em , ei to capture con- sented through existential quantification and nega- textual information from the input that is eventu- tion (not) preserving the semantics as ally used by the decoder to produce the output FOL sequence. The biLSTM block takes word embed- not(∃A(not(∃B(human(A) ∧ eat(B) ∧ dings for the input tokens Ee : {ee0 , ee1 , ..., eem }, agent(B,A))))) D eei ∈ R as input and processes them to calculate Output and Mapping Format : Given a text sen- the contextual representations tence, we obtain the following FOL output. hei = [hfei ; hbei ] = biLST M(ee0 , ee1 , ..., eei ) Sentence : “three women are traveling by foot” (2) Output FOL : fol(1,some(A,some(B,some(C,and where ; denotes concatenation operation and

(r1by(B,A),and(n1foot(A),and(r1agent(B,C),and hfei , hbei refer to the forward and backward hidden (v1travel(B),and(n1woman(C),some(D,and(card states of the biLSTM. (C,D),and(c3number(D),n1numeral(D))))))))))))) 3.2 Decoder Here, the predicates are prefixed with POS-tags Decoder D consists of an LSTM which uses (Wilks and Stevenson, 1998) and relation types. the outputs of encoder E along with previously Since the output FOL comprises of existential quan- decoded outputs, provided as embeddings E : tifiers and disjunction of atomic formulas only, we d {e , e , ..., e } to it as input, to generate a se- convert it into an equivalent mapping as a sequence d0 d1 dn quence of hidden states H : {h , h , ..., h }. of predicates, argument variables, scoping symbols d d0 d1 dn (such as “fol(”, “)”, “not(”) and train our models to h = LST M(e , e , ..., e ) (3) predict the sequence. We arrange scope symbols in di d0 d1 di accordance to their nesting level (top most appear- Attention (Bahdanau et al., 2014) has now become ing first in the sequence) with further ordering that ubiquitous in sequence to sequence models. We entities that are part of same scope are arranged as consider it to be a part of our baseline model. Fol- sequence of unary predicates, followed by binary lowing (Bahdanau et al., 2014), we calculate the predicates and other nested scoped entities. weights for encoder-decoder attention using Hd as Equivalent Mapping : fol( n1foot A v1travel B queries while He as keys as well as values (eq.4. n1woman C c3number D n1numeral D r1by B A T h he r1agent B C card C D ) e di j αij = T (4) m h he P e di k 3 Proposed Approach k=0 We model parsing a given sentence into FOL as a The encoder-context vector is obtained by taking sequence to sequence transduction problem. Our a weighted sum of encoder’s hidden states He (eq. 5). parser P generates a token in the output FOL rep- m resentation in a sequential manner by greedily sam- X cei = αij ∗ hej (5) pling it from a probability distribution conditioned j=0 Figure 1: Overview of our architecture showing separate heads (red), category prediction (orange) and alignment mechanism (green and pink). Input to Decoder LSTM (blue) depicts the output of last step being fed at next step. Red arrow between Attention Layer and Decoder depicts standard encoder-decoder attention.

The hidden state of the decoder hdi along with a token of category V does not posses semantic encoder-context vector cei is used to predict the meaning that is shared across all sequences from final output token at ith step. the output distribution. Thus, they are defined in the context of an FOL sequence only. We repre-

si = Wo ∗ [hdi ; cei ] (6) sent them through one-hot embeddings to ensure independence between them. VX(d +dc) where Wo ∈ R h is output head and dh is Building on above motivation, we use five dif- the dimension of hidden vector and dc = dh is the ferent heads on top of Decoder LSTM. While one dimension of context vector. head T decides what type of token is being gen- We train the model on the standard cross-entropy erated at a given decoding step, the other heads objective while adopting teacher forcing method- decode the probabilities of different types of to- ology i.e. giving the inputs to the decoder from kens. ground truth instead of previously decoded tokens u oi = softmax(Wu ∗ hdi ) (8) while training

n ob = softmax(W ∗ h ) (9) 1 X i b di L (θ) = − yt ∗ log(P (y |y ..y ,X)) CE n i i 0 i−1 0 v (7) oi = softmax(Wv ∗ hdi ) (10) t where yi is the target token from ground truth at step i and θ refers to the trainable model parame- s oi = softmax(Ws ∗ hdi ) (11) ters.

3.2.1 Separate Heads ti = softmax(Wt ∗ hdi ) (12)

VxXd The output tokens in an FOL sequence do not all where Wx ∈ R h and x : {u, b, v, s}. We also belong to the same token category unlike majority treat different categories as words in a vocabulary VcXd sequence to sequence translation problems which of size Vc and therefore, Wt ∈ R h process words. In particular, the output tokens in We, thus, train the model on an additional auxil- an FOL sequence can be divided into four major iary task of predicting the type of the token being types - Unary Predicates U, Binary Predicates B, generated at each step. Hence, the overall cross- Variables V , and Scoped Entities S. We create entropy objective to decode the correct type at all separate vocabularies of sizes Vu, Vb, Vv, and Vs steps becomes for each category. Apart from variables V which n 1 X have one-hot embedding, all other types of output L (θ, φ) = − tt ∗ log(t )) (13) aux n i i tokens have dense embeddings. This is because 0 t where ti is the target type (from ground truth) of At each variable decoding step, along with de- token to be predicted at this step and the probability coding the type of the token i.e. variable, a lin- x[ti] of generating token yi is given by oi . φ refers ear classifier makes the decision whether this vari- to additional decoder parameters introduced in the able is aligned with any previously decoded to- model. Thus, our overall objective is now a sum of ken/variable or is an entirely new variable being both cross entropy and auxiliary objective: generated at this step

Ad = sigmoid(Wali ∗ hd ) (18) Lsep(θ, φ) = LCE(θ) + Laux(θ, φ) (14) i i where W ∈ R1Xdh . Depending on this decision 3.3 Decoder Self Attention ali by the classifier, an alignment mechanism similar One of the key challenges for the model is to iden- to decoder self-attention performs the relational tify the relationship between the variables it gener- mapping between a previously decoded variable ates. A variable A that is an argument in a binary and the variable which is currently being decoded. predicate should be aligned with the same variable This mapping is performed only for the variables used as an argument in a unary predicate previ- and not for any other category. All the previously ously. One of the ways to achieve such alignment generated hidden states of the decoder are linearly is through decoder self-attention which is an exten- projected into a different space before calculating sion of the regular encoder-decoder attention. In the position of the token with which the variable is this case, queries, keys and values - all are decoder aligned. The projection is performed to reduce the hidden states Hd. Therefore, we determine decoder interference with encoder-decoder attention due to context vector cd along with the encoder-context alignment mechanism training. vector. However, one of the key differences be- ha = W ∗ h tween encoder-decoder and decoder self-attention di proj di (19) is that while encoder-decoder attention can be ap- The probabilities of whether a particular step j plied to the whole input, decoder self-attention can aligns with the current decoding step i is calculated only be applied on the hidden states which have with an attention-like formulation. been decoded so far. Just like encoder-context, de- aT a coder context is calculated by taking a weighted hd hd e i j sum of decoder hidden states γij = aT a (20) Pm hd hd k=0 e i k T hd hdj e i i−1 βij = T (15) m h hd X a P di k c = γ ∗ h (21) k=0 e ai j dj j=0 i−1 X For every other category of tokens but variables, c = β ∗ h (16) di ij dj the decoder heads remain the same. However, for j=0 variables, we first calculate aligned hidden state The linear head on the decoder now uses both value as encoder and decoder contexts along with decoder hali = A ∗ c + (1 − A ) ∗ h (22) hidden state to generate the final output di di ai di di The output is, then, calculated as si = Wo ∗ [hdi ; cei ; cdi ] (17) i ali ov = Wo ∗ [h ; cei ] (23) d +2∗dc di where Wo ∈ R h In order to provide explicit signal during training, 3.3.1 Alignment Mechanism we train γ and Adi on the target alignment positions Through decoder self attention, the model does not and decisions with a cross entropy objective receive any explicit signal on alignment and relies n only on cross-entropy objective to identify such 1 X L (θ, φ, ζ) = − (At ∗ log(A )+ relations between different variables. In order to dec n di di provide an explicit signal to the model, we intro- i=0 (1 − At ) ∗ log(1 − A )) (24) duce Alignment Mechanism. di di 3 from scratch. We used dh = dc = 400, m = 100 and n = 30. n i−1 1 X X t t Lpos(θ, φ, ζ) = − A ∗ γij ∗ log(γij) 4.3 Results and Discussion n di i=0 j=0 (25) 4.3.1 Evaluation Framework where At and γt refer to ground truth decision We evaluate different models through estimating di ij and alignment position values and ζ refers to ad- the accuracy of complete match between gold stan- ditional parameters introduced due to alignment dard FOL and predicted output. Due to the complex mechanism. Therefore our overall loss becomes nature of the task, it is less likely that the model gen- erates exactly the same FOL. To mitigate this, we Lalign(θ, φ, ζ) = Lsep + Ldec + Lpos (26) propose to evaluate the degree of partial match be- tween two FOLs following the intuition behind D- 4 Experiments match and Smatch (Cai and Knight, 2013), which are widely used to evaluate AMR graphs and DRGs. 4.1 Dataset We align two FOLs in bottom up manner beginning We collated a subset of SNLI (Bowman et al., with variables. For aligning two variables, it is re- 2015) corpus by extracting sentences from both quired that the corresponding predicates’ name (in premise and hypothesis for a limited number of which they appear as arguments) and argument po- examples. Eliminating duplicates, we prepared sitions match. Subsequently, while aligning two (Refer Section2) two versions of the dataset - predicates, we check if their arguments are aligned Small and Large to examine if the proposed and their names are same. We continue to follow improvements remain consistent even on small the same process where we align nested scope sym- data. In the smaller version, we prepared 138,346 bols (“not(” etc.). In particular, given an expected instances while in the larger one, we prepared scoped entity, we determine the predicted scoped 255,501 instances for training. We used the entity having maximum alignment with it based on development and test sets of SNLI as provided but the count of other aligned predicates and scoped eliminated the duplicates resulting in evaluation entities that are contained inside them. Given an set having 10,691 instances and test set having FOL, we decompose it into related pairs of the form 10,633 instances. (n1, n2) such that n2 appears inside the scope of n1. For instance, a variable that is an argument in a predicate or a predicate appearing inside a 4.2 Implementation scoped entity. Consequently, we estimate the num- We used Pytorch2 library for implementing an ber of pairs in expected FOL that can be matched auto-differentiable graph of our computations. All with pairs in predicted FOL based on the constraint the models were trained with an Adam Opti- that corresponding entities in the pairs should be mizer(Kingma and Ba, 2014) initialized with a aligned. We select the alignment with maximum learning rate of 0.001 with a decay rate of 10−4. matches and report metrics (precision-recall and We use an embedding size D = 100 for encoder as F1 over pair-matching) as evaluation criteria along well as decoder embeddings in the baseline model. with overall FOL accuracy. In our separated heads model, D remained the 4.3.2 Comparison with Baseline and same for encoder embeddings. However, on the Ablation Studies decoder side, Unary and Binary predicates have an We conduct a range of experiments and evalua- embedding size of 100 each while variable and type tions on different models. We show our results embeddings are one-hot having the number of di- on both development and test sets in Table1,2,3, mensions equal to their respective vocabulary sizes. and4. Our Vanilla (Baseline) model consists of Scoped entities, being very less in number, were a biLSTM Encoder and a plain LSTM decoder as encoded with an embedding size of 50. Our final in- described in Section 3.2 coupled with an encoder- put embedding is a concatenation of Unary, Binary, decoder attention mechanism. Performing disentan- Variable, Scope and type embeddings. All dense glement, our Separate Heads model uses different embeddings are randomly initialized and trained 3We experimented with GloVe embeddings but it did not 2https://pytorch.org give further improvements linear heads on the top of LSTM decoder for differ- Model Precision Recall F-1 Accuracy Vanilla (Baseline) 66.43 65.84 66.13 60.45 ent category of tokens as discussed in Section 3.2.1. + Self Attention 68.56 67.23 67.89 61.18 Our final proposed model Separate Heads + Align + Align Mechanism 61.74 60.60 61.17 56.79 uses our alignment mechanism on the top of Sepa- Separate Heads 69.18 67.63 68.40 61.14 rate Heads and utilises the disentangled variable + Self Attention 66.92 65.69 66.30 60.12 Separate Heads + Align 74.05 72.53 73.28 63.24 prediction mechanism coupled with an alignment mechanism to effectively identify the relationships Table 2: Results showing overall accuracy and F1- between variables in binary predicates and their Scores of different models (trained on Large dataset) unary counterparts. We also conduct ablations on on Test dataset the Vanilla and Separate Heads models by incre- mentally adding both decoder self-attention and Model Precision Recall F-1 Accuracy Vanilla (Baseline) 59.54 58.14 58.83 52.99 alignment mechanisms. Separate Heads 62.00 60.58 61.28 55.20 Evidently, our final model Separate Heads + Separate Heads + Align 65.75 64.36 65.05 55.30 Align convincingly outperforms all described mod- Table 3: Results showing overall accuracy and F1- els and improves the baseline by ∼ 8 F-1 points. Scores of different models (trained on Small Dataset) Decoder self-attention, even though, improves on development dataset Vanilla Model does not provide any improvements when used with Separate Heads. This can be at- Model Precision Recall F-1 Accuracy tributed to its inability to incorporate decoder level Vanilla (Baseline) 59.87 58.41 59.13 52.92 information which probably becomes factorized au- Separate Heads 62.56 61.10 61.82 55.94 Separate Heads + Align 66.45 65.05 65.74 56.10 tomatically during training through using separate heads. However, it provides improvements over Table 4: Results showing overall accuracy and F1- Vanilla by a good margin but still only matches or Scores of different models (trained on Small Dataset remains inferior to the standalone Separate Heads on test dataset model. Align Mechanism manages to provide a huge boost to the Separate Heads model by im- proving it by ∼ 5 F-1 points. However, perfor- mance deteriorates when used with Vanilla model since its ability to align variables only vanishes in this setup which we find critical for its working. We further note that by increasing the size of train- ing data, the performance increases uniformly with our final model achieving the best F-1 of ∼ 73 and an overall accuracy of ∼ 63%.

Model Precision Recall F-1 Accuracy Vanilla (Baseline) 65.48 65.15 65.31 59.26 + Self Attention 67.86 66.70 67.28 60.74 + Align Mechanism 62.13 61.06 61.59 56.98 Figure 2: Variation of output F-1 Score with input Separate Heads 68.48 66.81 67.64 60.77 length on Test Dataset + Self Attention 67.11 65.87 66.48 60.02 Separate Heads + Align 73.68 72.17 72.92 63.26

Table 1: Results showing overall accuracy and F1- Variation with Input Length: Evidently, our pro- Scores of different models (trained on Large dataset) posed models are relatively much more robust to on development dataset increase in length in the input sentence as shown in Fig.2. This can be attributed to many factors - increased model capacities as well as their abilities 4.3.3 Analysis to process different categories of output tokens sep- We perform additional experiments to analyse the arately giving better long range dependencies and results observed. We conduct two sets of analysis less confusion in generating many variables over - Variation of F-1 score with input length and Per- FOL owing to better alignment across the sequence. turbed training to establish the robustness of our proposed method. Perturbed Training: It has been noticed in litera- Model Precision Recall F-1 Accuracy tial parse prediction (Jia and Liang, 2016; Kociskˇ y` Vanilla(Baseline) 63.75 62.32 63.03 57.95 Separate Heads 67.39 65.59 66.48 61.44 et al., 2016) and graph structure decoding which Separate Heads + Align 72.90 71.95 72.42 63.14 tailor network architecture to utilize the syntac- tic structure of meaning representation. (Yin and Table 5: Results on Test set showing both accuracy and Neubig, 2017; Alvarez-Melis and Jaakkola, 2016; F1-Scores with perturbed training Rabinovich et al., 2017). (Dong and Lapata, 2016) proposed SEQ2TREE to generate domain-specific ture ((Jia and Liang, 2017; Niven and Kao, 2019) hierarchical logical form by introducing parenthe- that neural models sometimes exploit trivial pat- sis token and parent connections to recursively gen- terns in outputs/inputs to fool and provide pseudo- erate sub-trees. (Rabinovich et al., 2017) intro- improved results. One such pattern could be pres- duced a dynamic decoder whose components are ence of variables like A and B with some specific composed depending on generated tree parse. (Liu unary and binary predicates. In order to disturb et al., 2018) parse DRSs using dedicated hierar- such patterns, we randomly permute the presence chical decoders to generate partial structure first of such variables in the ground truth during train- before the semantic content. We instead make the ing. Our baseline model indeed shows a significant model disambiguate syntactic types (unary and bi- drop in results (Table5). On the other hand, our nary predicates, variables, scope symbols) through other two main models do not show such large drop performing category type prediction as an auxiliary proving their robustness to such disturbances. task and using separate prediction heads. Con- strained decoding using target language syntax and 5 Related Work grammar rules has been explored (Yin and Neu- big, 2017; Xiao et al., 2016). Copy-mechanism Early semantic parsers were majorly rule based (Gu et al., 2016) has been used to facilitate the (Johnson, 1984; Woods, 1973; Thompson et al., generation of out of vocabulary entities through en- 1969) using grammar systems (Waltz, 1978; Hen- coder attention (Jia and Liang, 2016). However, our drix et al., 1978), employing shallow pattern match- variable alignment mechanism is different since it ing (Johnson, 1984) and parse tree to generate constrains the model to align binary predicate argu- database query language (Woods, 1973). These ments with previously generated unary structures were succeeded by data driven learning tech- (alignment happening at decoder level) through niques which use language data paired with mean- specifying explicit loss on whether to align and ing representations and can be broadly classified where to align. into statistical methods (Thompson, 2003; Zettle- moyer and Collins, 2012; Zelle and Mooney, 1996; Kwiatkowski et al., 2010) and neural approaches 6 Conclusion (Kociskˇ y` et al., 2016; Dong and Lapata, 2016; Jia and Liang, 2016; Buys and Blunsom, 2017; In this work, we examined the capability of neural Cheng et al., 2017; Liu et al., 2018; Li et al., 2018). models on the task of parsing First-Order Logic (Zettlemoyer and Collins, 2012) proposed to use from natural language sentences. We proposed sentences and their lambda calculus expressions to to disentangle the representations of different to- learn a log linear model through probabilistic CCG ken categories while generating FOL output and to rank different parses for a given sentence us- used category prediction as an auxiliary task. We ing simple features such as count of lexical entries. utilized token factorization to build an alignment Additionally, captioned videos have been used to mechanism which effectively manages to capture perform visually grounded semantic parsing (Ross the relationship between variables across different et al., 2018). Feedback based semantic parsing has predicates in FOL. Our analysis showed the diffi- been done to facilitate continuous improvement in culties faced by neural networks in modeling FOL the quality of parse through conversations (Artzi and ways to tackle them. We also experimented by and Zettlemoyer, 2011) and user interaction (Iyer introducing a perturbation in inputs in order to ex- et al., 2017; Lawrence and Riezler, 2018). amine the robustness of different proposed models. Neural approaches alleviate the need for man- In a bid to promote research further in the area, we ually defining lexicons and can further be catego- aim to release our code as well as data publicly. rized based on the structure of parse into sequen- References H Bunt et al. 2001. Patrick blackburn, johan bos, michael kohlhase and hans de nivelle. Omri Abend and Ari Rappoport. 2013. Universal con- ceptual cognitive annotation (ucca). In Proceedings Jan Buys and Phil Blunsom. 2017. Robust incremen- of the 51st Annual Meeting of the Association for tal neural semantic graph parsing. arXiv preprint Computational Linguistics (Volume 1: Long Papers), arXiv:1704.07092. pages 228–238. Shu Cai and Kevin Knight. 2013. Smatch: an evalua- David Alvarez-Melis and Tommi S Jaakkola. 2016. tion metric for semantic feature structures. In Pro- Tree-structured decoding with doubly-recurrent neu- ceedings of the 51st Annual Meeting of the Associa- ral networks. tion for Computational Linguistics (Volume 2: Short Papers), pages 748–752. Yoav Artzi and Luke Zettlemoyer. 2011. Bootstrap- ping semantic parsers from conversations. In Pro- Jianpeng Cheng, Siva Reddy, Vijay Saraswat, and ceedings of the conference on empirical methods in Mirella Lapata. 2017. Learning structured natu- natural language processing, pages 421–432. Asso- ral language representations for semantic parsing. ciation for Computational Linguistics. arXiv preprint arXiv:1704.08387.

Yoav Artzi and Luke Zettlemoyer. 2013. Weakly su- Kyunghyun Cho, Bart Van Merrienboer,¨ Caglar Gul- pervised learning of semantic parsers for mapping cehre, Dzmitry Bahdanau, Fethi Bougares, Holger instructions to actions. Transactions of the Associa- Schwenk, and Yoshua Bengio. 2014. Learning tion for Computational Linguistics, 1:49–62. phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- arXiv:1406.1078. gio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint Stephen Clark and James R Curran. 2007. Wide- arXiv:1409.0473. coverage efficient statistical parsing with ccg and log-linear models. Computational Linguistics, Laura Banarescu, Claire Bonial, Shu Cai, Madalina 33(4):493–552. Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Donald Davidson. 1969. The individuation of events. Schneider. 2013. Abstract meaning representation In Essays in honor of Carl G. Hempel, pages 216– for sembanking. In Proceedings of the 7th Linguis- 234. Springer. tic Annotation Workshop and Interoperability with Discourse, pages 178–186. Li Dong and Mirella Lapata. 2016. Language to log- ical form with neural attention. arXiv preprint Hendrik P Barendregt et al. 1984. The lambda calculus, arXiv:1601.01280. volume 3. North-Holland Amsterdam. Ulrich Furbach, Ingo Glockner,¨ Hermann Helbig, and Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Bjorn¨ Pelzer. 2010. Logic-based question answer- Liang. 2013. Semantic parsing on freebase from ing. KI-Kunstliche¨ Intelligenz, 24(1):51–55. question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Lan- Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK guage Processing, pages 1533–1544. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. arXiv preprint Patrick Blackburn. 2005. Representation and inference arXiv:1603.06393. for natural language: A first course in . Ankush Gupta, Arvind Agarwal, Prawaan Singh, and Piyush Rai. 2018. A deep generative framework for Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim paraphrase generation. In Thirty-Second AAAI Con- Sturge, and Jamie Taylor. 2008. Freebase: a collab- ference on Artificial Intelligence. oratively created graph database for structuring hu- man knowledge. In Proceedings of the 2008 ACM Gary G Hendrix, Earl D Sacerdoti, Daniel Sagalowicz, SIGMOD international conference on Management and Jonathan Slocum. 1978. Developing a natural of data, pages 1247–1250. AcM. language interface to complex data. ACM Transac- tions on Database Systems (TODS), 3(2):105–147. Johan Bos. 2008. Wide-coverage semantic analysis with boxer. In Proceedings of the 2008 Conference Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant on Semantics in Text Processing, pages 277–286. As- Krishnamurthy, and Luke Zettlemoyer. 2017. Learn- sociation for Computational Linguistics. ing a neural semantic parser from user feedback. arXiv preprint arXiv:1704.08760. Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. A large anno- Robin Jia and Percy Liang. 2016. Data recombina- tated corpus for learning natural language inference. tion for neural semantic parsing. arXiv preprint arXiv preprint arXiv:1508.05326. arXiv:1606.03622. Robin Jia and Percy Liang. 2017. Adversarial exam- Percy Liang, Michael I Jordan, and Dan Klein. 2013. ples for evaluating reading comprehension systems. Learning dependency-based compositional seman- arXiv preprint arXiv:1707.07328. tics. Computational Linguistics, 39(2):389–446.

Tim Johnson. 1984. Natural language computing: the Jiangming Liu, Shay B Cohen, and Mirella Lapata. commercial applications. The Knowledge Engineer- 2018. Discourse representation structure parsing. In ing Review, 1(3):11–23. Proceedings of the 56th Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Peter T Johnstone. 1979. Conditions related to de mor- Long Papers), pages 429–439. gan’s law. In Applications of sheaves, pages 479– 491. Springer. Matt MacMahon, Brian Stankiewicz, and Benjamin Kuipers. 2006. Walk the talk: Connecting language, Hans Kamp, Josef Van Genabith, and Uwe Reyle. 2011. knowledge, and action in route instructions. Def, Discourse representation theory. In Handbook of 2(6):4. philosophical logic, pages 125–394. Springer.

Seonhoon Kim, Inho Kang, and Nojun Kwak. Timothy Niven and Hung-Yu Kao. 2019. Probing neu- 2019. Semantic sentence matching with densely- ral network comprehension of natural language argu- connected recurrent and co-attentive information. In ments. arXiv preprint arXiv:1907.07355. Proceedings of the AAAI Conference on Artificial In- telligence, volume 33, pages 6586–6593. Stephan Oepen, Marco Kuhlmann, Yusuke Miyao, Daniel Zeman, Dan Flickinger, Jan Hajic, Angelina Diederik P Kingma and Jimmy Ba. 2014. Adam: A Ivanova, and Yi Zhang. 2014. Semeval 2014 task method for stochastic optimization. arXiv preprint 8: Broad-coverage semantic dependency parsing. In arXiv:1412.6980. Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 63–72. Toma´sˇ Kociskˇ y,` Gabor´ Melis, Edward Grefenstette, Chris Dyer, Wang Ling, Phil Blunsom, and Panupong Pasupat and Percy Liang. 2015. Compo- Karl Moritz Hermann. 2016. Semantic parsing with sitional semantic parsing on semi-structured tables. semi-supervised sequential autoencoders. arXiv arXiv preprint arXiv:1508.00305. preprint arXiv:1609.09315. Aaditya Prakash, Sadid A Hasan, Kathy Lee, Vivek Thomas Kollar, Danielle Berry, Lauren Stuart, Datla, Ashequl Qadir, Joey Liu, and Oladimeji Karolina Owczarzak, Tagyoung Chung, Lambert Farri. 2016. Neural paraphrase generation with Mathias, Michael Kayser, Bradford Snow, and Spy- stacked residual lstm networks. arXiv preprint ros Matsoukas. 2018. The alexa meaning represen- arXiv:1610.03098. tation language. In Proceedings of the 2018 Con- ference of the North American Chapter of the As- Ella Rabinovich, Noam Ordan, and Shuly Wintner. sociation for Computational Linguistics: Human 2017. Found in translation: Reconstructing phy- Language Technologies, Volume 3 (Industry Papers), logenetic language trees from translations. arXiv pages 177–184. preprint arXiv:1704.07146.

Satwik Kottur, Xiaoyu Wang, and V´ıtor Carvalho. Candace Ross, Andrei Barbu, Yevgeni Berzak, Bat- 2017. Exploring personalized neural conversational tushig Myanganbayar, and Boris Katz. 2018. models. In IJCAI, pages 3728–3734. Grounding language acquisition by training seman- tic parsers using captioned videos. In Proceedings Tom Kwiatkowski, Luke Zettlemoyer, Sharon Gold- of the 2018 Conference on Empirical Methods in water, and Mark Steedman. 2010. Inducing proba- Natural Language Processing, pages 2647–2656. bilistic ccg grammars from logical form with higher- order unification. In Proceedings of the 2010 con- Raymond R Smullyan. 2012. First-order logic, vol- ference on empirical methods in natural language ume 43. Springer Science & Business Media. processing, pages 1223–1233. Association for Com- putational Linguistics. Mark Steedman and Jason Baldridge. 2011. Combi- Carolin Lawrence and Stefan Riezler. 2018. Improv- natory categorial grammar. Non-Transformational ing a neural semantic parser by counterfactual learn- Syntax: Formal and explicit models of grammar, ing from human bandit feedback. arXiv preprint pages 181–224. arXiv:1805.01252. Yibo Sun, Duyu Tang, Nan Duan, Jianshu Ji, Gui- Zuchao Li, Jiaxun Cai, Shexia He, and Hai Zhao. 2018. hong Cao, Xiaocheng Feng, Bing Qin, Ting Liu, and Seq2seq dependency parsing. In Proceedings of Ming Zhou. 2018. Semantic parsing with syntax- the 27th International Conference on Computational and table-aware SQL generation. In Proceedings Linguistics, pages 3203–3214. of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Percy Liang. 2013. Lambda dependency-based compo- pages 361–372, Melbourne, Australia. Association sitional semantics. arXiv preprint arXiv:1309.4408. for Computational Linguistics. I Sutskever, O Vinyals, and QV Le. 2014. Sequence to Luke S Zettlemoyer and Michael Collins. 2012. Learn- sequence learning with neural networks. Advances ing to map sentences to logical form: Structured in NIPS. classification with probabilistic categorial grammars. arXiv preprint arXiv:1207.1420. Cynthia Thompson. 2003. Acquiring word-meaning mappings for natural language interfaces. Journal Victor Zhong, Caiming Xiong, and Richard Socher. of Artificial Intelligence Research, 18:1–44. 2017. Seq2sql: Generating structured queries from natural language using reinforcement learning. Frederick B Thompson, PC Lockemann, B Dostert, and arXiv preprint arXiv:1709.00103. RS Deverill. 1969. Rel: A rapidly extensible lan- guage system. In Proceedings of the 1969 24th na- tional conference, pages 399–417. ACM.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information pro- cessing systems, pages 5998–6008.

Oriol Vinyals and Quoc Le. 2015. A neural conversa- tional model. arXiv preprint arXiv:1506.05869.

David L Waltz. 1978. An english language question an- swering system for a large relational database. Com- munications of the ACM, 21(7):526–539.

Bingning Wang, Kang Liu, and Jun Zhao. 2017. Condi- tional generative adversarial networks for common- sense machine comprehension. In IJCAI, pages 4123–4129.

Yushi Wang, Jonathan Berant, and Percy Liang. 2015. Building a semantic parser overnight. In Proceed- ings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Interna- tional Joint Conference on Natural Language Pro- cessing (Volume 1: Long Papers), pages 1332–1342.

Yorick Wilks and Mark Stevenson. 1998. The grammar of sense: Using part-of-speech tags as a first step in semantic disambiguation. Natural Language Engi- neering, 4(2):135–143.

William A Woods. 1973. Progress in natural lan- guage understanding: an application to lunar geol- ogy. In Proceedings of the June 4-8, 1973, national computer conference and exposition, pages 441–450. ACM.

Chunyang Xiao, Marc Dymetman, and Claire Gardent. 2016. Sequence-based structured prediction for se- mantic parsing. In Proceedings of the 54th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1341– 1350.

Pengcheng Yin and Graham Neubig. 2017. A syntactic neural model for general-purpose code generation. arXiv preprint arXiv:1704.01696.

John M Zelle and Raymond J Mooney. 1996. Learn- ing to parse database queries using inductive logic programming. In Proceedings of the national con- ference on artificial intelligence, pages 1050–1055.