Joint Syntactic and Semantic with Combinatory Categorial Grammar

Jayant Krishnamurthy Tom M. Mitchell Carnegie Mellon University Carnegie Mellon University 5000 Forbes Avenue 5000 Forbes Avenue Pittsburgh, PA 15213 Pittsburgh, PA 15213 [email protected] [email protected]

Abstract ideally improve the parser’s ability to solve diffi- cult syntactic parsing problems, as in the exam- We present an approach to training a joint ples above. A semantic representation tied to a syntactic and semantic parser that com- knowledge base allows for powerful inference op- bines syntactic training information from erations – such as identifying the possible entity CCGbank with semantic training informa- referents of a noun phrase – that cannot be per- tion from a knowledge base via distant su- formed with shallower representations (e.g., frame pervision. The trained parser produces a semantics (Baker et al., 1998) or a direct conver- full syntactic parse of any sentence, while sion of syntax to logic (Bos, 2005)). simultaneously producing logical forms for portions of the sentence that have a se- This paper presents an approach to training a mantic representation within the parser’s joint syntactic and semantic parser using a large predicate vocabulary. We demonstrate our background knowledge base. Our parser produces approach by training a parser whose se- a full syntactic parse of every sentence, and fur- mantic representation contains 130 pred- thermore produces logical forms for portions of icates from the NELL ontology. A seman- the sentence that have a semantic representation tic evaluation demonstrates that this parser within the parser’s predicate vocabulary. For ex- produces logical forms better than both ample, given a phrase like “my favorite town in comparable prior work and a pipelined California,” our parser will assign a logical form like λx.CITY(x) LOCATEDIN(x, CALIFORNIA) syntax-then-semantics approach. A syn- ∧ tactic evaluation on CCGbank demon- to the “town in California” portion. Additionally, strates that the parser’s dependency F- the parser uses predicate and entity type informa- score is within 2.5% of state-of-the-art. tion during parsing to select a syntactic parse. Our parser is trained by combining a syntactic 1 Introduction parsing task with a distantly-supervised relation Integrating syntactic parsing with semantics has extraction task. Syntactic information is provided long been a goal of natural language processing by CCGbank, a conversion of the Penn Treebank and is expected to improve both syntactic and se- into the CCG formalism (Hockenmaier and Steed- mantic processing. For example, semantics could man, 2002a). Semantics are learned by training help predict the differing prepositional phrase at- the parser to extract knowledge base relation in- tachments in “I caught the butterfly with the net” stances from a corpus of unlabeled sentences, in and “I caught the butterfly with the spots.” A joint a distantly-supervised training regime. This ap- analysis could also avoid propagating syntactic proach uses the knowledge base to avoid expen- parsing errors into semantic processing, thereby sive manual labeling of individual sentence se- improving performance. mantics. By optimizing the parser to perform both We suggest that a large populated knowledge tasks simultaneously, we train a parser that pro- base should play a key role in syntactic and se- duces accurate syntactic and semantic analyses. mantic parsing: in training the parser, in resolv- We demonstrate our approach by training a joint ing syntactic ambiguities when the trained parser syntactic and semantic parser, which we call ASP. is applied to new text, and in its output semantic ASP produces a full syntactic analysis of every representation. Using semantic information from sentence while simultaneously producing logical the knowledge base at training and test time will forms containing any of 61 category and 69 re-

1188 Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 1188–1198, Baltimore, Maryland, USA, June 23-25 2014. c 2014 Association for Computational Linguistics lation predicates from NELL. Experiments with related to the distantly-supervised approach of Kr- ASP demonstrate that jointly analyzing syntax ishnamurthy and Mitchell (2012). and semantics improves semantic parsing perfor- The parser presented in this paper can be viewed mance over comparable prior work and a pipelined as a combination of both a broad coverage syn- syntax-then-semantics approach. ASP’s syntactic tactic parser and a semantic parser trained using parsing performance is within 2.5% of state-of- distant supervision. Combining these two lines the-art; however, we also find that incorporating of work has synergistic effects – for example, our semantic information reduces syntactic parsing ac- parser is capable of semantically analyzing con- curacy by 0.5%. ∼ junctions and relative clauses based on the syn- 2 Prior Work tactic annotation of these categories in CCGbank. This synergy gives our parser a richer semantic This paper combines two lines of prior work: representation than previous work, while simulta- broad coverage syntactic parsing with CCG and neously enabling broad coverage. semantic parsing. Broad coverage syntactic parsing with CCG has 3 Parser Design produced both resources and successful parsers. These parsers are trained and evaluated using This section describes the Combinatory Categorial CCGbank (Hockenmaier and Steedman, 2002a), Grammar (CCG) parsing model used by ASP. The an automatic conversion of the Penn Treebank input to the parser is a part-of-speech tagged sen- into the CCG formalism. Several broad cover- tence, and the output is a syntactic CCG parse tree, age parsers have been trained using this resource along with zero or more logical forms representing (Hockenmaier and Steedman, 2002b; Hocken- the semantics of subspans of the sentence. These maier, 2003b). The parsing model in this paper is logical forms are constructed using category and loosely based on C&C (Clark and Curran, 2007b; relation predicates from a broad coverage knowl- Clark and Curran, 2007a), a discriminative log- edge base. The parser also outputs a collection of linear model for statistical parsing. Some work dependency structures summarizing the sentence’s has also attempted to automatically derive logi- predicate-argument structure. Figure 1 illustrates cal meaning representations directly from syntac- ASP’s input/output specification. tic CCG parses (Bos, 2005; Lewis and Steedman, 3.1 Knowledge Base 2013). However, these approaches to semantics do not ground the text to beliefs in a knowledge base. The parser uses category and relation predicates Meanwhile, work on semantic parsing has fo- from a broad coverage knowledge base both to cused on producing semantic parsers for answer- construct logical forms and to parametrize the ing simple natural language questions (Zelle and parsing model. The knowledge base is assumed Mooney, 1996; Ge and Mooney, 2005; Wong and to have two kinds of ontological structure: a gen- Mooney, 2006; Wong and Mooney, 2007; Lu et eralization/subsumption hierarchy and argument al., 2008; Kate and Mooney, 2006; Zettlemoyer type constraints. This paper uses NELL’s ontology and Collins, 2005; Kwiatkowski et al., 2011). This (Carlson et al., 2010), which, for example, speci- line of work has typically used a corpus of sen- fies that the category ORGANIZATION is a general- tences with annotated logical forms to train the ization of SPORTSTEAM, and that both arguments parser. Recent work has relaxed the requisite su- to the LOCATEDIN relation must have type LOCA- pervision conditions (Clarke et al., 2010; Liang et TION. These type constraints are enforced during al., 2011), but has still focused on simple ques- parsing. Throughout this paper, predicate names tions. Finally, some work has looked at applying are shown in SMALLCAPS. semantic parsing to answer queries against large knowledge bases, such as YAGO (Yahya et al., 3.2 Syntax 2012) and Freebase (Cai and Yates, 2013b; Cai ASP uses a lexicalized and semantically- and Yates, 2013a; Kwiatkowski et al., 2013; Be- typed Combinatory Categorial Grammar rant et al., 2013). Although this work considers (CCG) (Steedman, 1996). Most gram- a larger number (thousands) of predicates than we matical information in CCG is encoded in do, none of these systems are capable of parsing a lexicon Λ, containing entries such as: open-domain text. Our approach is most closely

1189 area / NN that / WDT includes / VBZ beautiful / JJ London / NNP N (N1 N1)/(S[dcl] NP1)2 (S[dcl] NP1)/NP2 N1/N1 N \ \ \ λx.LOCATION(x) λf.λg.λz.g(z) f(λy.y = z) λf.λg. x, y.g(x) f(y) λf.f λx.M(x, “london”, CITY) ∃ ∧ ∧ LOCATEDIN(y, x) ∧ N : λx.M(x, “london”, CITY) (S[dcl] NP1): \ λg. x, y.g(x) M(y, “london”, CITY) LOCATEDIN(y, x) ∃ ∧ ∧ N1 N1 : λg.λz. x, y.g(z) x = z M(y, “london”, CITY) LOCATEDIN(y, x) \ ∃ ∧ ∧ ∧ N : λz. x, y.LOCATION(z) x = z M(y, “london”, CITY) LOCATEDIN(y, x) ∃ ∧ ∧ ∧ Head Argument word POS semantic type index syntactic category arg. num. word POS semantic type index that WDT — 1 (N1 N1)/(S NP1)2 1 area NN LOCATION 0 \ \ 1 that WDT — 1 (N1 N1)/(S NP1)2 2 includes VBZ LOCATEDIN− 2 1 \ \ includes VBZ LOCATEDIN− 2 (S[dcl] NP1)/NP2 1 area NN LOCATION 0 1 \ includes VBZ LOCATEDIN− 2 (S[dcl] NP1)/NP2 2 ENTITY:CITY NNP CITY 4 \ beautiful JJ — 3 N1/N1 1 ENTITY:CITY NNP CITY 4 Figure 1: Example input and output for ASP. Given a POS-tagged sentence, the parser produces a CCG syntactic tree and logical form (top), and a collection of dependency structures (bottom). person := N : PERSON : λx.PERSON(x) icate that concisely represents the word’s seman- London := N : CITY : λx.M(x, “london”, CITY) tics. The semantic type is used to enforce type great := N1/N1 :—: λf.λx.f(x) constraints during parsing and to include seman- (S[dcl] NP1)/NP2 : ACQUIRED : bought := tics in the parser’s parametrization. The logi- λf.λg.\x, y.f(y) g(x) ACQUIRED(x, y) ∃ ∧ ∧ cal form gives the full semantics of the word in Each lexicon entry maps a word to a syntactic . The parser also allows lexicon category, semantic type, and logical form. CCG entries with the semantic type “—”, representing has two kinds of syntactic categories: atomic and words whose semantics cannot be expressed using functional. Atomic categories include N for noun predicates from the ontology. and S for sentence. Functional categories are Parsing in CCG combines adjacent categories functions constructed recursively from atomic cat- using a small number of combinators, such as egories; these categories are denoted using slashes function application: to separate the category’s argument type from its X/Y : f Y : g = X : f(g) ⇒ return type. The argument type appears on the Y : g X Y : f = X : f(g) right side of the slash, and the return type on the \ ⇒ left. The direction of slash determines where the The first rule states that the category X/Y can argument must appear – / means an argument on be applied to the category Y , returning category the right, and means an argument on the left. X, and that the logical form f is applied to g to \ produce the logical form for the returned category. Syntactic categories in ASP are annotated with Head words and semantic types are also propa- two additional kinds of information. First, atomic gated to the returned category based on the anno- categories may have associated syntactic features tated head-passing markup. given in square brackets. These features are used in CCGbank to distinguish variants of atomic syn- 3.3 Dependency Structures tactic categories, e.g., S[dcl] denotes a declara- Parsing a sentence produces a collection of depen- tive sentence. Second, each category is anno- dency structures which summarize the predicate- tated with head and dependency information us- argument structure of the sentence. Dependency ing subscripts. These subscripts are used to pop- structures are 10-tuples, of the form: ulate predicate-argument dependencies (described < head word, head POS, head semantic type, head word below), and to pass head information using unifi- index, head word syntactic category, argument number, ar- cation. For example, the head of the parse in Fig- gument word, argument POS, argument semantic type, argu- ure 1 is “area,” due to the coindexing of the argu- ment word index > ment and return categories in the category N N . 1\ 1 A dependency structure captures a relationship In addition to the syntactic category, each lexi- between a head word and its argument. During con entry has a semantic type and a logical form. parsing, whenever a subscripted argument of a The semantic type is a category or relation pred- syntactic category is filled, a dependency structure

1190 is created between the head of the applied func- 3.5 Parametrization tion and its argument. For example, in Figure 1, The parser Γ is trained as a discriminative linear the first application fills argument 1 of “beautiful” model of the following form: with “London,” creating a dependency structure. Γ(`, d, t s; θ) = θT φ(d, t, s) | 3.4 Logical Forms Given a parameter vector θ and a sentence s, the ASP performs a best-effort semantic analysis of parser produces a score for a syntactic parse tree every parsed sentence, producing logical forms for t, a collection of dependency structures d and a subspans of the sentence when possible. Logical logical form `. The score depends on features of forms are designed so that the meaning of a sen- the parse produced by the feature function φ. tence is a universally- and existentially-quantified φ contains four classes of features: lexicon conjunction of predicates with partially shared ar- features, combinator features, dependency fea- guments. This representation allows the parser to tures and dependency distance features (Table 1). produce semantic analyses for a reasonable subset These features are based on those of C&C (Clark of language, including prepositions, verbs, nouns, and Curran, 2007b), modified to include seman- relative clauses, and conjunctions. tic types. The features are designed to share syn- tactic information about a word across its distinct Figure 1 shows a representative sample of a log- semantic realizations in order to transfer syntactic ical form produced by ASP. Generally, the parser information from CCGbank to semantic parsing. produces a lambda calculus statement with sev- The parser also includes a hard type-checking eral existentially-quantified variables ranging over constraint to ensure that logical forms are well- entities in the knowledge base. The only excep- typed. This constraint states that dependency tion to this rule is conjunctions, which are rep- structures with a head semantic type only accept resented using a scoped universal quantifier over arguments that (1) have a semantic type, and (2) the conjoined predicates. Entity mentions appear are within the domain/range of the head type. in logical forms via a special mention predicate, M, instead of as database constants. For exam- 4 Parameter Estimation ple, “London” appears as M(x, “london”, CITY), This section describes the training procedure for instead of as a constant like LONDON. The mean- ASP. Training is performed by minimizing a joint ing of this mention predicate is that x is an en- objective function combining a syntactic parsing tity which can be called “london” and belongs to task and a distantly-supervised relation extraction the CITY category. This representation propagates task. The input training data includes: uncertainty about entity references into the logical 1. A collection L of sentences s with annotated form where background knowledge can be used i syntactic trees t (e.g., CCGbank). for disambiguation. For example, “London, Eng- i land” is assigned a logical form that disambiguates 2. A corpus of sentences S (e.g., Wikipedia). “London” to a “London” located in “England.”1 3. A knowledge base K (e.g., NELL), contain- Lexicon entries without a semantic type are au- ing relation instances r(e , e ) K. 1 2 ∈ tomatically assigned logical forms based on their 4. A CCG lexicon Λ (see Section 5.2). head passing markup. For example, in Figure 1, Given these resources, the algorithm described the adjective “beautiful” is assigned λf.f. This in this section produces parameters θ for a se- approach allows a logical form to be derived for mantic parser. Our parameter estimation proce- most sentences, but (somewhat counterintuitively) dure constructs a joint objective function O(θ) that can lose interesting logical forms from constituent decomposes into syntactic and semantic compo- subspans. For example, the preposition “in” has nents: O(θ) = Osyn(θ) + Osem(θ). The syntac- syntactic category (N1 N1)/N2, which results in \ tic component O is a standard syntactic pars- the logical form λf.λg.g. This logical form dis- syn ing objective constructed using the syntactic re- cards any information present in the argument f. source L. The semantic component O is a We avoid this problem by extracting a logical form sem distantly-supervised relation extraction task based from every subtree of the CCG parse. on the semantic constraint from Krishnamurthy

1Specifically, λx. y.CITYLOCATEDINCOUNTRY(x, y) and Mitchell (2012). These components are de- ∃ ∧ M(x, “london”, CITY) M(y, “england”, COUNTRY) scribed in more detail in the following sections. ∧

1191 , := X : t : ` Lexicon features: word POS Dependency Features: < hw, hp, ht, hi, s, n, aw, ap, at, ai >

Word/syntactic category word,X Predicate-Argument Indicator < hw, —, ht, —, s, n, aw, —, at, — > POS/syntactic category POS,X Word-Word Indicator < hw, —, —, —, s, n, aw, —, —, — > Word semantics word, X, t Predicate-POS Indicator < hw, —, ht, —, s, n, —, ap, —, — > Combinator features: XY Z or X Z Word-POS Indicator < hw, —, —, —, s, n, —, ap, —, — > → → Binary combinator indicator XY Z POS-Argument Indicator < —, hp, —, —, s, n, aw, —, at, — > Unary combinator indicator X →Z POS-Word Indicator < —, hp, —, —, s, n, aw, —, —, — > Root syntactic category Z → POS-POS Indicator < —, hp, —, —, s, n, —, ap, —, — > Dependency Distance Features:

Token distance hw, ht, —, s, n, d d = Number of tokens between hi and ai: 0, 1, 2 or more. Token distance word backoff hw, —, s, n, d d = Number of tokens between hi and ai: 0, 1, 2 or more. Token distance POS backoff —, —, hp, s, n, d d = Number of tokens between hi and ai: 0, 1, 2 or more. (The above distance features are repeated using the number of intervening verbs and punctuation marks.) Table 1: Listing of parser feature templates used in the feature function φ. Each feature template repre- sents a class of indicator features that fire during parsing when lexicon entries are used, combinators are applied, or dependency structures are instantiated.

4.1 Syntactic Objective supervision constraint Ψ forces the logical forms predicted for the sentences to entail the relations The syntactic objective is the structured percep- in y . Ψ is a deterministic OR constraint that tron objective instantiated for a syntactic parsing (e1,e2) checks whether each logical form entails the re- task. This objective encourages the parser to accu- lation instance r(e , e ), deterministically setting rately reproduce the syntactic parses in the anno- 1 2 y = 1 if any logical form entails the instance and tated corpus L = (s , t ) n : r i i i=1 y = 0 otherwise. {n } r ˆ ˆ Let (`, d, ) represent a collection of seman- Osyn(θ) = max Γ(`, d, tˆsi; θ) t | `,ˆd,ˆtˆ | − i=1 tic parses for the sentences S = S(e1,e2). Let X S max Γ(`∗, d∗, ti si; θ) + Γ(`, d, t S; θ) = i|=1| Γ(`i, di, ti si; θ) represent `∗,d∗ | | | | the total weight assigned by the parser to a collec- P The first term in the above expression represents tion of parses for the sentences S. For the pair of the best CCG parse of the sentence si according to entities (e1, e2), the semantic objective is: the current model. The second term is the best ˆ ˆ ˆ parse of si whose syntactic tree equals the true Osem(θ) = max Γ(`, d, t S; θ) max | ˆ`,dˆ,ˆt | − `∗,d∗,t∗ syntactic tree ti. In the above equation + de- | · | Ψ(y , `∗, d∗, t∗) + Γ(`∗, d∗, t∗ S; θ) + notes the positive part of the expression. Minimiz- (e1,e2) | | ing this objective therefore finds parameters θ that  reproduce the annotated syntactic trees. 4.3 Optimization 4.2 Semantic Objective Training minimizes the joint objective using the The semantic objective corresponds to a distantly- structured perceptron algorithm, which can be supervised relation extraction task that constrains viewed as the stochastic subgradient method the logical forms produced by the semantic parser. (Ratliff et al., 2006) applied to the objective Distant supervision is provided by the following O(θ). We initialize the parameters to zero, i.e., constraint: every relation instance r(e , e ) K θ0 = 0. On each iteration, we sample either a 1 2 ∈ must be expressed by at least one sentence in syntactic example (si, ti) or a semantic example

S(e1,e2), the set of sentences that mention both e1 (S(e1,e2), y(e1,e2)). If a syntactic example is sam- and e2 (Hoffmann et al., 2011). If this constraint pled, we apply the following parameter update: is empirically true and sufficiently constrains the t `,ˆ d,ˆ tˆ arg max Γ(`, d, t si; θ ) parser’s logical forms, then optimizing the seman- ← `,d,t | t tic objective produces an accurate semantic parser. `∗, d∗ arg max Γ(`, d, ti si; θ ) ← `,d | A training example in the semantic objective t+1 t θ θ + φ(d∗, t , s ) φ(d,ˆ t,ˆ s ) consists of the set of sentences mentioning a pair ← i i − i of entities, S = s1, s2, ... , paired with a This update moves the parameters toward the fea- (e1,e2) { } binary vector representing the set of relations that tures of the best parse with the correct syntactic the two entities participate in, y(e1,e2). The distant derivation, φ(d∗, ti, si). If a semantic example is

1192 Labeled Dependencies Unlabeled Dependencies P R F P R F Coverage

ASP 85.58 85.31 85.44 91.75 91.46 91.60 99.63 ASP-SYN 86.06 85.84 85.95 92.13 91.89 92.01 99.63 C&C (Clark and Curran, 2007b) 88.34 86.96 87.64 93.74 92.28 93.00 99.63 (Hockenmaier, 2003a) 84.3 84.6 84.4 91.8 92.2 92.0 99.83 Table 2: Syntactic parsing results for Section 23 of CCGbank. Parser performance is measured using precision (P), recall (R) and F-measure (F) of labeled and unlabeled dependencies. sampled, we instead apply the following update: 5.1 Data Sets The data sets for the evaluation consist of CCG- ˆ ˆ ˆ t `, d, t arg max Γ(`, d, t S(e1,e2); θ ) bank, a corpus of dependency-parsed Wikipedia ← `,d,t | t sentences, and a logical knowledge base derived `∗, d∗, t∗ arg max Γ(`, d, t S(e1,e2); θ ) ← `,d,t | from NELL and Freebase. Sections 02-21 of CCGbank were used for training, Section 00 for + Ψ(y(e1,e2), `, d, t) t+1 t validation, and Section 23 for the final results. The θ θ + φ(d∗, t∗, S ) ← (e1,e2) knowledge base’s predicate vocabulary is taken φ(dˆ,ˆt, S ) from NELL, and its instances are taken from Free- − (e1,e2) base using a manually-constructed mapping be- tween Freebase and NELL. Using Freebase rela- This update moves the parameters toward the fea- tion instances produces cleaner training data than tures of the best set of parses that satisfy the distant NELL’s automatically-extracted instances. supervision constraint. Training outputs the aver- ¯ 1 n t Using the relation instances and Wikipedia sen- age of each iteration’s parameters, θ = n t=1 θ . In practice, we train the parser by performing a tences, we constructed a data set for distantly- P single pass over the examples in the data set. supervised relation extraction. We identified men- tions of entities in each sentence using simple All of the maximizations above can be per- string matching, then aggregated these sentences formed exactly using a CKY-style chart parsing by entity pair. 20% of the entity pairs were set algorithm, except for the last one. This maxi- aside for validation. In the remaining training mization is intractable due to the coupling between data, we downsampled entity pairs that did not logical forms in ` caused by enforcing the dis- participate in at least one relation. We further tant supervision constraint. We approximate this eliminated sentences containing more than 30 to- maximization in two steps. First, we perform a kens. The resulting training corpus contains 25k beam search to produce a list of candidate parses entity pairs (half of which participate in a relation), for each sentence s S . We then extract ∈ (e1,e2) 41k sentences, and 71 distinct relation predicates. relation instances from each parse and apply the greedy inference algorithm from Hoffmann et al., 5.2 Grammar Construction (2011) to identify the best set of parses that satisfy The grammar for ASP contains the annotated lex- the distant supervision constraint. The procedure icon entries and grammar rules in Sections 02-21 skips any examples with sentences that cannot be of CCGbank, and additional semantic entries pro- parsed (due to beam search failures) or where the duced using a set of dependency parse heuristics. distant supervision constraint cannot be satisfied. The lexicon Λ contains all words that occur at least 20 times in CCGbank. Rare words are re- 5 Experiments placed by their part of speech. The head pass- The experiments below evaluate ASP’s syntactic ing and dependency markup was generated using and semantic parsing ability. The parser is trained the rules of the C&C parser (Clark and Curran, on CCGbank and a corpus of Wikipedia sentences, 2007b). These lexicon entries are also annotated using NELL’s predicate vocabulary. The syntactic with logical forms capturing their head passing re- analyses of the trained parser are evaluated against lationship. For example, the adjective category CCGbank, and its logical forms are evaluated on N1/N1 is annotated with the logical form λf.f. an task and against an an- These entries are all assigned semantic type —. notated test set of Wikipedia sentences. We augment this lexicon with additional entries

1193 Sentence Extracted Logical Form St. John, a Mexican-American born in San Francisco, Califor- λx. y, z.M(x, “st. john”) M(y, “san francisco”) ∃ ∧ ∧ nia, her family comes from Zacatecas, Mexico. PERSONBORNINLOCATION(x, y) ∧ CITYLOCATEDINSTATE(y, z) M(z, “california”) ∧ The capital and largest city of Laos is Vientiane and other major x, y.M(x, “vientiane”) CITY(x) ∃ ∧ ∧ cities include Luang Prabang, Savannakhet and Pakse. CITYCAPITALOFCOUNTRY(x, y) M(y, “laos”) ∧ Gellar next played a lead role in James Toback ’s critically λx. y.M(y, “james toback”) ∃ ∧ unsuccessful independent “Harvard Man” (2001), where she DIRECTORDIRECTEDMOVIE(y, x) ∧ played the daughter of a mobster. M(x, “harvard man”) Figure 2: Logical forms produced by ASP for sentences in the information extraction corpus. Each logical form is extracted from the underlined sentence portion. 1.0 lore and Joshi, 1999; Clark and Curran, 2004). 0.8 We trained a logistic regression classifier to pre- 0.6 dict the syntactic category of each token in a sen- tence from features of the surrounding tokens and 0.4 ASP POS tags. Subsequent parsing is restricted to only PIPELINE 0.2 K&M-2012 consider categories whose probability is within a 0 factor of α of the highest-scoring category. The 0 300 600 900 parser uses a backoff strategy, first attempting to Figure 3: Logical form precision as a function of parse with the supertags from α = 0.01, backing the expected number of correct extracted logical off to α = 0.001 if the initial parsing attempt fails. forms. ASP extracts more correct logical forms 5.4 Syntactic Evaluation because it jointly analyzes syntax and semantics. The syntactic evaluation measures ASP’s ability mapping words to logical forms with NELL pred- to reproduce the predicate-argument dependencies icates. These entries are instantiated using a set in CCGbank. As in previous work, our evalu- of dependency parse patterns, listed in an online ation uses labeled and unlabeled dependencies. 2 appendix. These patterns are applied to the train- Labeled dependencies are dependency structures ing corpus, heuristically identifying verbs, prepo- with both words and semantic types removed, sitions, and possessives that express relations, and leaving two word indexes, a syntactic category, nouns that express categories. The patterns also and an argument number. Unlabeled dependen- include special cases for forms of “to be.” This cies further eliminate the syntactic category and process generates 4000 entries (not counting en- ∼ argument number, leaving a pair of word indexes. tity names), representing 69 relations and 61 cate- Performance is measured using precision, recall, gories from NELL. Section 3.2 shows several lex- and F-measure against the annotated dependency icon entries generated by this process. structures in CCGbank. Precision is the fraction The parser’s combinators include function ap- of predicted dependencies which are in CCGbank, plication, composition, and crossed composition, recall is the fraction of CCGbank dependencies as well as several binary and unary type-changing produced by the parser, and F-measure is the har- rules that occur in CCGbank. All combinators monic mean of precision and recall. were restricted to only apply to categories that For comparison, we also trained a syntactic ver- combine in Sections 02-21. Finally, the grammar sion of our parser, ASP-SYN, using only the CCG- includes a number of heuristically-instantiated bi- bank lexicon and grammar. Comparing against nary rules of the form ,N N N that instanti- → \ this parser lets us measure the effect of the rela- ate a relation between adjacent nouns. These rules tion extraction task on syntactic parsing. capture appositives and some other constructions. Table 2 shows the results of our evaluation. 5.3 Supertagging For comparison, we include results for two ex- isting syntactic CCG parsers: C&C, the current Parsing in practice can be slow because the state-of-the-art CCG parser (Clark and Curran, parser’s lexicalized grammar permits a large num- 2007b), and the next best system (Hockenmaier, ber of parses for a sentence. We improve parser 2003a). Both ASP and ASP-SYN perform rea- performance by performing supertagging (Banga- sonably well, within 2.5% of the performance of 2http://rtw.ml.cmu.edu/acl2014_asp/ C&C at the same coverage level. However, ASP-

1194 Logical Form Extraction Extraction prove comparability, we reimplemented this ap- Accuracy Precision Recall proach using our parsing model, which has richer ASP 0.28 0.90 0.32 features than were used in their paper. K&M-2012 0.14 1.00 0.06 PIPELINE 0.2 0.63 0.17 5.5.2 Information Extraction Evaluation Table 3: Logical form accuracy and extraction pre- The information extraction evaluation uses each cision/recall on the annotated test set. The high system to extract logical forms from a large cor- extraction recall for ASP shows that it produces pus of sentences, then measures the fraction of more complete logical forms than either baseline. extracted logical forms that are correct. The test set consists of 8.5k sentences sampled from the SYN outperforms ASP by around 0.5%, suggesting held-out Wikipedia sentences. Each system was that ASP’s additional semantic knowledge slightly run on this data set, extracting all logical forms hurts syntactic parsing performance. This perfor- from each sentence that entailed at least one cat- mance loss appears to be largely due to poor en- egory or relation instance. We ranked these ex- tity mention detection, as we found that not us- tractions using the parser’s inside chart score, then ing entity mention lexicon entries at test time im- manually annotated a sample of 250 logical forms proves ASP’s labeled and unlabeled F-scores by from each system for correctness. Logical forms 0.3% on Section 00. The knowledge base contains were marked correct if all category and relation many infrequently-mentioned entities with com- instances entailed by the logical form were ex- mon names; these entities contribute incorrect se- pressed by the sentence. Note that a correct logical mantic type information that confuses the parser. form need not entail all of the relations expressed 5.5 Semantic Evaluation by the sentence, reflecting an emphasis on preci- We performed two semantic evaluations to bet- sion over recall. Figure 2 shows some example ter understand ASP’s ability to construct logical logical forms produced by ASP in the evaluation. forms. The first evaluation emphasizes precision The annotated sample of logical forms allows over recall, and the second evaluation accurately us to estimate precision for each system as a func- measures recall using a manually labeled test set. tion of the number of correct extractions (Figure 3). The number of correct extractions is directly 5.5.1 Baselines proportional to recall, and was estimated from the For comparison, we also trained two base- total number of extractions and precision at each line models. The first baseline, PIPELINE, is rank in the sample. All three systems initially a pipelined syntax-then-semantics approach de- have high precision, implying that their extracted signed to mimic Boxer (Bos, 2005). This base- logical forms express facts found in the sentence. line first syntactically parses each sentence using However, ASP produces 3 times more correct log- ASP-SYN, then produces a semantic analysis by ical forms than either baseline because it jointly assigning a logical form to each word. We train analyzes syntax and semantics. The baselines suf- this baseline using the semantic objective (Section fer from reduced recall because they depend on re- 4.2) while holding fixed the syntactic parse of each ceiving an accurate syntactic parse as input; syn- sentence. Note that, unlike Boxer, this baseline tactic parsing errors cause these systems to fail. learns which logical form to assign each word, and Examining the incorrect logical forms produced its logical forms contain NELL predicates. by ASP reveals that incorrect mention detection is The second baseline, K&M-2012, is the ap- by far the most common source of mistakes. Ap- proach of Krishnamurthy and Mitchell (2012), proximately 50% of errors are caused by marking representing the state-of-the-art in distantly- common nouns as entity mentions (e.g., marking supervised semantic parsing. This approach trains “coin” as a COMPANY). These errors occur be- a semantic parser by combining distant seman- cause the knowledge base contains many infre- tic supervision with syntactic supervision from quently mentioned entities with relatively com- dependency parses. The best performing vari- mon names. Another 30% of errors are caused by ant of this system also uses dependency parses assigning an incorrect type to a common proper at test time to constrain the interpretation of noun (e.g, marking “Bolivia” as a CITY). This test sentences – hence, this system also uses a analysis suggests that performing entity linking pipelined syntax-then-semantics approach. To im- before parsing could significantly reduce errors.

1195 Sentence: De Niro and Joe Pesci in “Goodfellas” offered a virtuoso display of the director’s bravura cinematic technique and reestablished, enhanced, and consolidated his reputation. Annotation: LF: λx. p λd.M(d, “de niro”), λj.M(j, “joe pesci”) y.p(x) STARREDINMOVIE(x, y) M(y, “goodfellas”) ∀ ∈ { }∃ ∧ ∧ Instances: STARREDINMOVIE(de niro, goodfellas), STARREDINMOVIE(joe pesci, goodfellas) Prediction: LF: λx. p λd.M(d, “de niro”), λj.M(j, “joe pesci”) y.p(x) STARREDINMOVIE(x, y) M(y, “goodfellas”) ∀ ∈ { }∃ ∧ ∧ Instances: STARREDINMOVIE(de niro, goodfellas), STARREDINMOVIE(joe pesci, goodfellas) Logical form accuracy: 1 / 1 Extraction Precision: 2 / 2 Extraction Recall: 2 / 2

Sentence: In addition to the University of Illinois, Champaign is also home to Parkland College. Annotation: LF: c, p.M(c, “champaign”) CITY(c) M(p, “parkland college”) UNIVERSITYINCITY(p, c) ∃ ∧ ∧ ∧ Instances: CITY(champaign), UNIVERSITYINCITY(parkland college, champaign) Prediction: LF 1: λx. yM(y, “illinois”) M(x, “university”) CITYLOCATEDINSTATE(x, y) ∃ ∧ ∧ LF 2: c, p.M(c, “champaign”) CITY(c) M(p, “parkland college”) UNIVERSITYINCITY(p, c) ∃ ∧ ∧ ∧ Instances: CITY(champaign), UNIVERSITYINCITY(parkland college, champaign), CITYLOCATEDINSTATE(university, illinois) Logical form accuracy: 1 / 1 Extraction Precision: 2 / 3 Extraction Recall: 2 / 2 Figure 4: Two test examples with ASP’s predictions and error calculations. The annotated logical forms are for the italicized sentence spans, while the extracted logical forms are for the underlined spans. 5.5.3 Annotated Sentence Evaluation Table 3 shows the results of the annotated sen- tence evaluation. ASP outperforms both baselines A limitation of the previous evaluation is that it in logical form accuracy and extraction recall, sug- does not measure the completeness of predicted gesting that it produces more complete analyses logical forms, nor estimate what portion of sen- than either baseline. The extraction precision of tences are left unanalyzed. We conducted a second 90% suggests that ASP rarely extracts incorrect in- evaluation to measure these quantities. formation. Precision is higher in this evaluation The data for this evaluation consists of sen- because every sentence in the data set has at least tences annotated with logical forms for subspans. one correct extraction. We manually annotated Wikipedia sentences from the held-out set with logical forms for the largest 6 Discussion subspans for which a logical form existed. To We present an approach to training a joint syntac- avoid trivial cases, we only annotated logical tic and semantic parser. Our parser ASP produces forms containing at least one category or relation a full syntactic parse of any sentence, while simul- predicate and at least one mention. We also chose taneously producing logical forms for sentence not to annotate mentions of entities that are not in spans that have a semantic representation within the knowledge base, as no system would be able its predicate vocabulary. The parser is trained to correctly identify them. The corpus contains 97 by jointly optimizing performance on a syntac- sentences with 100 annotated logical forms. tic parsing task and a distantly-supervised rela- We measured performance using two met- tion extraction task. Experimental results demon- rics: logical form accuracy, and extraction preci- strate that jointly analyzing syntax and semantics sion/recall. Logical form accuracy examines the triples the number of extracted logical forms over predicted logical form for the smallest subspan of approaches that first analyze syntax, then seman- the sentence containing the annotated span, and tics. However, we also find that incorporating se- marks this prediction correct if it exactly matches mantics slightly reduces syntactic parsing perfor- the annotation. A limitation of this metric is that it mance. Poor entity mention detection is a major does not assign partial credit to logical forms that source of error in both cases, suggesting that fu- are close to, but do not exactly match, the anno- ture work should consider integrating entity link- tation. The extraction metric assigns partial credit ing with joint syntactic and semantic parsing. by computing the precision and recall of the cat- egory and relation instances entailed by the pre- Acknowledgments dicted logical form, using those entailed by the an- This work was supported in part by DARPA under notated logical form as the gold standard. Figure award FA8750-13-2-0005. We additionally thank 4 shows the computation of both error metrics on Jamie Callan and Chris Re’s´ Hazy group for col- two examples from the test corpus. lecting and processing the Wikipedia corpus.

1196 References Julia Hockenmaier and Mark Steedman. 2002a. Acquiring compact lexicalized grammars from a Collin F. Baker, Charles J. Fillmore, and John B. Lowe. cleaner treebank. In Proceedings of Third Interna- 1998. The berkeley framenet project. In Proceed- tional Conference on Language Resources and Eval- ings of the 17th International Conference on Com- uation. putational Linguistics - Volume 1.

Srinivas Bangalore and Aravind K. Joshi. 1999. Su- Julia Hockenmaier and Mark Steedman. 2002b. Gen- pertagging: an approach to almost parsing. Compu- erative models for statistical parsing with combina- tational Linguistics, 25(2):237–265. tory categorial grammar. In Proceedings of the 40th Annual Meeting on Association for Computational Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Linguistics. Liang. 2013. Semantic parsing on Freebase from question-answer pairs. In Proceedings of the 2013 Julia Hockenmaier. 2003a. Data and Models for Sta- Conference on Empirical Methods in Natural Lan- tistical Parsing with Combinatory Categorial Gram- guage Processing. mar. Ph.D. thesis, University of Edinburgh.

Johan Bos. 2005. Towards wide-coverage seman- Julia Hockenmaier. 2003b. Parsing with generative tic interpretation. In In Proceedings of Sixth In- models of predicate-argument structure. In Proceed- ternational Workshop on ings of the 41st Annual Meeting on Association for IWCS-6. Computational Linguistics - Volume 1.

Qingqing Cai and Alexander Yates. 2013a. Large- Raphael Hoffmann, Congle Zhang, Xiao Ling, scale Semantic Parsing via Schema Matching and Luke S. Zettlemoyer, and Daniel S. Weld. 2011. Lexicon Extension. In Proceedings of the Annual Knowledge-based weak supervision for information Meeting of the Association for Computational Lin- extraction of overlapping relations. In The 49th An- guistics (ACL). nual Meeting of the Association for Computational Qingqing Cai and Alexander Yates. 2013b. Semantic Linguistics: Human Language Technologies. Parsing Freebase: Towards Open-domain Semantic Parsing. In Proceedings of the Second Joint Con- Rohit J. Kate and Raymond J. Mooney. 2006. Us- ference on Lexical and Computational Semantics ing string-kernels for learning semantic parsers. In (*SEM). 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Asso- Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr ciation for Computational Linguistics, Proceedings Settles, Estevam R. Hruschka Jr., and Tom M. of the Conference. Mitchell. 2010. Toward an architecture for never- ending language learning. In Proceedings of the Jayant Krishnamurthy and Tom M. Mitchell. 2012. Twenty-Fourth AAAI Conference on Artificial Intel- Weakly supervised training of semantic parsers. In ligence. Proceedings of the 2012 Joint Conference on Empir- ical Methods in Natural Language Processing and Stephen Clark and James R. Curran. 2004. The impor- Computational Natural Language Learning. tance of supertagging for wide-coverage CCG pars- ing. In Proceedings of the 20th International Con- Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwa- ference on Computational Linguistics. ter, and Mark Steedman. 2011. Lexical generaliza- tion in CCG grammar induction for semantic pars- Stephen Clark and James R. Curran. 2007a. Per- ing. In Proceedings of the Conference on Empirical ceptron training for a wide-coverage lexicalized- Methods in Natural Language Processing. grammar parser. In Proceedings of the Workshop on Deep Linguistic Processing. Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and Luke Zettlemoyer. 2013. Scaling semantic parsers with Stephen Clark and James R. Curran. 2007b. Wide- on-the-fly ontology matching. In Proceedings of the coverage efficient statistical parsing with CCG 2013 Conference on Empirical Methods in Natural and log-linear models. Computational Linguistics, Language Processing. 33(4):493–552.

James Clarke, Dan Goldwasser, Ming-Wei Chang, and Mike Lewis and Mark Steedman. 2013. Combined Dan Roth. 2010. Driving semantic parsing from distributional and logical semantics. Transactions the world’s response. In Proceedings of the Four- of the Association for Computational Linguistics, teenth Conference on Computational Natural Lan- 1:179–192. guage Learning. Percy Liang, Michael I. Jordan, and Dan Klein. 2011. Ruifang Ge and Raymond J. Mooney. 2005. A statis- Learning dependency-based compositional seman- tical semantic parser that integrates syntax and se- tics. In Proceedings of the Association for Compu- mantics. In Proceedings of the Ninth Conference on tational Linguistics, Portland, Oregon. Association Computational Natural Language Learning. for Computational Linguistics.

1197 Wei Lu, Hwee Tou Ng, Wee Sun Lee, and Luke S. Zettlemoyer. 2008. A generative model for pars- ing natural language to meaning representations. In Proceedings of the Conference on Empirical Meth- ods in Natural Language Processing. Nathan D. Ratliff, J. Andrew Bagnell, and Martin A. Zinkevich. 2006. (online) subgradient methods for structured prediction. Artificial Intelligence and Statistics. Mark Steedman. 1996. Surface Structure and Inter- pretation. The MIT Press, Cambridge, MA, USA. Yuk Wah Wong and Raymond J. Mooney. 2006. Learning for semantic parsing with statistical ma- chine translation. In Proceedings of the Human Lan- guage Technology Conference of the NAACL. Yuk Wah Wong and Raymond J. Mooney. 2007. Learning synchronous grammars for semantic pars- ing with lambda calculus. In Proceedings of the 45th Annual Meeting of the Association for Compu- tational Linguistics. Mohamed Yahya, Klaus Berberich, Shady Elbas- suoni, Maya Ramanath, Volker Tresp, and Gerhard Weikum. 2012. Natural language questions for the web of data. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Lan- guage Processing and Computational Natural Lan- guage Learning. John M. Zelle and Raymond J. Mooney. 1996. Learn- ing to parse database queries using inductive logic programming. In Proceedings of the thirteenth na- tional conference on Artificial Intelligence. Luke S. Zettlemoyer and Michael Collins. 2005. Learning to map sentences to logical form: struc- tured classification with probabilistic categorial grammars. In UAI ’05, Proceedings of the 21st Con- ference in Uncertainty in Artificial Intelligence.

1198