Semantic on Freebase from Question-Answer Pairs

Jonathan Berant Andrew Chou Roy Frostig Percy Liang Computer Science Department, Stanford University {joberant,akchou}@stanford.edu {rf,pliang}@cs.stanford.edu

Abstract Occidental College, Columbia University

In this paper, we train a semantic parser that Execute on Database scales up to Freebase. Instead of relying on

annotated logical forms, which is especially Type.University u Education.BarackObama expensive to obtain at large scale, we learn Type.University from question-answer pairs. The main chal- bridging lenge in this setting is narrowing down the Education huge number of possible logical predicates for alignment

a given question. We tackle this problem in BarackObama two ways: First, we build a coarse mapping alignment from phrases to predicates using a knowledge Which college did Obama go to ? base and a large text corpus. Second, we use a bridging operation to generate additional Figure 1: Our task is to map questions to answers via la- predicates based on neighboring predicates. tent logical forms. To narrow down the space of logical On the dataset of Cai and Yates (2013), despite predicates, we use a (i) coarse alignment based on Free- not having annotated logical forms, our sys- base and a text corpus and (ii) a bridging operation that tem outperforms their state-of-the-art parser. generates predicates compatible with neighboring predi- Additionally, we collected a more realistic and cates. challenging dataset of question-answer pairs and improves over a natural baseline. predicates (Cai and Yates, 2013). The goal of this 1 Introduction paper is to do both: learn a semantic parser with- We focus on the problem of semantic parsing nat- out annotated logical forms that scales to the large ural language utterances into logical forms that can number of predicates on Freebase. be executed to produce denotations. Traditional se- At the lexical level, a major challenge in semantic mantic parsers (Zelle and Mooney, 1996; Zettle- parsing is mapping natural language phrases (e.g., moyer and Collins, 2005; Wong and Mooney, 2007; “attend”) to logical predicates (e.g., Education). Kwiatkowski et al., 2010) have two limitations: (i) While limited-domain semantic parsers are able they require annotated logical forms as supervision, to learn the lexicon from per-example supervision and (ii) they operate in limited domains with a small (Kwiatkowski et al., 2011; Liang et al., 2011), at number of logical predicates. Recent developments large scale they have inadequate coverage (Cai and aim to lift these limitations, either by reducing the Yates, 2013). Previous work on semantic parsing on amount of supervision (Clarke et al., 2010; Liang et Freebase uses a combination of manual rules (Yahya al., 2011; Goldwasser et al., 2011; Artzi and Zettle- et al., 2012; Unger et al., 2012), distant supervision moyer, 2011) or by increasing the number of logical (Krishnamurthy and Mitchell, 2012), and schema matching (Cai and Yates, 2013). We use a large 2.1 Knowledge base amount of web text and a knowledge base to build a Let E denote a set of entities (e.g., BarackObama), coarse alignment between phrases and predicates— and let P denote a set of properties (e.g., an approach similar in spirit to Cai and Yates (2013). PlaceOfBirth). A knowledge base K is a set However, this alignment only allows us to gen- of assertions (e1, p, e2) ∈ E × P × E (e.g., erate a subset of the desired predicates. Aligning (BarackObama, PlaceOfBirth, Honolulu)). light verbs (e.g., “go”) and prepositions is not very We use the Freebase knowledge base (Google, informative due to polysemy, and rare predicates 2013), which has 41M non-numeric entities, 19K (e.g., “cover price”) are difficult to cover even given properties, and 596M assertions.1 a large corpus. To improve coverage, we propose a new bridging operation that generates predicates 2.2 Logical forms based on adjacent predicates rather than on words. At the compositional level, a semantic parser must To query the knowledge base, we use a logical lan- combine the predicates into a coherent logical form. guage called Lambda Dependency-Based Compo- Previous work based on CCG requires manually sitional Semantics (λ-DCS)—see Liang (2013) for specifying combination rules (Krishnamurthy and details. For the purposes of this paper, we use a re- Mitchell, 2012) or inducing the rules from anno- stricted subset called simple λ-DCS, which we will tated logical forms (Kwiatkowski et al., 2010; Cai define below for the sake of completeness. and Yates, 2013). We instead define a few simple The chief motivation of λ-DCS is to produce composition rules which over-generate and then use logical forms that are simpler than lambda cal- model features to simulate soft rules and categories. culus forms. For example, λx.∃a.p1(x, a) ∧ In particular, we use POS tag features and features ∃b.p2(a, b) ∧ p3(b, e) is expressed compactly in on the denotations of the predicted logical forms. λ-DCS as p1.p2.p3.e. Like DCS (Liang et al., We experimented with two 2011), λ-DCS makes existential quantification im- datasets on Freebase. First, on the dataset of Cai plicit, thereby reducing the number of variables. and Yates (2013), we showed that our system out- Variables are only used for anaphora and building performs their state-of-the-art system 62% to 59%, composite binary predicates; these do not appear in despite using no annotated logical forms. Second, simple λ-DCS. we collected a new realistic dataset of questions by Each logical form in simple λ-DCS is either a performing a breadth-first search using the Google unary (which denotes a subset of E) or a binary Suggest API; these questions are then answered by (which denotes a subset of E × E). The basic λ- Amazon Mechanical Turk workers. Although this DCS logical forms z and their denotations z K are J K dataset is much more challenging and noisy, we are defined recursively as follows: still able to achieve 31.4% accuracy, a 4.5% ab- • Unary base case: If e ∈ E is an entity (e.g., solute improvement over a natural baseline. Both Seattle), then e is a unary logical form with datasets as well as the source code for SEMPRE, our z K = {e}. J K semantic parser, are publicly released and can be • Binary base case: If p ∈ P is a property (e.g., downloaded from http://nlp.stanford.edu/ PlaceOfBirth), then p is a binary logical form 2 software/sempre/. with p K = {(e1, e2):(e1, p, e2) ∈ K}. • Join:J IfKb is a binary and u is a unary, then b.u 2 Setup (e.g., PlaceOfBirth.Seattle) is a unary de- noting a join and project: b.u K = {e1 ∈ E : J K Problem Statement Our task is as follows: Given ∃e2.(e1, e2) ∈ b K ∧ e2 ∈ u K}. (i) a knowledge base K, and (ii) a training set of J K J K n 1In this paper, we condense Freebase names for readability question-answer pairs {(xi, yi)} , output a se- i=1 /people/person Person mantic parser that maps new questions x to answers ( becomes ). 2Binaries can be also built out of lambda abstractions (e.g., y via latent logical forms z and the knowledge base λx.Performance.Actor.x), but as these constructions are K. not central to this paper, we defer to (Liang, 2013). • Intersection: If u1 and u2 are both unaries, Type.Location u PeopleBornHere.BarackObama intersection then u1 u u2 (e.g., Profession.Scientist u PlaceOfBirth.Seattle) denotes set intersec- Type.Location was PeopleBornHere.BarackObama ? lexicon join tion: u1 u u2 K = u1 K ∩ u2 K. • Aggregation:J K If u isJ aK unary,J thenK count(u) where BarackObama PeopleBornHere lexicon lexicon denotes the cardinality: count(u) K = J K Obama born {| u K|}. As aJ finalK example, “number of dramas star- ring Tom Cruise” in would Figure 2: An example of a derivation d of the utterance be represented as count(λx.Genre(x, Drama) ∧ “Where was Obama born?” and its sub-derivations, each ∃y.Performance(x, y) ∧ Actor(y, TomCruise)); labeled with composition rule (in blue) and logical form d “was” “?” in λ-DCS, it is simply count(Genre.Drama u (in red). The derivation skips the words and . Performance.Actor.TomCruise). It is useful to think of the knowledge base K as ily over-generates. We instead rely on features and a directed graph in which entities are nodes and learning to guide us away from the bad derivations. properties are labels on the edges. Then simple λ- DCS unary logical forms are tree-like graph patterns Modeling Following Zettlemoyer and Collins which pick out a subset of the nodes. (2005) and Liang et al. (2011), we define a discriminative log-linear model over derivations 2.3 Framework d ∈ D(x) given utterances x: pθ(d | x) = exp{φ(x,d)>θ} x P 0 > , where φ(x, d) is a feature Given an utterance , our semantic parser constructs d0∈D(x) exp{φ(x,d ) θ} a distribution over possible derivations D(x). Each vector extracted from the utterance and the deriva- b derivation d ∈ D(x) is a tree specifying the appli- tion, and θ ∈ R is the vector of parameters to cation of a set of combination rules that culminates be learned. As our training data consists only of in the logical form d.z at the root of the tree—see question-answer pairs (xi, yi), we maximize the log- Figure 2 for an example. likelihood of the correct answer ( d.z K = yi), sum- ming over the latent derivationJd.K Formally, our Composition Derivations are constructed recur- training objective is sively based on (i) a lexicon mapping natural lan- n X X guage phrases to knowledge base predicates, and (ii) O(θ) = log pθ(d | xi). (1) a small set of composition rules. i=1 d∈D(x): d.z K=yi More specifically, we build a set of derivations for J K each span of the utterance. We first use the lexicon to Section 4 describes an approximation of this ob- generate single-predicate derivations for any match- jective that we maximize to choose parameters θ. ing span (e.g., “born” maps to PeopleBornHere). 3 Approach Then, given any logical form z1 that has been con- structed over the span [i1 : j1] and z2 over a non- Our knowledge base has more than 19,000 proper- overlapping span [i2 : j2], we generate the following ties, so a major challenge is generating a manage- logical forms over the enclosing span [min(i1, i2): able set of predicates for an utterance. We propose max(j1, j2)]: intersection z1 u z2, join z1.z2, ag- two strategies for doing this. First (Section 3.1), gregation z1(z2) (e.g., if z1 = count), or bridging we construct a lexicon that maps natural language z1 u p.z2 for any property p ∈ P (explained more in phrases to logical predicates by aligning a large text 3 Section 3.2). corpus to Freebase, reminiscent of Cai and Yates Note that the construction of derivations D(x) (2013). Second, we generate logical predicates com- allows us to skip any words, and in general heav- patible with neighboring predicates using the bridg- ing operation (Section 3.2). Bridging is crucial when 3We also discard logical forms are incompatible according to the Freebase types (e.g., Profession.Politician u aligning phrases is difficult or even impossible. The Type.City would be rejected). derivations produced by combining these predicates grew up in[Person,Location] DateOfBirth were extracted from ClueWeb09 using the ReVerb born in[Person,Date] PlaceOfBirth open IE system (Fader et al., 2011). Lin et al. (2012) 5 married in[Person,Date] Marriage.StartDate released a subset of these triples where they were

born in[Person,Location] PlacesLived.Location able to substitute the subject arguments with KB en- tities. We downloaded their dataset and heuristically F(r1) F(r2) replaced object arguments with KB entities by walk- (MichelleObama,Chicago) ing on the Freebase graph from subject KB entities (RandomPerson,Seattle) C(r1, r2) (BarackObama,Chicago) and performing simple string matching. In addition,

(BarackObama,Honolulu) we normalized dates with SUTime (Chang and Man- ning, 2012).

Alignment features We lemmatize and normalize each text phrase log-phrase-count:log(15765) r ∈ R1 and augment it with a type signature log-predicate-count: log(9182) [t1, t2] to deal with polysemy (“born in” could ei- log-intersection-count: log(6048) KB-best-match: 0 ther map to PlaceOfBirth or DateOfBirth). We add an entity pair (e1, e2) to the extension of F(r[t1, t2]) if the (Freebase) type of e1 (e2) is t1 Figure 3: We construct a bipartite graph over phrases R1 (t2). For example, (BarackObama, 1961) is added and predicates R2. Each edge (r1, r2) is associated with alignment features. to F(“born in”[Person, Date]). We perform a simi- lar procedure that uses a Hearst-like pattern (Hearst, 1992) to map phrases to unary predicates. If a are scored using features that capture lexical, syn- text phrase r ∈ R1 matches the pattern “(is|was tactic and semantic regularities (Section 3.3). a|the) x IN”, where IN is a preposition, then we add e to F(x). For (Honolulu, “is a city in”, 3.1 Alignment 1 Hawaii), we extract x = “city00 and add Honolulu We now discuss the construction of a lexicon L, to F(“city”). From the initial 15M triples, we ex- which is a mapping from natural language phrases tracted 55,081 typed binary phrases (9,456 untyped) to logical predicates accompanied by a set of fea- and 6,299 unary phrases. tures. Specifically, for a phrase w (e.g., “born in”), Logical predicates Binary logical predicates con- L(w) is a set of entries (z, s), where z is a predicate tain (i) all KB properties6 and (ii) concatenations of and s is the set of features. A lexicon is constructed two properties p .p if the intermediate type repre- by alignment of a large text corpus to the knowledge 1 2 sents an event (e.g., the married to relation is rep- base (KB). Intuitively, a phrase and a predicate align resented by Marriage.Spouse). For unary pred- if they co-occur with many of the same entities. icates, we consider all logical forms Type.t and Here is a summary of our alignment proce- Profession.t for all (abstract) entities t ∈ E (e.g. dure: We construct a set of typed4 phrases Type.Book and Profession.Author). The types R (e.g., “born in”[Person,Location]) and pred- 1 of logical predicates considered during alignment is icates R (e.g., PlaceOfBirth). For each 2 restricted in this paper, but automatic induction of r ∈ R ∪ R , we create its extension 1 2 more compositional logical predicates is an interest- F(r), which is a set of co-occurring entity- ing direction. Finally, we define the extension of a pairs (e.g., F(“born in”[Person,Location]) = logical predicate r ∈ R to be its denotation, that is, {(BarackObama, Honolulu),... }. The lexicon is 2 the corresponding set of entities or entity pairs. generated based on the overlap F(r1) ∩ F(r2), for r1 ∈ R1 and r2 ∈ R2. Lexicon construction Given typed phrases R1, logical predicates R2, and their extensions F, we Typed phrases 15 million triples (e1, r, e2) (e.g., now generate the lexicon. It is useful to think of a (“Obama”, “was also born in”, “August 1961”)) 5http://knowitall.cs.washington.edu/ 4Freebase associates each entity with a set of types using the linked_extractions/ Type property. 6We filter properties from the domains user and base. Category Description as well as some other measures of token overlap. Alignment Log of # entity pairs that occur with the phrase r1 (|F(r1)|) 3.2 Bridging Log of # entity pairs that occur with the logical predicate r2 (|F(r2)|) While alignment can cover many predicates, it is un- Log of # entity pairs that occur with both reliable for cases where the predicates are expressed r1 and r2 (|F(r1) ∩ F(r2)|) weakly or implicitly. For example, in “What govern- Whether r2 is the best match for r1 (r2 = ment does Chile have?”, the predicate is expressed arg maxr |F(r1) ∩ F(r)|) Lexicalized Conjunction of phrase w and predicate z by the light verb have, in “What actors are in Top Text similarity Phrase r1 is equal/prefix/suffix of s2 Gun?”, it is expressed by a highly ambiguous prepo- Phrase overlap of r1 and s2 sition, and in “What is Italy money?” [sic], it is Bridging Log of # entity pairs that occur with bridg- omitted altogether. Since natural language doesn’t ing predicate b (|F(b)|) offer much help here, let us turn elsewhere for guid- Kind of bridging (# unaries involved) The binary b injected ance. Recall that at this point our main goal is to Composition # of intersect/join/bridging operations generate a manageable set of candidate logical forms POS tags in join/bridging and skipped to be scored by the log-linear model. words In the first example, suppose the phrases “Chile” Size of denotation of logical form and “government” are parsed as Chile and Type.FormOfGovernment, respectively, and we hy- Table 1: Full set of features. For the alignment and text sim- pothesize a connecting binary. The two predicates ilarity, r is a phrase, r is a predicate with Freebase name s , 1 2 2 impose strong type constraints on that binary, so we and b is a binary predicate with type signature (t1, t2). can afford to generate all the binary predicates that type check (see Table 2). More formally, given two R bipartite graph with left nodes 1 and right nodes unaries z1 and z2 with types t1 and t2, we generate a R (r , r ) 2 (Figure 3). We add an edge 1 2 if (i) the logical form z1 u b.z2 for each binary b whose type r r 7 type signatures of 1 and 2 match and (ii) their ex- signature is (t1, t2). Figure 1 visualizes bridging of tensions have non-empty overlap (F(r1) ∩ F(r2) 6= the unaries Type.University and Obama. ∅). Our final graph contains 109K edges for binary Now consider the example “What is the predicates and 294K edges for unary predicates. cover price of X-men?” Here, the binary Naturally, non-zero overlap by no means guaran- ComicBookCoverPrice is expressed explicitly, but tees that r1 should map to r2. In our noisy data, is not in our lexicon since the language use is rare. even “born in” and Marriage.EndDate co-occur 4 To handle this, we allow bridging to generate a bi- times. Rather than thresholding based on some cri- nary based on a single unary; in this case, based on terion, we compute a set of features, which are used the unary X-Men (Table 2), we generate several bina- by the model downstream in conjunction with other ries including ComicBookCoverPrice. Generically, sources of information. given a unary z with type t, we construct a logical We compute three types of features (Table 1). form b.z for any predicate b with type (∗, t). Alignment features are unlexicalized and measure Finally, consider the question “Who did association based on argument overlap. Lexicalized Tom Cruise marry in 2006?”. Suppose we features are standard conjunctions of the phrase w parse the phrase “Tom Cruise marry” into and the logical form z. Text similarity features com- Marriage.Spouse.TomCruise, or more explicitly, pare the (untyped) phrase (e.g., “born”) to the Free- λx.∃e.Marriage(x, e) ∧ Spouse(e, TomCruise). base name of the logical predicate (e.g., “People Here, the neo-Davidsonian event variable e is an born here”): Given the phrase r1 and the Freebase intermediate quantity, but needs to be further mod- name s2 of the predicate r2, we compute string sim- ified (in this case, by the temporal modifier 2006). ilarity features such as whether r1 and s2 are equal, To handle this, we apply bridging to a unary and the intermediate event (see Table 2). Generically, given 7Each Freebase property has a designated type signa- 0 ture, which can be extended to composite predicates, e.g., a logical form p1.p2.z where p2 has type (t1, ∗), sig(Marriage.StartDate) = (Person, Date). and a unary z with type t, bridging injects z and # Form 1 Form 2 Bridging 1 Type.FormOfGovernment Chile Type.FormOfGovernment u GovernmentTypeOf.Chile 2 X-Men ComicBookCoverPriceOf.X-Men 3 Marriage.Spouse.TomCruise 2006 Marriage.(Spouse.TomCruise u StartDate.2006)

Table 2: Three examples of the bridging operation. The bridging binary predicate b is in boldface.

0 constructs a logical form p1.(p2.z u b.z) for each logical forms z1 and z2 via a join or bridging, we logical predicate b with type (t1, t). include a feature on the POS tag of (the first word In each of the three examples, bridging gener- spanned by) z1 conjoined with the POS tag corre- ates a binary predicate based on neighboring logi- sponding to z2. Rather than using head-modifier in- cal predicates rather than on explicit lexical material. formation from dependency trees (Branavan et al., In a way, our bridging operation shares with bridg- 2012; Krishnamurthy and Mitchell, 2012; Cai and ing anaphora (Clark, 1975) the idea of establishing Yates, 2013; Poon, 2013), we can learn the appro- a novel relation between distinct parts of a sentence. priate relationships tailored for downstream accu- Naturally, we need features to distinguish between racy. For example, the phrase “located” is aligned the generated predicates, or decide whether bridging to the predicate ContainedBy. POS features can de- is even appropriate at all. Given a binary b, features tect that if “located” precedes a noun phrase (“What include the log of the predicate count log |F(b)|, in- is located in Beijing?”), then the noun phrase is the dicators for the kind of bridging, an indicator on the object of the predicate, and if it follows the noun binary b for injections (Table 1). In addition, we add phrase (“Where is Beijing located?”), then it is in all text similarity features by comparing the Free- subject position. base name of b with content words in the question. Note that our three operations (intersection, join, and bridging) are quite permissive, and we rely on 3.3 Composition features, which encode soft, overlapping rules. In So far, we have mainly focused on the generation of contrast, CCG-based methods (Kwiatkowski et al., predicates. We now discuss three classes of features 2010; Kwiatkowski et al., 2011) encode the com- pertaining to their composition. bination preferences structurally in non-overlapping rules; these could be emulated with features with Rule features Each derivation d is the result of ap- weights clamped to −∞. plying some number of intersection, join, and bridg- ing operations. To control this number, we define Denotation features While it is clear that learning indicator features on each of these counts. This is in from denotations rather than logical forms is a draw- contrast to the norm of having a single feature whose back since it provides less information, it is less ob- value is equal to the count, which can only repre- vious that working with denotations actually gives sent one-sided preferences for having more or fewer us additional information. Specifically, we include of a given operation. Indicator features stabilize the four features indicating whether the denotation of model, preferring derivations with a well-balanced the predicted logical form has size 0, 1, 2, or at least inventory of operations. 3. This feature encodes presupposition constraints in a soft way: when people ask a question, usually Part-of-speech tag features To guide the compo- there is an answer and it is often unique. This allows sition of predicates, we use POS tags in two ways. us to favor logical forms with this property. First, we introduce features indicating when a word of a given POS tag is skipped, which could capture 4 Experiments the fact that skipping auxiliaries is generally accept- able, while skipping proper nouns is not. Second, We now evaluate our semantic parser empirically. we introduce features on the POS tags involved in a In Section 4.1, we compare our approach to Cai composition, inspired by dependency parsing (Mc- and Yates (2013) on their recently released dataset Donald et al., 2005). Specifically, when we combine (henceforth, FREE917) and present results on a new dataset that we collected (henceforth, WEBQUES- lexicon used by Cai and Yates (2013), which con- TIONS). In Section 4.2, we provide detailed experi- tains 1,100 entries. Because entity disambiguation ments to provide additional insight on our system. is a challenging problem in semantic parsing, the en- tity lexicon simplifies the problem. Setup We implemented a standard beam-based Following Cai and Yates (2013), we held out 30% bottom-up parser which stores the k-best derivations of the examples for the final test, and performed all for each span. We use k = 500 for all our experi- development on the remaining 70%. During devel- ments on FREE917 and k = 200 on WEBQUES- opment, we split the data and used 512 examples TIONS. The root beam yields the candidate set D˜(x) (80%) for training and the remaining 129 (20%) for and is used to approximate the sum in the objective validation. All reported development numbers are function O(θ) in (1). In experiments on WEBQUES- averaged across 3 random splits. We evaluated us- TIONS, D˜(x) contained 197 derivations on average. ing accuracy, the fraction of examples where the pre- We write the approximate objective as O(θ; θ˜) = P P dicted answer exactly matched the correct answer. i log d∈D˜(x ;θ˜): d.z =y p(d | xi; θ) to explic- i K i Our main empirical result is that our system, J K ˜ itly show dependence on the parameters θ used for which was trained only on question-answer pairs, beam search. We optimize the objective by initial- obtained 62% accuracy on the test set, outperform- izing θ0 to 0 and applying AdaGrad (stochastic gra- ing the 59% accuracy reported by Cai and Yates dient ascent with per-feature adaptive step size con- (2013), who trained on full logical forms. trol) (Duchi et al., 2010), so that θt+1 is set based on taking a stochastic approximation of ∂O(θ;θt) . 4.1.2 WEBQUESTIONS ∂θ θ=θt We make six passes over the training examples. Dataset collection Because FREE917 requires We used POS tagging and named-entity recogni- logical forms, it is difficult to scale up due to the tion to restrict what phrases in the utterance could required expertise of annotating logical forms. We be mapped by the lexicon. Entities must be named therefore created a new dataset, WEBQUESTIONS, entities, proper nouns or a sequence of at least two of question-answer pairs obtained from non-experts. tokens. Unaries must be a sequence of nouns, and To collect this dataset, we used the Google Sug- binaries must be either a content word, or a verb fol- gest API to obtain questions that begin with a wh- lowed by either a noun phrase or a particle. In addi- word and contain exactly one entity. We started with tion, we used 17 hand-written rules to map question the question “Where was Barack Obama born?” words such as “where” and “how many” to logical and performed a breadth-first search over questions forms such as Type.Location and Count. (nodes), using the Google Suggest API supplying To compute denotations, we convert a logical the edges of the graph. Specifically, we queried the form z into a SPARQL query and execute it on our question excluding the entity, the phrase before the copy of Freebase using the Virtuoso engine. On entity, or the phrase after it; each query generates 5 WEBQUESTIONS, a full run over the training exam- candidate questions, which are added to the queue. ples involves approximately 600,000 queries. For We iterated until 1M questions were visited; a ran- evaluation, we predict the answer from the deriva- dom 100K were submitted to Amazon Mechanical tion with highest probability. Turk (AMT). The AMT task requested that workers answer the 4.1 Main results question using only the Freebase page of the ques- 4.1.1 FREE917 tions’ entity, or otherwise mark it as unanswerable Cai and Yates (2013) created a dataset consist- by Freebase. The answer was restricted to be one of ing of 917 questions involving 635 Freebase rela- the possible entities, values, or list of entities on the tions, annotated with lambda calculus forms. We page. As this list was long, we allowed the user to converted all 917 questions into simple λ-DCS, ex- filter the list by typing. We paid the workers $0.03 ecuted them on Freebase and used the resulting an- per question. Out of 100K questions, 6,642 were swers to train and evaluate. To map phrases to Free- annotated identically by at least two AMT workers. base entities we used the manually-created entity We again held out a 35% random subset of the Dataset # examples # word types System FREE917 WebQ. GeoQuery 880 279 ALIGNMENT 38.0 30.6 ATIS 5,418 936 BRIDGING 66.9 21.2 FREE917 917 2,036 ALIGNMENT+BRIDGING 71.3 32.9 WEBQUESTIONS 5,810 4,525 Table 4: Accuracies on the development set under different Table 3: Statistics on various semantic parsing datasets. Our schemes of binary predicate generation. In ALIGNMENT, bi- new dataset, WEBQUESTIONS, is much larger than FREE917 naries are generated only via the alignment lexicon. In BRIDG- and much more lexically diverse than ATIS. ING, binaries are generated through the bridging operation only. ALIGNMENT+BRIDGING corresponds to the full system. questions for the final test, and performed all devel- opment on the remaining 65%, which was further As a baseline, we omit from our system the main divided into an 80%–20% split for training and val- contributions presented in this paper—that is, we idation. To map entities, we built a Lucene index disallow bridging, and remove denotation and align- over the 41M Freebase entities. ment features. The accuracy on the test set of this Table 3 provides some statistics about the new system is 26.9%, whereas our full system obtains questions. One major difference in the datasets is 31.4%, a significant improvement. the distribution of questions: FREE917 starts from Note that the number of possible derivations for Freebase properties and solicits questions about questions in WEBQUESTIONS is quite large. In the these properties; these questions tend to be tai- question “What kind of system of government does lored to the properties. WEBQUESTIONS starts from the United States have?” the phrase “United States” questions completely independent of Freebase, and maps to 231 entities in our lexicon, the verb “have” therefore the questions tend to be more natural and maps to 203 binaries, and the phrases “kind”, “sys- varied. For example, for the Freebase property tem”, and “government” all map to many different ComicGenre,FREE917 contains the question “What unary and binary predicates. Parsing correctly in- genre is Doonesbury?”, while WEBQUESTIONS for volves skipping some words, mapping other words the property MusicGenre contains “What music did to predicates, while resolving many ambiguities in Beethoven compose?”. the way that the various predicates can combine. The number of word types in WEBQUESTIONS is larger than in datasets such as ATIS and GeoQuery 4.2 Detailed analysis (Table 3), making lexical mapping much more chal- We now delve deeper to explore the contributions of lenging. On the other hand, in terms of structural the various components of our system. All ablation complexity WEBQUESTIONS is simpler and many results reported next were run on the development questions contain a unary, a binary and an entity. set (over 3 random splits). In some questions, the answer provided by AMT workers is only roughly accurate, because workers Generation of binary predicates Recall that our are restricted to selecting answers from the Freebase system has two mechanisms for suggesting binaries: page. For example, the answer given by workers to from the alignment lexicon or via the bridging op- the question “What is James Madison most famous eration. Table 4 shows accuracies when only one or for?” is “President of the United States” rather than both is used. Interestingly, alignment alone is better “Authoring the Bill of Rights”. than bridging alone on WEBQUESTIONS, whereas for FREE917, it is the opposite. The reason for this Results AMT workers sometimes provide partial is that FREE917 contains questions on rare pred- answers, e.g., the answer to “What movies does Tay- icates. These are often missing from the lexicon, lor Lautner play in?” is a set of 17 entities, out but tend to have distinctive types and hence can be of which only 10 appear on the Freebase page. We predicted from neighboring predicates. In contrast, therefore allow partial credit and score an answer us- WEBQUESTIONS contains questions that are com- ing the F1 measure, comparing the predicted set of monly searched for and focuses on popular predi- entities to the annotated set of entities. cates, therefore exhibiting larger lexical variation. System FREE917 WebQ. FULL 71.3 32.9 -POS 70.5 28.9 -DENOTATION 58.6 28.0

Table 5: Accuracies on the development set with features re- moved. POS and DENOTATION refer to the POS tag and deno- tation features from Section 3.3. 0 iterations 1 iterations 2 iterations

System FREE917 WebQ. Figure 4: Beam of candidate derivations D˜(x) for 50 ALIGNMENT 71.3 32.9 WEBQUESTIONS examples. In each matrix, columns LEXICALIZED 68.5 34.2 LEXICALIZED+ALIGNMENT 69.0 36.4 correspond to examples and rows correspond to beam po- sition (ranked by decreasing model score). Green cells Table 6: Accuracies on the development set using either mark the positions of derivations with correct denota- unlexicalized alignment features (ALIGNMENT) or lexicalized tions. Note that both the number of good derivations and features (LEXICALIZED). their positions improve as θ is optimized.

1 1 oracle 0.8 0.8 For instance, when training without an align- accuracy 0.6 0.6 ment lexicon, the system errs on “When did Nathan 0.4 0.4 oracle 0.2 0.2 Smith die?”. Bridging suggests binaries that are accuracy 0 0 compatible with the common types Person and 0 100 200 300 400 500 0 100 200 Datetime, and the binary PlaceOfBirth is cho- sen. On the other hand, without bridging, the sys- (a) FREE917 (b) WEBQUESTIONS tem errs on “In which comic book issue did Kitty Figure 5: Accuracy and oracle as beam size k increases. Pryde first appear?”, which refers to the rare pred- ComicBookFirstAppearance icate . With bridging, unlexicalized features (Table 6). In the large WE- the parser can identify the correct binary, by linking BQUESTIONS dataset, lexicalized features helped, the types ComicBook and ComicBookCharacter. On and so we added those features to our model when both datasets, best performance is achieved by com- running on the test set. In FREE917 lexicalized fea- bining the two sources of information. tures result in overfitting due to the small number of Overall, running on WEBQUESTIONS, the parser training examples. Thus, we ran our final parser on constructs derivations that contain about 12,000 dis- the test set without lexicalized features. tinct binary predicates. Effect of beam size An intrinsic challenge in se- Feature variations Table 5 shows the results of mantic parsing is to handle the exponentially large feature ablation studies. Accuracy drops when POS set of possible derivations. We rely heavily on the tag features are omitted, e.g., in the question “What k-best beam approximation in the parser keeping number is Kevin Youkilis on the Boston Red Sox” the good derivations that lead to the correct answer. Re- parser happily skips the NNPs “Kevin Youkilis” and call that the set of candidate derivations D˜(x) de- returns the numbers of all players on the Boston Red pends on the parameters θ. In the initial stages of Sox. A significant loss is incurred without denota- learning, θ is far from optimal, so good derivations tion features, largely due to the parser returning log- are likely to fall below the k-best cutoff of inter- ical forms with empty denotations. For instance, the nal parser beams. As a result, D˜(x) contains few question “How many people were at the 2006 FIFA derivations with the correct answer. Still, placing world cup final?” is answered with a logical form these few derivations on the beam allows the train- containing the property PeopleInvolved rather than ing procedure to bootstrap θ into a good solution. SoccerMatchAttendance, resulting in an empty de- Figure 4 illustrates this improvement in D˜(x) across notation. early training iterations. Next we study the impact of lexicalized versus Smaller choices of k yield a coarser approxima- tion in beam search. As we increase k (Figure 5), we eral works perform relation extraction using dis- see a tapering improvement in accuracy. We also see tant supervision from a knowledge base (Riedel et a widening gap between accuracy and oracle score,8 al., 2010; Carlson et al., 2010; Hoffmann et al., as including a good derivation in D˜(x) is made eas- 2011; Surdeanu et al., 2012). While similar in spirit ier but the learning problem is made more difficult. to our alignment procedure for building the lexi- con, one difference is that relation extraction cares Error analysis The accuracy on WEBQUES- about facts, aggregating over phrases, whereas a TIONS is much lower than on FREE917. We an- lexicon concerns specific phrases, thus aggregating alyzed WEBQUESTIONS examples and found sev- over facts. On the question answering side, recent eral main causes of error: (i) Disambiguating en- methods have made progress in building semantic tities in WEBQUESTIONS is much harder because parsers for the open domain, but still require a fair the entity lexicon has 41M entities. For example, amount of manual effort (Yahya et al., 2012; Unger given “Where did the battle of New Orleans start?” et al., 2012; Cai and Yates, 2013). Our system re- the system identifies “New Orleans” as the target duces the amount of supervision and has a more ex- entity rather than its surrounding noun phrase. Re- tensive evaluation on a new dataset. call that all FREE917 experiments used a carefully Finally, although Freebase has thousands of prop- chosen entity lexicon. (ii) Bridging can often fail erties, open (Banko et al., when the question’s entity is compatible with many 2007; Fader et al., 2011; Masaum et al., 2012) binaries. For example, in “What did Charles Bab- and associated question answering systems (Fader bage make?”, the system chooses a wrong binary et al., 2013) work over an even larger open-ended compatible with the type Person. (iii) The system set of properties. The drawback of this regime is sometimes incorrectly draws verbs from subordinate that the noise and the difficulty in canonicaliza- clauses. For example, in “Where did Walt Disney tion make it hard to perform reliable composition, live before he died?” it returns the place of death of thereby nullifying one of the key benefits of se- Walt Disney, ignoring the matrix verb live. mantic parsing. An interesting midpoint involves 5 Discussion keeping the structured knowledge base but aug- menting the predicates, for example using random Our work intersects with two strands of work. walks (Lao et al., 2011) or Markov logic (Zhang The first involves learning models of semantics et al., 2012). This would allow us to map atomic guided by denotations or interactions with the world. words (e.g., “wife”) to composite predicates (e.g., Besides semantic parsing for querying databases λx.Marriage.Spouse.(Gender.Femaleux)). Learn- (Popescu et al., 2003; Clarke et al., 2010; Liang ing these composite predicates would drastically in- et al., 2011), previous work has looked at inter- crease the possible space of logical forms, but we preting natural language for performing program- believe that the methods proposed in this paper— ming tasks (Kushman and Barzilay, 2013; Lei et alignment via distant supervision and bridging—can al., 2013), playing computer games (Branavan et al., provide some traction on this problem. 2010; Branavan et al., 2011), following navigational instructions (Chen, 2012; Artzi and Zettlemoyer, Acknowledgments 2013), and interacting in the real world via percep- tion (Matuszek et al., 2012; Tellex et al., 2011; Kr- We would like to thank Thomas Lin, Mausam and ishnamurthy and Kollar, 2013). Our system uses Oren Etzioni for providing us with open IE triples denotations rather than logical forms as a training that are partially-linked to Freebase, and also Arun signal, but also benefits from denotation features, Chaganty for helpful comments. The authors grate- which becomes possible in the grounded setting. fully acknowledge the support of the Defense Ad- The second body of work involves connecting vanced Research Projects Agency (DARPA) Deep natural language and open-domain databases. Sev- Exploration and Filtering of Text (DEFT) Program 8Oracle score is the fraction of examples for which D˜(x) under Air Force Research Laboratory (AFRL) prime contains any derivation with the correct denotation. contract no. FA8750-13-2-0040. References A. Fader, L. Zettlemoyer, and O. Etzioni. 2013. Paraphrase-driven learning for open question answer- Y. Artzi and L. Zettlemoyer. 2011. Bootstrapping ing. In Association for Computational Linguistics semantic parsers from conversations. In Empirical (ACL). Methods in Natural Language Processing (EMNLP), D. Goldwasser, R. Reichart, J. Clarke, and D. Roth. pages 421–432. 2011. Confidence driven unsupervised semantic pars- Y. Artzi and L. Zettlemoyer. 2013. Weakly supervised ing. In Association for Computational Linguistics learning of semantic parsers for mapping instructions (ACL), pages 1486–1495. to actions. Transactions of the Association for Com- Google. 2013. Freebase data dumps (2013-06- putational Linguistics (TACL), 1:49–62. 09). https://developers.google.com/ M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, freebase/data. and O. Etzioni. 2007. Open information extraction M. A. Hearst. 1992. Automatic acquisition of hyponyms from the web. In International Joint Conference on from large text corpora. In Interational Conference on Artificial Intelligence (IJCAI), pages 2670–2676. Computational linguistics, pages 539–545. S. Branavan, L. Zettlemoyer, and R. Barzilay. 2010. R. Hoffmann, C. Zhang, X. Ling, L. S. Zettlemoyer, and Reading between the lines: Learning to map high-level D. S. Weld. 2011. Knowledge-based weak super- instructions to commands. In Association for Compu- vision for information extraction of overlapping rela- tational Linguistics (ACL), pages 1268–1277. tions. In Association for Computational Linguistics S. Branavan, D. Silver, and R. Barzilay. 2011. Learning (ACL), pages 541–550. to win by reading manuals in a Monte-Carlo frame- J. Krishnamurthy and T. Kollar. 2013. Jointly learning work. In Association for Computational Linguistics to parse and perceive: Connecting natural language to (ACL), pages 268–277. the physical world. Transactions of the Association for S. Branavan, N. Kushman, T. Lei, and R. Barzilay. 2012. Computational Linguistics (TACL), 1:193–206. Learning high-level planning from text. In Association J. Krishnamurthy and T. Mitchell. 2012. Weakly super- for Computational Linguistics (ACL), pages 126–135. vised training of semantic parsers. In Empirical Meth- Q. Cai and A. Yates. 2013. Large-scale semantic parsing ods in Natural Language Processing and Computa- via schema matching and lexicon extension. In Asso- tional Natural Language Learning (EMNLP/CoNLL), ciation for Computational Linguistics (ACL). pages 754–765. A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. N. Kushman and R. Barzilay. 2013. Using semantic uni- Jr, and T. M. Mitchell. 2010. Toward an architecture fication to generate regular expressions from natural for never-ending language learning. In Association for language. In Human Language Technology and North the Advancement of Artificial Intelligence (AAAI). American Association for Computational Linguistics (HLT/NAACL), pages 826–836. A. X. Chang and C. Manning. 2012. SUTime: A library T. Kwiatkowski, L. Zettlemoyer, S. Goldwater, and for recognizing and normalizing time expressions. In M. Steedman. 2010. Inducing probabilistic CCG Language Resources and Evaluation (LREC), pages grammars from logical form with higher-order unifi- 3735–3740. cation. In Empirical Methods in Natural Language D. Chen. 2012. Fast online lexicon learning for grounded Processing (EMNLP), pages 1223–1233. language acquisition. In Association for Computa- T. Kwiatkowski, L. Zettlemoyer, S. Goldwater, and tional Linguistics (ACL). M. Steedman. 2011. Lexical generalization in CCG H. H. Clark. 1975. Bridging. In Workshop on theoretical grammar induction for semantic parsing. In Empirical issues in natural language processing, pages 169–174. Methods in Natural Language Processing (EMNLP), J. Clarke, D. Goldwasser, M. Chang, and D. Roth. pages 1512–1523. 2010. Driving semantic parsing from the world’s re- N. Lao, T. Mitchell, and W. W. Cohen. 2011. Random sponse. In Computational Natural Language Learn- walk inference and learning in a large scale knowledge ing (CoNLL), pages 18–27. base. In Empirical Methods in Natural Language Pro- J. Duchi, E. Hazan, and Y. Singer. 2010. Adaptive cessing (EMNLP). subgradient methods for online learning and stochas- T. Lei, F. Long, R. Barzilay, and M. Rinard. 2013. tic optimization. In Conference on Learning Theory From natural language specifications to program input (COLT). parsers. In Association for Computational Linguistics A. Fader, S. Soderland, and O. Etzioni. 2011. Identifying (ACL). relations for open information extraction. In Empirical P. Liang, M. I. Jordan, and D. Klein. 2011. Learning Methods in Natural Language Processing (EMNLP). dependency-based compositional semantics. In As- sociation for Computational Linguistics (ACL), pages in Natural Language Processing and Computational 590–599. Natural Language Learning (EMNLP/CoNLL), pages P. Liang. 2013. Lambda dependency-based composi- 379–390. tional semantics. Technical report, ArXiv. M. Zelle and R. J. Mooney. 1996. Learning to parse T. Lin, Mausam, and O. Etzioni. 2012. Entity link- database queries using inductive logic proramming. In ing at web scale. In Workshop Association for the Advancement of Artificial Intelli- (AKBC-WEKEX). gence (AAAI), pages 1050–1055. Masaum, M. Schmitz, R. Bart, S. Soderland, and O. Et- L. S. Zettlemoyer and M. Collins. 2005. Learning to zioni. 2012. Open language learning for informa- map sentences to logical form: Structured classifica- tion extraction. In Empirical Methods in Natural Lan- tion with probabilistic categorial grammars. In Uncer- guage Processing and Computational Natural Lan- tainty in Artificial Intelligence (UAI), pages 658–666. guage Learning (EMNLP/CoNLL), pages 523–534. C. Zhang, R. Hoffmann, and D. S. Weld. 2012. Onto- C. Matuszek, N. FitzGerald, L. Zettlemoyer, L. Bo, and logical smoothing for relation extraction with minimal D. Fox. 2012. A joint model of language and percep- supervision. In Association for the Advancement of tion for grounded attribute learning. In International Artificial Intelligence (AAAI). Conference on Machine Learning (ICML). R. McDonald, K. Crammer, and F. Pereira. 2005. Online large-margin training of dependency parsers. In As- sociation for Computational Linguistics (ACL), pages 91–98. H. Poon. 2013. Grounded unsupervised semantic pars- ing. In Association for Computational Linguistics (ACL). A. Popescu, O. Etzioni, and H. Kautz. 2003. Towards a theory of natural language interfaces to databases. In International Conference on Intelligent User Inter- faces (IUI), pages 149–157. S. Riedel, L. Yao, and A. McCallum. 2010. Model- ing relations and their mentions without labeled text. In Machine Learning and Knowledge Discovery in Databases (ECML PKDD), pages 148–163. M. Surdeanu, J. Tibshirani, R. Nallapati, and C. D. Man- ning. 2012. Multi-instance multi-label learning for relation extraction. In Empirical Methods in Natu- ral Language Processing and Computational Natu- ral Language Learning (EMNLP/CoNLL), pages 455– 465. S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, A. G. Banerjee, S. J. Teller, and N. Roy. 2011. Understand- ing natural language commands for robotic navigation and mobile manipulation. In Association for the Ad- vancement of Artificial Intelligence (AAAI). C. Unger, L. Bhmann, J. Lehmann, A. Ngonga, D. Ger- ber, and P. Cimiano. 2012. Template-based ques- tion answering over RDF data. In World Wide Web (WWW), pages 639–648. Y. W. Wong and R. J. Mooney. 2007. Learning syn- chronous grammars for semantic parsing with lambda calculus. In Association for Computational Linguis- tics (ACL), pages 960–967. M. Yahya, K. Berberich, S. Elbassuoni, M. Ramanath, V. Tresp, and G. Weikum. 2012. Natural language questions for the web of data. In Empirical Methods