<<

and Corpus Driven Segmentation and Answer Inference for Telegraphic Entity-seeking Queries

Mandar Joshi ∗ Uma Sawant Soumen Chakrabarti IBM Research IIT Bombay, Yahoo Labs IIT Bombay [email protected] [email protected] [email protected]

Abstract 1 Introduction A majority of Web queries mention an entity or Much recent work focuses on formal in- type (Lin et al., 2012), as users increasingly ex- terpretation of natural question utterances, plore the Web of objects using Web search. To with the goal of executing the resulting better support entity-oriented queries, commercial structured queries on knowledge graphs Web search engines are rapidly building up large (KGs) such as . Here we address catalogs of types, entities and relations, popu- two limitations of this approach when ap- larly called a “knowledge graph” (KG) (Gallagher, plied to open-domain, entity-oriented Web 2012). Despite these advances, robust, Web-scale, queries. First, Web queries are rarely well- open-domain, entity-oriented search faces many formed questions. They are “telegraphic”, challenges. Here, we focus on two. with missing verbs, prepositions, clauses, case and phrase clues. Second, the KG is 1.1 “Telegraphic” queries always incomplete, unable to directly an- First, the surface utterances of entity-oriented Web swer many queries. We propose a novel queries are dramatically different from TREC- technique to segment a telegraphic query or -style factoid (QA), and assign a coarse-grained purpose to where questions are grammatically well-formed. each segment: a base entity e1, a rela- Web queries are usually “telegraphic”: they are tion type r, a target entity type t2, and short, rarely use function words, punctuations contextual words s. The query seeks en- or clausal structure, and use relatively flexible tity e t where r(e , e ) holds, fur- word orders. E.g., the natural utterance “on the 2 ∈ 2 1 2 ther evidenced by schema-agnostic words bank of which river is the Hermitage Museum lo- s. Query segmentation is integrated with cated” may be translated to the telegraphic Web the KG and an unstructured corpus where query hermitage museum river bank. Even mentions of entities have been linked to on well-formed question utterances, 50% of in- the KG. We do not trust the best or any terpretation failures are contributed by parsing or specific query segmentation. Instead, evi- structural matching failures (Kwiatkowski et al., dence in favor of candidate e2s are aggre- 2013). Telegraphic utterances will generally be gated across several segmentations. Ex- even more challenging. tensive experiments on the ClueWeb cor- Consequently, whereas TREC-QA/NLP-style pus and parts of Freebase as our KG, us- research has focused on parsing and precise in- ing over a thousand telegraphic queries terpretation of a well-formed query sentence to adapted from TREC, INEX, and Web- a strongly structured (typically graph-oriented) Questions, show the efficacy of our ap- query language (Kasneci et al., 2008; Pound et proach. For one benchmark, MAP im- al., 2012; Yahya et al., 2012; Berant et al., 2013; proves from 0.2–0.29 (competitive base- Kwiatkowski et al., 2013), the Web search and in- lines) to 0.42 (our system). NDCG@10 formation retrieval (IR) community has focused improves from 0.29–0.36 to 0.54. on telegraphic queries (Guo et al., 2009; Sarkas et al., 2010; Li et al., 2011; Pantel et al., 2012; Lin et al., 2012; Sawant and Chakrabarti, 2013). In ∗Work done as Masters student at IIT Bombay terms of target schema richness, these efforts may

1104 Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1104–1114, October 25-29, 2014, Doha, Qatar. c 2014 Association for Computational Linguistics appear more modest. The act of query ‘interpre- Other contextual matching words s (some- • tation’ is mainly a segmentation of query tokens times called selectors), by purpose. In the example above, one may re- with the simultaneous intent of finding and rank- port segments “Hermitage Museum” (a located ar- ing entities e t , such that r(e , e ) is likely 2 ∈ 2 1 2 tifact or named entity), and “river bank” (the target to hold, evidenced near the matching words in un- type). This is reminiscent of record segmentation structured text. in information extraction (IE). Over well-formed Given the short, telegraphic query utterances, utterances, IE baselines are quite competitive (Yao we limit our scope to at most one relation mention, and Van Durme, 2014). But here, we are interested unlike the complex mapping of clauses in well- exclusively in telegraphic queries. formed questions to twig and join style queries (e.g., “find an actor whose spouse was an Italian 1.2 Incomplete knowledge graph bookwriter”). On the other hand, we need to deal The second problem is that the KG is always with the unhelpful input, as well as consolidate work in progress (Pereira, 2013), and connec- the KG with the corpus for ranking candidate e2s. tions found within nodes of the KG, between the Despite the modest specification, our query tem- KG and the query, or the KG and unstructured plate is quite expressive, covering a wide range of text, are often incomplete or erroneous. E.g., entity-oriented queries (Yih et al., 2014). is considered tiny, and Freebase rather We present a novel discriminative graphical small, compared to what is needed to answer all model to capture the entity ranking inference task, but the “head” queries. Google’s Freebase an- with query segmentation as a by-product. Ex- notations (Gabrilovich et al., 2013) on ClueWeb tensive experiments with over a thousand entity- (ClueWeb09, 2009) number fewer than 15 per seeking telegraphic queries using the ClueWeb09 page to ensure precision. Fewer than 2% are to corpus and a subset of Freebase show that we can entities in Freebase but not in Wikipedia. accurately predict the segmentation and intent of It may also be difficult to harness the KG for telegraphic relational queries, and simultaneously answering certain queries. E.g., answering the rank candidate responses with high accuracy. We query fastest odi century batsman, the intent of also present evidence that the KG and corpus have which is to find the batsman holding the record for synergistic salutary effects on accuracy. the fastest century in One Day International (ODI) 2 explores related work in more detail. 3 § § cricket, may be too difficult for most KG-only sys- gives some examples fitting our query template, tems, but may be answered quite effectively by a explains why interpreting some of them is nontriv- system that also utilizes evidence from unstruc- ial, and sets up notation. 4 presents our core tech- § tured text. nical contributions. 5 presents experiments. Data § There is a clear need for a “pay-as-you-go” ar- can be accessed at http://bit.ly/Spva49 chitecture that involves both the corpus and KG. A and http://bit.ly/WSpxvr. query easily served by a curated KG should give accurate results, but it is desirable to have a grace- 2 Related work ful interpolation supported by the corpus: e.g., if The NLP/QA community has traditionally as- the relation r(e1, e2) is not directly evidenced in the KG, but strongly hinted in the corpus, we still sumed that question utterances are grammatically want to use this for ranking. well-formed, from which precise clause structure, ground constants, variables, and connective rela- 1.3 Our contributions tions can be inferred via semantic parsing (Kas- neci et al., 2008; Pound et al., 2012; Yahya et Here, we make progress beyond the above frontier al., 2012; Berant et al., 2013; Kwiatkowski et of prior work in the following significant ways. al., 2013) and translated to lambda expressions We present a new architecture for structural in- (Liang, 2013) or SPARQL style queries (Kasneci terpretation of a telegraphic query into these seg- et al., 2008), with elaborate schema knowledge. ments (some may be empty): Such approaches are often correlated with the as- Mention/s e of an entity e , • 1 1 sumption that all usable knowledge has been cu- Mention r of a relation type r, rated into a KG. The query is first translated to a • Mention t bof a target type t , and structured form and then “executed” on the KG. A • 2 2 b b 1105 Telegraphic query eb1 rb tb2 s nobel prize winner african american first first african american nobel prize winner nobel prize - winner first african american - - winner first african american nobel prize dave navarro band - first dave navarro first band dave navarro band band first merril lynch headquarters - - merril lynch headquarters merril lynch - headquarters - spanish died in poet civil war spanish poet died in civil war civil war died - spanish poet spanish in poet died civil war - - - first american in space first american in space - - american first, in space

Figure 1: Example queries and some potential segmentations.

large corpus may be used to build relation expres- and e1, r, t2 to represent their textual mentions or sion models (Yao and Van Durme, 2014), but not hints, if any, in the query. s is a set of uninterpreted as supporting evidence for target entities. textualb tokensb b in the query that are used to match In contrast, the Web and IR community gener- and collect corpus contexts that lend evidence to ally assumes a free-form query that is often tele- candidate entities. graphic (Guo et al., 2009; Sarkas et al., 2010; Li Figure 1 shows some telegraphic queries et al., 2011). Queries being far more noisy, the with possible segmentation into the above goal of structure discovery is more modest, and of- parts. Consider another example: dave ten takes the form of a segmentation of the query navarro first band. ‘Band’ is a hint for regarded as a token sequence, assigning a broad type /music/musical group, so it com- purpose (Pantel et al., 2012; Lin et al., 2012) to prises t2. Dave Navarro is an entity, with men- each segment, mapping them probabilistically to tion words ‘dave navarro’ comprising e1. r is a relatively loose schema, and ranking responses made upb of ‘band’, and represents the relation in conjunction with segmentations (Sawant and /music/group member/membershipb .b Fi- Chakrabarti, 2013). To maintain quality in the face nally, the word first cannot be mapped to any sim- of noisy input, these approaches often additionally ple KG artifact, so are relegated to s (which makes exploit clicks (Li et al., 2011) or a corpus that has the corpus a critical part of answer inference). We been annotated with entity mentions (Cheng and use s and s interchangeably. Chang, 2010; Li et al., 2010). The corpus provides Generally, there will be enough noise and uncer- contextual snippets for queries where the KG fails, tainty thatb the search system should try out several preventing the systems from falling off the “struc- of the most promising segmentations as shown in ture cliff” (Pereira, 2013). Figure 1. The accuracy of any specific segmenta- Our work advances the capabilities of the lat- tion is expected to be low in such adversarial set- ter class of approaches, bringing them closer to tings. Therefore, support for an answer entity is the depth of the former, while handling telegraphic aggregated over several segmentations. The ex- queries and retaining the advantage of corpus evi- pectation is that by considering multiple interpre- dence over and above the KG. Very recently, (Yao tations, the system will choose the entity with best et al., 2014) have concluded that for current bench- supporting evidence from corpus and knowledge marks, deep parsing and shallow information ex- base. traction give comparable interpretation accuracy. The very recent work of (Yih et al., 2014) is simi- 4 Our Approach lar in spirit to ours, but they do not unify segmen- tation and answer inference, along with corpus ev- Telegraphic queries are usually short, so we enu- idence, like we do. merate query token spans (with some restrictions, similar to beam search) to propose segmentations ( 4.1). Candidate response entities are lined up 3 Notation and examples § for each interpretation, and then scored in a global We use e , r, t , e to represent abstract nodes and model along with query segmentations ( 4.2). 1 2 2 § edges (MIDs in case of Freebase) from the KG, 4.3 describes how model parameters are trained. §

1106 1: input: query token sequence q graphical model. In subsequent subsections, we 2: initialize segmentations = ∅ will present the design of specific potentials. I 3: 1 = (entity, mention) pairs from linker E ΨR(q, z, r) denotes the compatibility be- 4: for all (e1, e1) 1 do • ∈ E tween the relation hint segment r(q, z) and 5: assign label E1 to mention tokens e1 a proposed relation type r in the KG ( 4.2.1). 6: for all contiguousb span v q e1 do § ⊂ \ ΨT2 (q, z, t2) denotes the compatibilityb be- 7: label each word w v as T2R b • ∈ tween the type hint segment t2(q, z) and a 8: label other words w q e1 bv as S ∈ \ \ proposed target entity type t2 in the KG 9: add segments (E ,T R,S) to 1 2 I ( 4.2.2). b 10: end for b § ΨE ,R,E ,S(q, z, e1, r, e2) is a novel corpus- 11: • 1 2 end for based evidence potential that measures how 12: return candidate segmentations I strongly e1 and e2 appear in corpus snippets Figure 2: Generating candidate query segmenta- in the proximity of words in s(q, z), and ap- parently related by relation type r ( 4.2.3). tions. § ΨE (q, z, e1) denotes the compatibilityb be- • 1 4.1 Generating candidate query tween the query segment e1(q, z) and entity e that it purportedly mentions ( 4.2.4). segmentations 1 § ΨS(q, z) denotes selector compatibility.b Se- • Each query token can have four labels, lectors are a fallback label, so this is pinned E1,T2,R,S, corresponding to the mentions arbitrarily to 1; other potentials are balanced of the base entity, target type, connecting relation, against this base value. and context words. We found that segments ΨE1,R,E2 (e1, r, e2) is A if the relation hinting at T2 and R frequently overlapped • r(e1, e2) exists in the KG, and is B > 0 oth- (e.g., ‘author’ in the query zhivago author). erwise, for tuned/learnt constants A > B > In our implementation, we simplified to three 0. Note that this is a soft constraint (B > 0); labels, E1,T2R,S, where tokens labeled T2R if the KG is incomplete, the corpus may be are involved with both t2 and r, the proposed able to supplement the required information. structured target type and connecting relation. ΨE ,T (e2, t2) is 1 if e2 belongs to t2 and Another reasonable assumption was that the base • 2 2 zero otherwise. In other words, candidate e2s entity mention and type/relation mentions are must be proposed to be instances of the pro- contiguous token spans, whereas context words posed t2 — this is a hard constraint, but can can be scattered in multiple segments. be softened if desired, like ΨE ,R,E . Figure 2 shows how candidate segmentations 1 2 are generated. For step 3, we use TagMe (Ferrag- Figure 3 shows the relevant variable states as ina and Scaiella, 2010), an entity linker backed by circled nodes, and the potentials as square factor an entity gazette derived from our KG. nodes. To rank candidate entities e2, we pin the node E2 to each entity in turn. With E2 pinned, 4.2 Graphical model we perform a MAP inference over all other hidden variables and note the score of e2 as the product of Based on the previous discussion, we assume that the above potentials maximized over choices of all q an entity-seeking query is a sequence of tokens other variables: score(e2) = q1, q2,..., and this can be partitioned into different maxz,t ,r,e ΨT (q, z, t2)ΨR(q, z, r) kinds of subsequences, corresponding to e1, r, t2 2 1 2 and s, and denoted by a structured (vector) label- ΨE1 (q, z, e1)ΨS(q, z) ing z = z1, z2,.... Given sequences q and z, we ΨE2,T2 (e2, t2)ΨE1,R,E2 (e1, r, e2) can separate out (possibly empty) token segments ΨE1,R,E2,S(q, z, e1, r, e2). (1) e1(q, z), t2(q, z), r(q, z), and s(q, z). A query segmentation z becomes plausible in We rank candidate e2s by decreasing score, which conjunctionb b withb proposals forb e1, r, t2 and e2 is estimated by max-product message-passing from the KG. The probability Pr(z, e , r, t , e q) (Koller and Friedman, 2009). 1 2 2| is modeled as proportional to the product of sev- As noted earlier, any of the relation/type, or eral potentials (Koller and Friedman, 2009) in a query entity partitions may be empty. To handle

1107 Type Segmentation language model Entity language Relation model language model Target Connecting Query Selectors type relation entity

Corpus-assisted entity-relation evidence potential Candidate entity

Figure 3: Graphical model for query segmentation and entity scoring. Factors/potentials are shown as squares. A candidate e2 is observed and scored using equation (1). Query q is also observed but not shown to reduce clutter; most potentials depend on it. this case, we allow each of the entity, relation or from a reference corpus, then mark these patterns target type nodes in the graphical to take the value into much larger payload corpus. or ‘null’. To support this, the value of the fac- We started with the 2000 (out of approximately ⊥ tor between the query segmentation node Z and 14000) most frequent relation types in Freebase, ΨE1(q, z, e1), ΨT 2(q, z, t2), and ΨR(q, z, r)) are and the ClueWeb09 corpus annotated with Free- set to suitable low values. base entities (Gabrilovich et al., 2013). For each Next, we will describe the detailed design of triple instance of each relation type, we located all some of the key potentials introduced above. corpus sentences that mentioned both participat- ing entities. We made the crude assumption that 4.2.1 Relation language model for ΨR if r(e1, e2) holds and e1, e2 co-occur in a sen- tence then this sentence is evidence of the rela- Potential ΨR(q, z, r) captures the compatibility tionship. Each such sentence is parsed to obtain between r(q, z) and the proposed relation r. a dependency graph using the Malt Parser (Hall E.g., if the query is steve jobs death rea- et al., 2014). Words in the path connecting the son, and br is (correctly chosen as) death rea- entities are joined together and added to a candi- son, then the correct candidate r is /people/ date phrase dictionary, provided the path is at most deceased_person/cause_of_deathb . An three hops. (Inspection suggested that longer de- incorrect r is /people/deceased_person/ pendency paths mostly arise out of noisy sentences place_of_death. An incorrect z may lead to or botched parses.) 30% of the sentences were r(q, z) being jobs death. thus retained. Finally, we defined Using corpus: Considerable variation may exist b n(r, r(q, z)) in how r is represented textually in a query. The ΨR(q, z, r) = , (2) n(r, p0) relation language model needs to build a bridge p0 b between the formal r and the textual r, so that where p0 ranges over all phrasesP that are known to (un)likely r’s have (small) large potential. Many hint at r, and n(r, p) denotes the number of sen- approaches (Berant et al., 2013; Berant andb Liang, tences where the phrase p occurred in the depen- 2014; Kwiatkowski et al., 2013; Yih et al., 2014) dency path between the entities participating in re- to this problem have been intensely studied re- lation r. cently. Given our need to process billions of Web Assuming entity co-occurrence implies evi- pages efficiently, we chose a pattern-based ap- dence is admittedly simplistic. However, the pri- proach (Nakashole et al., 2012): with each r, dis- mary function of the relation model is to retrieve cover the most strongly associated phrase patterns top-k relations that are compatible with the type/s

1108 of e1 and the given relation hint. Moreover, the physicist) using an edge with label /type/ remaining noise is further mitigated by the collec- object/type. tive scoring in the graphical model. While we may But relation types provide additional clues to miss relations if they are expressed in the query types of the endpoint entities. Freebase relation through obscure hints, allowing the relation to be types have the form /x/y/z, where x is the acts as a safety net. ⊥ domain of the relation, and y and z are string representations of the type of the entities partic- Using Freebase relation names: As mentioned earlier, queries may express relations differently as ipating in the relation. E.g., the (directed) re- compared to the corpus. A relation model based lation type /location/country/capital solely on corpus annotations may not be able to connects from from /location/country to bridge that gap effectively, particularly so, because /location/citytown. Therefore, “capital” of sparsity of corpus annotations or the rarity of can be added to the set of descriptive phrases of Freebase triples in ClueWeb. E.g., for the Freebase entity type /location/citytown. relation /people/person/profession, we It is important to note that while we use Free- found very few annotated sentences. One way base link nomenclature for relation and type lan- to address this problem is to utilize relation type guage models, our models are not incompati- names in Freebase to map hints to relation types. ble with other catalogs. Indeed, most catalogs Thus, in addition to the corpus-derived relation have established ways of deriving language mod- model, we also built a language model that used els that describe their various structures. For ex- Freebase relation type names as lemmas. E.g., the ample, most types are derived from Word- word ‘profession’ would contribute to the relation Net synsets with associated phrasal descriptions type /people/person/profession. (lemmas). YAGO relations also have readable Our relation models are admittedly simple. This names such as actedIn, isMarriedTo, etc. which is mainly because telegraphic queries may ex- can be used to estimate language models. DB- press relations very differently from natural lan- Pedia relations are mostly derived from (mean- guage text. As it is difficult to ensure precision of ingfully) named attributes taken from infoboxes, query interpretation stage, our models are geared hence they can be used directly. Furthermore, oth- towards recall. The system generates a large num- ers (Wu and Weld, 2007) have shown how to asso- ber of interpretations and relies on signals from ciate language models with such relations. the corpus and KG to bring forth correct interpre- 4.2.3 Snippet scoring tations.

The factor ΨE1,R,E2,S(q, z, e1, r, e2) should be 4.2.2 Type language model for ΨT 2 large if many snippets contain a mention of e1 and Similar to the relation language model, we need e2, relation r, and many high-signal words from s. a type language model to measure compatibil- Recall that we begin with a corpus annotated with ity between t2 and t2(q, z). Estimating the tar- entity mentions. Our corpus is not directly anno- get entity type, without over-generalizing or over- tated with relation mentions. Therefore, we get specifying it, has alwaysb been important for QA. from relations to documents via high-confidence E.g., when t2 is ‘city’, a good type language model phrases. Snippets are retrieved using a combined should prefer t2 as /location/citytown entity + word index, and scored for a given e1, r, over /location/locationb while avoiding e2, and selectors s(q, z). /location/es_autonomous_city. Given that relation phrases may be noisy and A catalog like Freebase suggests a straight- that their occurrenceb in the snippet may not nec- forward method to collect a type language model. essarily mean that the given relation is being ex- Each type is described by one or more phrases pressed, we need a scoring function that is cog- through the link /common/topic/alias. We nizant of the roles of relation phrases and enti- can collect these into a micro-‘document’ and ties occurring in the snippets. In a basic ver- use a standard Dirichlet-smoothed language model sion, e1, p, e2, s are used to probe a combined en- from IR (Zhai, 2008). In Freebase, an entity tity+word index to collect high scoring snippets, node (e.g., Einstein, /m/0jcx) may be linked with the score beingb adapted from BM25. The sec- to a type node (e.g. /base/scientist/ ond, refined scoring function used a RankSVM-

1109 style (Joachims, 2002) optimization. During inference, we seek to maximize

2 min λ + C + ξ + s.t. e ,e− e ,e− max w φ(q, z, e1, t2, r, e2), (5) λ,ξ k k q,z,e1,t2,r · + + e , e− : λ f(q, DPe+ , e ) + ξe+,e (3) ∀ · − for a fixed w, to find the score of each candidate λ f(q, De , e−) + 1. ≥ · − entity e2. Here all w and φ have been collected • • + into unified weight and feature vectors w, φ. Dur- where e and e− are positive and negative enti- ing training of w, we are given pairs of correct and ties for the query q and f(q, De, e) represents the + incorrect answer entities e , e−, and we wish to feature map for the set of snippets De belonging 2 2 to entity e. The assumption here is that all snip- satisfy constraints of the form + pets containing e are “positive” snippets for the + max w φ(q, z, e1, t2, r, e2 ) + ξ (6) query. f consolidates various signals like the num- q,z,e1,t2,r · ber of snippets where e occurs near query entity 1 + max w φ(q, z, e1, t2, r, e−), q,z,e ,t ,r 2 e1 and a relation phrase, or the number of snippets ≥ 1 2 · with high proportion of query IDF, hinting that e + is a positive entity for the given query. A partial because collecting e2 , e2− pairs is less work than list of features used for snippet scoring is given in supervising with values of z, e1, t2, r, e2 for each Figure 4. query. Similar distant supervision problems were posed via bundle method by (Bergeron et al., Number of snippets with distance(e2, eb1) < k1 (k1 = 5, 10) 2008), and (Yu and Joachims, 2009), who used Number of snippets with distance(e2, relation phrase) < k2 (k2 = 3, 6) CCCP (Yuille and Rangarajan, 2006). These are Number of snippets with relation r = ⊥ equivalent in our setting. We use the CCCP style, Number of snippets with relation phrases as prepositions and augment the objective with an additional en- Number of snippets covering fraction of query IDF > k3 (k3 = 0.2, 0.4, 0.6, 0.8) tropy term as in (Sawant and Chakrabarti, 2013). We call this LVDT (latent variable discriminative training) in 5. Figure 4: Sample features used for learning § weights λ to score snippets. 5 Experiments

4.2.4 Query entity model 5.1 Testbed Potential Ψ (q, z, e ) captures the compatibil- E1 1 Corpus and knowledge graph: We used the ity between e (q, z) (i.e., the words that mention 1 ClueWeb09B (ClueWeb09, 2009) corpus contain- e ) and the claimed entity e mentioned in the 1 1 ing 50 million Web documents. This corpus query. We used the TagMe entity linker (Fer- b was annotated by Google with Freebase enti- ragina and Scaiella, 2010) for annotating enti- ties (Gabrilovich et al., 2013). The average page ties in queries. TagMe annotates the query with contains 15 entity annotations from Freebase. We Wikipedia entities, which we map to Freebase, and used the Freebase KG and its links to Wikipedia. use the annotation confidence scores as the poten- tial ΨE1 (q, z, e1). Queries: We report on two sets of entity-seeking queries. A sample of about 800 well-formed 4.3 Discriminative parameter training with queries from WebQuestions (Berant et al., 2013) latent variables were converted to telegraphic utterances (such as We first set the potentials in (1) as explained in would be typed into commercial search engines) 4.2 (henceforth called ‘Unoptimized’), and got by volunteers familiar with Web search. We call § encouraging accuracy. Then we rewrote each po- this WQT (WebQuestions, telegraphic). Queries tential as are accompanied by ground truth entities. The second data set, TREC-INEX, from (Sawant and Ψ ( ) = exp w φ ( ) (4) Chakrabarti, 2013) has about 700 queries sam- • ··· • · • ··· pled from TREC and INEX, available at http: or log Ψ ( ) = w φ ( ), • • ··· • • · • ··· //bit.ly/WSpxvr. These come with well- with w beingQ a weight vectorP for a specific poten- formed and telegraphic utterances, as well as • tial , and φ being a corresponding feature vector. ground truth entities. • •

1110 There are some notable differences between were used. We believe commercial search engines these query sets. For WQT, queries were gener- can cut this down to less than a second. ated by using Google’s query suggestions inter- face. Volunteers were asked to find answers using 5.3 Research questions single Freebase pages. Therefore, by construction, In the rest of this section we will address these queries retained can be answered using the Free- questions: base KG alone, with a simple r(e1, ?) form. In For telegraphic queries, is our entity-relation- • contrast, TREC-INEX queries provide a balanced type-selector segmentation better than the mix of t2 and r hints in the queries, and direct an- type-selector segmentation of (Sawant and swers from triples is relatively less available. Chakrabarti, 2013)? When semantic parsers (Berant et al., 2013; • 5.2 Implementation details Kwiatkowski et al., 2013) are subjected to telegraphic queries, how do they perform On an average, the pseudocode in Figure 2 compared to our proposal? generated 13 segmentations per query, with Are the KG and corpus really complementary • longer queries generating more segmentations as regards their support of accurate ranking of than shorter ones. candidate entities? We used an MG4J (Boldi and Vigna, 2005) Is the prediction of r and t from our ap- • 2 based query processor, written in Java, over en- proach better than a greedy assignment based tity and word indices on ClueWeb09B. The in- on local language models? dex supplies snippets with a specified maximum We also discuss anecdotes of successes and fail- width, containing a mention of some entity and ures of various systems. satisfying a WAND (Broder et al., 2003) predi- cate over words in s. In case of phrases in the 5.4 Benefits of relation in addition to type query, the WAND threshold was computed by Figure 5 shows entity-ranking MAP, MRR, and adding the IDF of constituent words. The index b NDCG@10 (n@10) for two data sets and vari- returned about 330,000 snippets on average for ous systems. “No interpretation” is an IR baseline WAND threshold of 0.6. without any KG. Type+selector is our implemen- We retained the top 200 candidate entities from tation of (Sawant and Chakrabarti, 2013). Unopti- the corpus; increasing this horizon did not give mized and LVDT both beat “no interpretation” and benefits. We also considered as candidates for e2 “type+selector” by wide margins. (Boldface im- those entities that are adjacent to e1 in the KG plies best performing formulation.) There are two via top-scoring r candidates. In order to gener- notable differences between S&C and our work. ate supporting snippets for an interpretation con- First, S&C do not use the knowledge graph (KG) taining entity annotation e, we need to match e and rely on a noisy corpus. This means S&C fails with Google’s corpus annotations. However, re- to answer queries whose answers are found only lying solely on corpus annotations fails to retrieve in KG. This can be seen from WQT results; they many potential evidence snippets, because entity perform only slightly better than the baseline. Sec- annotations are sparse. Therefore we probed the ond, even for queries that can be answered through token index with the textual mention of e1 in the the corpus alone, S&C miss out on two important query; this improved recall. signals that the query may provide - namely the We also investigated the feasibility of our pro- query entity and the relation. Our framework not posals for interactive search. There are three major only provides a way to use a curated and high pre- processes involved in answering a query - gener- cision knowledge graph but also attempts to pro- ating potential interpretations, collecting/scoring vide more reachability to corpus by the use of re- snippets, and inference (MAP for Unoptimized lational phrases. and wφ( ) for LVDT). For the WQT dataset, av- In case of TREC-INEX, LVDT improves upon · erage time per query for each stage was approx- the unoptimized graphical model, where for WQT, imately - 0.2, 16.6 and 1.3 seconds respectively. it does not. Preliminary inspection suggests this is Our (Java) code did not optimize the bottleneck because WQT has noisy and incomplete ground at all; only 10 hosts and no clever load balancing truth, and LVDT trains to the noise; a non-convex

1111 Dataset Formulation map mrr n@10 using a single Freebase page, which effectively re- No interpretation .205 .215 .292 duced the role of a corpus. In fact, a large frac- TREC Type+selector .292 .306 .356 tion of WQT queries cannot be answered well us- -INEX Unoptimized .409 .419 .502 ing the corpus alone, because FACC1 annotations LVDT .419 .436 .541 are too sparse and rarely cover common nouns and No interpretation .080 .095 .131 phrases such as ‘democracy’ or ‘drug overdose’ WQT Type+selector .116 .152 .201 which are needed for some WQT queries. Unoptimized .377 .401 .474 For WQT, our system also compares favorably LVDT .295 .323 .406 with Jacana (Yao and Van Durme, 2014). Given that they subject their input to natural langauge Figure 5: ‘Entity-relation-type-selector’ segmen- parsing, their relatively poor performance is not tation yields better accuracy than ‘type-selector’ unsurprsing. segmentation. 5.6 Complementary benefits of KG & corpus objective makes matters worse. The bias in our Figure 7 shows the synergy between the corpus unoptimized model circumvents training noise. and the KG. In all cases and for all metrics, using 5.5 Comparison with semantic parsers the corpus and KG together gives superior perfor- mance to using any of them alone. However, it For TREC-INEX, both unoptimized and LVDT is instructive that in case of TREC-INEX, corpus- beat SEMPRE (Berant et al., 2013) convinc- only is better than KG-only, whereas this is re- ingly, whether it is trained with Free917 or Web- versed for WQT, which also supports the above Questions (Figure 6). argument. SEMPRE’s relatively poor performance, in this case, is explained by its complete reliance on the Data Formulation map mrr n@10 knowledge graph. As discussed previously, the Unoptimized (KG) .201 .209 .241 TREC-INEX dataset contains a sizable proportion Unoptimized (Corpus) .381 .388 .471 of queries that may be difficult to answer using Unoptimized (Both) .409 .419 .502 a KG alone. When SEMPRE is compared with LVDT (KG only) .255 .264 .293 our systems with a telegraphic sample of Web- LVDT (Corpus) .267 .272 .315 TREC-INEX Questions (WQT), results are mixed. Our Unop- LVDT (Both) .419 .436 .541 timized model still compares favorably to SEM- Unoptimized (KG) .329 .343 .394 PRE, but with slimmer gains. As before, LVDT Unoptimized (Corpus) .188 .228 .291 falls behind. Unoptimized (Both) .377 .401 .474

Dataset Formulation map mrr n@10 WQT LVDT (KG only) .257 .281 .345 SEMPRE(Free917) .154 .159 .186 LVDT (Corpus only) .170 .210 .280 TREC SEMPRE(WQ) .197 .208 .247 LVDT (Both) .295 .323 .406 -INEX Unoptimized .409 .419 .502 LVDT .419 .436 .541 Figure 7: Synergy between KB and corpus. SEMPRE(Free917) .229 .255 .285 WQT SEMPRE(WQ) .374 .406 .449 5.7 Collective vs. greedy segmentation Unoptimized .377 .401 .474 Jacana .239 .256 .329 To judge the quality of interpretations, we asked LVDT .295 .323 .406 paid volunteers to annotate queries with an appro- priate relation and type, and compared them with Figure 6: Comparison with semantic parsers. the interpretations associated with top-ranked en- tities. Results in Figure 8 indicate that in spite Our smaller gains over SEMPRE in case of of noisy relation and type language models, our WebQuestions is explained by how WebQuestions formulations produce high quality interpretations was assembled (Berant et al., 2013). Although through collective inference. Google’s query suggestions gave an eclectic pool, Figure 9 demonstrates the benefit of collective only those queries survived that could be answered inference over greedy segmentation followed by

1112 Formulation Type Relation Type/Rel Unoptimized (top 1) 23 49 60 Unoptimized (top 5) 29 57 68 LVDT (top 1) 25 52 61 LVDT (top 5) 33 61 69

Figure 8: Fraction of queries (%) with correct in- terpretations of t2, r, and t2 or r, on TREC-INEX. evaluation. Collective inference boosts absolute MAP by as much as 0.2. Figure 11: Comparison of various approaches for NDCG at rank 1 to 10, WQT dataset Dataset Formulation map mrr n@10 Unoptimized (greedy) .343 .347 .432 lector to arrive at the correct answer Alfa Romeo TREC Unoptimized .409 .419 .502 (/m/09c50). The corpus features also play a cru- -INEX LVDT (greedy) 205 .214 .259 cial role for queries which may not be accurately LVDT .436 .541 .419 represented with an appropriate logical formula. Unoptimized (greedy) .246 .271 .335 For the query meg ryan bookstore movie, the Unoptimized .377 .401 .474 textual patterns for the relation ActedIn in con- WQT LVDT (greedy) .212 .246 .317 junction with the selector word ‘bookstore’ cor- LVDT .295 .323 .406 rectly identifies the answer entity You’ve Got Mail (/m/014zwb). Figure 9: Collective vs. greedy segmentation We also analyzed samples of queries where our system did not perform particularly well. 5.8 Discussion We observed that one of the recurring themes Closer scrutiny revealed that collective infer- of these queries was that their answer enti- ence often overcame errors in earlier stages ties had very little corpus support, and the to produce a correct ranking over answer en- type/relation hint mapped to too many or no tities. E.g., for the query automobile com- candidate type/relations. For example, in the pany makes spider the entity disambiguation query south africa political system, the rel- stage fails to identify the car Alfa Romeo Spi- evant type/relation hint ‘political system’ could der (/m/08ys39). However, the interpretation not be mapped to /government/form_of_ stage recovers from the error and segments the government and /location/country/ query with Automobile (/m/0k4j) as the query form_of_government respectively. entity e1, /organization/organization and /business/industry/companies as 6 Conclusion and future work target type t2 and relation r respectively (from the We presented a technique to partition telegraphic relation/type hint ‘company’), and spider as se- entity-seeking queries into functional segments and to rank answer entities accordingly. While our results are favorable compared to strong prior art, further improvements may result from relax- ing our model to recognize multiple e1s and rs. It may also help to deploy more sophisticated para- phrasing models (Berant and Liang, 2014) or word embeddings (Yih et al., 2014) for relation hints. It would also be interesting to supplement entity- linked corpora and curated KGs with extracted triples (Fader et al., 2014). Another possibility is to apply the ideas presented here to well-formed Figure 10: Comparison of various approaches for questions. NDCG at rank 1 to 10, TREC-INEX dataset

1113 References ing shallow semantics. In CIKM, October. (demo). Yanen Li, Bo-Jun Paul Hsu, ChengXiang Zhai, and Jonathan Berant and Percy Liang. 2014. Semantic Kuansan Wang. 2011. Unsupervised query segmen- parsing via paraphrasing. In ACL Conference. tation using clickthrough for information retrieval. Jonathan Berant, Andrew Chou, Roy Frostig, and Percy In SIGIR Conference, pages 285–294. ACM. Liang. 2013. Semantic parsing on Freebase from Percy Liang. 2013. Lambda dependency- question-answer pairs. In Empirical Methods in based compositional semantics. Technical Report Natural Language Processing (EMNLP). arXiv:1309.4408, Stanford University. http:// Charles Bergeron, Jed Zaretzki, Curt Breneman, and arxiv.org/abs/1309.4408. Kristin P. Bennett. 2008. Multiple instance ranking. Thomas Lin, Patrick Pantel, Michael Gamon, Anitha In ICML, pages 48–55. ACM. Kannan, and Ariel Fuxman. 2012. Active objects: Paolo Boldi and Sebastiano Vigna. 2005. MG4J at Actions for entity-centric search. In WWW Confer- TREC 2005. In Ellen M. Voorheesand Lori P. Buck- ence, pages 589–598. ACM. land, editors, TREC, number SP 500-266 in Special Ndapandula Nakashole, Gerhard Weikum, and Fabian Publications. NIST. Suchanek. 2012. PATTY: A taxonomy of relational Andrei Z. Broder, David Carmel, Michael Herscovici, patterns with semantic types. In EMNLP Confer- Aya Soffer, and Jason Zien. 2003. Efficient query ence, EMNLP-CoNLL ’12, pages 1135–1145. ACL. evaluation using a two-level retrieval process. In Patrick Pantel, Thomas Lin, and Michael Gamon. CIKM, pages 426–434. ACM. 2012. Mining entity types from query logs via user Tao Cheng and Kevin Chen-Chuan Chang. 2010. Be- intent modeling. In ACL Conference, pages 563– yond pages: supporting efficient, scalable entity 571, Jeju Island, Korea, July. search with dual-inversion index. In EDBT. ACM. Fernando Pereira. 2013. Meaning in the ClueWeb09. 2009. http://www. wild. Invited talk at EMNLP Conference. lemurproject.org/clueweb09.php/. http://hum.csse.unimelb.edu.au/ emnlp2013/invited-talks.html Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. . 2014. Open question answering over curated and Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, and extracted knowledge bases. In SIGKDD Confer- Grant Weddell. 2012. Interpreting keyword queries ence. over Web knowledge bases. In CIKM. Paolo Ferragina and Ugo Scaiella. 2010. TAGME: Nikos Sarkas, Stelios Paparizos, and Panayiotis on-the-fly annotation of short text fragments (by Tsaparas. 2010. Structured annotations of Web wikipedia entities). CoRR/arXiv, abs/1006.3498. queries. In SIGMOD Conference. http://arxiv.org/abs/1006.3498. Uma Sawant and Soumen Chakrabarti. 2013. Learn- Evgeniy Gabrilovich, Michael Ringgaard, and Amar- ing joint query interpretation and response ranking. nag Subramanya. 2013. FACC1: Free- In WWW Conference, Brazil. base annotation of ClueWeb corpora. http:// Fei Wu and Daniel S Weld. 2007. Automatically se- lemurproject.org/clueweb12/, June. Ver- mantifying Wikipedia. In CIKM, pages 41–50. sion 1 (Release date 2013-06-26, Format version 1, Mohamed Yahya, Klaus Berberich, Shady Elbas- Correction level 0). suoni, Maya Ramanath, Volker Tresp, and Gerhard Sean Gallagher. 2012. How Google and Microsoft Weikum. 2012. Natural language questions for the taught search to ‘understand’ the Web. ArsTechnica Web of data. In EMNLP Conference, pages 379– article. http://goo.gl/NWs0zT. 390, Jeju Island, Korea, July. Jiafeng Guo, Gu Xu, Xueqi Cheng, and Hang Li. 2009. Xuchen Yao and Benjamin Van Durme. 2014. Infor- Named entity recognition in query. In SIGIR Con- mation extraction over structured data: Question an- ference, pages 267–274. ACM. swering with Freebase. In ACL Conference. ACL. Johan Hall, Jens Nilsson, and Joakim Nivre. 2014. Xuchen Yao, Jonathan Berant, and Benjamin Van Maltparser. http://www.maltparser.org/. Durme. 2014. Freebase QA: Information extrac- Thorsten Joachims. 2002. Optimizing search engines tion or semantic parsing? In ACL 2014 Workshop using clickthrough data. In SIGKDD Conference, on Semantic Parsing (SP14). pages 133–142. ACM. Wen-tau Yih, Xiaodong He, and Christopher Meek. Gjergji Kasneci, Fabian M. Suchanek, Georgiana Ifrim, 2014. Semantic parsing for single-relation question Maya Ramanath, and Gerhard Weikum. 2008. answering. In ACL Conference. ACL. NAGA: Searching and ranking knowledge. In Chun-Nam John Yu and Thorsten Joachims. 2009. ICDE. IEEE. Learning structural SVMs with latent variables. In Daphne Koller and Nir Friedman. 2009. Probabilistic ICML, pages 1169–1176. ACM. Graphical Models: Principles and Techniques. MIT A. L. Yuille and Anand Rangarajan. 2006. The Press. concave-convex procedure. Neural Computation, Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and 15(4):915–936. Luke S. Zettlemoyer. 2013. Scaling seman- ChengXiang Zhai. 2008. Statistical language models tic parsers with on-the-fly ontology matching. In for information retrieval: A critical review. Founda- EMNLP Conference, pages 1545–1556. tions and Trends in Information Retrieval, 2(3):137– Xiaonan Li, Chengkai Li, and Cong Yu. 2010. Enti- 213, March. tyEngine: Answering entity-relationship queries us-

1114