Natural Language Questions for the Web of Data

Mohamed Yahya1, Klaus Berberich1, Shady Elbassuoni2 Maya Ramanath3, Volker Tresp4, Gerhard Weikum1 1 Max Planck Institute for Informatics, Germany 2 Qatar Computing Research Institute 3 Dept. of CSE, IIT-Delhi, India 4 Siemens AG, Corporate Technology, Munich, Germany {myahya,kberberi,weikum}@mpi-inf.mpg.de [email protected] [email protected] [email protected]

Abstract and discover relevant information. As is in RDF format, the standard approach would be The Linked Data initiative comprises struc- to run structured queries in triple-pattern-based lan- tured in the Semantic-Web data model RDF. Exploring this heterogeneous guages like SPARQL, but only expert programmers data by structured query languages is tedious are able to precisely specify their information needs and error-prone even for skilled users. To ease and cope with the high heterogeneity of the data the task, this paper presents a methodology (and absence or very high complexity of schema in- for translating natural language questions into formation). For less initiated users the only option structured SPARQL queries over linked-data to query this rich data is by keyword search (e.g., sources. via services like sig.ma (Tummarello et al., 2010)). Our method is based on an integer linear pro- None of these approaches is satisfactory. Instead, the gram to solve several disambiguation tasks by far most convenient approach would be to search jointly: the segmentation of questions into in knowledge bases and the Web of linked data by phrases; the mapping of phrases to semantic entities, classes, and relations; and the con- means of natural-language questions. struction of SPARQL triple patterns. Our so- As an example, consider a quiz question like lution harnesses the rich type system provided “Which female actor played in Casablanca and is by knowledge bases in the web of linked data, married to a writer who was born in Rome?”. to constrain our semantic-coherence objective function. We present experiments on both the The answer could be found by querying sev- question translation and the resulting query eral linked data sources together, like the IMDB- answering. style LinkedMDB movie and the DB- pedia , exploiting that there are 1 Introduction entity-level sameAs links between these collections. One can think of different formulations of the 1.1 Motivation example question, such as “Which actress from Recently, very large, structured, and semantically Casablanca is married to a writer from Rome?”. A rich knowledge bases have become available. Ex- possible SPARQL formulation, assuming a user fa- amples are (Suchanek et al., 2007), DBpe- miliar with the schema of the underlying knowl- dia (Auer et al., 2007), and Freebase (Bollacker et edge base(s), could consist of the following six al., 2008). DBpedia forms the nucleus of the Web of triple patterns (joined by shared-variable bind- Linked Data (Heath and Bizer, 2011), which inter- ings): ?x hasGender female, ?x isa actor, ?x connects hundreds of RDF data sources with a total actedIn Casablanca (film), ?x marriedTo ?w, of 30 billion subject-property-object (SPO) triples. ?w isa writer, ?w bornIn Rome. This complex The diversity of linked-data sources and their high query, which involves multiple joins, would yield heterogeneity make it difficult for humans to search good results, but it is difficult for the user to come

379

Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 379–390, Jeju Island, Korea, 12–14 July 2012. c 2012 Association for Computational Linguistics up with the precise choices for relations, classes, and good query (with triple patterns that can actually entities. This would require familiarity with the con- be joined) or at a meaningless and non-executable tents of the knowledge base, which no average user query (with disconnected triple patterns). This gen- is expected to have. Our goal is to automatically cre- eralized disambiguation problem is much more chal- ate such structured queries by mapping the user’s lenging than the more focused task of named entity question into this representation. Keyword search is disambiguation (NED). It is also different from gen- usually not a viable alternative when the information eral word sense disambiguation (WSD), which fo- need involves joining multiple triples to construct cuses on the meaning of individual words (e.g., map- the final result, notwithstanding good attempts like ping them to WordNet synsets). that of Pound et al. (2010). In the example, the obvi- ous keyword query “female actress Casablanca mar- 1.3 Contribution ried writer born Rome” lacks a clear specification of In our approach, we introduce new elements towards the relations among the different entities. making translation of questions into SPARQL triple patterns more expressive and robust. Most impor- 1.2 Problem tantly, we solve the disambiguation and mapping Given a natural language question qNL and a knowl- tasks jointly, by encoding them into a comprehen- edge base KB, our goal is to translate qNL into a sive integer linear program (ILP): the segmentation formal query qFL that captures the information need of questions into meaningful phrases, the mapping expressed by qNL. of phrases to semantic entities, classes, and rela- We focus on input questions that put the em- tions, and the construction of SPARQL triple pat- phasis on entities, classes, and relations between terns. The ILP harnesses the richness of large knowl- them. We do not consider aggregations (counting, edge bases like Yago2 (Hoffart et al., 2011b), which max/min, etc.) and negations. As a result, we gener- has information not only about entities and relations, ate structured queries of the form known as conjunc- but also about surface names and textual patterns tive queries or select-project-join queries in database by which web sources refer to them. For example, terminology. Our target language is SPARQL 1.0, Yago2 knows that “Casablanca” can refer to the city where the above focus leads to queries that consist of or the film, and “played in” is a pattern that can de- multiple triple patterns, that is, conjunctions of SPO note the actedIn relation. In addition, we can lever- search conditions. We do not use any pre-existing age the rich type system of semantic classes. For ex- query templates, but generate queries from scratch ample, knowing that Casablanca is a film, for trans- as they involve a variable number of joins with a- lating “played in” we can focus on relations with a priori unknown join structure. type signature whose range includes films, as op- A major challenge is in the ambiguity of posed to sports teams, for example. Such informa- the phrases occurring in a natural-language ques- tion is encoded in judiciously designed constraints tion. Phrases can denote entities (e.g., the city for the ILP. Although we intensively harness Yago2, of Casablanca or the movie Casablanca), classes our approach does not depend on a specific choice of (e.g., actresses, movies, married people), or rela- knowledge base or language resource for type infor- tions/properties (e.g., marriedTo between people, mation and phrase/name dictionaries. Other knowl- played between people and movies). A priori, we do edge bases such as DBpedia can be easily plugged not know if a phrase should be mapped to an entity, in. a class, or a relation. In fact, some phrases may de- Based on these ideas, we have developed a frame- note any of these three kinds of targets. For example, work and system, called DEANNA (DEep Answers a phrase like “wrote score for” in a question about for maNy Naturally Asked questions), that com- film music composers, could map to the composer- prises a full suite of components for question de- film relation wroteSoundtrackForFilm, to the class composition, mapping constituents into the seman- of movieSoundtracks (a subclass of music pieces), tic concept space, generating alternative candidate or to an entity like the movie “The Score”. Depend- mappings, and computing a coherent mapping of all ing on the choice, we may arrive at a structurally constituents into a set of SPARQL triple patterns that

380 can be directly executed on one or more linked data ‘Who’, ‘played in’, ‘movie’ and ‘Casablanca’. sources. 2. Phrase mapping to semantic items. This in- 2 Background cludes finding that the phrase ‘played in’ can ei- ther refer to the semantic relation actedIn or to We use the Yago2 knowledge base, with its rich playedForTeam and that the phrase ‘Casablanca’ type system, as a semantic backbone. Yago2 is com- can potentially refer to Casablanca (film) or posed of instances of binary relations derived from Casablanca, Morocco. This step merely constructs Wikipedia and WordNet. The instances, called facts, a candidate space for the mapping. The actual dis- provide both ontological information and instance ambiguation is addressed by step 4, discussed below. data. Figure 1 shows sample facts from Yago2. Each 3. Q-unit generation. Intuitively, a q-unit is a triple fact is composed of semantic items that can be di- composed of phrases. Their generation and role will vided into relations, entities, and classes. Entities be discussed in detail in the next section. and classes together are referred to as concepts. 4. Joint disambiguation, where the ambiguities in the phrase-to-semantic-item mapping are resolved. Subject Predicate Object film subclassOf production This entails resolving the ambiguity in phrase bor- Casablanca (film) type film ders, and above all, choosing the best fitting can- “Casablanca” means Casablanca (film) “Casablanca” means Casablanca, Morocco didates from the semantic space of entities, classes, Ingrid Bergman actedIn Casablanca (film) and relations. Here, we determine for our running Figure 1: Sample knowledge base example that ‘played in’ refers to the semantic re- lation actedIn and not to playedForTeam and the Examples of relations are type, subclassOf, and phrase ‘Casablanca’ refers to Casablanca (film) actedIn. Each relation has a type signature: classes and not Casablanca, Morocco. for the relation’s domain and range. Classes, such as 5. Semantic items grouping to form semantic person and film group entities. Entities are repre- triples. For example, we determine that the relation sented in canonical form such as Ingrid Bergman marriedTo connects person referred to by ‘Who’ and Casablanca (film). A special type of entities and writer to form the semantic triple person are literals, such as strings, numbers, and dates. marriedTo writer. This is done via q-units. 6. Query generation. For SPARQL queries, seman- 3 Framework tic triples such as person marriedTo writer have Given a natural language question, Figure 2 shows to be mapped to suitable triple patterns with appro- the tasks DEANNA performs to translate a ques- priate join conditions expressed through common tion into a structured query. The first three steps variables: ?x type person, ?x marriedTo ?w, and prepare the input for constructing a disambiguation ?w type writer for the example. graph for mapping the phrases in a question onto 3.1 Phrase Detection entities, classes, and relations, in a coherent man- ner. The fourth step formulates this generalized dis- A detected phrase p is a pair < T oks, l > where ambiguation problem as an ILP with complex con- T oks is a phrase and l is a label, l ∈ straints and computes the best solution using an {concept, relation}, indicating whether a phrase is ILP solver. Finally, the fifth and sixth step together a relation phrase or a concept phrase. Pr is the set of use the disambiguated mapping to construct an exe- all detected relation phrases and Pc is the set of all cutable SPARQL query. detected concept phrases. A question sentence is a sequence of tokens, One special type of detected relation phrase is qNL = (t0, t1, ..., tn). A phrase is a contiguous sub- the null phrase, where no relation is explicitly men- sequence of tokens (ti, ti+1, ..., ti+l) ⊆ qNL, 0 ≤ tioned, but can be induced. The most prominent ex- i, 0 ≤ l ≤ n. The input question is fed into the fol- ample of this is the case of adjectives, such as ‘Aus- lowing pipeline of six steps: tralian movie’, where we know there is a relation 1. Phrase detection. Phrases are detected that being expressed between ‘Australia’ and ‘movie’. potentially correspond to semantic items such as We use multiple detectors for detecting phrases of

381 s e d t g s e s d a r n s n n e i e d t e r u i o s p e t i h c d l e a y t p d t c r e p p l o t r s Concept Concept a n i u h a e e a e n r a r r - t t p m Q-unit c s p u u g s s &Relation &Relation Joint q q Phras e Phrase Gen era- Disam bi- Items Gen e- Detection Mapping tion guation Grouping ration 5 6 3 1 2 4

Knowledge Base entities & names classes & subc lasses relations & pattterns incl. dictionaries & statistics

Figure 2: Architecture of DEANNA. different types. For concept detection, we use a de- 3.3 Dependency Parsing & Q-Unit Generation tector that works against a phrase-concept dictionary Dependency parsing identifies triples of to- which looks as follows: kens, or triploids, htrel, targ1, targ2i, where {‘Rome’,‘eternal city’} → Rome trel, targ1, targ2 ∈ qNL are seeds for phrases, with {‘Casablanca’} → Casablanca (film) the triploid acting as a seed for a potential SPARQL We experimented with using third-party named en- triple pattern. Here, trel is the seed for the relation tity recognizers but the results were not satisfactory. phrase, while targ1 and targ2 are seeds for the two This dictionary was mostly constructed as part of arguments. At this point, there is no attempt to the knowledge base, independently of the question- assign subject/object roles to the arguments. to-query translation task in the form of instances of Triploids are collected by looking for specific de- the means relation in Yago2, an example of which is pendency patterns in dependency graphs (de Marn- shown in Figure 1 effe et al., 2006). The most prominent pattern we For relation detection, we experimented with var- look for is a verb and its arguments. Other patterns ious approaches. We mainly rely on a relation detec- include adjectives and their arguments, preposition- tor based on ReVerb (Fader et al., 2011) with addi- ally modified tokens and objects of prepositions. tional POS tag patterns, in addition to our own which By combining triploids with detected phrases, we looks for patterns in dependency parses. obtain q-units. A q-unit is a triple of sets of phrases, 3.2 Phrase Mapping h{prel ∈ Pr}, {parg1 ∈ Pc}, {parg2 ∈ Pc}i, where t ∈ p and similarly for arg and arg . Concep- After phrases are detected, each phrase is mapped rel rel 1 2 tually, one can view a q-unit as a placeholder node to a set of semantic items. The mapping of concept with three sets of edges, each connecting the same phrases also relies on the phrase-concept dictionary. q-node to a phrase that corresponds to a relation or To map relation phrases, we rely on a corpus of concept phrase in the same q-unit. This notion of textual patterns to relation mappings of the form: nodes and edges will be made more concrete when {‘play’,‘star in’,‘act’,‘leading role’} → actedIn we present our disambiguation graph construction. {‘married’, ‘spouse’,‘wife’} → marriedTo 3.4 Disambiguation of Phrase Mappings Distinct phrase occurrences will map to different semantic item instances. We discuss why this is im- The core contribution of this paper is a framework portant when we discuss the construction of the dis- for disambiguating phrases into semantic items – ambiguation graph and variable assignment in the covering relations, classes, and entities in a unified structured query. manner. This can be seen as a joint task combining

382 named entity disambiguation for entities, word sense In the following, we construct a disambiguation disambiguation for classes (common nouns), and re- graph that encodes all possible mappings. We im- lation extraction. The next section presents the dis- pose a variety of complex constraints (mutual ex- ambiguation framework in detail. clusion among overlapping phrases, type constraints among the selected semantic items, etc.), and define 3.5 Query Generation an objective function that aims to maximize the joint Once phrases are mapped to unique semantic items, quality of the mapping. The graph construction it- we proceed to generate queries in two steps. First, self may resemble similar models used in NED (e.g., semantic items are grouped into triples. This is done (Milne and Witten, 2008; Kulkarni et al., 2009; Hof- using the triploids generated earlier. The power of fart et al., 2011a)). Recall, however, that our task is using a knowledge base is that we have a rich type more complex because we jointly consider entities, system that allows us to tell if two semantic items classes, and relations in the candidate space of pos- are compatible or not. Each relation has a type sig- sible mappings. Because of this complication and nature and we check whether the candidate items are to capture our complex constraints, we do not em- compatible with the signature. ploy graph algorithms, but model the general disam- We did not assign subject/object roles in triploids biguation problem as an ILP. and q-units because a natural language relation 4.1 Disambiguation Graph phrase might express the inverse of a semantic rela- Joint disambiguation takes place over a disambigua- tion, e.g., the natural language expression ‘directed tion graph DG = (V,E), where V = V ∪ V ∪ V by’ and the relation isDirectorOf with respect to s p q and E = Esim ∪ Ecoh ∪ Eq, where: the movies domain are inverses of each other. There- • Vs is the set of semantic items, vs ∈ Vs is an fore, we check which assignment of arg1 and arg2 s-node. is compatible with the semantic relation. If both ar- • Vp is the set of phrases, vp ∈ Vp is called a p- rangements are compatible, then we give preference node. We denote the set of p-nodes correspond- to the assignment given by the dependency parsers. ing to relation phrases by Vrp and the set of p-

Once semantic items are grouped into triples, it nodes corresponding to concept phrases by Vrc . is an easy task to expand them to SPARQL triple • Vq is a set of placeholder nodes for q–units, patterns. This is done by replacing each seman- called q-nodes. They represent phrase triples. tic class with a distinct type-constrained variable. • Esim ⊆ Vp × Vs is a set of weighted similarity Note that this is the reason why each distinct phrase edges that capture the strength of the mapping maps to a distinct instance of a semantic class, to of a phrase to a semantic item. • E ⊆ V × V coherence ensure correct variable assignment. This becomes coh s s is a set of weighted edges clear when we consider the question “Which singer that capture the semantic coherence be- is married to a singer?”, which requires two distinct tween two semantic items. Semantic coherence variables each constrained to bind to an entity of is discussed in more detail later in this section. • E ⊆ V ×V ×d, where d ∈ {rel, arg , arg } type singer. q q p 1 2 is a q-edge. Each such edge connects a place- holder q-node to a p-node with a specific role 4 Joint Disambiguation as a relation, or one of the two arguments. A The goal of the disambiguation step is to compute q-unit, as presented earlier, can be seen as a q- a partial mapping of phrases onto semantic items, node along with its outgoing q-edges. such that each phrase is assigned to at most one Figure 3 shows the disambiguation graph for our semantic item. This step also resolves the phrase- running example (excluding coherence edges be- boundary ambiguity, by enforcing that only non- tween s-nodes). overlapping phrases are mapped. As the result of disambiguating one phrase can influence the map- 4.2 Edge Weights ping of other phrases, we consider all phrases jointly We next describe how the weights on similarity in one big disambiguation task. edges and semantic coherence edges are defined.

383 q-nodes p-nodes s-nodes We define the semantic coherence (Cohsem) be- Rome r:Rome e:Sydne_Rome tween two semantic items s1 and s2 as the Jaccard born e:Born_(film) coefficient of their sets of inlinks. e:Max_Born q was born 1 r:bornOnDate 4.2.2 Similarity Weights r:bornIn a writer Similarity weights are computed differently for c:writer entities, classes, and relations. For entities, we use a Casablanca e:White_House normalized prior score based on how often a phrase e:Casablanca refers to a certain entity in Wikipedia. For classes, q2 played e:Casablanca_(film) e:Played_(film) we use a normalized prior that reflects the number played in r:actedIn of members in a class. Finally, for relations, similar- r:hasMusicalRole ity reflects the maximum n-gram similarity between Who c:person the phrase and any of the relation’s surface forms. We use Lucene for indexing and searching the rela- q3 married e:Married_(series) tion surface forms. married to c: married_person 4.3 Disambiguation Graph Processing is married to r:marriedTo The result of disambiguation is a subgraph of the rel arg1 arg2 disambiguation graph, yielding the most coherent Figure 3: Disambiguation graph for the running example. mappings. We employ an ILP to this end. Before describing our ILP, we state some necessary defini- 4.2.1 Semantic Coherence tions: • d ∈ {rel, arg , arg } Semantic coherence, Cohsem, captures to what Triple dimensions: 1 2 extent two semantic items occur in the same context. • Tokens: T = {t0, t1, ..., tn}. This is different from semantic similarity (Simsem), • Phrases: P = {p0, p1, ..., pk}. which is usually evaluated using the distance be- • Semantic items: S = {s0, s1, ..., sl}. tween nodes in a taxonomy (Resnik, 1995). While • Token occurrences: P(t) = {p ∈ P | t ∈ p}. we expect Simsem(George Bush, Woody Allen) to • Xi ∈ {0, 1} indicates if p-node i is selected. be higher than Simsem(Woody Allen, Terminator) • Y ∈ {0, 1} indicates if p-node i maps to s- we would like Cohsem(Woody Allen, Terminator), ij both of which are from the entertainment domain, to node j. be higher than Cohsem(George Bush, Woody Allen). • Zkl ∈ {0, 1} indicates if s-nodes k, l are both For Yago2, we characterize an entity e by its in- selected so that their coherence edge matters. links InLinks(e): the set of Yago2 entities whose • Qmnd ∈ {0, 1} indicates if the q-edge between corresponding Wikipedia pages link to the entity. q-node m and p-node n for d is selected.

To be able to compare semantic items of different • Cj, Ej and Rj are {0, 1} constants indicating semantic types (entities, relations, and classes), we if s-node j is a class, entity, or relation, resp. need to extend this to classes and relations. For class • wij is the weight for a p–s similarity edge. c with entities e, its inlinks are defined as follows: • vkl is the weight for an s–s semantic coherence InLinks(c) = S Inlinks(e) e∈c edge. For relations, we only consider those that map en- • trc ∈ {0, 1} indicates if the relation s-node r is tities to entities (e.g. actedIn, produced), for which type-compatible with the concept s-node c. we define the set of inlinks as follows: S InLinks(r) = (InLinks(e1) ∩ InLinks(e2)) (e1,e2)∈r Given the above definitions, our objective func- The intuition behind this is that when the two argu- tion is P P ments of an instance of the relation co-occur, then maximize α i,j wijYij + β k,l vklZkl+ P the relation is being expressed. γ m,n,d Qmnd

384 subject to the following constraints: q-nodes p-nodes s-nodes Rome r:Rome 1. A p-node can be assigned to one s-node at most: q1 was born r:bornIn P j Yij ≤ 1, ∀i a writer c:writer

2. If a p-s similarity edge is chosen, then the respec- q2 Casablanca e:Casablanca tive p-node must be chosen: played in r:actedIn

Yij ≤ Xi, ∀j Who c:person q is married to r:marriedTo 3. If s-nodes k and l are chosen (Zkl = 1), then there 3 are p-nodes mapping to each of the s-nodes k and l ( Yik = 1 for some i and Yjl = 1 for some j): Figure 4: Computed subgraph for the running example. Z ≤ P Y and Z ≤ P Y kl i ik kl j jl Figure 4 shows the resulting subgraph for the dis- 4. No token can appear as part of two phrases: ambiguation graph of Figure 3. Note how common P p-nodes between q-units capture joins. i∈P(t) Xi ≤ 1, ∀t ∈ T 5. At most one q-edge is selected for a dimension: 5 Evaluation P n Qmnd ≤ 1, ∀m, d 5.1 Datasets

6. If the q-edge mnd is chosen (Qmnd = 1) then Our experiments are based on two collections of p-node n must be selected: questions: the QALD-1 task for question answer- ing over linked data (QAL, 2011) and a collection Qmnd ≤ Xn, ∀m, d of questions used in (Elbassuoni et al., 2011; El- 7. Each semantic triple should include a relation: bassuoni et al., 2009) in the context of the NAGA

0 project, for informative ranking of SPARQL query E ≥ Q 0 + X 0 + Y 0 − 2 ∀m, n , r, d = rel r mn d n n r answers (Elbassuoni et al. (2009) evaluated the 8. Each triple should have at least one class: SPARQL queries, but the underlying questions are formulated in natural language.) The NAGA collec- C + C ≥ Q 00 + X 00 + Y 000 + c1 c2 mn d1 n n c1 tion is based on linking data from IMDB with the Q 000 + X 000 + Y 000 − 5, mn d2 n n c2 Yago2 knowledge base. This is an interesting linked- 00 000 ∀m, n , n , r, c1, c2, d1 = arg1, d2 = arg2 data case: IMDB provides data about movies, actors, This is not invoked for existential questions that directors, and movie plots (in the form of descrip- return Boolean answers and are translated to ASK tive keywords and phrases); Yago2 adds semantic queries in SPARQL. An example is the question types and relational facts for the participating enti- “Did Tom Cruise act in Top Gun?”, which can be ties. Yago2 provides nearly 3 million concepts and translated to ASK {Tom Cruise actedIn Top Gun}. 100 relations, of which 41 lie within the scope of 9. Type constraints are respected (through q-edges): our framework. Typical example questions for these two col- t + t ≥ Q 0 + X 0 + Y 0 + rc1 rc2 mn d1 n n r lections are: “Which software has been published Q 00 + X 00 + Y 000 + mn d2 n n c1 by Mean Hamster Software?” for QALD-1, and 000 000 000 Qmn d3 + Xn + Yn c2 − 7 0 00 000 “Which director has won the Academy Award for ∀m, n , n , n , r, c1, c2, Best director and is married to an actress that has d1 = rel, d2 = arg1, d3 = arg2 won the Academy Award for Best Actress?” for The above is a sophisticated ILP, and most likely NAGA. For both collections, some questions are NP-hard. However, even with ten thousands of vari- out-of-scope for our setting, because they mention ables it is within the regime of modern ILP solvers. entities or relations that are not available in the un- In our experiments, we used Gurobi (Gur, 2011), and derlying datasets, contain date or time comparisons, achieved run-times – typically of a few seconds. or involve aggregation such as counting. After re-

385 moving these questions, our test set consists of 27 patterns) regardless of the questions to which they QALD-1 training questions out of a total of 50 and belong. Macro-averaging first aggregates the items 44 NAGA questions, out of a total of 87. We used for the same question, and then averages the quality the 19 questions from the QALD-1 test set that are measure over all questions. within the scope of our method for tuning the hyper- For a question q and item set s in one of the stages parameters (α, β, γ) in the ILP objective function. of evaluation, let correct(q, s) be the number of cor- rect items in s, ideal(q) be the size of the ideal item 5.2 Evaluation Metrics set and retrieved(q, s) be the number of retrieved We evaluated the output of DEANNA at three items, we define coverage and precision as follows: stages in the processing pipeline: a) after the dis- cov(q, s) = correct(q, s)/ideal(q) ambiguation of phrases, b) after the generation of prec(q, s) = correct(q, s)/retrieved(q, s). the SPARQL query, and c) after obtaining answers from the underlying linked-data sources. This way, 5.3 Results & Discussion we could obtain insights into our building blocks, 5.3.1 Disambiguation in addition to assessing the end-to-end performance. In particular, we could assess the goodness of the Table 1 shows the results for disambiguation in question-to-query translation independently of the terms of macro and micro coverage and precision. actual answer quality which may depend on partic- For both datasets, coverage is high as few mappings ularities of the underlying datasets (e.g., slight mis- are missing. We obtain perfect precision for QALD- matches between query terminology and the names 1 as no mapping that we generate is incorrect, while in the data.) for NAGA we generate few incorrect mappings. At each of the three stages, the output was shown 5.3.2 Query Generation to two human assessors who judged whether an out- Table 2 shows the same metrics for the generated put item was good or not. If the two were in dis- triple patterns. The results are similar to those for agreement, then a third person resolved the judg- disambiguation. Missing or incorrect triple patterns ment. can be attributed to (i) incorrect mappings in the dis- For the disambiguation stage, the judges looked ambiguation stage or (ii) incorrect detection of de- at each q-node/s-node pair, in the context of the pendencies between phrases despite having the cor- question and the underlying data schemas, and de- rect mappings. termined whether the mapping was correct or not and whether any expected mappings were missing. 5.3.3 Question Answering For the query-generation stage, the judges looked Table 3 shows the results for query answering. at each triple pattern and determined whether the Here, we attempt to generate answers to questions pattern was meaningful for the question or not and by executing the generated queries over the datasets. whether any expected triple pattern was missing. The table shows the number of questions for which Note that, because our approach does not use any the system successfully generated SPARQL queries query templates, the same question may generate se- (#queries), and among those, how many resulted mantically equivalent queries that differ widely in in satisfactory answers as judged by our evalua- terms of their structure. Hence, we rely on our eval- tors (#satisfactory). Answers were considered un- uation metrics that are based on triple patterns, as satisfactory when: 1) the generated SPARQL query there is no gold-standard query for a given ques- was wrong, 2) the result set was empty due to the tion. For the query-answering stage, the judges were incompleteness of the underlying knowledge base, asked to identify if the result sets for the generated or 3) a small fraction of the result set was relevant queries are satisfactory. to the question. For both sets of questions, most of With these assessments, we computed over- the queries that were perceived unsatisfactory were all quality measures by both micro-averaging and ones that returned no answers. Table 4 shows a macro-averaging. Micro-averaging aggregates over set of example QALD questions, the corresponding all assessed items (e.g., q-node/s-node pairs or triple SPARQL queries and sample answers.

386 Benchmark QALD-1 NAGA Benchmark QALD-1 NAGA Benchmark QALD-1 NAGA covmacro 0.973 0.934 covmacro 0.975 0.894 #questions 27 44 precmacro 1.000 0.934 precmacro 1.000 0.941 #queries 20 41 covmicro 0.963 0.945 covmicro 0.956 0.847 #satisfactory 10 15 precmicro 1.000 0.941 precmicro 1.000 0.906 #relaxed +3 +3 Table 1: Disambiguation Table 2: Query generation Table 3: Query answering

Question Generated Query Sample Answers 1. Who was the wife of President ?x marriedTo Abraham Lincoln . Mary Todd Lincoln Lincoln? ?x type person 2. In which films did Julia Roberts ?x type movie . Richard Gere actedIn ?x . Runaway Bride as well as Richard Gere play? Julia Roberts actedIn ?x Pretty Woman 3. Which actors were born in ?x type actor . ?x bornIn Germany NONE Germany? Table 4: Example questions, the generated SPARQL queries and their answers

Queries that produced no answers, such as the 6 Related Work third query in Table 4 were further relaxed using an Question answering has a long history in NLP and incarnation of the techniques described in (Elbas- IR research. The Web and Wikipedia have proved to suoni et al., 2009), by retaining the triple patterns be a valuable resource for answering fact-oriented expressing type constraints and relaxing all other questions. State-of-the-art methods (Hirschman and triple patterns. Relaxing a triple pattern was done Gaizauskas, 2001; Kwok et al., 2001; Zheng, 2002; by replacing all entities with variables and casting Katz et al., 2007; Dang et al., 2007; Voorhees, 2003) entity mentions into keywords that are attached to cast the user’s question into a keyword query to a the relaxed triple pattern. For example, the QALD Web search engine (perhaps with phrases for loca- question “Which actors were born in Germany?” tion and person names or other proper nouns). Key was translated into the following SPARQL query: to finding good results is to retrieve and rank sen- ?x type actor . ?x bornIn Germany which pro- tences or short passages that contain all or most key- duced no answers when run over the Yago2 knowl- words and are likely to yield good answers. Together edge base since the relation bornIn relates peo- with trained classifiers for the question type (and ple to cities and not countries in Yago2. The thus the desired answer type), this methodology per- query was then relaxed into: ?x type actor . ?x forms fairly well for both factoid and list questions. bornIn ?z[Germany]. This relaxed (and keyword- IBM’s Watson project (Ferrucci et al., 2010) augmented) triple-pattern query was then processed demonstrated a new kind of deep QA. A key ele- the same way as triple-pattern queries without any ment in Watson’s approach is to decompose com- keywords. The results of such query were then plex questions into several cues and sub-cues, ranked based on how well they match the keyword with the aim of generating answers from matches conditions specified in the relaxed query using the for the various cues (tapping into the Web and ranking model in (Elbassuoni et al., 2009). Using Wikipedia). Knowledge bases like DBpedia (Auer this technique, the top ranked results for the relaxed et al., 2007), Freebase (Bollacker et al., 2008), and query were all actors born in German cities as shown Yago (Suchanek et al., 2007)) are used for both an- in Table 5. swering parts of questions that can be translated to After relaxation, the judges again assessed the re- structured form (Chu-Carroll et al., 2012) and type- sults of the relaxed queries and determined whether checking possible answer candidates and thus filter- they were satisfactory or not. The number of addi- ing out spurious results (Kalyanpur et al., 2011). tional queries that obtained satisfactory answers af- The recent QALD-1 initiative (QAL, 2011) pro- ter relaxation are shown under #relaxed in Table 3. posed a benchmark task to translate questions into The evaluation data, in addition to a demonstra- SPARQL queries over linked-data sources like DB- tion of our system (Yahya et al., 2012), can be found pedia and MusicBrainz. FREyA (Damljanovic et at http://mpi-inf.mpg.de/yago-naga/deanna/. al., 2011), the best performing system, relies on

387 Q: ?x type actor . ?x wasBornIn ?z[Germany] Martin Lawrence type actor . Martin Lawrence wasBornIn Frankfurt am Main Robert Schwentke type actor . Robert Schwentke wasBornIn Stuttgart Willy Millowitsch type actor . Willy Millowitsch wasBornIn Cologne Jerry Zaks type actor . Jerry Zaks wasBornIn Stuttgart Table 5: Top-4 results for the QALD question “Which actors were born in Germany?” after relaxation interaction with the user to interpret the question. constraints are more complex, as we address the Earlier work on mapping questions into structured joint mapping of arbitrary phrases onto entities, queries includes the work by Frank et al. (2007) and classes, or relations. Moreover, instead of graph al- Unger and Cimiano (2011). Frank et al. (2007) used gorithms or factor-graph learning, we use an ILP for lexical-conceptual templates for query generation. solving the ambiguity problem. This way, we can ac- However, this work did not address the crucial issue commodate expressive constraints, while being able of disambiguating the constituents of the question. to disambiguate all phrases in a few seconds. In Pythia, Unger and Cimiano (2011) relied on an DEANNA uses dictionaries of names and phrases ontology-driven grammar for the question language for entities, classes, and relations. Spitkovsky and so that questions could be directly mapped onto the Chang (2012) recently released a huge dictionary of vocabulary of the underlying ontology. Such gram- pairs of phrases and Wikipedia links, derived from mars are obviously hard to craft for very large, com- Google’s Web index. For relations, Nakashole et al. plex, and evolving knowledge bases. Nalix is an at- (2012) released PATTY, a large taxonomy of pat- tempt to bring question answering to XML data (Li, terns with semantic types. Yang, and Jagadish, 2007) by mapping questions to XQuery expressions, relying on human interaction 7 Conclusions and Future Work to resolve possible ambiguity. We presented a method for translating natural- Very recently, Unger et al. (2012) developed a language questions into structured queries. The nov- template-based approach based on Pythia, where elty of this method lies in modeling several map- questions are automatically mapped to structured ping stages as a joint ILP problem. We harness type queries in a two step process. First, a set of query signatures and other information from large-scale templates are generated for a question, independent knowledge bases. Although our model, in princi- of the knowledge base, determining the structure of ple, leads to high combinatorial complexity, we ob- the query. After that, each template is instantiated served that the Gurobi solver could handle our ju- with semantic items from the knowledge base. This diciously designed ILP very efficiently. Our experi- performs reasonably well for the QALD-1 bench- mental studies showed very high precision and good mark: out of 50 test questions, 34 could be mapped, coverage of the query translation, and good results and 19 were correctly answered. in the actual question answers. Efforts on user-friendly exploration of struc- Future work includes relaxing some of the limita- tured data include keyword search over relational tions that our current approach still has. First, ques- databases (Bhalotia et al., 2002) and structured key- tions with aggregations cannot be handled at this word search (Pound et al., 2010). The latter is a com- point. Second, queries sometimes return empty an- promise between full natural language and struc- swers although they perfectly capture the original tured queries, where the user provides the structure question, but the underlying data sources are incom- and the system takes care of the disambiguation of plete or represent the relevant information in an un- keyword phrases. expected manner. We plan to extend our approach of Our joint disambiguation method was inspired combining structured data with textual descriptions, by recent work on NED (Milne and Witten, 2008; and generate queries that combine structured search Kulkarni et al., 2009; Hoffart et al., 2011a) and predicates with keyword or phrase matching. WSD (Navigli, 2009). In contrast to this prior work on related problems, our graph construction and

388 References Hoffart, J.; Suchanek, F. M.; Berberich, K.; Lewis- Kelham, E.; de Melo, G.; and Weikum, G. 2011. Auer, S.; Bizer, C.; Kobilarov, G.; Lehmann, J.; Cyga- Yago2: exploring and querying world knowledge in niak, R.; and Ives, Z. G. 2007. DBpedia: A Nucleus time, space, context, and many languages. In WWW for a Web of Open Data. In ISWC/ASWC. (Companion Volume). Bhalotia, G.; Hulgeri, A.; Nakhe, C.; Chakrabarti, S.; and Kalyanpur, A.; Murdock, J. W.; Fan, J.; and Welty, C. A. Sudarshan, S. 2002. Keyword Searching and Brows- 2011. Leveraging community-built knowledge for ing in Databases using BANKS. In ICDE. type coercion in question answering. In International Bollacker, K. D.; Evans, C.; Paritosh, P.; Sturge, T.; and Conference. Taylor, J. 2008. Freebase: a Collaboratively Created Katz, B.; Felshin, S.; Marton, G.; Mora, F.; Shen, Y. K.; for Structuring Human Knowledge. Zaccak, G.; Ammar, A.; Eisner, E.; Turgut, A.; and In SIGMOD. Westrick, L. B. 2007. CSAIL at TREC 2007 Ques- Chu-Carroll, J.; Fan, J.; Boguraev, B. K.; Carmel, D.; and tion Answering. In TREC. Sheinwald, D.; Welty, C. 2012. Finding needles in the Kulkarni, S.; Singh, A.; Ramakrishnan, G.; and haystack: Search and candidate generation. In IBM J. Chakrabarti, S. 2009. Collective annotation of Res. & Dev., vol 56, no.3/4. wikipedia entities in web text. In KDD. Damljanovic, D.; Agatonovic, M.; and Cunningham, H. Kwok, C. C. T.; Etzioni, O.; and Weld, D. S. 2001. Scal- 2011. FREyA: an Interactive Way of Querying Linked ing Question Answering to the Web. In WWW. Data using Natural Language. Li, Y.; Yang, H.; and Jagadish, H. V. 2007. NaLIX: Dang, H. T.; Kelly, D.; and Lin, J. J. 2007. Overview of A Generic Natural Language Search Environment for the trec 2007 question answering track. In TREC. XML Data. ACM Trans. Database Syst. 32(4). de Marneffe, M. C.; Maccartney, B.; and Manning, C. D. 2006. Generating typed dependency parses from Milne, D. N., and Witten, I. H. 2008. Learning to link phrase structure parses. In LREC. with wikipedia. In CIKM. Elbassuoni, S.; Ramanath, M.; Schenkel, R.; Sydow, M.; Ndapandula Nakashole, Gerhard Weikum and Fabian and Weikum, G. 2009. Language-model-based rank- Suchanek 2012. PATTY: A Taxonomy of Relational ing for queries on rdf-graphs. In CIKM. Patterns with Semantic Types. In EMNLP. Elbassuoni, S.; Ramanath, M.; and Weikum, G. 2011. Navigli, R. 2009. Word sense disambiguation: A survey. Query relaxation for entity-relationship search. In ACM Comput. Surv. 41(2). ESWC. Pound, J.; Ilyas, I. F.; and Weddell, G. E. 2010. Ex- Fader, A.; Soderland, S.; and Etzioni, O. 2011. Iden- pressive and Flexible Access to Web-extracted Data: A tifying relations for open information extraction. In Keyword-based Structured Query Language. In SIG- EMNLP. MOD. Ferrucci, D. A.; Brown, E. W.; Chu-Carroll, J.; Fan, J.; 2011. 1st Workshop on Question Answering over Gondek, D.; Kalyanpur, A.; Lally, A.; Murdock, J. W.; Linked Data (QALD-1). http://www.sc.cit-ec.uni- Nyberg, E.; Prager, J. M.; Schlaefer, N.; and Welty, bielefeld.de/qald-1. C. A. 2010. Building Watson: An Overview of the Resnik, P. 1995. Using Information Content to Evaluate DeepQA Project. AI Magazine 31(3). Semantic Similarity in a Taxonomy. In IJCAI. Frank, A.; Krieger, H.-U.; Xu, F.; Uszkoreit, H.; Crys- Spitkovsky, V. I. Spitkovsky; Chang, A. X. ; 2012. A mann, B.; Jorg,¨ B.; and Schafer,¨ U. 2007. Question Cross-Lingual Dictionary for English Wikipedia Con- Answering from Structured Knowledge Sources. J. cepts. In LREC. Applied Logic 5(1). Suchanek, F. M.; Kasneci, G.; and Weikum, G. 2007. Gurobi Optimization, Inc. 2012. Gurobi Optimizer Ref- Yago: a core of semantic knowledge. In WWW. erence Manual. http://www.gurobi.com/. Tummarello, G.; Cyganiak, R.; Catasta, M.; Danielczyk, Heath, T., and Bizer, C. 2011. Linked Data: Evolving S.; Delbru, R.; and Decker, S. 2010. Sig.ma: Live the Web into a Global Data Space. San Rafael, CA: views on the web of data. J. Web Sem. 8(4). Morgan & Claypool, 1 edition. Unger, C.; and Cimiano, P. 2011. Pythia: Compositional Hirschman, L., and Gaizauskas, R. 2001. Natural Lan- Meaning Construction for Ontology-Based Question guage Question Answering: The View from Here. Nat. Answering on the Semantic Web. In NLDB. Lang. Eng. 7. Unger, C.; Buhmann,¨ L.; Lehmann, J.; Ngonga Ngomo, Hoffart, J.; Mohamed, A. Y.; Bordino, I.; Furstenau,¨ H.; A.-C.; Gerber, D.; and Cimiano, P. 2012. Template- Pinkal, M.; Spaniol, M.; Taneva, B.; Thaterm S.; and based question answering over RDF data. In WWW. Weikum, G. 2011. Robust Disambiguation of Named Voorhees, E. M. 2003. Overview of the trec 2003 ques- Entities in Text. In EMNLP. tion answering track. In TREC.

389 Yahya, M.; Berberich, K.; Elbassuoni, S.; Ramanath, M.; Zheng, Z. 2002. AnswerBus Question Answering Sys- Tresp, V.; and Weikum, G. 2012. Deep answers tem. In HLT. for naturally asked questions on the web of data. In WWW.

390