Natural Language Questions for the Web of Data
Total Page:16
File Type:pdf, Size:1020Kb
Natural Language Questions for the Web of Data Mohamed Yahya1, Klaus Berberich1, Shady Elbassuoni2 Maya Ramanath3, Volker Tresp4, Gerhard Weikum1 1 Max Planck Institute for Informatics, Germany 2 Qatar Computing Research Institute 3 Dept. of CSE, IIT-Delhi, India 4 Siemens AG, Corporate Technology, Munich, Germany {myahya,kberberi,weikum}@mpi-inf.mpg.de [email protected] [email protected] [email protected] Abstract and discover relevant information. As linked data is in RDF format, the standard approach would be The Linked Data initiative comprises struc- to run structured queries in triple-pattern-based lan- tured databases in the Semantic-Web data model RDF. Exploring this heterogeneous guages like SPARQL, but only expert programmers data by structured query languages is tedious are able to precisely specify their information needs and error-prone even for skilled users. To ease and cope with the high heterogeneity of the data the task, this paper presents a methodology (and absence or very high complexity of schema in- for translating natural language questions into formation). For less initiated users the only option structured SPARQL queries over linked-data to query this rich data is by keyword search (e.g., sources. via services like sig.ma (Tummarello et al., 2010)). Our method is based on an integer linear pro- None of these approaches is satisfactory. Instead, the gram to solve several disambiguation tasks by far most convenient approach would be to search jointly: the segmentation of questions into in knowledge bases and the Web of linked data by phrases; the mapping of phrases to semantic entities, classes, and relations; and the con- means of natural-language questions. struction of SPARQL triple patterns. Our so- As an example, consider a quiz question like lution harnesses the rich type system provided “Which female actor played in Casablanca and is by knowledge bases in the web of linked data, married to a writer who was born in Rome?”. to constrain our semantic-coherence objective function. We present experiments on both the The answer could be found by querying sev- question translation and the resulting query eral linked data sources together, like the IMDB- answering. style LinkedMDB movie database and the DB- pedia knowledge base, exploiting that there are 1 Introduction entity-level sameAs links between these collections. One can think of different formulations of the 1.1 Motivation example question, such as “Which actress from Recently, very large, structured, and semantically Casablanca is married to a writer from Rome?”. A rich knowledge bases have become available. Ex- possible SPARQL formulation, assuming a user fa- amples are Yago (Suchanek et al., 2007), DBpe- miliar with the schema of the underlying knowl- dia (Auer et al., 2007), and Freebase (Bollacker et edge base(s), could consist of the following six al., 2008). DBpedia forms the nucleus of the Web of triple patterns (joined by shared-variable bind- Linked Data (Heath and Bizer, 2011), which inter- ings): ?x hasGender female, ?x isa actor, ?x connects hundreds of RDF data sources with a total actedIn Casablanca (film), ?x marriedTo ?w, of 30 billion subject-property-object (SPO) triples. ?w isa writer, ?w bornIn Rome. This complex The diversity of linked-data sources and their high query, which involves multiple joins, would yield heterogeneity make it difficult for humans to search good results, but it is difficult for the user to come 379 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 379–390, Jeju Island, Korea, 12–14 July 2012. c 2012 Association for Computational Linguistics up with the precise choices for relations, classes, and good query (with triple patterns that can actually entities. This would require familiarity with the con- be joined) or at a meaningless and non-executable tents of the knowledge base, which no average user query (with disconnected triple patterns). This gen- is expected to have. Our goal is to automatically cre- eralized disambiguation problem is much more chal- ate such structured queries by mapping the user’s lenging than the more focused task of named entity question into this representation. Keyword search is disambiguation (NED). It is also different from gen- usually not a viable alternative when the information eral word sense disambiguation (WSD), which fo- need involves joining multiple triples to construct cuses on the meaning of individual words (e.g., map- the final result, notwithstanding good attempts like ping them to WordNet synsets). that of Pound et al. (2010). In the example, the obvi- ous keyword query “female actress Casablanca mar- 1.3 Contribution ried writer born Rome” lacks a clear specification of In our approach, we introduce new elements towards the relations among the different entities. making translation of questions into SPARQL triple patterns more expressive and robust. Most impor- 1.2 Problem tantly, we solve the disambiguation and mapping Given a natural language question qNL and a knowl- tasks jointly, by encoding them into a comprehen- edge base KB, our goal is to translate qNL into a sive integer linear program (ILP): the segmentation formal query qFL that captures the information need of questions into meaningful phrases, the mapping expressed by qNL. of phrases to semantic entities, classes, and rela- We focus on input questions that put the em- tions, and the construction of SPARQL triple pat- phasis on entities, classes, and relations between terns. The ILP harnesses the richness of large knowl- them. We do not consider aggregations (counting, edge bases like Yago2 (Hoffart et al., 2011b), which max/min, etc.) and negations. As a result, we gener- has information not only about entities and relations, ate structured queries of the form known as conjunc- but also about surface names and textual patterns tive queries or select-project-join queries in database by which web sources refer to them. For example, terminology. Our target language is SPARQL 1.0, Yago2 knows that “Casablanca” can refer to the city where the above focus leads to queries that consist of or the film, and “played in” is a pattern that can de- multiple triple patterns, that is, conjunctions of SPO note the actedIn relation. In addition, we can lever- search conditions. We do not use any pre-existing age the rich type system of semantic classes. For ex- query templates, but generate queries from scratch ample, knowing that Casablanca is a film, for trans- as they involve a variable number of joins with a- lating “played in” we can focus on relations with a priori unknown join structure. type signature whose range includes films, as op- A major challenge is in the ambiguity of posed to sports teams, for example. Such informa- the phrases occurring in a natural-language ques- tion is encoded in judiciously designed constraints tion. Phrases can denote entities (e.g., the city for the ILP. Although we intensively harness Yago2, of Casablanca or the movie Casablanca), classes our approach does not depend on a specific choice of (e.g., actresses, movies, married people), or rela- knowledge base or language resource for type infor- tions/properties (e.g., marriedTo between people, mation and phrase/name dictionaries. Other knowl- played between people and movies). A priori, we do edge bases such as DBpedia can be easily plugged not know if a phrase should be mapped to an entity, in. a class, or a relation. In fact, some phrases may de- Based on these ideas, we have developed a frame- note any of these three kinds of targets. For example, work and system, called DEANNA (DEep Answers a phrase like “wrote score for” in a question about for maNy Naturally Asked questions), that com- film music composers, could map to the composer- prises a full suite of components for question de- film relation wroteSoundtrackForFilm, to the class composition, mapping constituents into the seman- of movieSoundtracks (a subclass of music pieces), tic concept space, generating alternative candidate or to an entity like the movie “The Score”. Depend- mappings, and computing a coherent mapping of all ing on the choice, we may arrive at a structurally constituents into a set of SPARQL triple patterns that 380 can be directly executed on one or more linked data ‘Who’, ‘played in’, ‘movie’ and ‘Casablanca’. sources. 2. Phrase mapping to semantic items. This in- 2 Background cludes finding that the phrase ‘played in’ can ei- ther refer to the semantic relation actedIn or to We use the Yago2 knowledge base, with its rich playedForTeam and that the phrase ‘Casablanca’ type system, as a semantic backbone. Yago2 is com- can potentially refer to Casablanca (film) or posed of instances of binary relations derived from Casablanca, Morocco. This step merely constructs Wikipedia and WordNet. The instances, called facts, a candidate space for the mapping. The actual dis- provide both ontological information and instance ambiguation is addressed by step 4, discussed below. data. Figure 1 shows sample facts from Yago2. Each 3. Q-unit generation. Intuitively, a q-unit is a triple fact is composed of semantic items that can be di- composed of phrases. Their generation and role will vided into relations, entities, and classes. Entities be discussed in detail in the next section. and classes together are referred to as concepts. 4. Joint disambiguation, where the ambiguities in the phrase-to-semantic-item mapping are resolved. Subject Predicate Object film subclassOf production This entails resolving the ambiguity in phrase bor- Casablanca (film) type film ders, and above all, choosing the best fitting can- “Casablanca” means Casablanca (film) “Casablanca” means Casablanca, Morocco didates from the semantic space of entities, classes, Ingrid Bergman actedIn Casablanca (film) and relations. Here, we determine for our running Figure 1: Sample knowledge base example that ‘played in’ refers to the semantic re- lation actedIn and not to playedForTeam and the Examples of relations are type, subclassOf, and phrase ‘Casablanca’ refers to Casablanca (film) actedIn.