Template-Based Question Answering Over RDF Data
Total Page:16
File Type:pdf, Size:1020Kb
WWW 2012 – Session: Ontology Representation and Querying: RDF and SPARQL April 16–20, 2012, Lyon, France Template-based Question Answering over RDF Data Christina Unger Lorenz Bühmann Jens Lehmann Universität Bielefeld, CITEC Universität Leipzig, IFI/AKSW Universität Leipzig, IFI/AKSW Universitätsstraße 21–23, PO 100920, D-04009 Leipzig PO 100920, D-04009 Leipzig 33615 Bielefeld [email protected] [email protected] [email protected] leipzig.de leipzig.de bielefeld.de Axel-Cyrille Ngonga Daniel Gerber Philipp Cimiano Ngomo Universität Leipzig, IFI/AKSW Universität Bielefeld, CITEC Universität Leipzig, IFI/AKSW PO 100920, D-04009 Leipzig Universitätsstraße 21–23, PO 100920, D-04009 Leipzig [email protected] 33615 Bielefeld [email protected] leipzig.de [email protected] leipzig.de bielefeld.de ABSTRACT the underlying data model and query language, while hiding As an increasing amount of RDF data is published as Linked their complexity. As a good compromise between intuitive- ness and expressivity, question answering approaches allow Data, intuitive ways of accessing this data become more 1 and more important. Question answering approaches have users to express arbitrarily complex information needs in been proposed as a good compromise between intuitiveness natural language without requiring them to be aware of the and expressivity. Most question answering systems trans- underlying schema, vocabulary or query language. Several late questions into triples which are matched against the question answering systems for RDF data have been pro- RDF data to retrieve an answer, typically relying on some posed in the past, for example, Aqualog [14, 23], Power- similarity metric. However, in many cases, triples do not Aqua [24], NLP-Reduce [6] and FREyA [1]. Many of these represent a faithful representation of the semantic structure systems map a natural language question to a triple-based representation. For example, consider the simple question of the natural language question, with the result that more 2 expressive queries can not be answered. To circumvent this Who wrote The Neverending Story?. PowerAqua would map problem, we present a novel approach that relies on a parse this question to the triple representation of the question to produce a SPARQL template that directly h[person,organization]; wrote; Neverending Storyi. mirrors the internal structure of the question. This template is then instantiated using statistical entity identification and Then, by applying similarity metrics and search heuristics, predicate detection. We show that this approach is compet- it would retrieve matching subgraphs from the RDF repos- itive and discuss cases of questions that can be answered itory. For the above query, the following triples would be with our approach but not with competing approaches. retrieved from DBpedia, from which the answer \Michael Ende" can be derived: Categories and Subject Descriptors hWriter; IS A; Personi H.5.2 [Information systems]: User Interfaces|Natural hWriter; author; The Neverending Storyi language, Theory and methods While this approach works very well in cases where the General Terms meaning of the query can be captured easily, it has a num- ber of drawbacks, as in many cases the original semantic Algorithms, Experimentation, Theory structure of the question can not be faithfully captured us- ing triples. For instance, consider the questions 1a and 2a Keywords below. PowerAqua would produce the triple representations in 1b and 2b, respectively. The goal, however, would be Question Answering, Semantic Web, Natural Language Pat- 3 terns, SPARQL SPARQL queries like 1c and 2c, respectively. 1. (a) Which cities have more than three universities? 1. INTRODUCTION (b) h[cities]; more than; universities threei As more and more RDF data is published as Linked Data, (c) developing intuitive ways of accessing this data becomes in- SELECT ?y WHERE { creasingly important. One of the main challenges is the 1At least as complex as can be represented in the query development of interfaces that exploit the expressiveness of language. 2 Copyright is held by the International World Wide Web Conference Com- Accessed via the online demo at http://poweraqua.open. ac.uk:8080/poweraqualinked/jsp/index.jsp. mittee (IW3C2). Distribution of these papers is limited to classroom use, 3 and personal use by others. Assuming a DBpedia namespace with onto as prefix WWW 2012, April 16–20, 2012, Lyon, France. <http://dbpedia.org/ontology/>. ACM 978-1-4503-1229-5/12/04. 639 WWW 2012 – Session: Ontology Representation and Querying: RDF and SPARQL April 16–20, 2012, Lyon, France ?x rdf:type onto:University . lexical entries, are used for parsing, which leads to a seman- ?x onto:city ?y . tic representation of the natural language query, which is } then converted into a SPARQL query template. This pro- HAVING (COUNT(?x) > 3) cess is explained in Section 3. The query templates contain 2. (a) Who produced the most films? slots, which are missing elements of the query that have to be filled with URIs. In order to fill them, our approach (b) h[person,organization]; produced; most filmsi first generates natural language expressions for possible slot (c) SELECT ?y WHERE { fillers from the user question using WordNet expansion. In ?x rdf:type onto:Film . a next step, sophisticated entity identification approaches ?x onto:producer ?y . are used to obtain URIs for those natural language expres- } sions. These approaches rely both on string similarity as ORDER BY DESC(COUNT(?x)) OFFSET 0 LIMIT 1 well as on natural language patterns which are compiled from existing structured data in the Linked Data cloud and Such SPARQL queries are difficult to construct on the basis text documents. A detailed description is given in Section 4. of the above mentioned triple representations, as aggregation This yields a range of different query candidates as potential and filter constructs arising from the use of specific quan- translations of the input question. It is therefore important tifiers are not faithfully captured. What would be needed to rank those query candidates. To do this, we combine instead is a representation of the information need that is string similarity values, prominence values and schema con- much closer to the semantic structure of the original ques- formance checks into a score value. Details of this mecha- tion. Thus, we propose a novel approach to question an- nism are covered in Section 5. The highest ranked queries swering over RDF data that relies on a parse of the question are then tested against the underlying triple store and the to produce a SPARQL template that directly mirrors the best answer is returned to the user. internal structure of the question and that, in a second step, is instantiated by mapping the occurring natural language expressions to the domain vocabulary. For example, a tem- 3. TEMPLATE GENERATION plate produced for Question 2a would be: The main assumption of template generation is that the overall structure of the target SPARQL query is (at least 3. SELECT ?x WHERE { partly) determined by the syntactic structure of the natural ?x ?p ?y . language question and by the occurring domain-independent ?y rdf:type ?c . expressions. Consequently, our goal is to generate a SPARQL } query template such as the one in Example 3 in the intro- ORDER BY DESC(COUNT(?y)) LIMIT 1 OFFSET 0 duction, which captures the semantic structure of the user's In this template, c stands proxy for the URI of a class match- information need and leaves open only specific slots for re- ing the input keyword films and p stands proxy for a property sources, classes and properties, that need to be determined matching the input keyword produced. In a next step, c has with respect to the underlying dataset. to be instantiated by a matching class, in the case of using DBpedia onto:Film, and p has to be instantiated with a 3.1 SPARQL templates matching property, in this case onto:producer. For instan- SPARQL templates specify the query's select or ask clause, tiation, we exploit an index as well as a pattern library that its filter and aggregation functions, as well as the number links properties with natural language predicates. and form of its triples. Subject, predicate and object of a We show that this approach is competitive and discuss triple are variables, some of which stand proxy for appropri- specific cases of questions that can be precisely answered ate URIs. These proxy variables, called slots, are defined as with our approach but not with competing approaches. Thus, triples of a variable, the type of the intended URI (resource, the main contribution of this paper is a domain-independent class or property), and the natural language expression that question answering approach that first converts natural lan- was used in the user question, e.g. h?x; class; filmsi. guage questions into queries that faithfully capture the se- For example, for the question in 4 (taken from the QALD- mantic structure of the question and then identifies domain- 1 benchmark, cf. Section 6) the two SPARQL templates specific entities combining NLP methods and statistical in- given in 4a and 4b are built. formation. 4. How many films did Leonardo DiCaprio star in? In the following section we present an overview of the sys- tem's architecture, and in Sections 3, 4 and 5 we explain (a) SELECT COUNT(?y) WHERE { the components of the system in more detail. In Section 6 ?y rdf:type ?c . we report on evaluation results and then point to an online ?x ?p ?y . interface to the prototype in Section 7. In Section 8 we com- } pare our approach to existing question answering systems on Slots: RDF data, before concluding in Section 9. • h?x; resource; Leonardo DiCaprioi • h ; class; filmsi 2. OVERVIEW ?c • h?p; property; stari Figure 1 gives an overview of our approach.