Sapphire: Querying RDF Data Made Simple
Total Page:16
File Type:pdf, Size:1020Kb
Sapphire: Querying RDF Data Made Simple Ahmed El-Roby Khaled Ammar Ashraf Aboulnaga Jimmy Lin Carleton University University of Qatar Computing University of [email protected] Waterloo Research Institute - Waterloo [email protected] HBKU [email protected] [email protected] ABSTRACT curate answers for simple questions such as \How many peo- RDF data in the linked open data (LOD) cloud is very ple live in New York?". Questions like this one that ask valuable for many different applications. In order to un- about a specific property of an entity are termed factoid lock the full value of this data, users should be able to issue questions. Such questions can be answered by a simple struc- complex queries on the RDF datasets in the LOD cloud. tured search that can be constructed effectively by natural SPARQL can express such complex queries, but construct- language approaches. ing SPARQL queries can be a challenge to users since it re- However, the RDF data that makes up the LOD cloud is quires knowing the structure and vocabulary of the datasets not limited to answering simple questions. This data can being queried. In this paper, we introduce Sapphire, a tool be used to answer complex questions that require complex that helps users write syntactically and semantically cor- structured searches. Natural language approaches are not rect SPARQL queries without prior knowledge of the queried effective at constructing such complex structured searches. datasets. Sapphire interactively helps the user while typing Instead, complex structured searches are better expressed the query by providing auto-complete suggestions based on using SPARQL queries. It is a common practice for data the queried data. After a query is issued, Sapphire provides sources in the LOD cloud to provide SPARQL endpoints that allow users to issue SPARQL queries on the RDF data suggestions on ways to change the query to better match 6 the needs of the user. We evaluated Sapphire based on per- that they contain . formance experiments and a user study and showed it to be To illustrate the need for SPARQL, consider the question superior to competing approaches. \How many scientists graduated from an Ivy League uni- versity?" This question was used in the QALD-5 compe- tition [25]. QALD is an annual competition for Question 1. INTRODUCTION Answering over Linked Data, and this question was not an- In recent years, advances in the field of information extrac- swered by any of the natural language systems that partici- tion have helped in automating the construction of large Re- pated in QALD-5. This is not surprising since the question source Description Framework (RDF) datasets that are pub- involves concepts such as \scientist", \graduated", and \Ivy lished on the web. These datasets could be general-purpose League university" that are not easy to map to a structured such as DBpedia1 [5], a dataset of structured information search over the queried dataset (DBpedia), in addition to extracted from Wikipedia, or they could be specific to par- requiring a count of the results. On the other hand, the ticular domains such as movies2, geographic information3, following query over the SPARQL endpoint of DBpedia will and city data4. These datasets are graph-structured, and find the required answer: are interlinked via edges that point from one dataset to an- other, forming a massive graph known as the Linked Open PREFIX res: <http://dbpedia.org/resource/> Data (LOD) cloud 5. PREFIX dbo: <http://dbpedia.org/ontology/> The LOD cloud contains a wealth of structured informa- SELECT DISTINCT count (?uri) WHERE { arXiv:1805.11728v2 [cs.DB] 13 Sep 2018 tion that can be extremely useful to users and applications ?uri rdf:type dbo:Scientist. in diverse domains. However, utilizing this information re- ?uri dbo:almaMater ?university. quires an effective way to find the answers to questions in ?university dbo:affiliation res:Ivy_League. the datasets that make up this cloud. Answering questions } over RDF data generally follows one of two approaches: (a) To be able to compose a query such as this one, the user natural language queries, and (b) structured querying using needs to know the structure of the dataset, the vocabulary SPARQL [3], the standard query language for RDF. used to represent different concepts, and the literals used in Natural language approaches rely on keyword search or the dataset including their data types and format. For ex- more complex question answering techniques. These ap- ample, the user needs to know that\scientist"is an rdf:type proaches are convenient and easy to use, and they find ac- and \Ivy League" is an affiliation of a university. Achiev- ing this level of knowledge about a dataset can be difficult 1http://dbpedia.org even for experienced users given the massive scale and di- 2http://www.linkedmdb.org verse vocabulary of the LOD cloud. By one recent count7, 3http://www.geonames.org 4http://www.data.gov/opendatasites 6e.g., http://dbpedia.org/sparql for DBpedia. 5http://lod-cloud.net 7http://stats.lod2.eu the LOD cloud has almost 3000 data sources that contain 2. RELATED WORK over 14 billion RDF triples from various domains. To il- Prior work on helping users construct structured queries lustrate the diversity of the vocabulary in the LOD cloud, on RDF data falls into three categories: 1. Natural language consider that DBpedia alone has over 3K distinct predicates approaches. 2. Approximate queries. 3. Query by example. at the time of writing this paper. Thus, it is quite likely Natural Language Approaches: Several prior works that a user would need to construct SPARQL queries on create structured queries based on natural language ap- data whose structure and vocabulary she does not know in proaches [12, 20, 24, 29, 14]. Each of these works focuses full, for example when querying a new dataset. Our goal in on one or more specific query templates, and uses keyword this paper is to help users with this challenging task. search or natural language questions to construct these tem- We present Sapphire, an interactive tool aimed at helping plates and fill in the placeholders they contain. All of these users write syntactically and semantically correct SPARQL approaches suffer from two limitations compared to Sap- queries on RDF datasets they do not have prior knowledge phire: (1) Their expressiveness is limited to specific query about. Sapphire is aimed at users who have a technical templates, and (2) inferring query structure, predicates, and background but are not necessarily SPARQL experts, e.g., literals based only on natural language is inherently am- data scientists or application developers. Thus, Sapphire biguous. In contrast, Sapphire can construct any SPARQL makes no attempt to \shield" its users from the syntax of query, and it removes ambiguity by involving the user di- SPARQL, but rather helps them construct valid SPARQL rectly in query composition. queries with ease. Sapphire achieves this in two ways that In this paper, we compare Sapphire to QAKiS [7] and both rely on a predictive user model that is built for the end- KBQA [10] as representatives of the state of the art in nat- points to be queried in an initialization phase. First, while ural language approaches, and we show that Sapphire out- a user is typing a query, Sapphire interactively provides her performs these two systems. QAKiS [7] is a question an- with data-driven suggestions to complete the predicates and swering system over RDF that automatically extracts from literals in the query, similar to the auto-complete capability Wikipedia different ways of expressing relations in natu- in many user interfaces. Second, when a user completes the ral language (e.g., \a bridge spans a river" and \a bridge query and submits it for execution, Sapphire suggests ways crosses a river" express the same relation). These equivalent to modify the query into one that may be better suited to expressions are used to match fragments of a natural lan- the needs of the user. For example, if the user query re- guage question and construct the equivalent SPARQL query. turns no answers, Sapphire would attempt to modify it into KBQA [10] is a more recent question answering system that a query that does return answers. focuses on factoid questions. KBQA learns question tem- Sapphire's query completion and query suggestion mod- plates from a large Q&A corpus (e.g., Yahoo! Answers), ules rely on natural language techniques. Thus, in the spec- and learns mappings from these templates to RDF predi- trum of approaches for querying RDF, Sapphire bridges the cates in the queried dataset. The templates and mappings gap between the simple but ambiguous natural language are then used to answer user questions. approaches on the one hand, and the powerful but cum- Approximate Queries: This line of work goes beyond bersome SPARQL on the other. The novelty of Sapphire the fixed query templates used by natural language approaches, comes from the need to balance multiple conflicting goals: enabling the user to express approximate structured queries. Sapphire must provide high quality recommendations that That is, the query posed by the user does not have to be ex- actually help the user find the information that she needs, it actly matched with the queried RDF data [19, 18, 30]. These must have fast response time since it is interactive, it must approaches are still limited in the query structure that they run on a reasonably sized machine without placing excessive support, and they require the user to know the vocabulary demands on the machine's resources, and it must not over- and the approximate schema of the queried datasets. In con- load the SPARQL endpoints it queries. These design goals trast, Sapphire enables the user to compose any SPARQL require judicious design choices which we present in the rest query without prior knowledge of the queried datasets.