Semantic Queries by Example
Total Page:16
File Type:pdf, Size:1020Kb
Semantic Queries by Example Lipyeow Lim Haixun Wang Min Wang University of Hawai‘i at Manoa¯ Microsoft Research Asia Google, Inc. [email protected] [email protected] [email protected] ABSTRACT 1. INTRODUCTION With the ever increasing quantities of electronic data, there A growing number of advanced applications such as prod- is a growing need to make sense out of the data. Many ad- uct information management (PIM) systems, customer re- vanced database applications are beginning to support this lationship management (CRM) systems, electronic medical need by integrating domain knowledge encoded as ontolo- records (EMRs) systems are recognizing the need to incor- gies into queries over relational data. However, it is ex- porate ontology into the realm of object relational databases tremely difficult to express queries against graph structured (ORDBMs) so that the user can query data and its related ontology in the relational SQL query language or its exten- ontology in a consistent manner [7, 14, 22]. However, it is sions. Moreover, semantic queries are usually not precise, extremely tedious and time consuming to understand ontol- especially when data and its related ontology are compli- ogy, and use ontology in database queries [15]. Such queries cated. Users often only have a vague notion of their infor- that leverage the semantic information stored in ontologies mation needs and are not able to specify queries precisely. to filter and retrieve data from relational tables are called In this paper, we address these challenges by introducing semantic queries. The success of the relational database a novel method to support semantic queries in relational technology is at least partly due to the spartan simplicity databases with ease. Instead of casting ontology into rela- of its data model and query language, which insulate the tional form and creating new language constructs to express user from the physical implementation of the database. But such queries, we ask the user to provide a small number for semantic queries, users are often exposed to the full com- of examples that satisfy the query she has in mind. Using plexity of the ontology. Still, integrating data and its related those examples as seeds, the system infers the exact query ontology is a challenge too important to ignore. There are automatically, and the user is therefore shielded from the two major approaches in attacking this problem. One is to complexity of interfacing with the ontology. Our approach flatten graph-structured ontologies into relational form [7, consists of three steps. In the first step, the user provides 22], and the other is to extend ORDBMSs and SQL to han- several examples that satisfy the query. In the second step, dle non-relational data directly [13, 14]. However, both ap- we use machine learning techniques to mine the semantics proaches incur tremendous system cost, but have limited of the query from the given examples and related ontolo- success in taking the tediousness out of handling semantic gies. Finally, we apply the query semantics on the data to queries. In many cases the complexity of expressing seman- generate the full query result. We also implement an op- tic queries is almost prohibitive to allow for widespread use tional active learning mechanism to find the query seman- of such query mechanisms. tics accurately and quickly. Our experiments validate the The future success of incorporating ontologies into prac- effectiveness of our approach. tical database query processing depends on whether we can find automatic or semi-automatic methods to help users ex- press semantic queries. In this paper, we introduce a novel Categories and Subject Descriptors approach that insulates the users from the complexity of H.2 [Database Management]: Query Processing; H.3.3 the ontology, yet still enables them to ask every possible se- [Information Search and Retrieval]: Query formulation mantic query. The semi-automatic framework we develop bridges the gap between a query in a user’s mind and the final result of the query. Furthermore, since users do not General Terms handle the ontology directly, there is no need to map the Algorithms ontology into relational form, which means our approach does not incur expensive cost of extending database engines and query languages. Before we dive into the details of our approach, we use Permission to make digital or hard copies of all or part of this work for some examples to illustrate the task on our hand. Con- personal or classroom use is granted without fee provided that copies are sider an EMR (electronic medical records) system. Clini- not made or distributed for profit or commercial advantage and that copies cians recording the diagnosis of a patient visit may choose bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific different disease codes for the same symptoms that the pa- permission and/or a fee. tient is exhibiting. One clinician might describe a patient EDBT ’13, March 18-22 2013, Genoa, Italy diagnosis using the code for “Tumor of the Uvea”, while an- Copyright 2013 ACM 978-1-4503-1597-5/13/03$15.00 ...$15.00. 347 vID date patientID diagnosis WITH Traversed (cls,syn) AS ( 1 20080201 3243 Brain Neoplasm (SELECT R.cls, R.syn 2 20080202 4722 Stomatitis FROM XMLTABLE (’Document(”Thesaurus.xml”) 3 20080202 2973 Tumor of the Uvea /terminology/conceptDef/properties 4 20080204 9437 Corneal Intraepithelial Neoplasia [property/name/text()=”Synonym” and 5 20080205 2437 Choroid Tumor property/value/text()=”Eye Tumor”] ... ... ... ... /property[name/text()=”Synonym”]/value’ COLUMNS cls CHAR(64) PATH ’./parent::*/parent::* Figure 1: The table visit recording patient visits /parent::*/name’, tgt CHAR(64) PATH’.’) AS R) UNION ALL (SELECT CH.cls,CH.syn other might use the code for“Iris Neoplasm”. In the patient’s FROM Traversed PR, XMLTABLE (’Document(”Thesaurus.xml”) EMR a generic term like “Eye Neoplasm” might be recorded /terminology/conceptDef/definingConcepts/ instead of the more specific “Tumor of the Uvea” (we will concept[./text()=$parent]/parent::*/parent::*/ use the more descriptive terms instead of the corresponding properties/property[name/text()=”Synonym”]/value’ PASSING PR.cls AS ”parent” codes in this paper in the interest of readability). Hence, in COLUMNS order to obtain meaningful results from querying an EMR cls CHAR(64) PATH ’./parent::*/ database, the query processing system needs to understand parent::*/parent::*/name’, the semantics of the query and the data. syn CHAR(64) PATH’.’) AS CH )) SELECT DISTINCT V.* The growing demand for processing data and its meaning FROM Visit V has spurred the increasing use of ontology in various appli- WHERE V.diagnosis IN cations. Continuing with the EMR example, many ontolo- (SELECT DISTINCT syn FROM Traversed) gies have been developed to capture the semantics of various sub-components of EMR data. For example, the National In both instances, it is not straight-forward to write the Cancer Institute (NCI) Thesaurus [17] is a collection of on- query and the user needs to have an intimate knowledge of tologies spanning the following areas: Drugs and Chemicals, the structure or schema of the ontology such as existence of Diseases Disorders and Findings, Anatomic Structure Sys- “synonym”, and “is a” edges. tem or Substance, Gene, Chemotherapy Regimen etc. Fig- The second fundamental challenge is the inherent fuzzi- ure 2 shows a fragment of the NCI Thesaurus in graphical ness in the semantics of the query. In most practical ap- form. Moreover, many of these ontologies are well integrated plications, the data and the ontology behind it are quite with existing data coding terminologies (eg, SNOMED [10], complicated and consequently the queries are no longer ex- ICD9 [9]) used by industry EMR formats such as HL7 [8] act, that is, users often have no more than a vague notion, and CCR [5]. With the confluence of ontologies, coding rather than a clear understanding and definition, of what terminologies, and data standards, the need for querying re- they are looking for. In other words, even if the users have lational data together with its related ontology has become intimate knowledge of the structure of the ontology, they even more urgent. might not be able to precisely specify what they want to However, there are two fundamental challenges in this find. task. First, it is extremely difficult to express queries against a graph structured ontology in SQL as the following example Example 2. Find all patients diagnosed with some dis- illustrates. ease in the choroid, which is part of the eye. Example 1 (Running Example). Suppose we have a Intuitively, the user wants to find patients with some disease table of patient visit records as shown in Figure 1, of which that affects or is located in the choroid (Figure 3 shows one the diagnosis column is associated with the NCI Thesaurus possible processing workflow for this query). In the NCI ontology (Figure 2). Consider the query to find all patients Thesaurus, there are 3 separate types of relationship linking diagnosed with eye tumor. disease concepts to anatomic locations: Using existing RDF-like data models [22], we could store the 1. Disease Has Primary Anatomic Site ontology as triples in the Thesaurus(src,rel,tgt) relation 2. Disease Has Associated Anatomic Site and attempt to write the query in Example 1 using recursive 3. Disease Has Metastatic Anatomic Site SQL as follows. Even if the user knows the structure of the NCI Thesaurus WITH Traversed (src) AS ( (SELECT src ontology, i.e., he knows about the three types of relationships FROM Thesaurus that are relevant to the query, without looking at the results, WHERE tgt = ’Eye Tumor’ AND rel=’Synonym’) UNION ALL the user still may not know whether the query should use one (SELECT CH.tgt of these relationships to “choroid” or all of them.