Example-Driven Query Intent Discovery: Abductive Reasoning Using Semantic Similarity

Example-Driven Query Intent Discovery: Abductive Reasoning using Semantic Similarity Anna Fariha Alexandra Meliou College of Information and Computer Sciences College of Information and Computer Sciences University of Massachusetts Amherst University of Massachusetts Amherst [email protected] [email protected] ABSTRACT research academics aid interest Traditional relational data interfaces require precise structured id name 100 algorithms queries over potentially complex schemas. These rigid data re- 100 Thomas Cormen 101 data management 101 Dan Suciu 102 data mining trieval mechanisms pose hurdles for non-expert users, who typi- 102 Jiawei Han 103 data management cally lack language expertise and are unfamiliar with the details of 103 Sam Madden 103 distributed systems the schema. Query by Example (QBE) methods offer an alternative 104 James Kurose 104 computer networks mechanism: users provide examples of their intended query output 105 Joseph Hellerstein 105 data management 105 distributed systems and the QBE system needs to infer the intended query. However, these approaches focus on the structural similarity of the examples Figure 1: Excerpt of two relations of the CS Academics database. and ignore the richer context present in the data. As a result, they Dan Suciu and Sam Madden (in bold), both have research interests typically produce queries that are too general, and fail to capture the in data management. user’s intent effectively. In this paper, we present SQUID, a sys- the provided example tuples in their result set as equally likely to tem that performs semantic similarity-aware query intent discov- represent the desired intent.1 This ignores the richer context in the ery. Our work makes the following contributions: (1) We design data that can help identify the intended query more effectively. an end-to-end system that automatically formulates select-project- join queries in an open-world setting, with optional group-by ag- Example 1.1. In Figure1, the relations academics and research gregation and intersection operators; a much larger class than prior store information about CS researchers and their research inter- QBE techniques. (2) We express the problem of query intent dis- ests. Given the user-provided set of examples fDan Suciu, Sam covery using a probabilistic abduction model, that infers a query as Maddeng, a human can posit that the user is likely looking for data the most likely explanation of the provided examples. (3) We in- management researchers. However, a QBE system, that looks for troduce the notion of an abduction-ready database, which precom- queries based only on the structural similarity of the examples, pro- putes semantic properties and related statistics, allowing SQUID to duces Q1 to capture the query intent, which is too general: achieve real-time performance. (4) We present an extensive empir- Q1: SELECT name FROM academics ical evaluation on three real-world datasets, including user-intent In fact, the QBE system will generate the same generic query Q1 case studies, demonstrating that SQUID is efficient and effective, for any set of names from the relation academics. Even though and outperforms machine learning methods, as well as the state-of- the intended semantic context is present in the data (by associat- the-art in the related query reverse engineering problem. ing academics with research interest information using the relation research), existing QBE systems fail to capture it. The more spe- 1. INTRODUCTION cific query that better represents the semantic similarity among the Database technology has expanded drastically, and its audience example tuples is Q2: has broadened, bringing on a new set of usability requirements. A Q2: SELECT name FROM academics, research significant group of current database users are non-experts, such as WHERE research.aid = academics.id AND arXiv:1906.10322v1 [cs.DB] 25 Jun 2019 data enthusiasts and occasional users. These non-expert users want research.interest = ‘data management’ to explore data, but lack the expertise needed to do so. Traditional database technology was not designed with this group of users in Example 1.1 shows how reasoning about the semantic similarity mind, and hence poses hurdles to these non-expert users. Tradi- of the example tuples can guide the discovery of the correct query tional query interfaces allow data retrieval through well-structured structure (join of the academics and research tables), as well as the queries. To write such queries, one needs expertise in the query discovery of the likely intent (research interest in data management). language (typically SQL) and knowledge of the, potentially com- We can often capture semantic similarity through direct attributes plex, database schema. Unfortunately, occasional users typically of the example tuples. These are attributes associated with a tuple lack both. Query by Example (QBE) offers an alternative retrieval within the same relation, or through simple key-foreign key joins mechanism, where users specify their intent by providing example (such as research interest in Example 1.1). Direct attributes capture tuples for their query output [45]. intent that is explicit, precisely specified by the particular attribute Unfortunately, traditional QBE systems [51, 48, 16] for rela- values. However, sometimes query intent is more vague, and not tional databases make a strong and oversimplifying assumption in expressible by explicit semantic similarity alone. In such cases, modeling user intent: they implicitly treat the structural similarity the semantic similarity of the example tuples is implicit, captured and data content of the example tuples as the only factors specify- 1More nuanced QBE systems exist, but typically place additional requirements or sig- ing query intent. As a result, they consider all queries that contain nificant restrictions over the supported queries (Figure3). 1 Legend query class QBE: Query by Example QRE: Query Reverse Engineering additional DX: Data Exploration requirements KG: Knowledge Graph !: with significant restrictions selection scalable Figure 2: Partial schema of the IMDb database. The schema con- join projection aggregation semi-join implicit property open-world SQUID tains 2 entity relations: movie, person; and a semantic property re- X X X X X X X X Bonifati et al. [12] X X ! X X X user feedback lation: genre. The relations castinfo and movietogenre associate QPlain [16] X X X X X X provenance input Shen et al. [51] entities and semantic properties. relational X X X X FASTTOPK[48] X X X X QBE Arenas et al. [4] X X ! X X X through deeper associations with other entities in the data (e.g., type SPARQLByE [17] X X ! X X X negative examples and quantity of movies an actor appears in). KG GQBE [29] X X ! X X X QBEES [43] X X ! X X X Example 1.2. The IMDb dataset contains a wealth of information PALEO-J [47] X X X X X top-k queries only related to the movies and entertainment industry. We query the SQLSynthesizer [65] X X X X X schema knowledge IMDb dataset (Figure2) with a QBE system, using two different SCYTHE [57] X X X X X schema knowledge Zhang et al. [64] sets of examples: X X X REGAL [53] X X X X ET1=fArnold Schwarzenegger ET2=fEddie Murphy QRE REGAL+ [54] relational X X X X Sylvester Stallone Jim Carrey FASTQRE [33] X X X X Dwayne Johnsong Robin Williamsg QFE [37] X X X user feedback ET1 contains the names of three actors from a public list of “physi- TALOS [55] X X X X X X 2 cally strong” actors ; ET2 contains the names of three actors from a AIDE [18] X X X user feedback rel. public list of “funny” actors3. ET1 and ET2 represent different query DX REQUEST [24] X X X user feedback intents (strong actors and funny actors, respectively), but a standard Figure 3: SQUID captures complex intents and more expressive QBE system produces the same generic query for both: queries than prior work in the open-world setting. Q3: SELECT person.name FROM person Explicit semantic similarity cannot capture these different intents, Example 1.3. We query the IMDb dataset with SQUID, using the as there is no attribute that explicitly characterizes an actor as “strong” example tuples in ET2 (Example 1.2). SQUID discovers the fol- or “funny”. Nevertheless, the database encodes these associations lowing semantic similarities among the examples: (1) all are Male, implicitly, in the number and type of movies an actor appears in (2) all are American, and (3) all appeared in more than 40 Comedy (“strong” actors frequently appear in action movies, and “funny” movies. Out of these properties, Male and American are very com- actors in comedies). mon in the IMDb database. In contrast, a very small fraction of persons in the dataset are associated with such a high number of Standard QBE systems typically produce queries that are too gen- Comedy movies; this means that it is unlikely for this similarity to be eral, and fail to capture nuanced query intents, such as the ones in coincidental, as opposed to the other two. Based on abductive rea- Examples 1.1 and 1.2. Some prior approaches attempt to refine the soning, SQUID selects the third semantic similarity as the best ex- queries based on additional, external information, such as external planation of the observed example tuples, and produces the query: ontologies [38], provenance information of the example tuples [16], Q4: SELECT person.name and user feedback on multiple (typically a large number) system- FROM person, castinfo, movietogenre, genre generated examples [12, 37, 18]. Other work relies on a closed- WHERE person.id = castinfo.person id world assumption4 to produce more expressive queries [37, 57, 65] AND castinfo.movie id = movietogenre.movie id and thus requires complete examples of input databases and output AND movietogenre.genre id = genre.id results.

Example-Driven Query Intent Discovery: Abductive Reasoning Using Semantic Similarity

The Beneficence of Gayface Tim Macausland Western Washington University, [email protected]

Acting Without Bullshit

'2009 MTV Movie Awards' Honors Ben Stiller with 'MTV Generation Award'

The Evolution of Sherlock Holmes: Adapting Character Across Time

A Reappraisal of Three Character Actors from Hollywood’S Golden Age

Katya Thomas - CV

With No Catchy Taglines Or Slogans, Showtime Has Muscled Its Way Into the Premium-Cable Elite

2019 Olden Lobes Ballot

Proto-Cinematic Narrative in Nineteenth-Century British Fiction

SHIRLEY MACLAINE by PETE MARTIN the Star of Ccm-Can and the Apartment Tells of the Lucky L^Reaks That Took Her from a Broadway Chorus Line to Hollywood Stardom

Roger Vadim, BARBARELLA (1968, 98 Minutes)

CASA Continuing Education Approved Movie List