Probabilistic Query Rewriting for Efficient and Effective Keyword
Total Page:16
File Type:pdf, Size:1020Kb
Probabilistic Query Rewriting for Efficient and Effective Keyword Search on Graph Data Lei Zhang Thanh Tran Achim Rettinger Karlsruhe Institute of Karlsruhe Institute of Karlsruhe Institute of Technology Technology Technology 76128 Karlsruhe, Germany 76128 Karlsruhe, Germany 76128 Karlsruhe, Germany [email protected] [email protected] [email protected] ABSTRACT connecting them, which represent final answers covering all The problem of rewriting keyword search queries on graph query keywords. The space of possible subgraphs is gener- data has been studied recently, where the main goal is to ally exponential in the number of query keywords. Through clean user queries by rewriting keywords as valid tokens ap- grouping keywords into larger meaningful units (called seg- pearing in the data and grouping them into meaningful seg- ments), the number of keywords to be processed and the ments. The main solution to this problem employs heuris- corresponding search space is reduced. tics for ranking query rewrites and a dynamic programming The two main tasks involved in query cleaning (henceforth algorithm for computing them. Based on a broader set of also called query rewriting) are token rewriting, where query queries defined by an existing benchmark, we show that the keywords are rewritten as tokens appearing in the data, and use of these heuristics does not yield good results. We pro- query segmentation, where tokens are grouped together as pose a novel probabilistic framework, which enables the op- segments representing compound keywords. Query rewrit- timality of a query rewrite to be estimated in a more princi- ing helps to improve not only the result quality but also the pled way. We show that our approach outperforms existing runtime performance of keyword search. Towards a rewrit- work in terms of effectiveness and efficiency of query rewrit- ing solution that enables more effective and efficient keyword ing. More importantly, we provide the first results indicating search, we provide the following contributions: query rewriting can indeed improve overall keyword search Probabilistic Ranking of Query Rewrites and Its runtime performance and result quality. Impact on Keyword Search Effectiveness. The opti- mality of query rewrites has been defined based on heuris- tics for scoring tokens and segments, including an adoption 1. INTRODUCTION of TFIDF [16, 4]. However, we show in this work that for Keyword search on graph data has attracted large inter- ranking query rewrites, existing work based on these heuris- est. It has proven to be an intuitive and effective paradigm tics has several conceptual flaws and does not yield high for accessing information, helping to circumvent the com- quality results. Instead of using ad-hoc heuristics, we pro- plexity of structured query languages and to hide the un- pose a probabilistic framework for keyword query rewriting, derlying data representation. Using simple keyword queries, which enables the optimality of query rewrites to be studied users can search for complex structured results, including in a systematic fashion. In particular, optimality is cap- connected tuples from relational databases, XML data, RDF tured in terms of the probability a query rewrite can be graphs, and general data graphs [8, 5, 18]. Existing work so observed given the data, and estimated using the principled far focuses on the efficient processing of keyword queries [6, technique (Maximum Likelihood Estimation). Furthermore, 5], or effective ranking of results [12, 14]. while previous work only considers the textual information In addition, recent work studies the problem of keyword but neglects the rather rich graph structure, which might query cleaning [16, 4]. The motivation is keyword queries be more crucial for keyword search on graph data, our ap- are dirty, often containing words not intended to be part of proach takes both textual and structural information in the the query, words that are misspelled, or words that do not data into account. In [16, 4], it has shown that w.r.t. the directly appear but are semantically equivalent to words in proposed ad-hoc notion of optimality, computed rewrites are the data. Besides dirty queries, keyword search solutions accurate. However, the actual effect of query rewriting on also face the problem of search space explosion. Searching the quality of keyword search results is not clear. Using the results on graph data requires finding matches for the indi- recently established benchmark [2] for keyword search, we vidual keywords as well as considering subgraphs in the data show that our approach not only yields better query rewrites but more importantly, also better keyword search results. Permission to make digital or hard copies of all or part of this work for Context-based Computation of Query Rewrites and personal or classroom use is granted without fee provided that copies are Its Impact on Keyword Search Efficiency. The prob- not made or distributed for profit or commercial advantage and that copies lem of computing query rewrites has shown to be NP-hard. bear this notice and the full citation on the first page. To copy otherwise, to A solution [16] based on dynamic programming has been republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume were invited to present proposed for this, which computes optimal query rewrites by their results at The 39th International Conference on Very Large Data Bases, considering all possible combinations of optimal sub-query August 26th - 30th 2013, Riva del Garda, Trento, Italy. rewrites. There, the optimality of a rewrite is based on the Proceedings of the VLDB Endowment, Vol. 6, No. 14 optimality of all its components, while our probabilistic ap- Copyright 2013 VLDB Endowment 2150-8097/13/14... $ 10.00. 1642 Baseball John Kansas City Tuning Car Performance Keyword Query Possible Query Rewrites Player McCarty Cowboys Award Tuning ∗ type name name name description \P ublication Article John⊕McCarthy Turing⊕Award John McCarty Article John⊕McCarthy Tuning⊕Award person1 team1 team award1 T uning Award" Article John⊕McCarty Turing⊕Award Article John⊕McCarty Tuning⊕Award John Turing Nobel Prize of Article McCarthy Award Computing Table 1: Possible query rewrites type name name description NA article1 person2 award2 N author prizes R a d-length Steiner graph when paths that connect keyword Figure 1: Example data graph elements are of length d or less. proach enables optimality to be captured merely based on Example 1. Given the data graph in Fig. 1, for the key- the previously observed context in an incremental rewrit- word query shown in Table 1, there is one matching Steiner ing process. We show that this probabilistic model not only graph as highlighted in Fig. 1, namely the one connecting the produces higher quality results but also can be exploited three nodes Article, John McCarthy and T uring Award by a context-based top-k algorithm that is more efficient (assuming that keywords have already been rewritten so that than the previous solution. Moreover, while previous work they match the labels of these three nodes, e.g., \Tuning reported the search space reduction resulting from segmen- Award" has been rewritten to match the node T uring Award). tation, its impact on overall keyword search performance is For finding whether some data elements match query key- not clear. In this work, we show that the search space re- words, existing solutions typically use an inverted index and duction can outweigh the overhead incurred through query treat elements (their labels) as documents (task 1). For find- rewriting, resulting in better overall runtime performance. ing paths to form Steiner graph from these elements (task Outline. We provide an overview of the problems in 2), they explore the data as an undirected graph, traversing Sec. 2. Then, we present our solution for ranking and com- the edges without taking their direction into account. For puting query rewrites along with differences to the most re- pragmatic reasons, existing keyword search solutions [5, 18, lated work in Sec. 3 and Sec. 4, respectively. Experimental 10] apply a maximum path length restriction d, such that results are presented in Sec. 5, followed by more related work only paths of length d or less have to be traversed. in Sec. 6 and conclusions in Sec. 7. 2.2 Keyword Query Rewriting 2. OVERVIEW The label L(e) of each data element e and the query Q can We firstly provide an overview of the keyword search prob- be conceived as a sequence of tokens, e.g., the label T uring lem, then discuss the role of keyword query rewriting. Award consists of two tokens T uring and Award. Query rewriting firstly maps query keywords (also called query to- 2.1 Keyword Search on Graph Data kens) to tokens appearing in the labels of data elements (to- Keyword search solutions have been proposed for dealing ken rewriting), and then groups the resulting data tokens with different kinds of data, including relational, XML and into segments to form query rewrites (query segmentation): RDF data. In the general setting, existing approaches treat D these different kinds of data as graphs: Definition 3 (Token Rewrite). Let Token be the set of all tokens in the data graph D. Token rewriting with Definition 1 (Data). Data are captured as a directed factor m is a function rewritem, which maps a query token D labeled graph D(N; E) called data graph, where N = NR ] q to a list of m data tokens t 2 Token associated with NA is the disjoint union of resource and attribute value the respective distance d between q and t. Given a keyword nodes NR and NA, respectively, and E = ER ] EA is the set query Q = fq1; q2; : : : ; qng, a query token rewrite is a m × n D of directed edges, where ER are edges between two resources matrix M of tokens t 2 Token , where the i-th column is called relations, i.e., e(ni; nj ) 2 ER iff ni; nj 2 NR, and EA obtained through rewritem(qi).