Materialization and Decomposition of Dataspaces for Efficient Search

1872 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 12, DECEMBER 2011 Materialization and Decomposition of Dataspaces for Efficient Search Shaoxu Song, Student Member, IEEE, Lei Chen, Member, IEEE, and Mingxuan Yuan Abstract—Dataspaces consist of large-scale heterogeneous data. The query interface of accessing tuples should be provided as a fundamental facility by practical dataspace systems. Previously, an efficient index has been proposed for queries with keyword neighborhood over dataspaces. In this paper, we study the materialization and decomposition of dataspaces, in order to improve the query efficiency. First, we study the views of items, which are materialized in order to be reused by queries. When a set of views are materialized, it leads to select some of them as the optimal plan with the minimum query cost. Efficient algorithms are developed for query planning and view generation. Second, we study the partitions of tuples for answering top-k queries. Given a query, we can evaluate the score bounds of the tuples in partitions and prune those partitions with bounds lower than the scores of top-k answers. We also provide theoretical analysis of query cost and prove that the query efficiency cannot be improved by increasing the number of partitions. Finally, we conduct an extensive experimental evaluation to illustrate the superior performance of proposed techniques. Index Terms—Dataspaces, materialization, decomposition. Ç 1INTRODUCTION 3 ATASPACES are recently proposed [1], [2] to provide a the DBPedia project. Again, the attributes of tuples in Dco-existing system of heterogeneous data. The impor- different entries are various, while each tuple may only tance of dataspace systems has already been recognized contain a limited number of attributes. Thereby, all these and emphasized in handling heterogeneous data [3], [4], tuples from heterogeneous sources form a huge dataspace [5], [6], [7]. In fact, examples of interesting dataspaces are in Wikipedia. now prevalent, especially on the Web [3]. Due to the heterogeneous data, there exist matching For example, Google Base1 is a very large, self-describing, correspondences among attributes in dataspaces. For ex- semistructured, heterogeneous database. We illustrate several ample, the matching correspondence between attributes dataspace tuples with attribute values in Fig. 1 as follows: manu and prod could be identified in Fig. 1, since both of each entry Ti consists of several attributes with correspond- them specify similar information of manufacturer of ing values and can be regarded as a tuple in dataspaces. Due products. Such attribute correspondences are often recog- to the heterogeneity of data, which are contributed by users nized by schema mapping techniques [9]. In dataspaces, a around the world, the data set is extremely sparse. pay-as-you-go style [5] is usually applied to gradually According to our observations, there are total 5,858 identify these correspondences according to users’ feedback attributes in 307,667 tuples (random samples), while most when necessary. of these tuples only have less than 30 attributes individually. Once the attribute correspondences are recognized, the Another example of dataspaces is from Wikipedia,2 keywords in attributes with correspondences are said where each article usually has a tuple with some attributes neighbors in schema level. For example, keywords Apple in and values to describe the basic structured information of attributes manu and prod are neighbor keywords, since manu prod the entry. For instance, a tuple describing the Nikon and have correspondence. Consequently, a query with keyword neighborhood in schema level [10] Corporation may contain attributes like (founded:Tokyo should not only search the keywords in the attributes Japan 1917), (industry: imaging), (products: cameras) ...}. specified in the query, but also match the neighbor Such interesting tuples could not only be found in article keywords in the attributes with correspondences. For entries but also mined by advanced tools such as Yago [8] in example, a query predicate (manu : Apple) should search Apple manu prod 1. http://base.google.com/. keyword in both the attributes and , 2. http://www.wikipedia.org/. according to the correspondence between manu and prod. To support efficient queries on dataspaces, Dong and Halevy [10] utilize the encoding of attribute-keywords as . The authors are with the Department of Computer Science and items and extend the inverted index to answer queries. Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong. Specifically, each distinct attribute name and value pair is E-mail: {sshaoxu, leichen, mingxuan}@cse.ust.hk. encoded by a unique item. For instance, (manu : Apple)is Manuscript received 4 Dec. 2009; revised 20 Apr. 2010; accepted 11 June denoted by the item I1. Then, each tuple can be represented 2010; published online 26 Oct. 2010. by a set of items. Similarly, the query input can also be Recommended for acceptance by N. Bruno. For information on obtaining reprints of this article, please send e-mail to: encoded in the same way. Since the data are extremely [email protected], and reference IEEECS Log Number TKDE-2009-12-0821. Digital Object Identifier no. 10.1109/TKDE.2010.213. 3. http://dbpedia.org/. 1041-4347/11/$26.00 ß 2011 IEEE Published by the IEEE Computer Society SONG ET AL.: MATERIALIZATION AND DECOMPOSITION OF DATASPACES FOR EFFICIENT SEARCH 1873 the near optimal query plan, with relative error bounds on the query cost. 2. We discuss the generation of item views to minimize the query costs. Obviously, the more the materialized views are, the better the query performance is. However, real scenarios usually have a constraint on the maximum available disk space for materialization. Thereby, we also study greedy heuristics to generate views that can possibly provide low cost query plans. 3. We propose the decomposition of dataspaces to support efficient top-k queries. The decomposition Fig. 1. Example of dataspaces. scheme in dataspaces is first introduced, where tuples are divided into nonoverlapping partitions. sparse, the inverted index can be built on items to support The score bounds for the tuples in a partition to the the efficient query answering. query are theoretically proved. Safe pruning is then In this paper, from a different aspect of query optimiza- developed based on these score bounds in partition, we study the materialization and decomposition of tions. It is notable that we are not proposing a new dataspaces. The idea of improving query efficiency with top-k ranking method. Instead, our partitioning keyword neighborhood in schema level follows two technique is regarded as a complementary work to intuitions: 1) the reuse of contents of a query, and 2) the the previous merge operators. Thereby, advanced pruning of contents for a query. merge methods, such as TA family methods [11], Motivated by the neighbor keywords that are queried [12], can be cooperated together with our ap- together, we study the materialization of views of items in proaches as presented in experiments. order to reuse the computation. Intuitively, due to the 4. We develop a theoretical analysis for the cost of correspondence of attributes, keywords in neighborhood in querying with partitions. We provide the analysis of schema level are always searched together in a same pruning rate and query cost by using the self- manu : Apple predicate query. For example, a query on ( ) similarity property, which is also verified by our prod : Apple will always search ( ) as well. Therefore, we can experimental observations. According to the cost cache the search results of (manu : Apple) and (prod : Apple), analysis, we cannot always improve the query as a materialized view in dataspaces. Such view results could efficiency by increasing the number of partitions. be reused in different queries. When multiple views are The generation of partitions is also discussed available, it leads us to the problem of selecting the optimal according to the cost analysis. query plans on materialized views. 5. We report an extensive experimental evaluation. To answer the top-k query, we study the pruning of Both the materialization of item views and the unqualified partitions of tuples. Specifically, tuples in decomposition of tuple partitions are evaluated in dataspaces are divided into a set of nonoverlapping groups, querying over real data sets. Especially, the decom- namely, partitions. When a query comes, we develop the position techniques can significantly improve the score bounds of the tuples in partitions. After processing the tuples in some partitions, if the current top-k answers query time performance. Moreover, the hybrid have higher scores than the bounds of remaining partitions, approach which combines views and partitions then we can safely prune these remaining partitions together can always achieve the best performance without evaluating their tuples. and scales well under large data sizes. In addition, the experimental results also verify our conclusions 1.1 Contribution of cost analysis, that is, we can improve the query To our best knowledge, this is the first work on studying performance by increasing the number of views but materialization and decomposition of dataspaces for effi- not that of partitions. cient search. Following the previous work by Dong and The remainder of this paper is organized as Halevy [10], the attribute-keyword model is also utilized in follows: first, we introduce the preliminary of this this study. Although our techniques are motivated by study in Section 2. Section 3 develops the planning of queries with keyword neighborhood in schema level in queries with materialization on views of items. In dataspaces, the proposed idea of materialization and Section 4, we propose the pruning on partitions for decomposition is also generally applicable to attribute- merging and answering top-k queries. Section 5 keyword search over structured and semi-structured data. reports our extensive experimental evaluation. We Our main contributions in this paper are summarized by: discuss the related work in Section 6.

Load more