1872 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 12, DECEMBER 2011 Materialization and Decomposition of Dataspaces for Efficient Search

Shaoxu Song, Student Member, IEEE, Lei Chen, Member, IEEE, and Mingxuan Yuan

Abstract—Dataspaces consist of large-scale heterogeneous data. The query interface of accessing tuples should be provided as a fundamental facility by practical dataspace systems. Previously, an efficient index has been proposed for queries with keyword neighborhood over dataspaces. In this paper, we study the materialization and decomposition of dataspaces, in order to improve the query efficiency. First, we study the views of items, which are materialized in order to be reused by queries. When a set of views are materialized, it leads to select some of them as the optimal plan with the minimum query cost. Efficient algorithms are developed for query planning and view generation. Second, we study the partitions of tuples for answering top-k queries. Given a query, we can evaluate the score bounds of the tuples in partitions and prune those partitions with bounds lower than the scores of top-k answers. We also provide theoretical analysis of query cost and prove that the query efficiency cannot be improved by increasing the number of partitions. Finally, we conduct an extensive experimental evaluation to illustrate the superior performance of proposed techniques.

Index Terms—Dataspaces, materialization, decomposition. Ç

1INTRODUCTION 3 ATASPACES are recently proposed [1], [2] to provide a the DBPedia project. Again, the attributes of tuples in Dco-existing system of heterogeneous data. The impor- different entries are various, while each tuple may only tance of dataspace systems has already been recognized contain a limited number of attributes. Thereby, all these and emphasized in handling heterogeneous data [3], [4], tuples from heterogeneous sources form a huge dataspace [5], [6], [7]. In fact, examples of interesting dataspaces are in Wikipedia. now prevalent, especially on the Web [3]. Due to the heterogeneous data, there exist matching For example, Google Base1 is a very large, self-describing, correspondences among attributes in dataspaces. For ex- semistructured, heterogeneous . We illustrate several ample, the matching correspondence between attributes dataspace tuples with attribute values in Fig. 1 as follows: manu and prod could be identified in Fig. 1, since both of each entry Ti consists of several attributes with correspond- them specify similar information of manufacturer of ing values and can be regarded as a tuple in dataspaces. Due products. Such attribute correspondences are often recog- to the heterogeneity of data, which are contributed by users nized by schema mapping techniques [9]. In dataspaces, a around the world, the data set is extremely sparse. pay-as-you-go style [5] is usually applied to gradually According to our observations, there are total 5,858 identify these correspondences according to users’ feedback attributes in 307,667 tuples (random samples), while most when necessary. of these tuples only have less than 30 attributes individually. Once the attribute correspondences are recognized, the Another example of dataspaces is from Wikipedia,2 keywords in attributes with correspondences are said where each article usually has a tuple with some attributes neighbors in schema level. For example, keywords Apple in and values to describe the basic structured information of attributes manu and prod are neighbor keywords, since manu prod the entry. For instance, a tuple describing the Nikon and have correspondence. Consequently, a query with keyword neighborhood in schema level [10] Corporation may contain attributes like (founded:Tokyo should not only search the keywords in the attributes Japan 1917), (industry: imaging), (products: cameras) ...}. specified in the query, but also match the neighbor Such interesting tuples could not only be found in article keywords in the attributes with correspondences. For entries but also mined by advanced tools such as [8] in example, a query predicate (manu : Apple) should search Apple manu prod 1. http://base.google.com/. keyword in both the attributes and , 2. http://www.wikipedia.org/. according to the correspondence between manu and prod. To support efficient queries on dataspaces, Dong and Halevy [10] utilize the encoding of attribute-keywords as . The authors are with the Department of Computer Science and items and extend the inverted index to answer queries. Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong. Specifically, each distinct attribute name and value pair is E-mail: {sshaoxu, leichen, mingxuan}@cse.ust.hk. encoded by a unique item. For instance, (manu : Apple)is Manuscript received 4 Dec. 2009; revised 20 Apr. 2010; accepted 11 June denoted by the item I1. Then, each tuple can be represented 2010; published online 26 Oct. 2010. by a set of items. Similarly, the query input can also be Recommended for acceptance by N. Bruno. For information on obtaining reprints of this article, please send e-mail to: encoded in the same way. Since the data are extremely [email protected], and reference IEEECS Log Number TKDE-2009-12-0821. Digital Object Identifier no. 10.1109/TKDE.2010.213. 3. http://dbpedia.org/.

1041-4347/11/$26.00 ß 2011 IEEE Published by the IEEE Computer Society SONG ET AL.: MATERIALIZATION AND DECOMPOSITION OF DATASPACES FOR EFFICIENT SEARCH 1873

the near optimal query plan, with relative error bounds on the query cost. 2. We discuss the generation of item views to minimize the query costs. Obviously, the more the materialized views are, the better the query performance is. However, real scenarios usually have a constraint on the maximum available disk space for materialization. Thereby, we also study greedy heuristics to generate views that can possibly provide low cost query plans. 3. We propose the decomposition of dataspaces to support efficient top-k queries. The decomposition Fig. 1. Example of dataspaces. scheme in dataspaces is first introduced, where tuples are divided into nonoverlapping partitions. sparse, the inverted index can be built on items to support The score bounds for the tuples in a partition to the the efficient query answering. query are theoretically proved. Safe pruning is then In this paper, from a different aspect of query optimiza- developed based on these score bounds in parti- tion, we study the materialization and decomposition of tions. It is notable that we are not proposing a new dataspaces. The idea of improving query efficiency with top-k ranking method. Instead, our partitioning keyword neighborhood in schema level follows two technique is regarded as a complementary work to intuitions: 1) the reuse of contents of a query, and 2) the the previous merge operators. Thereby, advanced pruning of contents for a query. merge methods, such as TA family methods [11], Motivated by the neighbor keywords that are queried [12], can be cooperated together with our ap- together, we study the materialization of views of items in proaches as presented in experiments. order to reuse the computation. Intuitively, due to the 4. We develop a theoretical analysis for the cost of correspondence of attributes, keywords in neighborhood in querying with partitions. We provide the analysis of schema level are always searched together in a same pruning rate and query cost by using the self- manu : Apple predicate query. For example, a query on ( ) similarity property, which is also verified by our prod : Apple will always search ( ) as well. Therefore, we can experimental observations. According to the cost cache the search results of (manu : Apple) and (prod : Apple), analysis, we cannot always improve the query as a materialized view in dataspaces. Such view results could efficiency by increasing the number of partitions. be reused in different queries. When multiple views are The generation of partitions is also discussed available, it leads us to the problem of selecting the optimal according to the cost analysis. query plans on materialized views. 5. We report an extensive experimental evaluation. To answer the top-k query, we study the pruning of Both the materialization of item views and the unqualified partitions of tuples. Specifically, tuples in decomposition of tuple partitions are evaluated in dataspaces are divided into a set of nonoverlapping groups, querying over real data sets. Especially, the decom- namely, partitions. When a query comes, we develop the position techniques can significantly improve the score bounds of the tuples in partitions. After processing the tuples in some partitions, if the current top-k answers query time performance. Moreover, the hybrid have higher scores than the bounds of remaining partitions, approach which combines views and partitions then we can safely prune these remaining partitions together can always achieve the best performance without evaluating their tuples. and scales well under large data sizes. In addition, the experimental results also verify our conclusions 1.1 Contribution of cost analysis, that is, we can improve the query To our best knowledge, this is the first work on studying performance by increasing the number of views but materialization and decomposition of dataspaces for effi- not that of partitions. cient search. Following the previous work by Dong and The remainder of this paper is organized as Halevy [10], the attribute-keyword model is also utilized in follows: first, we introduce the preliminary of this this study. Although our techniques are motivated by study in Section 2. Section 3 develops the planning of queries with keyword neighborhood in schema level in queries with materialization on views of items. In dataspaces, the proposed idea of materialization and Section 4, we propose the pruning on partitions for decomposition is also generally applicable to attribute- merging and answering top-k queries. Section 5 keyword search over structured and semi-structured data. reports our extensive experimental evaluation. We Our main contributions in this paper are summarized by: discuss the related work in Section 6. Finally, Section 7 concludes this paper. 1. We study the query planning on item views that are materialized in dataspaces. The materialization scheme in dataspaces is first introduced, based on 2PRELIMINARY which we can select a plan with minimum cost for a In this section, we introduce some preliminary settings of query. The optimal planning problem can be existing work, including the query and index of datas- formulated as an integer linear programming problem. paces. The notations frequently used in this paper are Thereby, we investigate greedy algorithms to select listed in Table 1. 1874 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 12, DECEMBER 2011

w manu prod TABLE 1 states that keywords i appearing in and are Notations said neighbor keywords, e.g., (manu : Apple) and (prod : Apple). Since the correspondence between the same attribute is straightforward, a keyword can always be regarded as a neighbor to itself, such as (prod : Apple) and (prod : Apple). 2.3 Query In this paper, we consider queries with a set of attribute and keyword predicates, e.g., (manu : Apple) and (post : Infinite). Thus, the query inputs can be represented in the same way as tuples in dataspaces. As discussed in [10], the query with keyword neighbor- hood in schema level over dataspaces should not only consider tuples with matched keywords on the attributes specified in the query, but also extend to the attributes with correspondence according to the keyword neighborhood. For example, we consider a query Q ¼fðmanu : AppleÞ; ðpost : InfiniteÞg: The query evaluation searches not only in the manu and 2.1 Data post attributes specified in the query, but also in the We first introduce the model to represent the data. As the attributes prod and addr according to the attribute corre- encoding system presented in [10], we can use pairs of spondences manu $ prod and addr $ post, respectively. (attribute : keyword) to represent the content of a tuple. For example, the attribute value (manu : Apple Inc.) can be Definition 2.2. A disjunctive query with keyword neighbor- represented by {(manu : Apple), (manu : Inc.)}, if each word hood in schema level, Q ¼fðA1 : w1Þ; ...; ðAjQj : wjQjÞg, is considered as a keyword. Let item I be a unique identifier specifies a set of attribute-keyword predicates. It is to return all of a distinct pair. We can represent each tuple T as a set of the tuples T in dataspaces with neighbor keywords to Q with A items, that is, T ¼fI1;I2; ...;IjTjg. respect to attribute correspondence, i.e., for an attribute i of Assume that I is the set of all the items in dataspaces. Q, we can find a Bi of T,suchthatAi $ Bi and We use the vector space model [13] to logically represent ðBi : wiÞ2T. the tuples. Obviously, there may exist multiple attributes Bi Definition 2.1 (Tuple Vector). Given a tuple T,the associated to an attribute A according to A $ B . The t i i i corresponding tuple vector is given by disjunctive query only needs that one of them is true, i.e., considering “OR” logical operator between different attri- t ¼ðt1;t2; ...;tjIjÞ; ð1Þ bute matching correspondences where ti denotes the weight of item Ii in the tuple T, having _ 0 ti 1. Ai $ Bi: ðBi:wiÞ2T t ¼ 1 I I 2 T For example, the weight i of item i denotes i ; For instance, we consider the above Q ¼fðmanu : I T otherwise 0 means i 62 . Advanced weight schemes, such AppleÞ; ðpost : InfiniteÞg. According to the attribute match- as term frequency and inverse document frequency [13] in ing correspondences manu $ prod and addr $ post, a tuple information retrieval, can also be applied. Without loss of T is considered as a candidate answer, if T either contains generality, we adopt the tf*idf score in this work. keyword Apple in manu or prod or contains Infinite in post 2.2 Attribute Correspondence or addr. Therefore, we can evaluate the query by finding all tuples T such that The correspondence between two attributes (e.g., manu versus prod) is often recognized by schema mapping ððmanu : AppleÞ2T _ðprod : AppleÞ2TÞ techniques [9] in . The main principles of _ techniques include data instances matching, linguistic ððpost : InfiniteÞ2T _ðaddr : InfiniteÞ2TÞ: matching of the schema element names, schema structural Let Q^ be the neighbor predicates of a query Q, i.e., a set of similarities, and domain knowledge including user feed- items with keyword neighborhood in schema level to the back (see [9] for a survey). In dataspaces, the matching query Q, correspondence between attributes are often incrementally recognized in a pay-as-you-go style [5], e.g., gradually ^ Q ¼fðBi : wiÞjBi $ Ai; ðAi : wiÞ2Qg: identified according to users’ feedback when necessary. Let Ai;Bi be two attributes with matching correspon- The disjunctive query returns tuples T that match at least ^ dence, denoted by Ai $ Bi. Any keywords wi appearing in one predicate in Q. For example, we have neighbor ^ Ai;Bi are said neighbors. For instance, we consider a predicates Q for the above query Q ¼fðmanu : AppleÞ; matching correspondence of attributes manu $ prod.It ðpost : InfiniteÞg as follows: SONG ET AL.: MATERIALIZATION AND DECOMPOSITION OF DATASPACES FOR EFFICIENT SEARCH 1875

Q^ ¼fðmanu : AppleÞ; ðprod : AppleÞ; ðpost : InfiniteÞ; ðaddr : InfiniteÞg:

Let q be the corresponding tuple vector of neighbor ^ ^ predicates Q of a query Q, where qi ¼ 1 for Ii 2 Q and 0 otherwise. Then, the ranking score between any tuple T and the query Q can be computed by score aggregation functions on neighbor predicates. Without loss of general- ity, we should support any scoring function that satisfies monotonicity [11]. For example, we can consider the intersection of vectors of the tuple T and neighbor predicates Q^, which is widely used in keyword search studies [14] Fig. 2. Indexing dataspaces. XjIj X ^ ^ scoreðQ; TÞ¼kq tk¼ qiti ¼ ti: ð2Þ new list of tuples, by merging the lists in L, with scoreðQ; TÞ i 1 ^ T ¼ Ii2Q on each tuple .

Therefore, to evaluate the ranking scoreP between Q and T, Forexample,consideraqueryQ ¼fðA : aÞ; ðC : cÞ; we are essentially required to compute I 2Q^ ti with respect ðD : dÞg. Suppose that there exists attribute correspondence ^ i to neighbor predicates Q. A $ B. Thereby, we have the neighbor predicates ^ It is notable that different attribute correspondences of Q ¼fðA : aPÞ; ðB : aÞ; ðC : cÞ; ðD : dÞg. The merge operator an attribute require an OR operator. This “OR” semantics ^ t T computes Ii2Q i for each tuple that appears in the lists for the query with keyword neighborhood in schema level of items ðA : aÞ; ðB : aÞ; ðC : cÞ; ðD : dÞ. is different from the CompleteSearch [15]. Specifically, Let ci ¼ OðsiÞ¼si þ r, where is a constant, si is the CompleteSearch indicates an “AND” operator of predi- size of the list of item Ii and r is a constant time ofP random cates. For example, in CompleteSearch, the query (manu : merging cost c access. Then, the can be estimated by Ii2Q^ i. Apple), (prod : Apple) will return tuples containing both Advanced merge methods, such as threshold algorithm (manu : Apple) and (prod : Apple). Those tuples, which (TA) [19], combined algorithm (CA) [11] or IO-Top-K [12], can contain one of predicates or none of them, can be directly be applied to merge inverted lists.P When such TA-family ignored. Instead, in dataspace query with OR operator, methods are utilized, the above ^ c is then an upper Ii2Q i tuples having only part of the predicates will also be bound of estimated merge cost. In the following, instead of considered and ranked as candidates. Such OR logical proposing a new merge operator for inverted lists, we focus semantics are necessary for querying with keyword on advanced techniques that are built upon the available neighborhood on attributes with correspondence, since merge operators to further improve the query efficiency. more than one attribute may be associated to an attribute according to attribute correspondence and the query only needs that one of them is matched with respect to 3PLANNING WITH MATERIALIZATION neighbor keywords. According to the attribute correspondence, e.g., addr $ post, 2.4 Index each query predicate on attribute addr has to conduct a search on attribute post as well. In other words, those neighbor Indexing of dataspaces has been studied by Dong and keywords on addr, post are often searched together in Halevy [10], which extends inverted index for dataspaces. predicate queries with keyword neighborhood in schema The inverted index, also known as inverted files or inverted lists level. Intuitively, we would like to cache these query results [16], [17], [18], consists of a vocabulary of items I and a set for reuse. of inverted lists. Each item Ii corresponds to an inverted list In relational , a view consists of the result set of of tuple IDs, usually sorted in a certain order, where each ID a query, which can be materialized to optimize queries. reports the item weight t in that tuple. i Similarly, in this study, we introduce the view on a set of In Fig. 2, we use an example to illustrate the index items (query predicates) in dataspaces to speed up the framework. The data set consists of 10 tuples (denoted by 1- query processing. Specifically, the merge results of item sets 10 I¼fðA : aÞ; ðB : aÞ; ...; ðL : gÞg ) with an item vocabulary , are materialized, and then queries can utilize these 12 having jIj ¼ . In the inverted lists, for each item (an materialized views. It raises two questions 1) how to select attribute and keyword pair such as (manu : Apple)), we have a the views of items to materialize, and 2) how to utilize the pointer referring to a specific list of tuple IDs, where the item materialized views to minimize the query cost. appears. For instance, Fig. 2b shows an example of the Note that the techniques discussed as follows are also inverted lists of item ðD : dÞ, which indicates that the applicable in attribute-keyword search over structured and keyword d appears in the attribute D of tuples 2, 3, 5, 8, 10. semi-structured data, since the data in dataspaces are In the real implementation, each tuple ID in the list is modeled by attribute-keyword as well. However, due to associated with a weight value ti. the OR operator of predicates that should be considered for Definition 2.3. Consider the neighbor predicates Q^ of a query Q. queries with keyword neighborhood in schema level, our Let L be the set of lists corresponding to the items in neighbor current techniques can only support general attribute- predicates Q^, respectively. The merge operator returns a keyword queries with OR operator. 1876 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 12, DECEMBER 2011

Fig. 3. Materialization on views of items.

3.1 Materialization We first introduce the concept of materialization. Let view V be a set of items (attribute-keyword pairs). By applying the merge operation , we get a new list of tuples with corresponding scoreðV;TÞ. This list is stored in the disk as the materialization of view V . In Fig. 3, we show an example of materialized lists of item views. For instance, we have neighbor keywords a in ðA : aÞ and ðB : aÞ according to the attribute correspondence A $ B. The first view, denoted by V1 ¼fðA : aÞ; ðB : aÞg, materi- alizes the merge results of lists corresponding to items ðA : aÞ Fig. 4. Plan selection. and ðB : aÞ in the example of Fig. 2. Those items appearing together frequently may also be materialized as well, e.g., Proof. According to the score function in (2), we have V4 ¼fðD : dÞ; ðE : eÞg where ðD : aÞ; ðE : eÞ may frequently X X appear together in tuples or queries. scoreðV;TÞ¼ xjscoreðVj;TÞ Let V denote the view scheme, i.e., the set of views that are V 2P Xj X materialized. Since all the original items are already stored, ¼ xj vijti we can treat each item as a single size view, that is, IV. Xj X i ¼ xjvijti 3.2 Standard Query Plan Xi j Given a query Q, a query plan P of Q is a set of views, ¼ q t ¼ scoreðQ;^ TÞ; ^ i i having PV, which can be used to evaluate scoreðQ; TÞ for i each possible tuple T. Since various views are available in a where i ¼ 1; ...; jIj and j ¼ 1; ...; jVj. tu view scheme V, it leads to study the selection of optimal plan P with the minimum query cost. For example, we consider a query Q ¼fðA : aÞ; ðC : cÞ; ðD : dÞg in Fig. 4. According to the attribute correspondence, 3.2.1 Formalization A $ B, we have We first formalize the definition of query plan P. Let vij ¼ 1 Q^ A : a ; B : a ; C : c ; D : d : denote that view Vj contains item Ii; otherwise, vij ¼ 0 means ¼fð Þ ð Þ ð Þ ð Þg i ¼ 1; ...; jIj j ¼ 1; ...; jVj not containing, having and . Let Suppose that the view V1 ¼fðA : aÞ; ðB : aÞg is materialized. qi ¼ 1 denote that item Ii is contained in the neighbor A feasible standard plan can be P¼fV ; ðC : cÞ; ðD : dÞg. ^ P 1 predicates Q of a query Q;otherwise,qi ¼ 0,having The formula j vijxj ¼ qi semantically denotes that the i ¼ 1; ...; jIj. ^ union of views Vj 2PS is exactly the neighbor predicates Q Q Q V Q^ Definition 3.1. Given a query , a feasible standard plan P is a of query , i.e., Vj2P j ¼ . In fact, we can further subset of all views, PV, having develop the following properties of feasible plans. X Lemma 2. For any view V in a feasible standard plan P, we have v x ¼ q ;i¼ 1; ...; jIj; ij j i V Q^. j 1; if Vj 2P; Proof. Assume that there existsP an item Ii having Ii 2 V but x ¼ j ¼ 1; ...; jVj: ^ j Ii 62 Q.Thus,wehave vijxj 1 >qi ¼ 0,which 0; if Vj 62P; j contradicts the definition of feasible P. tu

During the query evaluation, lists corresponding to the Lemma 3. For any two views V1 and V2 in a feasible standard views in P are merged by using the merge operator in plan P, we have V1 \ V2 ¼;. PDefinition 2.3. It is notable that a feasible plan requires Proof. Assume that there existsP an item Ii having j vijxj ¼ qi as illustrated in Definition 3.1, i.e., no Ii 2 V1 \ V2. Thus, we have vijxj 2 > 1 qi, which duplicate items in P. As presented in the following, such j contradicts the definition of feasible P. tu requirement is necessary for computing the ranking scores of tuples. We first prove that the standard plan can evaluate 3.2.2 Optimal Plan ^ scoreðQ; TÞ for each possible tuple T. First, recall that all the original items are already Lemma 1. Let P be a feasible queryP plan for a query Q. For any materialized, i.e., IV. Therefore, given any query Q,a ^ ^ tuple T, we have scoreðQ; TÞ¼ V 2P scoreðV;TÞ. feasible plan always exists, that is, P¼Q. Next, we study SONG ET AL.: MATERIALIZATION AND DECOMPOSITION OF DATASPACES FOR EFFICIENT SEARCH 1877 the selection of query plans with the minimum cost. Let cj item ðD : dÞ will be counted twice in the score function which be the cost of retrieving the materialized list of view Vj as contradicts to the correctness in Lemma 1. Therefore, we P defined in SectionP 2. Then, the cost of plan can be need to deduct the items such as nonrequested ðE : eÞ or estimated by cjxj. j duplicate ðD : dÞ. The operator of removing corresponding Definition 3.2. The problem of selecting the optimal standard lists is defined as follows, namely, the negative merge operator. query plan is to determine the x for P, having X Definition 3.3. Consider a set of items, e.g., V . Let L be the set of minimize cjxj lists corresponding to the items in V , respectively. The j X negative merge operator returns a new list of tuples, by subject to v x ¼ q ;i¼ 1; ...; jIj; ij j i merging the lists in L, with negative-scoreðV;TÞ on each j tuple T. xj 2f0; 1g;j¼ 1; ...; jVj;

which is a 0-1 integer programming or binary integer It is notable that the negative merge employs negative programming problem. score values, which still satisfy the monotonicity of score functions. Therefore, TA family methods [11] can still Unfortunately, the 0-1 integer programming problem with nonnegative data is equivalent to the set cover problem, be utilized for negative merge. As illustrated in the which is NP-complete [20]. Therefore, we explore the following, lists with both positive scores and negative approximate solutions by greedy algorithm.4 scores in a general query plan are merged together by using a TA-style algorithm. 3.2.3 Greedy Algorithm We define the general query plan with both the merge Intuitively, in each step of adding a view into P, we can operator and the negative merge operator . Let vij and greedily select the view Vj with the minimum cost of each cj qi have the same semantics as the standard plan in item unit, i.e., the minimum ratio , where cj denotes the jVjj cost ofPVj and jVjj means the size of view Vj, such as Definition 3.1. jV j¼ s j Ii2Vj i. Definition 3.4. Given a query Q,ageneral plan P consists of According to Lemma 2, not all the views V 2V should j two subsets of all views, P Vand P V, having be considered for a specific Q. Instead, as presented in line 3 X of Algorithm 1, we only need to evaluate the views that are v ðxþ xÞ¼q ;i¼ 1; ...; jIj; ^ ^ ij j j i contained by neighbor predicates Q, i.e., Vj Q. Moreover, j since any two views in a feasible plan are nonoverlapping þ 1; if Vj 2P ; xj ¼ j ¼ 1; ...; jVj; (Lemma 3), we can remove the items of the currently 0; otherwise; ^ ^ selected view Vk from Q in each step and stop till Q ¼;. 1; if V 2P; x ¼ j j ¼ 1; ...; jVj; j 0; otherwise: Algorithm 1. Standard Planning SP(Q) 1: P :¼; Similar to Lemma 1, we can also prove that scoreðQ;^ TÞ¼ 2: Q^ :¼ neighbor predicates of Q according attribute P scoreðV;TÞ P correspondence V 2P for the general plan . During the query 3: while Q^ 6¼;do evaluation, the merge operator and negative merge cj ^ 4: k :¼ arg minj ;Vj Q operator are then conducted on P and P , respectively. jVjj 5: P :¼P[Vk For example, a general plan for the query Q can be ^ ^ 6: Q :¼ Q n Vk P¼fV2 ; ðC : cÞ ; ðE : eÞ g. The above definition specifies 7: return P a constraint that the union of view in P minus the union of ^ Let d be the size of the largest viewP Vj Q and Hd be the views in P is exactly Q^ of the query Q, i.e., d 1 S S dth harmonic number, having Hd ¼ k¼1 k . Then, the relative ^ ð VjÞnð VjÞ¼Q. In fact, the standard plan is a error of greedy approximation is bounded as follows: Vj2P Vj2P special case of the general plan, where P ¼;. Corollary 1. [22] The cost of the plan returned by the greedy algorithm SP is at most Hd times the cost of the optimal plan. 3.3.1 Optimal Plan 3.3 General Query Plan We then introduce the problem of selecting the optimal Note that the standard plan only contains views that are general plan with the minimum cost. The only difference subsets of the neighbor predicates Q^ of a query Q. However, between two kinds of merge operators is their outputs of the views with items not in Q^ can be used in the query scores, while the cost of the negative merge operator is evaluation as well. For example, in Fig. 4, we can also utilize actually the same as the merge operator. Thereby, we can ^ P the view V2 by removing the item ðE : eÞ (not requested by Q) þ estimate the cost of a general plan P by j cjxj þ cjxj . from V2. Moreover, if both V2 and V4 are considered, then the Definition 3.5. The problem of selecting the optimal general 4. Advanced approximation approaches on solving the binary integer query plan is to determine the xþ for P and the x for P, programming problem can also be adopted [21], which is not the focus of this paper. having 1878 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 12, DECEMBER 2011 X þ minimize cjxj þ cjxj When there is no query log available in the beginning, j X we can randomly generate views as V. Let xj ¼ 1 denotes þ subject to vijxj vijxj ¼ qi;i¼ 1; ...; jIj; that the view Vj 2S is selected to materialize in the view j scheme V;P otherwise xj ¼ 0. Then the disk space cost can be þ xj 2f0; 1g;j¼ 1; ...; jVj; sizeðVÞ ¼ j sjxj. The random generation of view stops x 2f0; 1g;j¼ 1; ...; jVj; when the space cost sizeðVÞ exceeds the limitation M. j After processing a batch of queries, we can rely on the which is exactly the 0-1 integer programming problem. query log to select views for materialization. Let Q be a set However, the coefficients of variables xj are negative. of query tuples, i.e., the query log. The straightforward strategy is to materialize the views of item sets V that Lemma 4. For an optimal general plan P, we have P \P ¼;. j appear most frequently in the query log Q. This interesting Proof. Assume that there exists a view Vj 2P \P . Then, intuition leads us to the famous frequent itemset mining we can build another feasible general plan P1, P1 ¼ algorithms [24], [25]. Each query log can be treated as a P n Vj and P1 ¼P n Vj, whose cost is less than P. tu transaction with a set of items. Then, we can select the frequent k-itemsets as materialized views. Existing efficient As proved in by Dobson [23], when there are negative algorithms can be utilized, such as Apriori [24] or FP- entries, it is unlikely that we can guarantee the existence of growth [25], which are not the focuses of this paper. a polynomial approximation scheme with relative error However, this frequency-based strategy fails to offer bounds. Therefore, we study the greedy heuristics. optimal views due to the favor of views with smaller sizes, e.g., the frequency of fðC : cÞ; ðD : dÞg is always not less 3.3.2 Heuristics than fðC : cÞ; ðD : dÞ; ðE : eÞg.

First, we introduce a virtual (empty) view V0 with cost ^ 3.4.1 Cost-Based Generation c0 ¼ 0. Then each Vj Q can be represented by fVj ;V0 g. We seek the view scheme that can minimize the cost of the Next, we consider possible pairs of fVj ;Vl g that can be used by the query V n V Q^, where j ¼ 1; ...; jVj and query log. Let qik ¼ 1 denote that the item Ii is contained in j l ^ l ¼ 0; ...;j 1;jþ 1; ...; jVj,havingl 6¼ j according to the neighbor predicates Qk of a query Qk; otherwise, not c þc Lemma 4. Let the ratio be j l , which denotes the average containing. Let yjk ¼ 1 denote that the view Vj 2S is jVjnVlj cost of retrieving each unit of items. As presented in expected to be used in the query tuple Qk, no matter Vj is Algorithm 2, similar to the SP algorithm, we can greedily selected in V (xj ¼ 1) or not. select the view pair fVk1 ;Vk2 g with the minimum ratio in Definition 3.6. The problem of generating the optimal view each step. The view Vk1 is considered to be in P , while Vk2 scheme V is to determine a feasible x having, is added into P. X minimize cjxjyjk Algorithm 2. General Planning GP(Q) Xj;k 1: P :¼P :¼; subject to v x y ¼ q ;i¼ 1; ...; jIj;k¼ 1; ...; jQj Q^ : Q ij j jk ik 2: ¼ neighbor predicates of according attribute j correspondence X s x M 3: while Q^ 6¼;do j j j cjþcl ^ 4: ðk1;k2Þ :¼ arg minj;l ;Vj n Vl Q jVjnVlj xj 2f0; 1g;j¼ 1; ...; jSj; 5: P :¼P [ Vk1 yjk 2f0; 1g;j¼ 1; ...; jSj;k¼ 1; ...; jQj: 6: P :¼P [ Vk2 ^ ^ 7: Q :¼ðQ [ Vk ÞnVk 2 1 Then, V¼fVj j xj ¼ 1;Vj 2Sg. 8: return P; P

Corollary 2. The cost of the general plan returned by GP Similar to the query planning, we alsoP study greedy algorithm is at least no greater (worse) than the cost of the heuristics to solve this problem. Let fj ¼ k yjk be the standard plan returned by SP algorithm. frequency of Vj in the neighbor predicates of query log Q. Similarly, we can develop the greedy heuristic by the ratio Proof. The worst case is P ¼;, i.e., xj ¼ 0 for all j, which c c is exactly the solution of the SP algorithm. tu Pj ¼ j : jVjj k yjk jVjjfj 3.4 Generating Views V Now we present how to generate the view scheme V. In each greedy step, we select the view j to V which has the Obviously, the larger the number of views in V is, the better minimum ratio, or equivalently, the Vj that can cover 1 maximum number of items (jVjjfj) by each unit of cost . the query performance will be. Let S be the space of all cj possible views on the item set I. The ideal scenario is to The cost-based view generation algorithm is developed as materialize all the possible views, i.e., V¼S, when the follows: for the initialization from lines 2-5 in Algorithm 3, ^ space of materialization is not limited. However, real we assume that each view Vj Qk can possibly be used, i.e., applications usually have a constraint on the maximum assigning yjk ¼ 1. Therefore, we have fj þþin line 5 when ^ available disk space, say M, for materialization. The Vj Qk. During each greedy step in lines 6-12, once we problem we address is to determine a VS with disk decide to select a view (say Vl) into V, then all the other views ^ space less than M. Vj (having Vj Qk and Vj \ Vl 6¼;according to Lemmas 2 SONG ET AL.: MATERIALIZATION AND DECOMPOSITION OF DATASPACES FOR EFFICIENT SEARCH 1879 and 3) are impossible to be used, i.e., assigning yjk ¼ 0.In other words, we should remove such yjk ¼ 1 from fj by fj in line 12. The program terminates when Q¼;or the disk space sizeðVÞ exceeds the limitation M. Algorithm 3. View Generation VG(Q) 1: fj :¼ 0 2: for j :1!jSjdo Fig. 5. Decomposition on partitions of tuples. 3: for k :1! jQj do ^ nontrivial to find optimal updates of views with respect 4: if Vj Qk then to the balance of update cost and query cost. 5: fj þþ 6: while Q 6¼;and sizeðVÞ M do cj 7: l :¼ arg minj ERGING WITH ECOMPOSITION jVjjfj 4M D V :¼V[V 8: l Instead of searching data in the entire space, we often 9: for k :1! jQj do ^ decompose the data into partitions for efficient top-k queries. 10: if Vl Qk then During the query processing, those partitions of tuples with ^ ^ 11: Qk :¼ Qk n Vl low scores to the query are then pruned directly without 12: fj for Vj that Vj \ Vl 6¼; evaluation. In this section, we also study the decomposition 13: return V of dataspaces for efficient query. Again, we have to address two questions: 1) how to prune the partitions of tuples on Corollary 3. For any V generated by the VG algorithm, the total top-k answers, and 2) how to generate the partitions of tuples cost of the queries in Q by using the optimal standard plan for P which will have less query cost. each query tuple is no worse than the cost of c x y . j;k j j jk Recall that during the evaluation of a query, we rely on the merge operator to merge the lists referred by the query According to the greedy strategy of selecting views, the plan. That is, given a set of lists,5 we study the techniques first chosen ones can have more effectiveness in reuse, for efficiently merging the lists to return top-k answers. while those views ranked lower in the generation may Since all the tuples are decomposed into a set of partitions, benefit the queries less. Note that some of the views have and the merge operator is then applied on each partition of extremely low frequency when considering the entire view tuples, respectively. Given a query, we can develop the scheme space. Although the view generation is offline bound of scores of the tuples in each partition. Therefore, processing, we can select a candidate subset of views from those partitions whose score bounds are lower than the the entire space as S, e.g., with frequency greater than a top-k answers can be pruned. threshold in the query log. It is notable that we utilize the previous merge operators such as TA family methods [11] to rank the answers in each 3.4.2 Updates partition. Instead of proposing a new top-k ranking method, We mainly have two aspects of updates in dataspaces, i.e., our partitioning technique is regarded as a complementary updates of tuples and updates of attribute correspondences. work to the previous merge operators. Thereby, advanced For the updates of tuples, we consider the inserting and merge methods, such as IO-top-k [12], can be cooperated deleting tuples. During the updating, inverted lists of both together with our decomposition. original items and their corresponding materialized views should be addressed. Efficient approaches have already 4.1 Decomposition been developed for updating inverted lists [26], which can Let H denotes a partition scheme, i.e., a set of be applied as well. m nonoverlapping partitions of all the tuples. In other It is notable that the attribute correspondences between words, each tuple T is assigned to one and only one attributes in dataspaces are often incrementally recognized partition Hi 2H;i¼ 1; ...;m. in a pay-as-you-go style [5]. The neighbor predicates with Thereby, each list can be decomposed to a set of respect to neighbor keywords of query workload evolve as nonoverlapping sublists of tuples according to the partitions well. Consequently, it leads to the updates of views in V. of tuples. For example, as illustrated in Fig. 5, each list can For the frequency-based view scheme, it is easy to update be decomposed into at most m ¼ 4 partitions. Some the frequency statistics, remove low frequency item partitions might be empty in a specific list. For instance, patterns in updated query predicates and add high the list of item ðA : aÞ say I1 in Fig. 2b is decomposed into frequency item patterns to V. Those views whose frequen- three partitions, H1;H3, and H4, while H2 is empty. It states cies decrease may be replaced by new frequent views. For that the item I1 does not appear in any tuple in partition H2. the cost-based view scheme, however, we can only rely on In order to compute the score bounds of the tuples in each batch updates, due to the maintenance of yjk in fj for each partition, we introduce a head structure for each list. The specific view Vj. Note that if the updates have to be head stores the following information: 1) partition ID, 2) the conducted online, the cost of updating should be consid- bound of item weights in the partition, and 3) the pointer of ered as well. Consequently, there will be a trade-off start and offset of the partition in the list. Both the head and between the view update cost and the query cost with 5. Corresponding to either original items or views. For simplicity, in the respect to workload. As presented, the generation of views remainder of this section, we use the example of original items, which is the has already been shown hard. Therefore, it is highly same for views. 1880 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 12, DECEMBER 2011 the lists of tuple partitions of an item are stored in first g partitions with the highest bounds, we obtain a continuous disk blocks and can be retrieved in one random current top-k answer, say K. The following theorem access as the original lists. specifies the condition of pruning the next g þ 1 partition. 4.1.1 Updating Theorem 1. Let Kk be the kth tuple with the minimum score in the top-k answers in the previous g steps. For the next During the updating (insertion or deletion of tuples), both g þ 1 partition, if we have the lists and the head information should be updated, including the bound of weight and also the pointers to ^ ^ scoreðQ; KkÞscoreðQ; Hgþ1Þ; ð6Þ the partitions. then the partition Hgþ1 can be safely pruned. 4.2 Pruning Top-k Answers Proof. According to Lemma 5, for any tuple T in the 4.2.1 Bound ^ ^ partition Hgþ1, we have scoreðQ; KkÞscoreðQ; Hgþ1Þ ^ In order to evaluate the score bounds of the tuples in a scoreðQ; TÞ. Since Kk is the tuple in the current top-k partition, we first introduce the formal representation of results with the minimum ranking score, in other words, partitions. Note that each partition also describes a set of the tuples in partition Hgþ1 will never be ranked higher items, which appear in the tuples of this partition. There- than Kk and can be pruned safely without further fore, similar to the tuple vector, each partition can be evaluation. tu logically represented by a partition vector of items. Definition 4.1 (Partition Vector). Let H be a partition in H. For the remaining partitions Hgþ2;Hgþ3; ..., since the The corresponding partition vector is defined by, partitions are in the decreasing order of score bounds, we have h ¼ðh1;h2; ...;hjIjÞ; ð3Þ ^ ^ ^ scoreðQ; KkÞ > scoreðQ; Hgþ1Þ > scoreðQ; HgþxÞ; where hi is the bound of weight of the item Ii in the partition H. Specifically, let T be any tuple in the partition H. We have where x ¼ 2; 3; ....Therefore,wecanpruneallthe remaining partitions, starting from Hgþ1. hi ¼ max ðtiÞ; ð4Þ T2H For example, we consider the query Q with neighbor ^ predicates Q ¼fI1;I2;I3;I4g in Fig. 5. Suppose that the where ti is the weight of item Ii in the tuple T. partitions are ordered by the bounds as follows, H1; H3;H4;H2. After processing the first partition H1, if the Q Given a query , we can compute an intersection score current kth answer Kk has a score higher than the bound of H Q^ Q between any partition and the neighbor predicates of , next the partition H3, then we can prune all the remaining X partitions H3;H4, and H2 without evaluating their lists of score Q;^ H q h h : ð Þ¼k k¼ i tuples. Ii2Q^ 4.2.3 Algorithm As presented in the following, this scoreðQ;^ HÞ is exactly the upper bound of scores of the tuples in the partition H. Given a query Q and an integer k, the query algorithm is described in the following Algorithm 4. During the Lemma 5. Let T be any tuple in a partition H, we have initialization, the PARTITIONS(Q^) function returns a set of partitions H ranked in descending order of score bounds to scoreðQ;^ HÞscoreðQ;^ TÞ; ð5Þ the query Q. Recall that the bound hi of item Ii of each where Q is the query. partition H is recorded in the head structure as illustrated in Fig. 5. Thus, the bound of scores of each partition can be Proof. According to the definition of partition vector in (4), efficiently computed by merging the heads of all the items I h t for any item i, we have i i. Therefore, Q^ Q X X referred in the neighbor predicates of . ^ ^ scoreðQ; HÞ¼ hi ti ¼ scoreðQ; TÞ: Algorithm 4. Merge Top-k MT(Q; k) ^ ^ Ii2Q Ii2Q 1: Q^ :¼ neighbor predicates of Q according attribute In other words, scoreðQ;^ HÞ is the bound of scores of correspondence ^ tuples T 2 H to the query Q. tu 2: H :¼ PARTITIONSðQÞ 3: K :¼; When a list of item I is manipulated by the negative i 4: for j :1!H:size do merge operator, e.g., in a general query plan, we can assign ^ ^ 5: if scoreðQ; KkÞscoreðQ; HjÞ then h ¼ 0 for each partition H. For the tuple T 2 H containing i 6: break item I , we have t > 0, i.e., t < 0 in negative merge. i i i 7: else Moreover, for the tuple T 0 2 H that does not contain the item 8: K0 :¼ MERGEðH :listsÞ I , we have t ¼t ¼ 0. Therefore, we can assign h ¼ 0 for j i i i i 9: K :¼ RANKðK;K0Þ this item Ii for simplicity. 10: return K ^ 4.2.2 Pruning Let scoreðQ; KkÞ be the kth largest score in the current ^ Next, we can order the partitions in decreasing order of top-k answers K. For the partition Hj, if the scoreðQ; KkÞ is their upper score bounds to the query. After processing the larger than the bound of the tuple scores of partition SONG ET AL.: MATERIALIZATION AND DECOMPOSITION OF DATASPACES FOR EFFICIENT SEARCH 1881

^ scoreðQ; HjÞ, then the partition Hj and all the remaining probability that the similarity between a query and a tuple T partitions can be pruned. Otherwise, we merge and rank all is greater than , denoted by CtðÞ. Moreover, if we define the the tuples in the partition Hj. The MERGE() function is the set Y to be the set of partition vectors in dataspaces, then the implementation of the merge operator introduced in correlation integral, say ChðÞ, means the probability that a Definition 2.3 or Definition 3.3. query has high similarity (greater than ) to the bound of a partition H. During the computation, if a query workload is 4.3 Cost Analysis provided, then we can directly use query tuples as X. Suppose that we have m ¼ jHj partitions on the n tuples However, if query workload is not available, then we can in the dataspaces. We analyze the cost of disk space and only rely on the data itself, i.e., using data tuples as X as well. query time. According to the fractal and self-similarity features, which have been observed in various applications including 4.3.1 Space Cost the high dimension spaces [28], [29], [30], there exists a Let OðnÞ be the space cost of the original inverted lists of all constant, known as correlation dimension [27] D, n the n tuples. We have m tuples in each partition on average. @ log CðÞ Thus, the space cost introduced by the partition information D ¼ : ð8Þ m @ log can be estimated by n OðnÞ. The total space cost with partitions is ð1 þ mÞOðnÞ. Since the number of partitions is n We observe the Dt corresponding to CtðÞ of tuples and always less than tuples, m n, the space cost is at most Dh corresponding to ChðÞ of partitions in real data sets of twice the cost of original inverted lists. dataspaces. As presented in Figs. 9, 10, and 11, a straight 4.3.2 Time Cost line can be fit in each plot, respectively, whose slope is exactly the constant D according to the above definition. O n Again, let ð Þ be the time cost of merging on the entire According to the constant D in (8), we can represent the n space of tuples. Suppose that the pruning is conducted relationship among ; D and CðÞ as follows: after processing the first g partitions. Then, we can estimate the time cost of the query with pruning as follows: CðÞ/D; ð9Þ Lemma 6. The cost of processing first g partitions can be where / stands for proportional, i.e., follows the power law. estimated by Let CðÞ¼ðÞD, where is a constant and can be observed m g together with slope D in Figs. 9, 10, and 11. þ OðnÞ: n m Lemma 7. The number of processed partitions g can be estimated by Proof. The merge cost with pruning partitions consists of Dh two aspects, from the merge of partitions and tuples Dh h k Dt respectively. First, m OðnÞ denotes the merge of parti- g m : n n tions, in order to calculate the bounds of all the t m partitions. Moreover, according to the pruning, all Proof. Recall that CðÞ denotes the mean probability that the the remaining m g partitions can be ignored without similarity is greater than . Let t be the minimum evaluation. That is, the merge cost of tuples would be g m similarity of the top-k answers. Then, we have of the original cost without partition pruning OðnÞ. tu k Dt n ¼ CtðÞ¼ðttÞ . Moreover, let h be the bound of similarity scores of the gth partition. Thus, we also have Now, we study the estimation of the number of g Dh ¼ ChðÞ¼ðhhÞ . g m partitions that are processed before pruning, i.e., the According to the pruning condition of partitions, we value. Intuitively, we want to estimate the probability g m have the similarity score t h, that is, that a partition is processed in a query, i.e., the probability 1 1 of a partition having bound greater than the top-k answers. Dt 1 k 1 g Dh Therefore, we introduce the concept of correlation integral : t n h m [27] CðÞ which denotes the mean probability that two objects from two sets, respectively, are similar (with In other words, we have similarity greater than ). Let jXj be the size of object set Dh Dh X and x be an object in X. Let jY j be the size of object set Y h k Dt i g m : and yi be an object in Y . Then, for a specific similarity value t n , the correlation integral CðÞ can be approximated by the The lemma is proved. tu correlation sum Combining Lemmas 6 and 7, we have the following 1 XjXj XjY j conclusion. Let be CðÞ¼ ðkxi yjkÞ; ð7Þ jXjjY j Dh i¼1 j¼1 Dh k Dt ¼ h : ð10Þ where ðÞ is the heaviside function, ðxÞ¼1 for x 0 and 0 t n otherwise, and kkis the intersection similarity. Let X be the set of query tuples and Y be the set of data Corollary 4. The cost of merging with pruning on partitions can tuples in dataspaces. Then, the correlation integral is the be estimated by 1882 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 12, DECEMBER 2011

TABLE 2 Observations of ; D of CðÞ

m Fig. 6. Estimated cost of a query. þ OðnÞ: ð11Þ n different partitions are various as well. For a specific query, Given a data set, the values of Dt and t are then fixed the bounds of partition will be more distinguishable. In according to their definitions. For example, as we observed other words, more irrelevant tuples in those partitions with Dt in Fig. 9, we can find an ideal line CtðÞ¼ðttÞ having low bounds can be possibly pruned. Dt ¼3:4 and t ¼ 0:12, which can approximately fit the In fact, as we observe, the feature-based partition scheme observed Base data set. Recall that the slope Dh and the has a large h value, such as h 3 in the Wiki data set in corresponding h in (10) of can also be observed as Table 2. Thus, given a specific k value, the feature-based constants on a large enough partition scheme. For partition has a smaller constant than the random approach Dh example, in Fig. 10, we can observe ChðÞ¼ðhhÞ (see details in Section 5.2). According to Corollary 4, the time having Dh ¼2:5 and t ¼ 0:15, which fits the Base data cost of queries on feature-based partitions should be small set with random partitions. In other words, will be a as well. Therefore, in our experiments, the feature-based constant which is independent with respect to the partition shows better query time performance. processed number of partitions g and total number of partitions m. Therefore, according to Corollary 4, we cannot further improve the query efficiency by increasing 5EXPERIMENTS the number of partitions m. Our experimental evaluation This section reports the experimental evaluation of pro- also verifies this conclusion. posed techniques. We evaluate the following approaches: the baseline approach with extended inverted lists [10], the 4.4 Generating Partitions planning with materialization of views, the merging with Now, we discuss the generation of tuple partitions. The decomposition of partitions, and the hybrid approach with partition scheme is preferred which has lower query cost both views and partitions. In the implementation, we use according to the above theoretical analysis. the Combined Algorithm [11] as the state-of-art merge operator of inverted lists. Moreover, the successful idea of 4.4.1 Random Partition inverted block-index in IO-Top-K [12] is also applied, i.e., The straightforward partition scheme is to assign a tuple to divide each inverted list into blocks and use score- a partition at random. The distribution of tuples in different descending order among blocks but keep the tuple entries partitions tends to be the same. In other words, each within each block in the order of tuple IDs. Such block idea partition shows a similar partition vector. Therefore, the in CA is complementary to our proposed materialization prune power is low by conjecture. and decomposition techniques. The main evaluation criter- D < 0 Recall that the correlation dimension has h . ion is query time cost. We run the experiments in two real According to Lemma 7, the smaller the h is, the larger data sets, Google Base (Base), and Wikipedia (Wiki). There the number of processed partitions g would be. In fact, as are 7,432,575 tuples crawled from Google Base web site, in we observed in Table 2, the random partition scheme shows size of 3.06 GB after preprocessing. The data of Wikipedia a small , e.g., 0:3 in the Wiki data set. Thus, h h consists of 3,493,237 tuples, in size of 0.82 GB after theoretically, the query efficiency based on random parti- tions is low. Our experimental evaluation also verifies the preprocessing. Items of attribute-keywords are associated unsuitability of the random partition scheme. with tf*idf weight scores [13]. During the evaluation, we randomly select 200 tuples from the data set as a synthetic 4.4.2 Feature-Based Partition workload of queries.6 The average response time of these The feature-based partitioning is developed by the intuition queries are reported when different approaches are applied. that the tuples in the same partition share similar contents The experiment runs on a machine with Intel Core 2 CPU of items (features). There are various algorithms to partition (2.13 GHz) and 2 GB of memory. tuples according to their similarities [31], which is not the 5.1 Evaluating Materialization focus of this paper. For example, we can employ a classifier In this experiment, we mainly test the performance of to make decisions based on item features of partitions. Or different planning approaches under various disk space we can use the clustering algorithms to group the tuples limitations. Let s be the average size of lists. Then, the into m clusters without a supervised classifier. limitation of space M is actually the limitation of the number Intuitively, in a feature-based partition scheme, the of views, say M . Since the lists of single-item views (i.e., tuples in the same partitions share similar item features, s while the tuples in different partitions often have various 6. We explore the correspondences of attributes in dataspaces by using item features. Consequently, the partition vectors of instance-level matching [9]. SONG ET AL.: MATERIALIZATION AND DECOMPOSITION OF DATASPACES FOR EFFICIENT SEARCH 1883

Fig. 9. CðÞ Observation of tuples. Fig. 7. Time cost of a query.

Fig. 10. CðÞ Observation of random partitions. Fig. 8. Time cost of a query by different plans. original items) are already stored by the index, we mainly check the extra space cost introduced by materializing the views with multiple items, i.e., m ¼jVjjIj. In the follow- ing experiments, the number of views denotes m multiitem views by default. When the view number m ¼ 0, it means that no multiitem views are materialized, i.e., equivalent to Fig. 11. CðÞ Observation of feature-based partitions. the baseline approach. InP Fig. 6, we present the estimated cost of query plan P, According to our view selection strategies in query plan i.e., j cjxj in Definition 3.1, when different total numbers and view generation, we always first choose those most of views m are available. With the increase of materialized effective views that can contribute to the query at most. views m, the query plan P can choose more effective views Thereby, with the increase of views, e.g., from 150 to 200 in with less estimated cost. Obviously, a randomly generated Figs. 6 and 7, the achieved improvement may not as view has rare chance to be effective for a specific query. significant as first chosen ones like 1-50. Thus, the query may have to retrieve the original items with In Fig. 8, we evaluate the standard and general query higher cost, since the randomly generated views could be planning. As we presented in Corollary 2, the worst case of useless. The frequency-based view scheme materializes an optimal general plan is the corresponding optimal those views with frequent items according to the historical standard plan without any negative merge operation. query log. Queries can reuse these views and consequently Therefore, as presented in Fig. 8, in Wiki data set, the have query plans with less estimated cost. Finally, we also performance of general plan is generally not worse than the report the estimated cost of queries on cost-based view standard one. When proper negative merge is applicable, scheme, which is smaller than the other ones. e.g., in Base data set, the general plan can achieve better Fig. 7 reports the corresponding query time cost of performance. various view schemes. As shown in figures, the time cost is roughly proportional to the estimated cost of corresponding 5.2 Evaluating Decomposition query plans. The more materialized views m are, the better In this experiment, we first observe that the constant D in the query time performance is. According to the above (8) exists in real dataspace examples, since our cost observation of estimated cost, the frequency-based ap- estimation is based on the assumption of the existence of proach has more chance to utilize effective materialization this D. Specifically, we collect the CðÞ values of tuples, than the random one. Therefore, the time cost of queries on random partitions, and feature-based partitions, which are frequency-based views is lower as well. However, the reported in Figs. 9, 10, and 11, respectively. Each point frequency-based approach favors small views as we ð; CðÞÞ denotes the observation of probability CðÞ with mentioned in Section 3.4, while the cost-based scheme can similarity score in the corresponding data set. We also plot generate useful views according to the cost estimation. an ideal CðÞ¼ðÞD in each data set that can fit our Thus, queries on cost-based views have even lower time observations. We can record the constants and D of ideal cost. Note that if there is no proper views available in large CðÞ¼ðÞD as the estimation of real observations, which size, cost-based strategy will generate similar views as are presented in Table 2. For example, we have Dh ¼2:5 frequency one. Consequently, the difference between these and h ¼ 0:25 for feature-based partitions in Base data set, two generation strategies may not be large, e.g., 10 views in according to the ideal CðÞ¼ð0:25Þ2:5 that fits the Fig. 7. Nevertheless, the cost-based approach will not observed data in Fig. 11. According to the definition of generate significantly worse views than frequency one. in (10), for the random partitions in Base, we have 1884 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 12, DECEMBER 2011

Fig. 12. Retrieved partitions of a query. Fig. 13. Time cost of a query.

Dh 2:5 2:5 Dh 2:5 k Dt 0:15 k 3:4 k 3:4 of partitions can effectively identify and prune those low ¼ h ¼ ¼ 0:572 : random n 0:12 n n score tuples, thereby the cost drops. Moreover, with the t further increase of partition numbers m, e.g., 1,000 parti- For the feature-based scheme, we have tions in Fig. 13, the performance cannot be improved any more, since the fractal property has already been clearly 2:5 2:5 2:5 0:25 k 3:4 k 3:4 observed with a constant . Consequently, the time cost ¼ ¼ 0:159 : feature 0:12 n n increases with the number of partitions, which confirms our conclusion in Corollary 4. Obviously, the constant feature is less than random. Accord- ing to Corollary 4, the queries on feature-based partitions 5.3 Scalability should have lower time cost than the random approach. Finally, we combine our proposed views and partitions Similarly, we can also observe the constant in Wiki together, called hybrid approach with materialization+de- 1:5 1:5 composition. For the view materialization, we use the cost- 0:3 k 1:5 k ¼ ¼ 137:706 ; based view generation and the standard query plan. The random 8 n n partition decomposition uses the feature-based partition 1:5 1:5 3 k 1:5 k generation. The state-of-art CA method is utilized as baseline ¼ ¼ 4:354 : feature 8 n n approach where materialization and decomposition are not applied. We mainly test the performance of approaches That is, we have feature <random as well. So far, according to under different data sizes, in order to evaluate scalability. the above observation and analysis on both data sets, As presented in Fig. 14, the time cost of all the queries on feature-based partition should have better approaches increases linearly as the data size. Both our performance than that of the random one. Next, we show materialization and decomposition approaches can improve that this case holds for the real query evaluation. the time performance in various data scale, compared with Specifically, we observe the time performance of top-k7 the baseline approach. The hybrid one can always achieve queries under different number of partitions m. Note that the best performance and scales well under large sizes. when the partition number m ¼ 1, it means that all the tuples are in one partition, which is equivalent to the 6RELATED WORK baseline approach. In Fig. 12, we study the number of retrieved partitions g The concept of dataspaces is proposed in [1] and [2], which that have to be processed before the pruning can be applied. provides a coexisting system of heterogeneous data. Due to First, compared with feature-based partition scheme, the the huge amount of increasing data especially from the random approach has much more retrieved partitions g, Web, the importance of dataspace systems has already been which also confirms our analysis of the random scheme in recognized and emphasized [3], [4]. Recent work [5], [6], [7] Section 4.4. Moreover, as we analyzed, the constant is mainly dedicated to offering best-effort answers in a pay- g as-you-go style in view of integration issues. In our study, approximately denotes the rate of processed partitions m .In the above observation, we find that the feature is much less instead, we concern the efficiency issues on accessing 4:354 dataspaces, which also plays a fundamental role in practical than random in Wiki ð137:706Þ, compared with those of Base 0:159 dataspace systems. ð0:572Þ. Thus, in Fig. 12, the pruning power of featured-based partitions on Wiki is stronger than that in Base. The problem of indexing dataspaces is first studied by Fig. 13 shows the time cost of queries with corresponding Dong and Halevy [10] to support efficient query. Based on partitions. When the number of partitions m is small, e.g., the encoding of attribute label and value as items, the 10 partitions in Fig. 13, the overlap of items (features) among partitions might be large. That is, in terms of our cost analysis, we cannot accurately observe the constant about D and . Thus, the bounds of partitions are not distinguishable for a specific query. Consequently, the pruning is not ensured, and the cost will be high. With the increase of m, e.g., from 10 to 100 partitions in Fig. 13, the constant can be observed. In other words, the bounds

7. k ¼ 5; similar results are observed with k ¼ 1; 10, etc. Fig. 14. Scalability. SONG ET AL.: MATERIALIZATION AND DECOMPOSITION OF DATASPACES FOR EFFICIENT SEARCH 1885 inverted index is extended to dataspaces, which is also study the k-NN search by avoiding merging all the considered as the baseline approach in our work. In fact, dimensions referred by query items. In our study, we first inverted index [16], [17] has been studied for decades as decompose the tuples into a set of partitions, then the above efficient access to sparse data. Zobel and Moffat [18] merging techniques can be applied in each partition, provide a comprehensive introduction of key techniques respectively. Thus, our focus is the pruning of partitions in the text indexing literature. Li et al. [32] also study the rather than the merge operator. indexing for approximately retrieving the sparse data, Materialized views in relational databases are often which extends inverted lists as well. The difference between utilized to find equivalent view-based rewritings of rela- dataspaces and XML is also discussed by Dong and Halevy tional queries [45], such as conjunctive queries or aggregate [10]. Specifically, since XML techniques rely on encoding queries in databases. Similar problem is also studied in the parent-child and ancestor-descendant relationships in index selection [46]. Specifically, given a workload of SQL an XML tree, which do not fit in dataspaces, the query statements and a user-specified storage constraint, it is to processing of related items in XML [33] is not directly recommend a set of indexes that have the maximum benefit applicable to the query processing with keyword neighbor- for the given workload. Greedy heuristics are often used in hood in dataspaces [10]. Once the data are indexed, our such selection, e.g., in SQL Server [47] and DB2 [48]. approaches introduce materialization in dataspaces and Chirkova and Li [49] study the generation of a set of views that can compute the answers to the queries, such that the provide (near) optimal query planning based on the size of the view set is minimal. Heeren et al. [50] consider materialized data. For high dimensional data, Sarawagi the index selection with a bound of available space, upon and Kirpal [34] propose the grouping of items to materialize which the average query response time is minimized. their corresponding inverted lists. However, the proposed Instead of considering materialized views and index algorithm is developed for set similarity joins, and does not separately, Aouiche and Darmont [51] take view-index address the generation of optimal query plans based on the interactions into account and achieve efficient storage space available materialized data. sharing. All these previous works focus on rewriting Various strategies for storing sparse data are also relational queries based on materialized views in relational proposed. Chu et al. [35] use a big wide-table to store the databases. In our study, we extend the concept of materi- sparse data and extract the data incrementally [36]. Rather alization to dataspaces and explore the corresponding than the predominant positional storage with a preallocated query optimization. amount of space for each attribute, the wide-table uses an Partitioning-based approaches for efficient access are interpreted storage to avoid allocating spaces to those null studied as well. Lester et al. [52] propose the partitioning values in the sparse data. Agrawal et al. [37] study a vertical index for efficient online index. To make documents format storage of the tuples. Specifically, a 3-ary vertical immediately accessible, the index is divided into a controlled scheme is developed with columns including tuple identi- number of partitions. Nikos Mamoulis [53] studies the fier, attribute name, and attribute value. Beckmann et al. efficient joins on set-valued attributes, by using inverted [38] extend the RDBMS attributes to handle the sparse data index. Different from our large number of attributes, the join as interpreted fields. A prototype implementation in the predicates are evaluated between two attributes only. existing RDBMS is evaluated to illustrate the advanced Sarawagi and Kirpal [34] also propose an efficient algorithm performance in dealing with sparse data. Abadi et al. [39], for indexing with a data partitioning strategy. All these [40] provide comprehensive studies of the column-based efficient techniques are dedicated to the single attribute store comparing with the row-based store. The column problem, while the dataspaces contain various attributes. store with vertical partitioning shows advanced perfor- The idea of cracking databases into manageable pieces is mance in many applications, such as the RDF data of the developed recently to organize data in the way users request [39] and the recent Star Schema Benchmark it. Rather than dataspaces, Idreos et al. [54], [55] mainly study of data warehousing [40]. Chaudhuri et al. [41] study a the cracking of relational databases. The cracking approach is similarity join operator (SSJoin [41], [42]) on text attributes, based on the hypothesis that index maintenance should be a which are also organized in a vertical style. Specifically, byproduct of query processing, not of updates. each value of text attributes is converted to a set of tokens (words or q-grams [43]), which are stored separately in different tuples, respectively. Our work is independent with 7CONCLUSIONS the storage of dataspaces, instead we develop the materi- In this paper, we study the materialization and decomposi- alization and decomposition of dataspaces upon the tion of dataspaces in order to improve the efficiency of indexing framework. queries with keyword neighborhood in schema level. Since For the merge operator, the inverted lists are usually neighbor keywords are always queried together, we first sorted by the tuple IDs, then efficient merging algorithm can propose the materialization of neighbor keywords as views be applied [18]. When the inverted lists are sorted by the of items. Then, the optimal query planning is studied on the weight of each tuple, the threshold algorithm [19] can item views that are materialized in dataspaces. Due to the return the top-k answers efficiently. Advanced TA family NP-completeness of the problem, we study the greedy methods such as combined algorithm [11] can also be used approaches to generate query plans. Obviously, the more as merge operator of inverted lists. Moreover, inverted materialized views there are, the better the query perfor- block-index is also proposed in IO-Top-K [12], i.e., divide mance is. The generation of views is then discussed with each inverted list into blocks and use score-descending limitation of materialization space. Moreover, we also study order among blocks but keep the tuple entries with in each the decomposition of dataspaces into partitions of tuples for block in tuple ID order. Arjen P. de Vries et al. [44] also top-k queries. Efficient pruning of tuple partitions is 1886 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 12, DECEMBER 2011 developed during the top-k query processing. We propose a [15] H. Bast and I. Weber, “The Completesearch Engine: Interactive, theoretical analysis for the cost of querying with partitions Efficient, and Towards IR& DB Integration,” Proc. Conf. Innovative Data Systems Research (CIDR), pp. 88-95, 2007. and find that the pruning power cannot be improved by [16] R.A. Baeza-Yates and B.A. Ribeiro-Neto, Modern Information increasing the number of partitions. The generation of Retrieval. ACM Press / Addison-Wesley, 1999. partitions is also discussed based on the cost analysis. [17] I.H.Witten,A.Moffat,andT.C.Bell,Managing Gigabytes: Compressing and Indexing Documents and Images, second ed. Finally, we report an extensive experiment to illustrate Morgan Kaufmann, 1999. the performance of proposed methods. In the method of [18] J. Zobel and A. Moffat, “Inverted Files for Text Search Engines,” materialization, the general query plans show no worse ACM Computing Surveys, vol. 38, no. 2, pp. 1-55, 2006. performance than the standard query plans. When proper [19] R. Fagin, A. Lotem, and M. Naor, “Optimal Aggregation Algorithms for Middleware,” Proc. 20th ACM SIGMOD-SI- negative merge is applicable, the general plan can achieve GACT-SIGART Symp. Principles of Database Systems (PODS ’01), better performance. The hybrid approach with both views 2001. and partitions can always achieve the best performance. [20] D. Peleg, G. Schechtman, and A. Wool, “Approximating Bounded 0-1 Integer Linear Programs,” Proc. Second Israel Symp. Theory and Furthermore, the experimental results also verify our Computing Systems, pp. 69-77, 1993. conclusions of cost analysis, that is, we can improve the [21] C.H. Papadimitriou and K. Steiglitz, Combinatorial Optimization: query performance by increasing the number of views but Algorithms and Complexity. Prentice-Hall, Inc., 1982. [22] V. Chvatal, “A Greedy Heuristic for the Set-Covering Problem,” not that of partitions. Math. Operations Research, vol. 4, no. 3, pp. 233-235, 1979. [23] G. Dobson, “Worst Case Analysis of Greedy Heuristics for Integer Programming with Non-Negative Data,” Math. Operations Re- ACKNOWLEDGMENTS search, vol. 7, no. 4, pp. 515-531, 1982. [24] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Funding for this work was provided by Hong Kong RGC Association Rules in Large Databases,” Proc. 20th Int’l Conf. Very GRF 611608, NSFC Grant Nos. 60736013, 60970112, and Large Data Bases (VLDB ’94), pp. 487-499, 1994. 60803105, and Microsoft Research Asia Gift Grant, [25] J. Han, J. Pei, Y. Yin, and R. Mao, “Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Ap- MRA11EG05. The authors also thank the support from proach,” and Knowledge Discovery, vol. 8, no. 1, HKUST RFID center. pp. 53-87, 2004. [26] L. Lim, M. Wang, S. Padmanabhan, J.S. Vitter, and R.C. Agarwal, “Efficient Update of Indexes for Dynamically Chan- REFERENCES ging Web Documents,” J. , vol. 10, no. 1, pp. 37- 69, 2007. [1] M.J. Franklin, A.Y. Halevy, and D. Maier, “From Databases to [27] P. Grassberger and I. Procaccia, “Measuring the Strangeness of Dataspaces: A New Abstraction for Information Management,” Strange Attractors,” Physica D: Nonlinear Phenomena, vol. 9, SIGMOD Record, vol. 34, no. 4, pp. 27-33, 2005. nos. 1/2, pp. 189-208, 1983. [2] A.Y. Halevy, M.J. Franklin, and D. Maier, “Principles of Dataspace [28] Systems,” Proc. 25th ACM SIGMOD-SIGACT-SIGART Symp. A. Belussi and C. Faloutsos, “Estimating the Selectivity of Spatial Principles of Database Systems (PODS ’06), pp. 1-9, 2006. Queries Using the “Correlation” Fractal Dimension,” Proc. 21th [3] M.J. Franklin, A.Y. Halevy, and D. Maier, “A First Tutorial on Int’l Conf. Very Large Data Bases (VLDB ’95), pp. 299-310, 1995. [29] Dataspaces,” Proc. VLDB Endowment, vol. 1, no. 2, pp. 1516-1517, B.-U. Pagel, F. Korn, and C. Faloutsos, “Deflating the Dimension- 2008. ality Curse Using Multiple Fractal Dimensions,” Proc. 16th Int’l [4] J. Madhavan, S. Cohen, X.L. Dong, A.Y. Halevy, S.R. Jeffery, D. Conf. Data Eng., pp. 589-598, 2000. Ko, and C. Yu, “Web-Scale Data Integration: You can Afford to [30] F. Korn, B.-U. Pagel, and C. Faloutsos, “On the “Dimensionality Pay as You Go,” Proc. Conf. Innovative Data Systems Research Curse” and the “Self-Similarity Blessing,”” IEEE Trans. Knowledge (CIDR), pp. 342-350, 2007. and Data Eng., vol. 13, no. 1, pp. 96-111, Jan./Feb. 2001. [5] S.R. Jeffery, M.J. Franklin, and A.Y. Halevy, “Pay-As-You-Go User [31] J. Han and M. Kamber, Data Mining: Concepts and Techniques. Feedback for Dataspace Systems,” Proc. ACM SIGMOD Int’l Conf. Morgan Kaufmann, 2000. Management of Data (SIGMOD ’08), pp. 847-860, 2008. [32] B. Li, M. Hui, J. Li, and H. Gao, “Iva-File: Efficiently Indexing [6] A.D. Sarma, X. Dong, and A.Y. Halevy, “Bootstrapping Pay-As- Sparse Wide Tables in Community Systems,” Proc. IEEE Int’l Conf. You-Go Data Integration Systems,” Proc. ACM SIGMOD Int’l Conf. Data Eng. (ICDE ’09), pp. 210-221, 2009. Management of Data (SIGMOD ’08), pp. 861-874, 2008. [33] S. Cohen, J. Mamou, Y. Kanza, and Y. Sagiv, “Xsearch: A Semantic [7] M.A.V. Salles, J.-P. Dittrich, S.K. Karakashian, O.R. Girard, and L. Search Engine for Xml,” Proc. 29th Int’l Conf. Very Large Data Bases Blunschi, “Itrails: Pay-As-You-Go Information Integration in (VLDB ’03), pp. 45-56, 2003. Dataspaces,” Proc. 33rd Int’l Conf. Very Large Data Bases (VLDB [34] S. Sarawagi and A. Kirpal, “Efficient Set Joins on Similarity ’07), pp. 663-674, 2007. Predicates,” Proc. ACM SIGMOD Int’l Conf. Management of Data [8] F.M. Suchanek, G. Kasneci, and G. Weikum, “Yago: A Core of (SIGMOD ’04), pp. 743-754, 2004. Semantic Knowledge,” Proc. 16th Int’l Conf. World Wide Web [35] E. Chu, J.L. Beckmann, and J.F. Naughton, “The Case for a Wide- (WWW ’07), pp. 697-706, 2007. Table Approach to Manage Sparse Relational Data Sets,” Proc. [9] E. Rahm and P.A. Bernstein, “A Survey of Approaches to ACM SIGMOD Int’l Conf. Management of Data (SIGMOD ’07), Automatic Schema Matching,” Int’l J. Very Large Data Bases, pp. 821-832, 2007. vol. 10, no. 4, pp. 334-350, 2001. [36] E. Chu, A. Baid, T. Chen, A. Doan, and J.F. Naughton, “A [10] X. Dong and A.Y. Halevy, “Indexing Dataspaces,” Proc. ACM Relational Approach to Incrementally Extracting and Querying SIGMOD Int’l Conf. Management of Data (SIGMOD ’07), pp. 43-54, Structure in Unstructured Data,” Proc. 33rd Int’l Conf. Very Large 2007. Data Bases (VLDB ’07), pp. 1045-1056, 2007. [11] R. Fagin, “Combining Fuzzy Information: An Overview,” SIG- [37] R. Agrawal, A. Somani, and Y. Xu, “Storage and Querying of MOD Record, vol. 31, no. 2, pp. 109-118, 2002. e-Commerce Data,” Proc. 27th Int’l Conf. Very Large Data Bases [12] H. Bast, D. Majumdar, R. Schenkel, M. Theobald, and G. Weikum, (VLDB ’01), pp. 149-158, 2001. “Io-Top-k: Index-Access Optimized Top-k Query Processing,” [38] J.L. Beckmann, A. Halverson, R. Krishnamurthy, and J.F. Proc. 32nd Int’l Conf. Very Large Data Bases (VLDB ’06), pp. 475-486, Naughton, “Extending RDBMSs to Support Sparse Datasets Using 2006. an Interpreted Attribute Storage Format,” Proc. 22nd Int’l Conf. [13] G. Salton, Automatic Text Processing: The Transformation, Analysis, Data Eng. (ICDE ’06), p. 58, 2006. and Retrieval of Information by Computer. Addison-Wesley, 1989. [39] D.J. Abadi, A. Marcus, S. Madden, and K.J. Hollenbach, “Scalable [14] F. Liu, C.T. Yu, W. Meng, and A. Chowdhury, “Effective Keyword Semantic Web Using Vertical Partitioning,” Search in Relational Databases,” Proc. ACM SIGMOD Int’l Conf. Proc. 33rd Int’l Conf. Very Large Data Bases (VLDB ’07), pp. 411-422, Management of Data (SIGMOD ’06), pp. 563-574, 2006. 2007. SONG ET AL.: MATERIALIZATION AND DECOMPOSITION OF DATASPACES FOR EFFICIENT SEARCH 1887

[40] D. Abadi, S. Madden, and N. Hachem, “Column-Stores Vs. Row- Shaoxu Song is currently working toward the Stores: How Different are They Really?” Proc. ACM SIGMOD Int’l PhD degree in the Department of Computer Conf. Management of Data (SIGMOD ’08), 2008. Science and Engineering, The Hong Kong [41] S. Chaudhuri, V. Ganti, and R. Kaushik, “A Primitive Operator for University of Science and Technology, Hong Similarity Joins in Data Cleaning,” Proc. 22nd Int’l Conf. Data Eng. Kong. His research interests include data quality (ICDE ’06), p. 5, 2006. and data dependency. He is a student member [42] A. Arasu, V. Ganti, and R. Kaushik, “Efficient Exact Set-Similarity of the IEEE. Joins,” Proc. 32nd Int’l Conf. Very Large Data Bases (VLDB ’06), pp. 918-929, 2006. [43] E. Ukkonen, “Approximate String Matching with q-Grams and Maximal Matches,” Theoretical Computer Science—Selected Papers of the Combinatorial Pattern Matching School, vol. 92, no. 1, pp. 191- 211, 1992. Lei Chen received the BS degree in computer [44] A.P. de Vries, N. Mamoulis, N. Nes, and M.L. Kersten, “Efficient science and engineering from Tianjin University, k-NN Search on Vertically Decomposed Data,” Proc. ACM China, in 1994, the MA degree from Asian SIGMOD Int’l Conf. Management of Data (SIGMOD ’02), pp. 322- Institute of Technology, Thailand, in 1997, and 333, 2002. the PhD degree in computer science from [45] G. Gou, M. Kormilitsin, and R. Chirkova, “Query Evaluation University of Waterloo, Canada, in 2005. He is Using Overlapping Views: Completeness and Efficiency,” Proc. now an associate professor in the Department of ACM SIGMOD Int’l Conf. Management of Data (SIGMOD), pp. 37- Computer Science and Engineering at Hong 48, 2006. Kong University of Science and Technology. His [46] S. Chaudhuri, M. Datar, and V.R. Narasayya, “Index Selection for research interests include uncertain databases, Databases: A Hardness Study and a Principled Heuristic graph databases, multimedia and time series databases, and sensor Solution,” IEEE Trans. Knowledge and Data Eng., vol. 16, no. 11, and peer-to-peer databases. He is a member of the IEEE. pp. 1313-1323, Nov. 2004. [47] S. Agrawal, S. Chaudhuri, and V.R. Narasayya, “Automated Mingxuan Yuan received the BSc degree in Selection of Materialized Views and Indexes in Sql Databases,” 2002 and MSc degree in 2006, both in computer Proc. 26th Int’l Conf. Very Large Data Bases (VLDB ’00), pp. 496-505, science and engineering, from Xi’an Jiaotong 2000. University, China. He is currently working toward [48] G. Valentin, M. Zuliani, D.C. Zilio, G.M. Lohman, and A. Skelley, the PhD degree in the Department of Computer “DB2 Advisor: An Optimizer Smart Enough to Recommend Its Science and Engineering at Hong Kong Uni- Own Indexes,” Proc. 16th Int’l Conf. Data Eng., pp. 101-110, 2000. versity of Science and Technology, China. His [49] R. Chirkova and C. Li, “Materializing Views with Minimal Size to research interests include privacy problems of Answer Queries,” Proc. 22nd ACM SIGMOD-SIGACT-SIGART social networks. Symp. Principles of Database Systems (PODS ’03), pp. 38-48, 2003. [50] C. Heeren, H.V. Jagadish, and L. Pitt, “Optimal Indexing Using Near-Minimal Space,” Proc. 22nd ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems (PODS ’03), pp. 244-251, 2003. [51] K. Aouiche and J. Darmont, “Data Mining-Based Materialized . For more information on this or any other computing topic, View and Index Selection in Data Warehouses,” J. Intelligent please visit our at www.computer.org/publications/dlib. Information Systems, vol. 33, no. 1, pp. 65-93, 2009. [52] N.Lester,A.Moffat,andJ.Zobel,“FastOn-LineIndex Construction by Geometric Partitioning,” Proc. 14th ACM Int’l Conf. Information and (CIKM ’05), pp. 776- 783, 2005. [53] N. Mamoulis, “Efficient Processing of Joins on Set-Valued Attributes,” Proc. ACM SIGMOD Int’l Conf. Management of Data (SIGMOD ’03), pp. 157-168, 2003. [54] S. Idreos, M.L. Kersten, and S. Manegold, “Database Cracking,” Proc. Conf. Innovative Data Systems Research (CIDR), pp. 68-78, 2007. [55] S. Idreos, M.L. Kersten, and S. Manegold, “Updating a Cracked Database,” Proc. ACM SIGMOD Int’l Conf. Management of Data (SIGMOD ’07), pp. 413-424, 2007.