Similarity Group-by* Yasin N. Silva1, Walid G. Aref 1, Mohamed H. Ali2 1Department of Computer Science, Purdue University, Indiana, USA {ysilva,aref}@cs.purdue.edu 2Microsoft Corporation, Washington, USA [email protected]
Abstract— Group-by is a core database operation that is used Similarity Grouping Implementation Approach extensively in OLTP, OLAP, and decision support systems. In Integrated in Using Basic SQL Using User- As Stored DB Engine Operators Defined Aggr. Procedures many application scenarios, it is required to group similar but Supported All (Supervised Only Supervised Only Supervised All (Sup. & grouping not necessarily equal values. In this paper we propose a new SQL & Unsupervised) (partially) (partially) Unsup.) construct that supports similarity-based Group-by (SGB). SGB is strategies UDAs require SPs require not a new clustering algorithm, but rather is a practical and fast Low (reuses Queries use a the support of the support and extends similarity grouping query operator that is compatible with other Implementation complex mix of vectors for of large hash DB operators complexity joins and reference points, tables, SQL operators and can be combined with them to answer and aggregations spilling spilling similarity-based queries efficiently. In contrast to expensive structures) mechanisms, mechanisms, etc. etc. clustering algorithms, the proposed similarity group-by operator Execution time Very Low High (see Fig. 13) Medium Low maintains low execution times while still generating meaningful Composable with groupings that address many application needs. The paper other operators Yes Yes Yes No (pipelining) presents a general definition of the similarity group-by operation Take advantage Yes (pre-aggr. No directly No No and gives three instances of this definition. The paper also and use of MVs) of query optimizer discusses how optimization techniques for the regular group-by can be extended to the case of SGB. The proposed operators are Fig. 1 Comparison of similarity group-by implementation approaches implemented inside PostgreSQL. The performance study shows that the proposed similarity-based group-by operators have good data mining, provide advanced features for grouping. OLAP scalability properties with at most only 25% increase in provides several tools to summarize data organized under the execution time over the regular group-by. dimensional model but the formation of groups is still based on the equality semantics. Data mining clustering tools I. INTRODUCTION employ complex and sophisticated algorithms to discover One of the most important paradigm shifts in data groups naturally formed in the data. Unfortunately, the use of management is the move from systems that focus on exact these analysis techniques requires the definition of elaborate semantics of data and Boolean semantics of queries to systems data mining models; is not integrated into the regular query that focus on imprecise and approximate semantics of data processing engine, and is not standard among the different and queries. Among the areas driving this paradigm shift are commercial database systems. probabilistic databases, the integration of information retrieval Emerging applications, e.g., biological databases and data and database systems, and similarity query processing. streaming, require the identification of sets of approximate Previous work on similarity query processing has focused on values. Moreover, many business application scenarios, solving the problem of joining two sets of data using especially those handling large datasets, can benefit similarity join predicates that match tuples with similar or tremendously from SQL constructs that identify groups of approximate values. The goal of this paper is to extend similar values to process them further. Current database similarity-based processing to another core database operation, systems do not provide fast mechanisms that are integrated the group-by, to group objects with similar or approximate into the query engine to generate and process groups of values. Grouping capabilities have been extensively studied similar objects using complex TPC-H-like queries [29]. Figure and implemented in data management systems. In DBMSs, 1 presents a comparison of several approaches to support the most common support for grouping is implemented similarity-based grouping. The implementation of similarity- through the standard group-by operator. This operator has based grouping at the DB engine level has the following key relatively good execution time and scalability properties. advantages: (1) the execution time of the Similarity Group-by However, while the semantics of group-by is simple, it is also operator (SGB) is comparable to that of the regular group-by limited because it is based only on equality, i.e., all the tuples and is superior to the performance of the other user-level in a group have exactly the same values of the grouping definitions; (2) SGB can be interleaved with other regular attributes. Grouping has also been studied in data warehouses, operators and its results pipelined for further processing; and where several facilities have been implemented to enrich the (3) important optimization techniques, e.g., pre-aggregation output of the regular group-by operator with multiple levels of and the use of materialized views can be extended to the new subtotals. Additionally, analysis techniques, e.g., OLAP and operator. The contributions of this paper are as follows: ——————————————— • We introduce the similarity group-by (SGB) operator * This work was partially supported by NSF Grant Number IIS-0811954. which extends standard group-by to allow the formation of groups based on similarity rather than equality of the requirements due to very large datasets and many attributes of data. different types. Given that the result of the clustering process • We present a generic definition of the SGB operator and depends on the specific clustering algorithm and its parameter three instances to support: (1) the formation of groups settings, it is important to assess the quality of the results. This based on fundamental group properties, e.g., group evaluation process is termed cluster validity [12], [13]. Of compactness and group size, (2) the formation of groups special interest is the work on clustering of very large datasets. around points of interest, and (3) the formation of Single scan versions of the well-known clustering algorithms groups delimited by a set of limiting points. The K-means and Cobweb for large datasets is proposed in [14] proposed instances support similarity grouping of one and [15]. CURE [16] and BIRCH [17] are two alternative or more independent one-dimensional attributes. clustering algorithms based on sampling and summaries, • We extend the standard optimization techniques for respectively. They use only one pass over the data and hence regular aggregations to the case of SGB. In particular, reduce notably the execution time of clustering. However, we introduce the main theorem of Eager and Lazy their execution times are still significantly slower than the one similarity aggregations, an extension of the of the standard group-by. The main differences of the corresponding regular aggregation based theorem; and proposed similarity group-by from these algorithms are: (1) the requirements that a materialized view must satisfy the execution times of the proposed similarity grouping to be used to answer a similarity aggregation query. operators are very close to that of the regular group-by; (2) • We implement the proposed SGB operators in similarity group-by operators are fully integrated with the PostgreSQL and study their performance and scalability query engine allowing the direct use of their results in properties. We use SGB in modified TPC-H queries to complex query pipelines for further analysis; and (3) the answer interesting business questions and show that the computation of aggregation functions is integrated in the execution time of all implemented SGB's instances is at grouping process and considers all the tuples in each group, most only 25% larger than that of the regular group-by. not a summary or a subset based on sampling. The last feature The rest of this paper proceeds as follows. Section II allows for fast generation of cluster representatives with the discusses the related work. Section III presents the general exact values of the aggregation functions that can be used definition of SGB and three instances of this definition. immediately by other operators in the query pipeline. Section IV studies optimization techniques applicable to the Algorithms similar to CURE or BIRCH would require extra new operators. Section V presents implementation guidelines steps to evaluate aggregation functions or to make available based on a prototype realization of the operators within their results to SQL queries. Several clustering algorithms PostgreSQL. Section VI reports on the performance have been implemented in data mining systems. In general, evaluation of the similarity group-by operators and Section the use of clustering is via a complex data mining model and VII presents the conclusions and directions for future research. the implementation is not integrated with the standard query processing engine. The work in [18] proposes some SQL II. RELATED WORK constructs to make clustering facilities available from SQL in The work on similarity-based query processing has focused the context of spatial data. Basically, these constructs act as on similarity joins. Similarity joins use special types of join wrappers of conventional clustering algorithms but no further predicates to match tuples that have approximate values. integration with database systems is studied. Li et al. extend Different types of similarity join have been proposed, e.g., the group-by operator to approximately cluster all the tuples in range distance join (retrieves all pairs whose distances are a pre-defined number of clusters [28]. Their framework makes smaller than a pre-defined threshold) [1], [7], [8], k-distance use of conventional clustering algorithms, e.g., K-means; and join (retrieves the k most-similar pairs) [2], and knn-join employs summaries and bitmap indexes to integrate clustering (retrieves, for each tuple in one table, the k nearest-neighbours and ranking into database systems. Our study differs from [28] in the other table) [3], [5], [6]. Some similarity join techniques in that (1) we focus on similarity grouping operators have been employed as building blocks to implement common independent of the support and tight coupling to ranking; (2) clustering algorithms [4]. Kriegel et al. extend the work on we introduce a framework that does not depend on possibly similarity join to uncertain data [9]. costly conventional clustering algorithms, but rather allows The clustering problem has been studied extensively, e.g., the specification of the desired grouping using descriptive in pattern recognition, machine learning, physiology, biology, properties such as group size and compactness; and (3) we statistics, and data mining. In some of these application consider optimization techniques of the proposed similarity scenarios, finding the groups with certain similarity properties group-by operators. In the context of data reconciliation, is the goal of data analysis while in others finding the groups Schallehn et al. propose SQL extensions to allow the use of is just the first step for other operations, e.g., for data user-defined similarity functions for grouping purposes [25] compression or discovery of hidden patterns or relationships and similarity grouping predicates [26], [27]. They focus on among the data items. Jain et al. present an overview of string similarity and similarity predicates to reconcile records. clustering from a statistical perspective [10]. Berkhin surveys Although they can be used for this purpose, the proposed clustering techniques used in data mining [11]. These similarity group-by operators in this paper are more general techniques consider the special data mining computational and are designed to be part of a DBMS’s query engine. The optimization techniques of similarity grouping S 5 presented in this paper builds on previous work on 1 10 5 10 10 optimization of regular aggregation queries. Larson et al. 10 5 5 20 study pull-up and push-down techniques that enable the query 5 10 optimizer to move aggregation operators up and down the 20 S1,2 20 query tree [19], [20]. These techniques allow complete [19] or S1,1 20 S1,3 partial [20] pre-aggregation that can reduce the input size of a join and consequently decrease significantly the execution Fig. 2 Example usage of the generic SGB time of an aggregation query. Galindo-Legaria proposes a general framework for optimization of queries with subqueries modeled as instances of this generic definition. The generic and aggregations [21]. Another technique that can provide definition is useful for reasoning with the new SGB operation substantial improvements in query processing is the use of and for deriving equivalences that allow the optimization of materialized views to answer aggregation queries. This queries (as in Section IV). Naturally, this generic form of SGB technique is presented in [22] for the case of sum and count is not to be implemented directly. Below, we present three aggregation functions, and is extended in [23] and [24] to implementable instances of the generic SGB. The main factors arbitrary aggregation functions. considered in the selection of the proposed instances are: (1) the ability to generate meaningful and useful groups, e.g., III. SIMILARITY GROUP-BY: DEFINITION around a set of points of interest or groups that satisfy key This section presents the general definition of the similarity properties such as group size and group compactness; (2) the group-by operator along with three instances that enable: (1) viability of a fast implementation, e.g., using a single-pass grouping tuples based on desired group properties, e.g., size plane-sweep approach; and (3) the usefulness of the instances and compactness, (2) grouping around points of interest, and in practical scenarios; the specific scenarios considered in this (3) segmenting the tuples based on given limiting values. paper are: business decision support systems (Section VI-B.3) and sensor networks (Section III-B). The proposed instances A. Generic Definition represent middle ground between the regular group-by and We define the similarity group-by operator as follows: standard clustering algorithms. The proposed similarity group- , ,…, , ,…, by instances are intended to be much faster than regular clustering algorithms and generate groupings that capture where R is a relation name, Gi is an attribute of R that is used similarities on the data not captured by regular group-by. On to generate the groups, i.e., a similarity grouping attribute, Si the other hand, the quality of the generated groupings is not is a segmentation of the domain of Gi in non-overlapping expected to be always as high as the ones generated by more segments, Fi is an aggregation function, and Ai is an attribute complex and costly clustering algorithms. The presentation in of R. this section focuses on the case of one or multiple independent The formation of groups has two steps: grouping attributes (multiple independent dimensions). 1. For each tuple t, each value vi of t.Gi is replaced by the identifier of the segment (member of Si) that contains vi. 1) Unsupervised Similarity Group-by (SGB-U): This If no segment contains vi, t is dismissed. operator groups a set of tuples in an unsupervised fashion, i.e., 2. The resulting tuples are merged to form the similarity with no extra data provided to guide the process. The SGB-U groups. Two tuples are in the same group if their new operator uses the following two clauses to control the group G1,…,Gn values are the same. size and the group compactness: • The aggregation functions Fi are applied over each group MAXIMUM_ELEMENT_SEPARATION s: If the similar to a standard aggregation operation. Figure 2 distance between two neighbor elements (consecutive illustrates an example segmentation S1 that groups a two- elements, for the one-dimensional case) is greater than dimensional data set into three segments S1,1, S1,2, and S1,3 s, then these elements belong to different groups. based on some notion of similarity. Let the dots in the figure • MAXIMUM_GROUP_DIAMETER d: For each represent the tuples of a relation R(G1, A1), where the value of formed group, the distance between the extreme G1 is the position of the dot and the value of A1 is the value elements of a group should be less than or equal to d. next to the dot. The result of: The SQL syntax of the SGB-U operator is: