MSQL+: a Plugin Toolkit for Similarity Search Under Metric Spaces in Distributed Relational Database Systems

MSQL+: A Plugin Toolkit for Similarity Search under Metric Spaces in Distributed Relational Database Systems Wei Luy, Xinyi Zhangy, Zhiyu Shuiy, Zhe Pengy, Xiao Zhangy, Xiaoyong Duy, Hao Huangz, Xiaoyu Wangx, Anqun Panx, Haixiang Lix ySchool of Information and DEKE, MOE, Renmin University of China, Beijing, China zSchool of Computer Science, Wuhan University, Wuhan, China xTencent Inc., China Contact:fluwei, [email protected], [email protected] ABSTRACT [11], face recognition in multimedia databases [10], and struc- Similarity search is a primitive operation in various database tural motif discovery in protein databases [9]. Given a query applications. Thus far, a large number of access methods object q and a collection of objects R, similarity search re- have been proposed to accelerate the similarity query pro- turns the set of objects from R whose distances to q are no cessing. Nonetheless, these methods mostly focus on devel- greater than a user-defined threshold θ. A naive approach oping standalone systems by proposing new indices. Given to answering similarity queries is to sequentially scan each the fact that existing RDBMS merely support traditional object r 2 R, and compute the similarity between r and indices, it is of great necessity and practical importance to q. As this naive approach is inefficient, a large number of develop a standard RDBMS built-in index based approach access methods have been proposed to speed up the query to speeding up the query processing. In this demonstration, performance. Nevertheless, these methods still suffer from we introduce MSQL+, a plugin toolkit that enable users either of the following three drawbacks. to answer similarity queries in metric spaces simply using • Standalone. Most of existing access methods focus on de- standard SQL statements. This toolkit can help existing veloping standalone systems by proposing new indices, such RDBMS to effectively and efficiently handle with big data as M-Tree [4], D-Index [5], kd-tree, Quadtree, and Tries due to the following three advantages. First, MSQL+ en- [2], to improve the efficiency. However, integrating these new indices into RDBMS is difficult, since existing RDBMS ables users to find similar objects by submitting SELECT- + FROM-WHERE statements so that it can be easily inte- merely support built-in indices, typically including B -tree, grated into existing RDBMS. Second, MSQL+ works in a R-tree, and hash index. Some other solutions [6, 12, 7, 3] index the data with B+-trees and answer similarity queries more general data space. Objects of any type can be in- + dexed by B+-trees and the query processing can be boosted by probing B -trees. Nonetheless, these solutions require by using index seeks, as long as the similarity function is new index probing mechanisms which are not supported by metric. Third, MSQL+ supports the parallelization of both existing RDBMS unless new APIs are implemented. Fur- pre-processing and query processing in distributed RDBMS. thermore, as discussed in [8], even if these solutions can be integrated into RDBMS with newly introduced APIs, their PVLDB Reference Format: performance may be degraded to table scans. Wei Lu, Xinyi Zhang, Zhiyu Shui, Zhe Peng, Xiao Zhang, Xi- • Working in restricted data spaces. Many existing access aoyong Du, Hao Huang, Xiaoyu Wang, Anqun Pan, Haixiang Li. methods try to improve their efficiency with new pruning MSQL+: A Plugin Toolkit for Similarity Search under Metric Spaces in Distributed Relational Database Systems. PVLDB, 11 rules. Nevertheless, each of these methods works in a spe- (12): 1970-1973, 2018. cific data space, and extending them to other data spaces is DOI: https://doi.org/10.14778/3229863.3236237 often infeasible. For example, methods that are proposed to answer string similarity queries [12, 7] work in text spaces only, and cannot be extended to Euclidean spaces or pro- 1. INTRODUCTION tein spaces. As a primitive operation, a similarity search Similarity search works as a primitive operation in many approach should be general enough to deal with various database applications, such as approximate string search in database applications in RDBMS. text databases [1], location based services in spatial databases • Running on centralized systems only. In the era of big data, it is imperative to utilize distributed systems to man- age the ever-increasing data. In many big internet enter- prises nowadays like Tencent, data are split across multiple This work is licensed under the Creative Commons Attribution- compute nodes, and both OLTP/OLAP queries are executed NonCommercial-NoDerivatives 4.0 International License. To view a copy over the distributed data directly. Hence, using distributed of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtain permission by emailing similarity processing approaches is an inevitable trend, while [email protected]. existing methods work in centralized systems only. Proceedings of the VLDB Endowment, Vol. 11, No. 12 To avoid the above three drawbacks, we propose MSQL+, Copyright 2018 VLDB Endowment 2150-8097/18/8. a plugin toolkit based on our previous work [8] that is able to DOI: https://doi.org/10.14778/3229863.3236237 1970 answer similarity queries in distributed RDBMS fully using Given r 2 R, let r[A] be the set of attribute values of r over SQL statements. MSQL+ consists of two main phases. A. To support B+-tree boosted similarity queries, MSQL+ • Index building. As long as the similarity function is met- executes in two stages, namely (1) the offline index building ric, MSQL+ generates pair-wise comparable signatures for and (2) the online query processing. The first stage gener- objects, and builds B+-trees to index objects. Objects with ates a B+-tree index over attributes A, and the second stage signatures within a set of intervals are taken as candidates runs the query processing using index seeks. for similarity search. • Query processing. MSQL+ enables users to find similar ob- 2.3.1 Index Building jects by merely submitting SELECT-FROM-WHERE state- We build the index over the attributes with two requirements with two predicates. One predicate involves in the ments. First, the attribute values must be comparable so similarity function which is implemented as an user-defined that it is able to be indexed by B+ trees. Second, it is able function. The other predicate specifies signatures in a cer- to figure out candidates for the similarity queries by simply tain set of ranges. The latter predicate triggers multiple comparing the attribute values. Apparently, by building the index seeks to filter out false positives, while the former index with the above two requirments, it is able to answer predicate verifies the candidates. similarity queries by probing the index. For this purpose, Compared with existing solutions, MSQL+ has the fol- we propose a signature generation scheme with which we lowing three advantages. (1) MSQL+ answers similarity generate a signature S(r[A]) for each record r 2 R. Re- queries simply using SQL statements. (2) MSQL+ works in call that in Section 2.1, R is split into jPj partitions, i.e., a more general data space. (3) MSQL+ can run on both SjPj R R th R = i+1 Pi , where Pi denotes the i partition with pi centralized and distributed RDBMS. R R as the pivot. Given a partition Pi in R, 8r 2 Pi , S(r[A]) is defined as a pair shown below: 2. TECHNICAL BACKGROUND S(r[A]) = hi; jr; piji (1) 2.1 Similarity Search in Metric Spaces where i is the partition ID and jr; pij is the distance be- 0 MSQL+ adopts the divide-and-conquer paradigm to pro- tween r and pi. Given two signatures hi; di and hj; d i, the cess similarity queries. The rationale of MSQL+ is to first comparison rule is as follows. select m objects as pivots and assign each object r 2 R to 8 0 0 one and only one pivot according to a certain strategy (e.g., hi; di > hj; d i; if i > j or (i = j and d > d ); > the pivot leading to a minimal distance). Then, the data < 0 0 hi; di = hj; d i; if i = j and d = d ; space is split into m disjoint partitions. Let P be the set R > 0 of selected pivots. 8pi 2 P, Pi denotes the partition whose : hi; di = hj; d i; otherwise objects take pi as their pivot. The distance jr; pij in each R R + partition Pi is also maintained (r 2Pi ). Then, similarity Instead of directly building a B -tree over A, we append R + search is conducted by checking each partition Pi individu- a new attribute I (i.e., signatures) to R, build a B -tree R ally. Following the filter-and-verify paradigm, according to over I. 8r 2 Pi , we update r[I] to its correspondingly sig- R Theorem 1, objects in Pi with their distances to pi within nature hi; jr; piji. Clearly, our index satisfies the above two interval [LBi;UBi] are taken as candidates. In this way, all requirements shown below. (1) 8r1; r2 2 R, S(r1[A]) and the candidates are verified and similar objects are obtained. S(r2[A]) are comparable. (2) Records with their signatures in an interval, i.e., 8r 2 R; S(r[A] 2 [LB; UB]), are taken Theorem 1. Given a partition Pi, 8r 2 Pi, the necessary as candidates. Additionally, because there is no intersection condition for jq; rj ≤ θ is as follows. between ranges of signatures from different partitions, dupli- cated traversal of the index is avoided during the candidate LBi = jpi; qj − θ ≤ jpi; rj ≤ jpi; qj + θ = UBi: identification.

Load more