IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1 BrePartition: Optimized High-Dimensional kNN Search With Bregman Distances
Yang Song ,YuGu , Rui Zhang , Senior Member, IEEE, and Ge Yu , Member, IEEE
Abstract—Bregman distances (also known as Bregman divergences) are widely used in machine learning, speech recognition and signal processing, and kNN searches with Bregman distances have become increasingly important with the rapid advances of multimedia applications. Data in multimedia applications such as images and videos are commonly transformed into space of hundreds of dimensions. Such high-dimensional space has posed significant challenges for existing kNN search algorithms with Bregman distances, which could only handle data of medium dimensionality (typically less than 100). This paper addresses the urgent problem of high-dimensional kNN search with Bregman distances. We propose a novel partition-filter-refinement framework. Specifically, we propose an optimized dimensionality partitioning scheme to solve several non-trivial issues. First, an effective bound from each partitioned subspace to obtain exact kNN results is derived. Second, we conduct an in-depth analysis of the optimized number of partitions and devise an effective strategy for partitioning. Third, we design an efficient integrated index structure for all the subspaces together to accelerate the search processing. Moreover, we extend our exact solution to an approximate version by a trade-off between the accuracy and efficiency. Experimental results on four real-world datasets and two synthetic datasets show the clear advantage of our method in comparison to state-of-the-art algorithms.
Index Terms—Bregman distance, high-dimensional, kNN search, dimensionality partitioning Ç
1INTRODUCTION
REGMAN distances (also called Bregman divergences), as a man and the horse are perceptually similar to their composi- Bgeneralization of a wide range of non-metric distance tion, but the two obviously differ from each other. Therefore, functions (e.g., Squared Mahalanobis Distance and Itakura- it is not appropriate to employ Euclidian distance as the dis- Saito Distance), play an important role in many multimedia tance measurement in many practical scenarios. applications such as image and video analysis and retrieval, Since Bregman distances have the capability of exploring speech recognition and time series analysis [1], [2], [3]. This the implicit correlations of data features [8], they have been is because metric measurements (such as Euclidian distance) widely used in recent decades in a variety of applications, satisfy the basic properties in metric space, such as non- including image retrieval, image classification and sound negativity, symmetry and triangular inequality. Although it processing [7], [9], [10]. Over the last several years, they are is empirically proved successful, the metric measurements also used in many practical applications. For example, they represented by Euclidian distance are actually inconsistent are employed to measure the closeness between Hermitian with human’s perception of similarity [4], [5]. Examples Positive-Definite (HPD) matrices to realize target detection from [6] and [4] illustrate that the distance measurement is in a clutter [11]. They are also used as the similarity meas- not metric when comparing images. As can be seen in urements of the registration functional to combine various Fig. 1a, the moon and the apple are similar in shape, the pen- types of image characterization as well as spatial and statis- tagram and the apple are similar in color, but there is no sim- tical information in image registration [12] and apply multi- ilarity between the moon and the pentagram. In this case, region information to express the global information in our perception of similarity violates the notion of triangular image segmentation [13]. In addition, they are applied in inequality and illustrates that human beings are often com- graph embedding, matrix factorization and tensor factoriza- fortable when deploying or using non-metric dissimilarity tion in the field of social network analysis [14]. Among the measurements instead of metric ones especially on complex operations that employ Bregman distances, kNN queries are data types [6], [7]. Likewise, as shown in Fig 1b [4], both the demanded as a core primitive or a fundamental step in the aforementioned applications [15], [16], [17]. Y. Song, Y. Gu, and G. Yu are with the School of Computer Science and In addition, data in multimedia applications such as Engineering, Northeastern University, Shenyang, Liaoning 110819, China. images and videos are commonly transformed into space of E-mail: [email protected], {guyu, yuge}@mail.neu.edu.cn. hundreds of dimensions, such as the commonly used data- R. Zhang is with the School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia. sets (Audio, Fonts, Deep, SIFT, etc.) illustrated in our experi- E-mail: [email protected]. mental evaluation. Existing approaches [6], [18] for kNN Manuscript received 17 Oct. 2019; revised 15 Mar. 2020; accepted 30 Apr. 2020. searches with Bregman distances focus on designing index Date of publication 0 . 0000; date of current version 0 . 0000. structures for Bregman distances, but these index structures (Corresponding author: Ge Yu). perform poorly in high-dimensional space due to either Recommended for acceptance by G. Chen. Digital Object Identifier no. 10.1109/TKDE.2020.2992594 large overlaps between clusters or intensive computation.
1041-4347 ß 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html_ for more information. 2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
dimensional data efficiently, so they work well in our framework since the data has been partitioned into low-dimensional subspaces. In addition, the data points are well-organized on the disks based on our proposed PCCP to improve data’s reusability in all the subspaces, so that the I/O cost can be reduced. In order to improve the search efficiency while ensur- ing comparable accuracy with the probability guaran- Fig. 1. Examples. tee, we make a trade-off between the efficiency and the accuracy and propose a solution to approximate kNN search through the data distribution. Extensive experimental evaluations demonstrate the Aiming at these situations, this paper addresses the urgent high efficiency of our approach. Our algorithm named problem of high-dimensional kNN searches with Bregman dis- BrePartition can clearly outperform state-of-the-art tances. In this paper, we propose a partition-filter-refinement algorithms in running time and I/O cost. framework. We first partition a high-dimensional space into The rest of the paper is structured as follows. Section 2 many low-dimensional subspaces. Then, range queries in all presents the related works. The preliminaries and overview the subspaces are performed to generate candidates. Finally, are discussed in Section 3. We present the derivation of the the exact kNN results are evaluated by refinement from the upper bound in Section 4 and the dimensionality partitioning candidates. However, realizing such a framework requires scheme in Section 5. The index structure, BB-Forest, is solving the following three challenges: described in Section 6. We present the overall framework in Bound: In metric space, bounds are usually derived Section 7. The extended solution to approximate kNN search based on triangular inequality, but in non-metric is presented in Section 8. Experimental results are disscussed space, triangular inequality does not hold. Therefore, in Section 9. Finally, we conclude our work in Section 10. it is challenging to derive an efficient bound for Breg- man distances which are not metric. 2RELATED WORKS Partition: How to partition the dimensions to get the kNN search is a fundamental problem in many application best efficiency is a challenge. We need to work out domains. Here we review existing works on the kNN search both how many partitions and at which dimensions problem in both metric and non-metric spaces. should we partition. Index: It is challenging to design an I/O efficient 2.1 Metric Similarity Search index that handles all the dimension partitions in a unified manner. Specifically, by effectively organiz- The metric similarity search problem is a classic topic and a ing the data points on disks, the index is desired to plethora of methods exist for speeding up the nearest neigh- adapt to our proposed partition strategy and facili- bor retrieval. Existing methods contain tree-based data tate the data’s reusability across partitions. structures including KD-tree [19], R-tree [20], B+-tree varia- To address the above challenges, and make the following tions [21], [22], [23] and transformation-based methods contributions in this paper: including Product Quantization (PQ)-based methods [24], [25], Locality Sensitive Hashing (LSH) family [26], [27], [28], We derive the upper bounds between the given [29] and some other similarity search methods based on var- query point and an arbitrary data point in each sub- iant data embedding or dimensionality reduction techni- space mathematically based on Cauchy inequality ques [30], [31], [32]. These methods can’t be utilized in non- and the proper upper bounds are selected as the metric space where the metric postulates, such as symmetry searching bounds from these subspaces. The final and triangle inequality, are not followed. candidate set is the union of the candidate subsets of all the subspaces. We theoretically prove that the 2.2 Bregman Distance-Based Similarity Search kNN candidate set obtained following our bound Due to the widespread use of Bregman distances in multime- contains the kNN results. dia applications, a growing body of work is tailored for the For dimensionality partitioning, we observe a trade- kNN search with bregman distances. The prime technique is off between the number of partitions and the search Bregman voronoi diagrams derived by Nielsen et al. [33]. time. Therefore, we derive the algorithm’s time com- Soon after that, Bregman Ball tree (BB-tree) is introduced by plexity and the optimized number of partitions in Cayton [18]. BB-trees are built by a hierarchical space decom- theory. Furthermore, a strategy named Pearson Cor- position via k-means, sharing the similar flavor with KD-tree. relation Coefficient-based Partition (PCCP) is pro- In a BB-tree, the clusters appear in terms of Bregman balls posed to reduce the size of the candidate set by and the filtering condition in the dual space on the Bregman partitioning highly-correlated dimensions into dif- distance from a query to a Bregman ball is computed for ferent subspaces. pruning out portions of the search space. Nielsen et al. [34] After dimensionality partitioning, we employ BB-trees extend the BB-tree to symmetrized Bregman distances. Cay- in our partitioned low-dimensional subspaces and ton [35] explores an algorithm to solve the range query based design an integrated and disk-resident index struc- on BB-tree. Nevertheless, facing higher dimensions, consid- ture, named BB-forest. BB-trees can handle low- erable overlap between clusters will be incurred, and too SONG ET AL.: BREPARTITION: OPTIMIZED HIGH-DIMENSIONAL KNN SEARCH WITH BREGMAN DISTANCES 3 many nodes have to be traversed during a kNN search. TABLE 1 Therefore the efficiency of BB-tree is dramatically degraded, Non-Metric Search Methods sometimes even worse than the linear search. Zhang et al. [6] Name NM BDS HD Exact devise a novel solution to handle the class of Bregman dis- BrePartition (Our solution) √√√√ tances by expanding and mapping data points in the original √√ √ space to a new extended space. It employs typical index Zhang et al. [6] BB-tree [18], [35] √√ √ structures, R-tree and VA-file, to solve exact similarity search Bregman voronoi diagram [33] √√ √ problems. But it’s also inefficient for more than 100 dimen- BB-tree variants [34] √√ sions, because too many returned candidates in a filter- Ackermann et al. [36] √√ refinement model lead to intensive computation overhead of Abdullah et al. [37] √√ Bregman distances. Coviello et al. [34] √√√ √√ Towards more efficient search processing, there have Ferreira et al. [38] Non-metric LSH [4] √√√ been increasing attentions focusing on the approximate FastMap [39] √√ search methods for Bregman distances [4], [18], [34], [36], Wang et al. [40] √√ [37], [38]. These approximate methods achieve the efficiency Boostmap [41] √ promotions with the price of losing accuracies. For example, Athitsos et al. [42] √√ the state-of-the-art approximate solution [34], which is TriGen [43] √√ √ designed for the high-dimensional space, exploits the data- NM-tree [44] √ a s distribution and employs a variational approximation to Liu et al. [45] LCE [46] √ estimate the explored nodes during backtracking in the BB- DBH [47] √√ tree. Nevertheless, all these methods can’t provide the preci- Dyndex [48] √√ sion guarantees, while some of them cannot be applied to HNSW [49] √√ the high-dimensional space [18], [36], [37], [38].
2.3 Other Non-Metric Similarity Search Methods designed specifically for Bregman distances, and HD means There also exist many works in the context of non-metric that the method works well in high-dimensional space similarity search without the limit to Bregman distances. (more than 100 dimensions). Our proposed solution BrePar- Space-embedding techniques [39], [40], [41], [42] embed tition is the first algorithm that possesses all the four desired non-metric spaces into an euclidean one where two points properties compared to existing algorithms. that are close to each other in the original space are more likely close to each other in the new space. Distance-map- 3PRELIMINARIES AND OVERVIEW ping techniques transform the non-metric distance by modi- fying the distance function itself while preserving the 3.1 Bregman Distance d S y y ; original distance orderings. Skopal [43] develops TriGen Given a -dimensional vector space , a query ¼ð 1 y ; ;y x x ;x ; ;x algorithm to derive an efficient mapping function among 2 ... dÞ and an arbitrary data point ¼ð 1 2 ... dÞ, concave functions by using the distance distribution of the the Bregman distance between x and y is defined as database. NM-tree [44] combines M-tree and TriGen algo- Df ðÞ¼x; y fxðÞ fyðÞ < rfyðÞ;x y>, where fð Þ is a rithm for the non-metric search. Liu et al. [45] propose a sim- convex function mapping points in S to real numbers, ulated-annealing-based technique to derive optimized rfyðÞis the gradient of fð Þ at y, and < ; > denotes transform functions while preserving the original similarity the dot product between two vectors. When different con- orderings. Chen et al. [46] employ the constant shifting vex functions are employed, Bregman distances define sev- embedding with a suitable clustering of the dataset for a eral well-known distance functions. Some representatives more effective lower-bounds. Recently, a representative contain: technique based on Distance-Based Hashing (DBH) [47] is fx 1 xTQx Squared Mahalanobis Distance: The given ðÞ¼ 2 presented and a general similarity indexing methods for 1 T generates Df ðÞ¼x; y ðÞx y QxðÞ y which can be non-metric distance measurements is designed by optimiz- 2 considered as a generalization of the above squared ing hashing functions. In addition, Dyndex [48], as the most euclidean distance. impressive technique based on classification performs clas- Itakura-SaitoP Distance (ISD): When the given fxðÞ¼ sification of the query point to answer similarity search by x log i, the distanceP is IS distance which is categorizing points into classes. These methods degrade xi xi denoted by Df ðÞ¼x; y log 1 . dramatically in performance when dealing with high- yi yi dimensional issues. There exists an approximate solution Exponential Distance (ED): When the given fðxÞ¼ex, called Navigable Small World graph with controllable Hier- the Bregman distance is represented as x y archy (HNSW) [49], which can be extended to non-metric Df ðx; yÞ¼e ðx y þ 1Þe . In this paper, we name space. However, it is not a disk-resient solution, while we it exponential distance. mainly focus on disk-resident solutions in this paper. In addition, our method can be applied to most measures We summarize the properties of representative non- belonging to Bregman distances, such as Shannon entropy, metric search methods in Table 1. In Table 1, NM means Burg entropy, lp-quasi-norm and lp-norm, except KL-diver- that the method adopts the distance functions of non-metric gence, since it’s not cumulative after the dimensionality space instead of metric space, BDS means that the method is partitioning. 4 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
TABLE 2 Frequently Used Symbols
Symbol Explanation S dataset n number of data points d dimensionality of each data point k number of returned points that users require Fig. 2. Overview. x data point y query point 3.2 Overview PðxÞ transformed data point QðyÞ transformed query point Our method consists of the precomputation and the search Df ðx; yÞ the Bregman distance between x and y processing. In the precomputation, we first partition the UBðx; yÞ the upper bound of Bregman distance between x and y M dimensions (described in Section 5). Second, we construct number of partitions BB-trees in the partitioned subspaces and integrate them to form a BB-forest (described in Section 6). Third, we trans- form the data points into tuples for computing the searching Proof. Please see Section 1 in the supplementary file, which bound (described in Section 4). During the search process- can be found on the Computer Society Digital Library at ing, we first transform the query point into a triple and com- http://doi.ieeecomputersociety.org/10.1109/ pute the bound used for the range query (described in TKDE.2020.2992594. tu Section 4). Second, we perform the range query for the can- didates. Finally, the kNN results are evaluated from these Based on Theorem 1, we can derive the upper bound candidates. The whole process is illustrated in Fig 2. between two arbitrary data points in each subspace. Fur- We summarize the frequently-used symbols in Table 2. thermore, we can obtain the upper bound between two data points in the original high-dimensional space by summing M 4DERIVATION OF BOUND up the upper bounds from all the subspaces. Theorem 2 proves it. For kNN queries, it is crucial to avoid the exhaustive search by deriving an effective bound as a pruning condition. We Theorem 2. The Bregman distance between x and y is bounded exploit the property of Bregman distances and an upper by the sum of each upper bound from its corresponding sub- bound is derived from Cauchy inequality. space.P Formally, Df ðx; yÞ UBðx; yÞ, where UBðx; yÞ¼ T M UB x ;y Given a data set D, suppose x ¼ðx1; ...;xdÞ and y ¼ i¼1 ðÞi i . T ðy ; ...;ydÞ are two d-dimensional vectors in D. After 1 Proof. Please see Section 2 in the supplementary file, avail- dimensionality partitioning, x is partitioned into M disjoint able online. tu subspaces represented by M subvectors. These M subvec- tors are denoted by: Therefore, when dealing with a kNN search, we first T compute the upper bounds between the given query points xi ¼ x d i ; ...;x d i deM ðÞþ 1 1 deM and every data point in the dataset. And then the kth small- T est upper bound is selected and its components are selected ¼ðxi1; ...;xi d Þ ; deM as the searching bounds to perform range queries in their where 1 i M. Vector y is partitioned in the same man- corresponding subspaces as the filter processing. The union y i M ner, while each part is denoted by i ðÞ1 .By of the searching result sets of all the subspaces is the final Cauchy inequality, we can prove Theorem 1 below, which candidate set for the refinement processing. Theorem 3 x can be used to derive the upper bound between arbitrary i proves that all the kNN points are in the final candidate set. and yi (1 i M) in the same subspace. Theorem 3. The candidate set of each subspace is denoted by Ci x y i M Theorem 1. The upper bound between i and i (1 )is (1 i M), and the final candidate set: derived: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi i i i i ð Þ ð Þ ð Þ ð Þ ðiÞ C ¼ C1 [ C2 [ [CM : Df ðÞ xi ;yi ax þ ay þ byy þ gx dy :
ðiÞ ðiÞ ðiÞ ðiÞ ðiÞ k y C For simplicity, ax , ay , byy , gx and dy are used to mark The NN points of the given query point exist in . these formulas: Proof. Please see Section 3 in the supplementary file, avail- deXd=M ÀÁ ded=MX ÀÁ able online. tu ðiÞ ðiÞ ax ¼ fxij ; ay ¼ fyij ; j¼1 j¼1 Based on the upper bound derived in Theorem 1, two components ax and gx of data points in all the subspaces can be computed offline. It can be considered as the precom- ded=MX @f y deXd=M @f y ðiÞ ð Þ ðiÞ ð Þ putation that each partitioned multi-dimensional data point bxy ¼ xij ; byy ¼ yij ; @yij @yij j¼1 j¼1 in each subspaces is transformed into a two-dimensional tuple denoted by PðxÞ¼ðax; gxÞ (See Fig. 3 for an illustra- deXd=M deXd=M @f y 2 tion). Once the query point is given, we only need to com- ðiÞ 2 ðiÞ ð Þ gx ¼ xij; dy ¼ : pute the values of ay, byy and dy, which can be considered as @yij j¼1 j¼1 a triple denoted by QðyÞ¼ðay; byy; dyÞ, to obtain the upper SONG ET AL.: BREPARTITION: OPTIMIZED HIGH-DIMENSIONAL KNN SEARCH WITH BREGMAN DISTANCES 5
exponential relationship between the upper bound and the number of partitions. However, more partitions lead to more online computation overhead. Therefore, it’s an unre- solved issue to determine the number of partitions. Besides, how to partition these dimensions may also affect the effi- ciency. In this section, the optimized number of partitions which contributes to the efficiency is derived theoretically Fig. 3. Transformation. and a heuristic strategy is devised for the purpose of reduc- ing the candidate set by partitioning. bounds while the computation overhead is very small. Rely- ing on the precomputation, the search processing will be Algorithm 4. QBDetermine (St;Q) dramatically accelerated. S P ;P ; ;P Algorithm 1 describes the process of computing the upper Input: the transformed dataset t ¼f 1 2 ... ng, the trans- formed query point Q. bound between two arbitrary data points by their correspond- M QB ing transformed vectors. Algorithms 2 and 3 describe the trans- Output: the set containing subspaces’ searching bounds . 1: UB ¼;, QB ¼;; formations of the partitioned data point and the partitioned 2: for i ¼ 1 to n do query point, respectively. Algorithm 4 describes the process of 3: for j ¼ 1 to M do determining the searching bound from each subspace. 4: ubij ¼ UBComputeðPij;QjÞ; // Algorithm 1 5: ubiþ¼ubij; P x ;Q y Algorithm 1. UBCompute ( ð Þ ð Þ) 6: QBi ¼ QBi [fubijg; 7: end for Input: the transformed vectors of x and y, PðxÞ¼ðax; gxÞ and 8: UB ¼ UB [ ubi; QðyÞ¼ðay; byy; dyÞ. 9: end for Output: the upper bound of x and y, ub. pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 10: Sort UB and find the kth smallest ubt (1 t n); 1: ub ¼ ax þ ay þ byy þ gx dy; 11: QB ¼ QBt ¼fubt1;ubt2; ...;ubtM g. return ub 2: ; 12: return QB;
Algorithm 2. PTransform (X) 5.1 Number of Partitions Input: the subvectors’ set of a partitioned data point X ¼fx ; 1 More partitions lead to a tighter bound, which indicates a x2; ...;xM g. Output: the transformed tuples’ set of a partitioned data point smaller candidate set intuitively. But more partitions also incur more computation overhead when calculating the P ¼fPðx1Þ;Pðx2Þ; ...;PðxM Þg. 1: for i ¼ 1 to M do upper bounds from more subspaces at the same time. It’s a ðiÞ ðiÞ 2: Compute ax and gx ; trade-off between the number of partitions and the efficiency. ðiÞ ðiÞ In this part, we will theoretically analyze the online processing 3: Pðx1Þ¼ðax ; gx Þ; 4: end for to derive the optimized number of partitions for the opti- mized efficiency. The online processing mainly includes two 5: P ¼fPðx1Þ;Pðx2Þ; ...;PðxM Þg; 6: return P; parts, one is to compute the upper bounds between the given query point and an arbitrary data point in the dataset and sort these upper bounds to determine the searching bound. The Algorithm 3. QTransform (Y ) other is to obtain the candidate points by performing range queries and refine the candidate points for the results. Below Y Input: the subvectors’ set of the partitioned query point ¼ we will derive the time complexities of the online kNN search. y ;y ; ;y f 1 2 ... M g. As a prerequisite, we transform the given query into a triple: Output: the transformed tuples’ set of the partitioned query Q Q y ;Q y ; ;Q y X X X point ¼f ð 1Þ ð 2Þ ... ð M Þg. @fðyÞ @fðyÞ i M Q y f y ; ; y : 1: for ¼ 1 to do ð Þ¼ ð Þ @y @y ðiÞ ðiÞ ðiÞ 2: Compute ay , byy and dy ; ðiÞ ðiÞ ðiÞ 3: QðxiÞ¼ðay ; byy ; dy Þ; When it comes to computing upper bounds, these triples in 4: end for the M partitions can be computed in OðdÞ time. Since we Q Q y ;Q y ; ;Q y 5: ¼f ð 1Þ ð 2Þ ... ð M Þg; have already transformed the multi-dimensional data 6: return Q; points in each subspace into tuples, we can compute the upper bounds in OðMnÞ time in all M subspaces. The time complexity of summing up all n points’ upper bounds is 5DIMENSIONALITY PARTITIONING also OðMnÞ and the time complexity of sorting them to find According to Cauchy inequality, we can prove: the kth smallest one is Oðnlog kÞ. Therefore, the whole time qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi XM1 XM2 complexity of transforming the query point, computing the ðiÞ ðiÞ ðiÞ ðiÞ gx dy > gx dy ; upper bounds, summing up them and sorting them to find i¼1 i¼1 the kth smallest one is Oðd þ Mn þ nlog kÞ. when M1