Dimensional Testing for Reverse K-Nearest Neighbor Search
Total Page:16
File Type:pdf, Size:1020Kb
University of Southern Denmark Dimensional testing for reverse k-nearest neighbor search Casanova, Guillaume; Englmeier, Elias; Houle, Michael E.; Kröger, Peer; Nett, Michael; Schubert, Erich; Zimek, Arthur Published in: Proceedings of the VLDB Endowment DOI: 10.14778/3067421.3067426 Publication date: 2017 Document version: Final published version Document license: CC BY-NC-ND Citation for pulished version (APA): Casanova, G., Englmeier, E., Houle, M. E., Kröger, P., Nett, M., Schubert, E., & Zimek, A. (2017). Dimensional testing for reverse k-nearest neighbor search. Proceedings of the VLDB Endowment, 10(7), 769-780. https://doi.org/10.14778/3067421.3067426 Go to publication entry in University of Southern Denmark's Research Portal Terms of use This work is brought to you by the University of Southern Denmark. Unless otherwise specified it has been shared according to the terms for self-archiving. If no other license is stated, these terms apply: • You may download this work for personal use only. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying this open access version If you believe that this document breaches copyright please contact us providing details and we will investigate your claim. Please direct all enquiries to [email protected] Download date: 05. Oct. 2021 Dimensional Testing for Reverse k-Nearest Neighbor Search Guillaume Casanova Elias Englmeier Michael E. Houle ONERA-DCSD, France LMU Munich, Germany NII, Tokyo, Japan [email protected] [email protected] [email protected] Peer Kröger Michael Nett LMU Munich, Germany Google Japan [email protected]fi.lmu.de [email protected] Erich Schubert Arthur Zimek Heidelberg U., Germany SDU, Odense, Denmark [email protected] [email protected] heidelberg.de ABSTRACT data streams, RkNN queries serve as a basic operation for Given a query object q, reverse k-nearest neighbor (RkNN) determining those objects that would potentially be affected search aims to locate those objects of the database that have by a particular data update operation, particularly for those q among their k-nearest neighbors. In this paper, we propose applications that aim to track the changes of clusters and an approximation method for solving RkNN queries, where outliers [1, 36, 35]. Bichromatic RkNN queries have also the pruning operations and termination tests are guided by been proposed, in which the data objects are divided into a characterization of the intrinsic dimensionality of the data. two types, where queries on the objects of the first type The method can accommodate any index structure sup- are to return reverse neighbors drawn from the set of ob- porting incremental (forward) nearest-neighbor search for jects of the second type [29, 48, 50]. Typical applications the generation and verification of candidates, while avoid- of bichromatic queries arise in situations where one object ing impractically-high preprocessing costs. We also provide type represents services, and the other represents clients. experimental evidence that our method significantly outper- The methods suggested in [49] and [9] provide ways of com- forms its competitors in terms of the tradeoff between ex- puting monochromatic or bichromatic RkNN queries for 2- ecution time and the quality of the approximation. Our dimensional location data. approach thus addresses many of the scalability issues sur- Virtually all existing RkNN query strategies aggregate in- rounding the use of previous methods in data mining. formation from the kNN sets (the ‘forward’ neighborhoods) of data objects in order to determine and verify RkNN can- didate objects. Some strategies require the computation of 1. INTRODUCTION exact or approximate kNN distances for all data objects [29, The reverse k-nearest neighbor (RkNN) similarity query 51, 3, 44, 2, 4], whereas others apply distance-based pruning — that is, the computation of all objects of a data set that techniques with index structures such as R-trees [17] for Eu- have a given query object amongst their respective k-nearest clidean data, or M-trees [10] for general metric data [41, 33, neighbors (kNNs) — is a fundamental operation in data min- 40, 43]. Although they are required for the determination of ing. Even though RkNN and kNN queries may seem to be exact RkNN query results, the use of exact kNN sets gen- equivalent at first glance, they require different algorithms erally involves significant overheads for precomputation and and data structures for their implementation. Intuitively storage, especially if the value of k is not known beforehand. speaking, RkNN queries attempt to determine those data In recent years, data sets have grown considerably in com- objects that are most ‘influenced’ by the query object. Such plexity, with the feature spaces often being of high dimen- notions of influence arise in many important applications. sionality. For the many data mining applications that de- In graph mining, the degree of hubness of a node [46] can pend on indices for similarity search, overall performance be computed by means of RkNN queries. RkNN queries are can be sharply limited by the curse of dimensionality. For also crucial for many existing data mining models, partic- reverse neighborhood search, the impact of the curse of di- ularly in the areas of clustering and outlier detection [18, mensionality is felt in the high cost of computing the forward 27, 37]. For dynamic scenarios such as data warehouses and kNN distances needed for the generation and verification of RkNN candidates. Most practical RkNN approaches work with approximations to the true kNN distance values. The quality of the approximate RkNN results, and thus the qual- This work is licensed under the Creative Commons Attribution- ity of the method as a whole, naturally depends heavily on NonCommercial-NoDerivatives 4.0 International License. To view a copy the accuracy of the kNN distance approximation. of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtain permission by emailing Theoreticians and practitioners have put much effort into [email protected]. finding workarounds for the difficulties caused by the curse Proceedings of the VLDB Endowment, Vol. 10, No. 7 of dimensionality. One approach to the problem has been to Copyright 2017 VLDB Endowment 2150-8097/17/03. 769 characterize the difficulty of datasets not in terms of the rep- terminology. The algorithm is presented in Section 4, fol- resentational dimension of the data, but instead in terms of lowed by its analysis in Section 5. Section 7 describes the a more compact model of the data. Such models have often experimental framework used for the comparison of meth- (explicitly or implicitly) relied on such parameters dimen- ods, and the results of the experimentation are discussed in sion of the surface or manifold that best approximates the Section 8. The presentation concludes in Section 9. data, or of the minimum number of latent variables needed to represent the data. The numbers of such parameters can be regarded as an indication of the intrinsic dimensionality 2. RELATED WORK (ID) of the data set, which is typically much lower than the representational dimension. Intrinsic dimensionality thus 2.1 Methods with Heavy Precomputation serves as an important natural measure of the complexity The basic observation motivating all RkNN methods is of data. For a general reference on intrinsic dimensional that if the distance of an object v to the query object q is measures, see for example [13]. smaller than the actual kNN distance of v, then v is a RkNN Recently, intrinsic dimensionality has been considered in of q. This relationship between RkNN queries and the kNN the design and analysis of similarity search applications, in distances of data objects was first formalized in [29]. The the form of the expansion dimension of Karger and Ruhl [28] authors proposed a solution (originally for the special case and its variant, the generalized expansion dimension (GED) k = 1) in which each database object is transformed into the [22]. The GED is a measure of dimensionality based on the smallest possible hypersphere covering the k-neighborhood rate of growth in the cardinality of a neighborhood as its of the object. For this purpose, the (exact) kNN distance radius increases — it can be regarded as an estimator of of each object must be precomputed. By indexing the ob- the extreme-value-theoretic local intrinsic dimension (LID) tained collection of hyperspheres in an R-Tree [17], the au- of distance distributions [20, 5], which assesses the rate of thors reduced the RkNN problem to that of answering point growth of the probability measure (‘intrinsic volume’) cap- containment queries for hyperspheres. The authors of the tured by a ball with expanding radius centered at a location RdNN-Tree [51] extend this idea: at each index node, the of interest. LID is in turn related to certain measures of the maximum of the kNN distances of the points (hypersphere overall intrinsic dimensionality of data sets, including the radii) is aggregated within the subtree rooted at this node. classical Hausdorff dimension [34], correlation dimension [16, In this way, smaller bounding rectangles can be maintained 42], and other forms of fractal dimension [13]. Originally for- at the R-Tree nodes, with the potential of higher pruning mulated solely for the analysis of similarity search indices [6, power when processing queries. 28], the expansion dimension and its variants have recently An obvious limitation of the aforementioned solutions is been used in the analysis of the performance of a projection- that an independent R-Tree would be required for each pos- based heuristic for outlier detection [12], and of approxima- sible value of k — which is generally infeasible in practice. tion algorithms for multistep similarity search with lower- Approaches such as [3, 44, 2, 4] attempt to overcome this bounding distances [23, 24, 25].