Complex Queries and Complex Data: Challenges in Similarity Search

Total Page:16

File Type:pdf, Size:1020Kb

Complex Queries and Complex Data: Challenges in Similarity Search Complex Queries and Complex Data: Challenges in Similarity Search Johannes Niedermayer M¨unchen2015 Complex Queries and Complex Data: Challenges in Similarity Search Johannes Niedermayer Dissertation an der Fakult¨atf¨urMathematik, Informatik und Statistik der Ludwig{Maximilians{Universit¨at M¨unchen vorgelegt von Johannes Niedermayer aus M¨unchen M¨unchen, den 21.07.2015 Erstgutachter: PD Dr. Peer Kr¨oger Zweitgutachter: Prof. Dr. Michael Gertz Tag der m¨undlichen Pr¨ufung:30.10.2015 Eidesstattliche Versicherung (Siehe Promotionsordnung vom 12.07.11, § 8, Abs. 2 Pkt. .5.) Hiermit erkläre ich an Eidesstatt, dass die Dissertation von mir selbstständig, ohne unerlaubte Beihilfe angefertigt ist. Name, Vorname Ort, Datum Unterschrift Doktorand/in Formular 3.2 To my future wife. To my parents. To my brother. Contents List of Figures xii List of Tables xv Table of Notations xix Summary xxii I Preliminaries 1 1 Introduction 3 2 Similarity Search 5 2.1 Mathematical Definitions . .5 2.2 Query Types . .6 2.3 Pipeline . .8 2.3.1 Feature Extraction . .9 2.3.2 Indexing and Query Processing . 11 2.4 Challenges . 16 2.4.1 Complex Data . 16 2.4.2 Complex Query Predicates . 18 2.4.3 Large Volumes . 19 3 Thesis Overview and Contributions 21 4 Incorporated Publications and Coauthorship 23 II The RkNN Join 25 5 Introduction 27 x CONTENTS 6 Preliminaries 31 6.1 Problem Definition . 31 6.2 Related Work . 33 6.3 Classification of Existing RkNN Joins . 34 7 Algorithms 37 7.1 The Mutual Pruning Algorithm . 37 7.1.1 General Idea . 37 7.1.2 The Algorithm joinEntry ................ 41 7.1.3 Refinement: The resolve-Routine . 42 7.2 A Self Pruning Approach . 42 7.2.1 General Idea . 42 7.2.2 Implementing the Self-kNN-Join . 45 7.2.3 Implementing the Varying-Range-Join . 47 7.3 Extension to Metric Spaces . 49 7.3.1 Adaptions of the Update List Approach . 50 7.3.2 Adaptions of the kNN-Based Approach . 51 8 Evaluation 53 8.1 Experiments on Synthetic Data . 56 8.2 Real Data Experiments . 62 8.3 Comparing CPU-Cost and IO-Cost . 63 9 Conclusion 65 III Nearest Neighbor Queries on Uncertain Spatio- Temporal Data 67 10 Introduction 69 11 Preliminaries 75 11.1 Problem Definition . 75 11.1.1 Uncertain Trajectory Model . 75 11.1.2 Nearest Neighbor Queries . 78 11.1.3 Probabilistic Reverse Nearest Neighbor Queries . 80 11.2 Related Work . 82 12 Nearest Neighbor Queries 85 12.1 Theoretical Analysis . 85 12.1.1 The P9NN Query . 85 12.1.2 The P8NN Query . 87 CONTENTS xi 12.1.3 The PCNN Query . 95 12.2 Sampling Possible Trajectories . 97 12.2.1 Traditional Sampling . 97 12.2.2 Efficient and Appropriate Sampling . 98 12.3 Spatial Pruning . 108 12.4 Experimental Evaluation . 111 12.4.1 Evaluation: P8NNQ and P9NNQ . 113 12.4.2 Continuous Queries . 119 13 Reverse Nearest Neighbor Queries 121 13.1 PRNN Query Processing . 121 13.1.1 Temporal and Spatial Filtering . 122 13.1.2 Verification . 125 13.2 Experiments . 125 13.2.1 Evaluation: P8RNNQ and P9RNNQ . 126 14 Conclusions 129 IV kNN Queries for Image Retrieval 131 15 Introduction 133 16 Preliminaries 139 16.1 Problem Definition . 140 16.2 Related Work . 142 16.2.1 Keypoint Reduction . 142 16.2.2 kNN Indexing . 143 16.2.3 kNN-based Matching . 144 16.2.4 Match Expansion . 144 17 Minimizing the Number of Matching Queries for Object Re- trieval 147 17.1 Pipeline . 147 17.1.1 Theory . 147 17.1.2 Practical Considerations . 151 17.2 Experiments . 154 17.2.1 Experimental Setup . 154 17.2.2 Experiments . 157 xii CONTENTS 18 Retrieval of Binary Features in Image Databases: A Study 163 18.1 Querying Binary Features with LSH . 164 18.2 Experimental Evaluation . 167 18.2.1 Nearest Neighbor Queries . 169 18.2.2 Range Queries and BoVW . 172 19 Conclusions 175 V Conclusions 177 Acknowledgements 194 List of Figures 2.1 Query Processing Pipeline . .9 2.2 The R-Tree . 12 2.3 The Filter-Refinement approach . 16 5.1 Application of RkNN join between two sets of products R and S for product (set) recommendation. 28 6.1 Visualization of the Monochromatic RkNN Query and the Monochromatic RkNN Join . 32 7.1 Spatial Domination. 38 7.2 Comparison of MS(eS) and CMBR(eS) in an (a) average, (b) worst, and (c) best case. 47 8.1 Performance (CPU time), synthetic dataset. Time is measured in seconds. 55 8.2 Performance (page accesses), synthetic dataset. 57 8.3 Given a set S of size jSj, the left figure visualizes the size of set Re for which TPL and the kNN-based joins have the same computational performance on our test sets R and S (note that this might vary with other datasets). The right figure visualizes the results for varying cache size. Note the logarithmic scale on the y-Axis. 60 8.4 Performance (CPU time in seconds, page accesses), real dataset (HSV). 61 8.5 Performance (CPU time, page accesses), real dataset (post office). 63 8.6 CPU and IO time in seconds when varying the cache size. Left: SSD, right: HDD . 63 10.1 Uncertainty in a Spatio-Temporal Context. 70 xiv LIST OF FIGURES 11.1 Model Visualization (best viewed in color) . 76 11.2 Example Setup for PNN Queries . 80 11.3 Example Setup for PRNN Queries . 82 12.1 An example instance of our mapping of the 3-SAT problem to a set of Markov chains. 86 12.2 P8NN: Violation of the Markov assumption . 95 12.3 Traditional MC-Sampling. 97 12.4 An overview over our forward-backward-algorithm. 99 12.5 Spatio-Temporal Pruning Example. 110 12.6 Examples of the models used for synthetic and real data. Black lines denote transition probabilities. Thicker lines de- note higher probabilities, thinner lines lower probabilities. The synthetic model consists of 10k states. 111 12.7 Varying the Number of States N ................ 113 12.8 Varying the Branching Factor b ................. 114 12.9 Varying the Number of Objects jDj ............... 115 12.10Real Data: Varying the Number of Objects . 115 12.11Efficiency of Sampling without Model Adaption. 116 12.12Effectiveness of Sampling, P8NN and P9NN . 117 12.13Real Data: Effectiveness of the Model Adaption . 118 12.14PCNN: Varying the Number of Objects . 119 12.15PCNN: Varying τ ......................... 120 13.1 Spatio-temporal filtering (only leaf nodes are shown) . 122 13.2 Synthetic Data, Varying jDj.................... 126 13.3 Synthetic Data, Varying b..................... 127 13.4 Synthetic Data, Varying N = jSj................. 128 13.5 Real Data, Varying jDj...................... 128 16.1 Object Recognition Pipeline . 139 17.1 Generation of additional match hypotheses. 149 17.2 Performance for varying k (Hessian-affine SIFT). Straight lines show the performance for 10 keypoints, dashed lines for 1000 keypoints. Equivalent approaches have equivalent colors. 161 18.1 LSH-based indexing . 165 18.2 Parameter Settings and Distance Distribution. 168 18.3 Populatation of buckets (log-log-space). 169 18.4 Varying # Tables (left) and Database Size (right) . 170 18.5 Varying # Probes, Recall (left) and Distance Calculations (right)171 LIST OF FIGURES xv 18.6 Varying # Bits . 172 18.7 Recall and False Hit Rate for Range Queries . 173 18.8 Applying the Hashing Functions to a BoVW-based Ranking . 173 xvi LIST OF FIGURES List of Tables 8.1 Values for the evaluated independent variables. Default values are denoted in bold. 54 12.1 Parameters varied during our experimental evaluation (syn- thetic data). Differing parameters for continuous experiments are denoted by a superscript c. Default values are denoted in bold. 112 17.1 Database Statistics . 155 17.2 Parameters for Match Expansion . 156 17.3 SIFT, Oxford5k, k=100 . 157 17.4 SIFT, Paris6k, k=100 . 157 17.5 SIFT, Holidays, k=10 . 158 17.6 BinBoost, Oxford5k, k=100 . 158 17.7 SIFT, Oxford 105k, k=100 . 159 17.8 BinBoost, Oxford 105k, k=100 . 160 xviii LIST OF TABLES Table of Notations General Notations Symbol Description M A metric space Rd The d-dimensional vector space Bd The space of d-dimensional binary vectors D A database of objects q A query object o 2 D A database object dist(x; y) A distance function, usually a metric − range(q; D; ) -range query kNN(q; D; k) kNN query RkNN(q; D; k) RkNN query MINDIST (A; B) Minimum distance between objects A and B MAXDIST (A; B) Maximum distance between object A and B Dom(A; B; C) Spatial domination function of the objects A, B, and C H A family of hash functions G A function family Q A queue Part II: The RkNN Join Symbol Description R A set of objects, if not denoted otherwise R ⊂ Rd. Corresponds to the left (query) set of the join. S A set of objects, usually S ⊂ Rd. Corresponds to the right (database) set of the join. r Element from R, r 2 R s Element from S, s 2 S R; S Index of set R and S, respectively xx TABLE OF NOTATIONS Part III: Nearest Neighbor Queries on Uncertain Spatio-Temporal Data Symbol Description T A time domain T = f0; : : : ; ng T A time interval t 2 T A single point in time S A spatial domain si 2 S A spatial location, i.e. a state θi 2 S A spatial observation Θo The representation of object o, i.e. a set of location-time tuples, w.l.o.g. ordered by the timesteps ti o o o o o o o Θ = fht1; θ1i; ht2; θ2i;:::; htjΘoj; θjΘojig o(t) The realization of the random variable represent- ing object o at time t M o(t) The transition matrix of object o at time t ~so(t) The probability distribution over spatial locations of object o at time t τ Probability threshold l A literal (9 query) c A clause (9 query) x A variable (9 query) F (t);R(t) Forward and backward model for sampling TABLE OF NOTATIONS xxi Part IV: kNN Queries for Image Retrieval Symbol Description Ij An image Ij 2 D i i i i i i i i i pj 2 Ij A keypoint, pj = (vj; xj; yj; sj; rj; σj;Aj) i vj Feature vector of feature i in image j i i xj, yj Interest point location of feature i in image j i sj Scale of feature i in image j i σj Response of feature i in image j i Aj Affine matrix of feature i in image j m(x) An arbitrary matching function Ψ ⊆ Ij A.
Recommended publications
  • Two Algorithms for Nearest-Neighbor Search in High Dimensions
    Two Algorithms for Nearest-Neighbor Search in High Dimensions Jon M. Kleinberg∗ February 7, 1997 Abstract Representing data as points in a high-dimensional space, so as to use geometric methods for indexing, is an algorithmic technique with a wide array of uses. It is central to a number of areas such as information retrieval, pattern recognition, and statistical data analysis; many of the problems arising in these applications can involve several hundred or several thousand dimensions. We consider the nearest-neighbor problem for d-dimensional Euclidean space: we wish to pre-process a database of n points so that given a query point, one can ef- ficiently determine its nearest neighbors in the database. There is a large literature on algorithms for this problem, in both the exact and approximate cases. The more sophisticated algorithms typically achieve a query time that is logarithmic in n at the expense of an exponential dependence on the dimension d; indeed, even the average- case analysis of heuristics such as k-d trees reveals an exponential dependence on d in the query time. In this work, we develop a new approach to the nearest-neighbor problem, based on a method for combining randomly chosen one-dimensional projections of the underlying point set. From this, we obtain the following two results. (i) An algorithm for finding ε-approximate nearest neighbors with a query time of O((d log2 d)(d + log n)). (ii) An ε-approximate nearest-neighbor algorithm with near-linear storage and a query time that improves asymptotically on linear search in all dimensions.
    [Show full text]
  • Rank-Approximate Nearest Neighbor Search: Retaining Meaning and Speed in High Dimensions
    Rank-Approximate Nearest Neighbor Search: Retaining Meaning and Speed in High Dimensions Parikshit Ram, Dongryeol Lee, Hua Ouyang and Alexander G. Gray Computational Science and Engineering, Georgia Institute of Technology Atlanta, GA 30332 p.ram@,dongryel@cc.,houyang@,agray@cc. gatech.edu { } Abstract The long-standing problem of efficient nearest-neighbor (NN) search has ubiqui- tous applications ranging from astrophysics to MP3 fingerprinting to bioinformat- ics to movie recommendations. As the dimensionality of the dataset increases, ex- act NN search becomes computationally prohibitive; (1+) distance-approximate NN search can provide large speedups but risks losing the meaning of NN search present in the ranks (ordering) of the distances. This paper presents a simple, practical algorithm allowing the user to, for the first time, directly control the true accuracy of NN search (in terms of ranks) while still achieving the large speedups over exact NN. Experiments on high-dimensional datasets show that our algorithm often achieves faster and more accurate results than the best-known distance-approximate method, with much more stable behavior. 1 Introduction In this paper, we address the problem of nearest-neighbor (NN) search in large datasets of high dimensionality. It is used for classification (k-NN classifier [1]), categorizing a test point on the ba- sis of the classes in its close neighborhood. Non-parametric density estimation uses NN algorithms when the bandwidth at any point depends on the ktℎ NN distance (NN kernel density estimation [2]). NN algorithms are present in and often the main cost of most non-linear dimensionality reduction techniques (manifold learning [3, 4]) to obtain the neighborhood of every point which is then pre- served during the dimension reduction.
    [Show full text]
  • Nearest Neighbor Search in Google Correlate
    Nearest Neighbor Search in Google Correlate Dan Vanderkam Robert Schonberger Henry Rowley Google Inc Google Inc Google Inc 76 9th Avenue 76 9th Avenue 1600 Amphitheatre Parkway New York, New York 10011 New York, New York 10011 Mountain View, California USA USA 94043 USA [email protected] [email protected] [email protected] Sanjiv Kumar Google Inc 76 9th Avenue New York, New York 10011 USA [email protected] ABSTRACT queries are then noted, and a model that estimates influenza This paper presents the algorithms which power Google Cor- incidence based on present query data is created. relate[8], a tool which finds web search terms whose popu- This estimate is valuable because the most recent tradi- larity over time best matches a user-provided time series. tional estimate of the ILI rate is only available after a two Correlate was developed to generalize the query-based mod- week delay. Indeed, there are many fields where estimating eling techniques pioneered by Google Flu Trends and make the present[1] is important. Other health issues and indica- them available to end users. tors in economics, such as unemployment, can also benefit Correlate searches across millions of candidate query time from estimates produced using the techniques developed for series to find the best matches, returning results in less than Google Flu Trends. 200 milliseconds. Its feature set and requirements present Google Flu Trends relies on a multi-hour batch process unique challenges for Approximate Nearest Neighbor (ANN) to find queries that correlate to the ILI time series. Google search techniques. In this paper, we present Asymmetric Correlate allows users to create their own versions of Flu Hashing (AH), the technique used by Correlate, and show Trends in real time.
    [Show full text]
  • R-Trees with Voronoi Diagrams for Efficient Processing of Spatial
    VoR-Tree: R-trees with Voronoi Diagrams for Efficient Processing of Spatial Nearest Neighbor Queries∗ y Mehdi Sharifzadeh Cyrus Shahabi Google University of Southern California [email protected] [email protected] ABSTRACT A very important class of spatial queries consists of nearest- An important class of queries, especially in the geospatial neighbor (NN) query and its variations. Many studies in domain, is the class of nearest neighbor (NN) queries. These the past decade utilize R-trees as their underlying index queries search for data objects that minimize a distance- structures to address NN queries efficiently. The general ap- based function with reference to one or more query objects. proach is to use R-tree in two phases. First, R-tree's hierar- Examples are k Nearest Neighbor (kNN) [13, 5, 7], Reverse chical structure is used to quickly arrive to the neighborhood k Nearest Neighbor (RkNN) [8, 15, 16], k Aggregate Nearest of the result set. Second, the R-tree nodes intersecting with Neighbor (kANN) [12] and skyline queries [1, 11, 14]. The the local neighborhood (Search Region) of an initial answer applications of NN queries are numerous in geospatial deci- are investigated to find all the members of the result set. sion making, location-based services, and sensor networks. While R-trees are very efficient for the first phase, they usu- The introduction of R-trees [3] (and their extensions) for ally result in the unnecessary investigation of many nodes indexing multi-dimensional data marked a new era in de- that none or only a small subset of their including points veloping novel R-tree-based algorithms for various forms belongs to the actual result set.
    [Show full text]
  • Dimensional Testing for Reverse K-Nearest Neighbor Search
    University of Southern Denmark Dimensional testing for reverse k-nearest neighbor search Casanova, Guillaume; Englmeier, Elias; Houle, Michael E.; Kröger, Peer; Nett, Michael; Schubert, Erich; Zimek, Arthur Published in: Proceedings of the VLDB Endowment DOI: 10.14778/3067421.3067426 Publication date: 2017 Document version: Final published version Document license: CC BY-NC-ND Citation for pulished version (APA): Casanova, G., Englmeier, E., Houle, M. E., Kröger, P., Nett, M., Schubert, E., & Zimek, A. (2017). Dimensional testing for reverse k-nearest neighbor search. Proceedings of the VLDB Endowment, 10(7), 769-780. https://doi.org/10.14778/3067421.3067426 Go to publication entry in University of Southern Denmark's Research Portal Terms of use This work is brought to you by the University of Southern Denmark. Unless otherwise specified it has been shared according to the terms for self-archiving. If no other license is stated, these terms apply: • You may download this work for personal use only. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying this open access version If you believe that this document breaches copyright please contact us providing details and we will investigate your claim. Please direct all enquiries to [email protected] Download date: 05. Oct. 2021 Dimensional Testing for Reverse k-Nearest Neighbor Search Guillaume Casanova Elias Englmeier Michael E. Houle ONERA-DCSD, France LMU Munich, Germany NII, Tokyo,
    [Show full text]
  • Efficient Nearest Neighbor Searching for Motion Planning
    E±cient Nearest Neighbor Searching for Motion Planning Anna Atramentov Steven M. LaValle Dept. of Computer Science Dept. of Computer Science Iowa State University University of Illinois Ames, IA 50011 USA Urbana, IL 61801 USA [email protected] [email protected] Abstract gies of con¯guration spaces. The topologies that usu- ally arise are Cartesian products of R, S1, and projective We present and implement an e±cient algorithm for per- spaces. In these cases, metric information must be ap- forming nearest-neighbor queries in topological spaces propriately processed by any data structure designed to that usually arise in the context of motion planning. perform nearest neighbor computations. The problem is Our approach extends the Kd tree-based ANN algo- further complicated by the high dimensionality of con¯g- rithm, which was developed by Arya and Mount for Eu- uration spaces. clidean spaces. We argue the correctness of the algo- rithm and illustrate its e±ciency through computed ex- In this paper, we extend the nearest neighbor algorithm amples. We have applied the algorithm to both prob- and part of the ANN package of Arya and Mount [16] to abilistic roadmaps (PRMs) and Rapidly-exploring Ran- handle general topologies with very little computational dom Trees (RRTs). Substantial performance improve- overhead. Related literature is discussed in Section 3. ments are shown for motion planning examples. Our approach is based on hierarchically searching nodes in a Kd tree while handling the appropriate constraints induced by the metric and topology. We have proven the 1 Introduction correctness of our approach, and we have demonstrated the e±ciency of the algorithm experimentally.
    [Show full text]
  • A Fast All Nearest Neighbor Algorithm for Applications Involving Large Point-Clouds à Jagan Sankaranarayanan , Hanan Samet, Amitabh Varshney
    Best Journal Paper of 2007 Award ARTICLE IN PRESS Computers & Graphics 31 (2007) 157–174 www.elsevier.com/locate/cag A fast all nearest neighbor algorithm for applications involving large point-clouds à Jagan Sankaranarayanan , Hanan Samet, Amitabh Varshney Department of Computer Science, Center for Automation Research, Institute for Advanced Computer Studies, University of Maryland, College Park, MD - 20742, USA Abstract Algorithms that use point-cloud models make heavy use of the neighborhoods of the points. These neighborhoods are used to compute the surface normals for each point, mollification, and noise removal. All of these primitive operations require the seemingly repetitive process of finding the k nearest neighbors (kNNs) of each point. These algorithms are primarily designed to run in main memory. However, rapid advances in scanning technologies have made available point-cloud models that are too large to fit in the main memory of a computer. This calls for more efficient methods of computing the kNNs of a large collection of points many of which are already in close proximity. A fast kNN algorithm is presented that makes use of the locality of successive points whose k nearest neighbors are sought to reduce significantly the time needed to compute the neighborhood needed for the primitive operation as well as enable it to operate in an environment where the data is on disk. Results of experiments demonstrate an order of magnitude improvement in the time to perform the algorithm and several orders of magnitude improvement in work efficiency when compared with several prominent existing methods. r 2006 Elsevier Ltd. All rights reserved.
    [Show full text]
  • Fast Algorithm for Nearest Neighbor Search Based on a Lower Bound Tree
    in Proceedings of the 8th International Conference on Computer Vision, Vancouver, Canada, Jul. 2001, Volume 1, pages 446–453 Fast Algorithm for Nearest Neighbor Search Based on a Lower Bound Tree Yong-Sheng Chen†‡ Yi-Ping Hung†‡ Chiou-Shann Fuh‡ †Institute of Information Science, Academia Sinica, Taipei, Taiwan ‡Dept. of Computer Science and Information Engineering, National Taiwan University, Taiwan Email: [email protected] Abstract In the past, Bentley [2] proposed the k-dimensional bi- nary search tree to speed up the the nearest neighbor search. This paper presents a novel algorithm for fast nearest This method is very efficient when the dimension of the data neighbor search. At the preprocessing stage, the proposed space is small. However, as reported in [13] and [3], its algorithm constructs a lower bound tree by agglomeratively performance degrades exponentially with increasing dimen- clustering the sample points in the database. Calculation of sion. Fukunaga and Narendra [8] constructed a tree struc- the distance between the query and the sample points can be ture by repeating the process of dividing the set of sam- avoided if the lower bound of the distance is already larger ple points into subsets. Given a query point, they used than the minimum distance. The search process can thus a branch-and-bound search strategy to efficiently find the be accelerated because the computational cost of the lower closest point by using the tree structure. Djouadi and Bouk- bound, which can be calculated by using the internal node tache [5] partitioned the underlying space of the sample of the lower bound tree, is less than that of the distance.
    [Show full text]
  • Efficient Similarity Search in High-Dimensional Data Spaces
    New Jersey Institute of Technology Digital Commons @ NJIT Dissertations Electronic Theses and Dissertations Spring 5-31-2004 Efficient similarity search in high-dimensional data spaces Yue Li New Jersey Institute of Technology Follow this and additional works at: https://digitalcommons.njit.edu/dissertations Part of the Computer Sciences Commons Recommended Citation Li, Yue, "Efficient similarity search in high-dimensional data spaces" (2004). Dissertations. 633. https://digitalcommons.njit.edu/dissertations/633 This Dissertation is brought to you for free and open access by the Electronic Theses and Dissertations at Digital Commons @ NJIT. It has been accepted for inclusion in Dissertations by an authorized administrator of Digital Commons @ NJIT. For more information, please contact [email protected]. Copyright Warning & Restrictions The copyright law of the United States (Title 17, United States Code) governs the making of photocopies or other reproductions of copyrighted material. Under certain conditions specified in the law, libraries and archives are authorized to furnish a photocopy or other reproduction. One of these specified conditions is that the photocopy or reproduction is not to be “used for any purpose other than private study, scholarship, or research.” If a, user makes a request for, or later uses, a photocopy or reproduction for purposes in excess of “fair use” that user may be liable for copyright infringement, This institution reserves the right to refuse to accept a copying order if, in its judgment, fulfillment
    [Show full text]
  • A Cost Model for Nearest Neighbor Search in High-Dimensional Data Space
    A Cost Model For Nearest Neighbor Search in High-Dimensional Data Space Stefan Berchtold Christian Böhm University of Munich University of Munich Germany Germany [email protected] [email protected] Daniel A. Keim Hans-Peter Kriegel University of Munich University of Munich Germany Germany [email protected] [email protected] Abstract multimedia indexing [9], CAD [17], molecular biology (for In this paper, we present a new cost model for nearest neigh- the docking of molecules) [24], string matching [1], etc. Most bor search in high-dimensional data space. We first analyze applications use some kind of feature vector for an efficient different nearest neighbor algorithms, present a generaliza- access to the complex original data. Examples of feature vec- tion of an algorithm which has been originally proposed for tors are color histograms [23], shape descriptors [16, 18], Quadtrees [13], and show that this algorithm is optimal. Fourier vectors [26], text descriptors [15], etc. Nearest neigh- Then, we develop a cost model which - in contrast to previous bor search on the high-dimensional feature vectors may be models - takes boundary effects into account and therefore defined as follows: also works in high dimensions. The advantages of our model Given a data set DS of d-dimensional points, find the data are in particular: Our model works for data sets with an arbi- point NN from the data set which is closer to the given query trary number of dimensions and an arbitrary number of data point Q than any other point in the data set.
    [Show full text]
  • Improving K-Nearest Neighbor Approaches for Density-Based Pixel Clustering in Hyperspectral Remote Sensing Images
    remote sensing Article Improving K-Nearest Neighbor Approaches for Density-Based Pixel Clustering in Hyperspectral Remote Sensing Images Claude Cariou 1,*, , Steven Le Moan 2 and Kacem Chehdi 1 1 Institut d’Électronique et des Technologies du numéRique, CNRS, Univ Rennes, UMR 6164, Enssat, 6 rue de Kerampont, F22300 Lannion, France; [email protected] 2 Centre for Research in Image and Signal Processing, Massey University, Palmerston North 4410, New Zealand; [email protected] * Correspondence: [email protected] Received: 30 September 2020; Accepted: 12 November 2020; Published: 14 November 2020 Abstract: We investigated nearest-neighbor density-based clustering for hyperspectral image analysis. Four existing techniques were considered that rely on a K-nearest neighbor (KNN) graph to estimate local density and to propagate labels through algorithm-specific labeling decisions. We first improved two of these techniques, a KNN variant of the density peaks clustering method DPC, and a weighted-mode variant of KNNCLUST, so the four methods use the same input KNN graph and only differ by their labeling rules. We propose two regularization schemes for hyperspectral image analysis: (i) a graph regularization based on mutual nearest neighbors (MNN) prior to clustering to improve cluster discovery in high dimensions; (ii) a spatial regularization to account for correlation between neighboring pixels. We demonstrate the relevance of the proposed methods on synthetic data and hyperspectral images, and show they achieve superior overall performances in most cases, outperforming the state-of-the-art methods by up to 20% in kappa index on real hyperspectral images. Keywords: clustering methods; density estimation; nearest neighbor search; deterministic algorithm; unsupervised learning 1.
    [Show full text]
  • Very Large Scale Nearest Neighbor Search: Ideas, Strategies and Challenges
    Int. Jnl. Multimedia Information Retrieval, IJMIR, vol. 2, no. 4, 2013. The final publication is available at Springer via http://dx.doi.org/10.1007/s13735-013-0046-4 Very Large Scale Nearest Neighbor Search: Ideas, Strategies and Challenges Erik Gast, Ard Oerlemans, Michael S. Lew Leiden University Niels Bohrweg 1 2333CA Leiden The Netherlands {gast, aoerlema, mlew} @liacs.nl Abstract Web-scale databases and Big Data collections are computationally challenging to analyze and search. Similarity or more precisely nearest neighbor searches are thus crucial in the analysis, indexing and utilization of these massive multimedia databases. In this work, we begin by reviewing the top approaches from the research literature in the past decade. Furthermore, we evaluate the scalability and computational complexity as the feature complexity and database size vary. For the experiments we used two different data sets with different dimensionalities. The results reveal interesting insights regarding the index structures and their behavior when the data set size is increased. We also summarized the ideas, strategies and challenges for the future. 1. Introduction Very large scale multimedia databases are becoming common and thus searching within them has become more important. Clearly, one of the most frequently used searching paradigms is k-nearest neighbor(k- NN), where the k objects that are most similar to the query are retrieved. Unfortunately, this k-NN search is also a very expensive operation. In order to do a k-NN search efficiently, it is important to have an index structure that can efficiently handle k-NN searches on large databases. From a theoretical and technical point of view, finding the k nearest neighbors is less than linear time is challenging and largely 1 unsolved.
    [Show full text]