Efficient Similarity Search in High-Dimensional Data Spaces

Total Page:16

File Type:pdf, Size:1020Kb

Efficient Similarity Search in High-Dimensional Data Spaces New Jersey Institute of Technology Digital Commons @ NJIT Dissertations Electronic Theses and Dissertations Spring 5-31-2004 Efficient similarity search in high-dimensional data spaces Yue Li New Jersey Institute of Technology Follow this and additional works at: https://digitalcommons.njit.edu/dissertations Part of the Computer Sciences Commons Recommended Citation Li, Yue, "Efficient similarity search in high-dimensional data spaces" (2004). Dissertations. 633. https://digitalcommons.njit.edu/dissertations/633 This Dissertation is brought to you for free and open access by the Electronic Theses and Dissertations at Digital Commons @ NJIT. It has been accepted for inclusion in Dissertations by an authorized administrator of Digital Commons @ NJIT. For more information, please contact [email protected]. Copyright Warning & Restrictions The copyright law of the United States (Title 17, United States Code) governs the making of photocopies or other reproductions of copyrighted material. Under certain conditions specified in the law, libraries and archives are authorized to furnish a photocopy or other reproduction. One of these specified conditions is that the photocopy or reproduction is not to be “used for any purpose other than private study, scholarship, or research.” If a, user makes a request for, or later uses, a photocopy or reproduction for purposes in excess of “fair use” that user may be liable for copyright infringement, This institution reserves the right to refuse to accept a copying order if, in its judgment, fulfillment of the order would involve violation of copyright law. Please Note: The author retains the copyright while the New Jersey Institute of Technology reserves the right to distribute this thesis or dissertation Printing note: If you do not wish to print this page, then select “Pages from: first page # to: last page #” on the print dialog screen The Van Houten library has removed some of the personal information and all signatures from the approval page and biographical sketches of theses and dissertations in order to protect the identity of NJIT graduates and faculty. ASAC EICIE SIMIAIY SEAC I IGIMESIOA AA SACES b Y Simiaiy seac i ig-imesioa aa saces is a oua aaigm o may moe aaase aicaios suc as coe ase image eiea ime seies aaysis i iacia a makeig aaases a aa miig Oecs ae ee- see as ig-imesioa ois o ecos ase o ei imoa eaues Oec simiaiy is e measue y e isace ewee eaue ecos a simiaiy seac is imemee ia age queies o k-eaes eigo (k- queies Imemeig k- queies ia a sequeia sca o age aes o eaue ecos is comuaioay eesie uiig mui-imesioa iees o e eaue ecos o k- seac aso es o e usaisacoy we e ime- sioaiy is ig is is ue o e oo ie eomace cause y e ime- sioaiy cuse imesioaiy eucio usig e Sigua aue ecomosiio meo is e aoac aoe i is suy o ea wi ig-imesioa aa oig a o may ea-wo aases aa isiuio es o e eeogeeous imesioaiy eucio o e eie aase may cause a sigiica oss o iomaio Moe eicie eeseaio is soug y cuseig e aa io omogeeous suses o ois a ayig imesioaiy eucio o eac cuse eseciey ie uiiig oca ae a goa imesioaiy eucio e esis eas wi e imoeme o e eiciecy o quey ocessig associae wi oca imesioaiy eucio meos suc as e Cuseig a Sigua aue ecomosiio (CS a e oca imesioaiy eucio ( meos aiaios i e imemeaio o CS ae cosiee a e wo meos ae comae om e iewoi o e comessio aio CU ime a eiea eiciecy A eac k- agoim is esee o oca imesioaiy eucio meos y eeig a eisig mui-se k- seac agoim wic is esige o goa imesioaiy eucio Eeimea esus sow a e ew meo equies ess CU ime a e aoimae meo oose oigia o CS a a comaae ee o accuacy Oima susace imesioaiy eucio as e ie o miimiig oa quey cos e oem is comicae i a eac cuse ca eai a iee ume o imesios A yi meo is esee comiig e es eaues o e CS a meos o i oima susace imesioaiies o cuses geeae y oca imesioaiy eucio meos e eeimes sow a e oose meo woks we o o ea-wo aases a syeic aases EICIE SIMIAIY SEAC I IG-IMESIOA AA SACES y Yue i A isseaio Sumie o e acuy o ew esey Isiue o ecoogy i aia uime o e equiemes o e egee o oco o iosoy i Comue Sciece eame o Comue Sciece May Copyright © 2004 by Yue Li ALL RIGHTS RESERVED AOA AGE EICIE SIMIAIY SEAC I IGIMESIOA AA SACE Y Aeae omasia isseaio Aiso ae oesso o Comue Sciece I D ose eug Commiee Meme ae isiguise oesso o Comue Sciece I Wociec ye Commiee Meme ae oesso o Comue Sciece I ice Oia Commiee Meme ae Assisa oesso o Comue Sciece I ia Yag Commiee Meme ae Assisa oesso o Iusia a Mauacuig Egieeig I IOGAICA SKEC Athr: Yue i r: oco o iosoy t: May Undrrdt nd Grdt Edtn: • oco o iosoy i Comue Sciece ew esey Isiue o ecoogy ewak • Mase o Sciece i Comue Sciece Saog Uiesiy ia Cia 1999 • aceo o Sciece i Comue Sciece Saog Uiesiy ia Cia 1993 Mjr: Comue Sciece rnttn nd bltn: Y i A omasia a ag "iig Oima Susace imesioaiy o k- Seac i Cusee aases" 15 I Co o aaase a Ee Sysems Aicaios (EA sumie ag A omasia a Y i "A esise a yamic Oee aiio Ie o k eaes eigo Seac" 15 I Co o aaase a Ee Sysems Aicaios (EA sumie Y i A omasia a ag "A Eac k- Seac Agoim o CS" IEEE asacios o Kowege a aa Egieeig (KE oua i eiew 3 v o e memoy o my moe v ACKOWEGME I wou ike o eess my eees aeciaio o Aeae omasia wo o oy see as my eseac aiso oiig auae a couess esouces isig a iuiio u aso cosay gae me suo ecouageme a eassuace Secia aks ae gie o ose eug Wociec ye ice Oia a ia Yag o aciey aiciaig i my commiee I wou ike o ak youg-Kee Yi a ioio Casei o ei e o my eseac is wok cou o ae ee comee wiou ei ackgou e Secia aks ae gie o iua ag wo woke wi me o wo yeas a assise me i imemeig cuseig a mui-imesioa ieig agoims I wou ike o ak my oe eow sues i e Iegae Sysems aoaoy — Cuqi a Cag iu a Gag u — o ei iesi suo a assisace oe yeas iay I wou ike o ak my usa my ae my auge a my i-aws o ei oe a suo v AE O COES Chptr Page 1 IOUCIO 1 11 Moiaio 1 1 Caeges Coiuios a Ouie 4 11 imesioaiy eucio 4 1 Eac k- Seac 5 13 Oima Susace imesioaiy 6 1 Ouie 7 2 IG-IMESIOA IEIG 9 1 Simiaiy Queies 0 2.2 isace Meics 11 3 Mui-imesioa Ie Sucues 13 31 K--ees 1 3 e -ee a Is aiaios 15 2.4 imesioaiy Cuse 17 5 imesioaiy eucio 19 51 owe ouig oey 1 5 omaie Mea Squae Eo 22 53 Sigua aue ecomosiio (S 22 5 iscee ouie asom ( 24 55 iscee Waee asom (W 5 5 Segmeaio Mea Meo (SMM 26 2.6 Meic ase Ie Sucues 26 7 Summay 7 3 EOMACE O CS 9 31 Ioucio 9 v AE O COES (Cntnd Chptr Page 3 eae Wok 33 31 Cuseig 33 3 3 33 MM 37 33 e CS Meo 39 331 CS Imemeaio Ses 39 33 Aoimae k- Seac Agoim 3 3 eomace Suy 46 31 Eeimea Seu 46 3 eomace Meics 7 33 Eeimes 48 35 Cocusios 60 4 K-EAES EIGO SEAC 62 1 Ioucio 62 4.2 eae Wok 3 3 Eac k- Seac Agoim o CS 5 4.4 Eeimes 68 1 CU Cos esus MSE o e Eac Meo 68 4.4.2 Eac Meo esus Aoimae Meo 7 3 e Eec o Maiaiig a Ea imesio 71 5 Cocusio a iscussio 73 5 OIMA SUSACE IMESIOAIY 75 51 Ioucio 75 5 eae Wok 79 53 e yi Meo o Cusee aases 1 531 Iomaio oss a Susace imesioaiy 1 v AE O COES (Coiue Cae age 53 e yi Meo 5 533 Oima Susace imesioaiy 5 Eeimes 7 51 MSE a e Aeage Susace imesioaiy 9 5 Quey Coss a Susace imesioaiy 91 55 Cocusio 93 COCUSIO 95 AEI A K- AGOIM O 97 AEI CU COS O EGUA K-MEAS A EIICA K-MEAS 99 AEI C AAYIC K- QUEY COS MOE 11 C1 aca imesios 11 C aousoss Quey Cos Moes 1 C3 oms Quey Cos moes 15 C e eise Quey Moe o age Queies 1 C5 Eeimes 17 EEECES 19 i IS O AES ae age 11 Symo ae 8 31 aases 7 3 MM aamees a eaie imesios Comae o CS 60 51 Agoim 1 84 5 Agoim 5 53 aases o e Eeimes 88 5 Cusee aases (C Geeae y wi iee aamees 9 A1 k- Agoim o 9 C1 esus o Equaio s Eeimes (M = 1 E = 1 E = 1 1 x IS O IGUES r Page 1 A - eame o (a ee yes o isaces ( age quey o aius E ue e ee meics 1 Eame o a k---ee o - sace 15 3 Eame o a - -ee 1 A cice a is miimum ouig squae i - sace 1 5 isaces ewee aa ois a cees o the datasets: (a) COLH64 (b) TXT55 (c) GABOR64 (d) SYHT64.
Recommended publications
  • Pivot Selection Method for Optimizing Both Pruning and Balancing in Metric Space Indexes
    Pivot Selection Method for Optimizing both Pruning and Balancing in Metric Space Indexes Hisashi Kurasawa1, Daiji Fukagawa2, Atsuhiro Takasu3, and Jun Adachi3 1 The University of Tokyo, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, Japan 2 Doshisha University, 1-3 Tatara Miyakodani, Kyotanabe-shi, Kyoto, Japan 3 National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, Japan Abstract. We researched to try to find a way to reduce the cost of nearest neighbor searches in metric spaces. Many similarity search indexes recur- sively divide a region into subregions by using pivots, and construct a tree structure index. A problem in the existing indexes is that they only focus on the pruning objects and do not take into consideration the tree balanc- ing. The balance of the indexes depends on the data distribution and the indexes don’t reduce the search cost for all data. We propose a similar- ity search index called the Partitioning Capacity Tree (PCTree). PCTree automatically optimizes the pivot selection based on both the balance of the regions partitioned by a pivot and the estimated effectiveness of the search pruning by the pivot. As a result, PCTree reduces the search cost for various data distributions. Our evaluations comparing it with four in- dexes on three real datasets showed that PCTree successfully reduces the search cost and is good at handling various data distributions. Keywords: Similarity Search, Metric Space, Indexing. 1 Introduction Finding similar objects in a large dataset is a fundamental process of many applications, such as image completion [7]. These applications can be speeded up by reducing the query execution cost of a similarity search.
    [Show full text]
  • Peer-To-Peer Similarity Search Based on M-Tree Indexing
    Peer-to-Peer Similarity Search based on M-Tree Indexing Akrivi Vlachou1?, Christos Doulkeridis1?, and Yannis Kotidis2 1Dept.of Computer and Information Science, Norwegian University of Science and Technology 2Dept.of Informatics, Athens University of Economics and Business fvlachou,[email protected], [email protected] Abstract. Similarity search in metric spaces has several important applications both in centralized and distributed environments. In centralized applications, such as similarity-based image retrieval, usually a server indexes its data with a state- of-the-art centralized metric indexing technique, such as the M-Tree. In this pa- per, we propose a framework for distributed similarity search, where each par- ticipating peer stores its own data autonomously, under the assumption that data is indexed locally by peers using M-Trees. In order to support scalability and efficiency of search, we adopt a super-peer architecture, where super-peers are responsible for query routing. We propose the construction of metric routing in- dices suitable for distributed similarity search in metric spaces. We study the performance of the proposed framework using both synthetic and real data. 1 Introduction Similarity search in metric spaces has received significant attention in centralized set- tings [1, 6], but also recently in decentralized environments [3, 5, 8]. A prominent appli- cation is distributed search for multimedia content, such as images, video or plain text. Existing approaches for P2P metric-based similarity search mainly rely on a structured P2P overlay, which is used to intentionally store objects to peers [5, 8]. The aim is to achieve high parallelism and share the high processing cost over a set of cooperative computers.
    [Show full text]
  • Two Algorithms for Nearest-Neighbor Search in High Dimensions
    Two Algorithms for Nearest-Neighbor Search in High Dimensions Jon M. Kleinberg∗ February 7, 1997 Abstract Representing data as points in a high-dimensional space, so as to use geometric methods for indexing, is an algorithmic technique with a wide array of uses. It is central to a number of areas such as information retrieval, pattern recognition, and statistical data analysis; many of the problems arising in these applications can involve several hundred or several thousand dimensions. We consider the nearest-neighbor problem for d-dimensional Euclidean space: we wish to pre-process a database of n points so that given a query point, one can ef- ficiently determine its nearest neighbors in the database. There is a large literature on algorithms for this problem, in both the exact and approximate cases. The more sophisticated algorithms typically achieve a query time that is logarithmic in n at the expense of an exponential dependence on the dimension d; indeed, even the average- case analysis of heuristics such as k-d trees reveals an exponential dependence on d in the query time. In this work, we develop a new approach to the nearest-neighbor problem, based on a method for combining randomly chosen one-dimensional projections of the underlying point set. From this, we obtain the following two results. (i) An algorithm for finding ε-approximate nearest neighbors with a query time of O((d log2 d)(d + log n)). (ii) An ε-approximate nearest-neighbor algorithm with near-linear storage and a query time that improves asymptotically on linear search in all dimensions.
    [Show full text]
  • Similarity Searching in Peer-To-Peer Environment
    Masaryk University in Brno Faculty of Informatics ^TIS ftp Similarity Searching in Peer-to-Peer Environment Dissertation Proposal Mgr. David Novák Supervisor: Prof. Ing. Pavel Zezula, CSc. In Brno, January 2006 Supervisor Contents 1 Introduction 1 2 Metrie Space Indexing 2 2.1 Metrie Space Model 2 2.1.1 Metrie Space 2 2.1.2 Similarity Queries 2 2.1.3 Metrie Distance Measures 3 2.2 Metrie Partitioning Principles 5 2.2.1 Ball Partitioning 5 2.2.2 Excluded Middle Partitioning 5 2.2.3 Generalized Hyperplane Partitioning 6 2.2.4 Voronoi-like Partitioning 6 2.3 Search Space Pruning Policies 6 2.3.1 Object-Pivot Distance Constraint 6 2.3.2 Range-Pivot Distance Constraint 7 2.3.3 Double-Pivot Distance Constraint 7 2.3.4 Pivot Filtering 7 2.4 Metric Data Structures 7 2.4.1 Vantage Point Tree 8 2.4.2 Generalized Hyperplane Tree 8 2.4.3 M-Tree 8 2.4.4 D-Index 8 3 Peer-to-Peer Structures 9 3.1 Distributed Hash Tables 9 3.1.1 Chord 9 3.1.2 CAN 10 3.2 One-Dimensional Range Queries 10 3.2.1 P-Grid 10 3.2.2 Skip Graphs & SkipNet 11 3.2.3 P-Tree 12 3.3 Range Queries over Multiple Attributes 12 3.3.1 SCRAP & HSFC-based 12 3.3.2 ZNet 12 3.3.3 MAAN 12 3.3.4 Mercury 13 3.3.5 MURK 13 3.4 Nearest Neighbors Queries 13 3.4.1 pSearch 13 3.4.2 Distributed Quadtree 14 3.4.3 SWAM 14 3.5 Metric Space Structures 14 3.5.1 GHT* 15 3.5.2 MCAN 16 4 Thesis Plan 16 4.1 Overview 16 4.2 Objectives 17 4.3 System Structure Proposal 17 4.4 Future Plans 19 References 20 1 Introduction The diffusion of computer technology into a wide spectrum of human activities leads to formation of a variety of new complex data types.
    [Show full text]
  • Rank-Approximate Nearest Neighbor Search: Retaining Meaning and Speed in High Dimensions
    Rank-Approximate Nearest Neighbor Search: Retaining Meaning and Speed in High Dimensions Parikshit Ram, Dongryeol Lee, Hua Ouyang and Alexander G. Gray Computational Science and Engineering, Georgia Institute of Technology Atlanta, GA 30332 p.ram@,dongryel@cc.,houyang@,agray@cc. gatech.edu { } Abstract The long-standing problem of efficient nearest-neighbor (NN) search has ubiqui- tous applications ranging from astrophysics to MP3 fingerprinting to bioinformat- ics to movie recommendations. As the dimensionality of the dataset increases, ex- act NN search becomes computationally prohibitive; (1+) distance-approximate NN search can provide large speedups but risks losing the meaning of NN search present in the ranks (ordering) of the distances. This paper presents a simple, practical algorithm allowing the user to, for the first time, directly control the true accuracy of NN search (in terms of ranks) while still achieving the large speedups over exact NN. Experiments on high-dimensional datasets show that our algorithm often achieves faster and more accurate results than the best-known distance-approximate method, with much more stable behavior. 1 Introduction In this paper, we address the problem of nearest-neighbor (NN) search in large datasets of high dimensionality. It is used for classification (k-NN classifier [1]), categorizing a test point on the ba- sis of the classes in its close neighborhood. Non-parametric density estimation uses NN algorithms when the bandwidth at any point depends on the ktℎ NN distance (NN kernel density estimation [2]). NN algorithms are present in and often the main cost of most non-linear dimensionality reduction techniques (manifold learning [3, 4]) to obtain the neighborhood of every point which is then pre- served during the dimension reduction.
    [Show full text]
  • Prefix Hash Tree an Indexing Data Structure Over Distributed Hash
    Prefix Hash Tree An Indexing Data Structure over Distributed Hash Tables Sriram Ramabhadran ∗ Sylvia Ratnasamy University of California, San Diego Intel Research, Berkeley Joseph M. Hellerstein Scott Shenker University of California, Berkeley International Comp. Science Institute, Berkeley and and Intel Research, Berkeley University of California, Berkeley ABSTRACT this lookup interface has allowed a wide variety of Distributed Hash Tables are scalable, robust, and system to be built on top DHTs, including file sys- self-organizing peer-to-peer systems that support tems [9, 27], indirection services [30], event notifi- exact match lookups. This paper describes the de- cation [6], content distribution networks [10] and sign and implementation of a Prefix Hash Tree - many others. a distributed data structure that enables more so- phisticated queries over a DHT. The Prefix Hash DHTs were designed in the Internet style: scala- Tree uses the lookup interface of a DHT to con- bility and ease of deployment triumph over strict struct a trie-based structure that is both efficient semantics. In particular, DHTs are self-organizing, (updates are doubly logarithmic in the size of the requiring no centralized authority or manual con- domain being indexed), and resilient (the failure figuration. They are robust against node failures of any given node in the Prefix Hash Tree does and easily accommodate new nodes. Most impor- not affect the availability of data stored at other tantly, they are scalable in the sense that both la- nodes). tency (in terms of the number of hops per lookup) and the local state required typically grow loga- Categories and Subject Descriptors rithmically in the number of nodes; this is crucial since many of the envisioned scenarios for DHTs C.2.4 [Comp.
    [Show full text]
  • Nearest Neighbor Search in Google Correlate
    Nearest Neighbor Search in Google Correlate Dan Vanderkam Robert Schonberger Henry Rowley Google Inc Google Inc Google Inc 76 9th Avenue 76 9th Avenue 1600 Amphitheatre Parkway New York, New York 10011 New York, New York 10011 Mountain View, California USA USA 94043 USA [email protected] [email protected] [email protected] Sanjiv Kumar Google Inc 76 9th Avenue New York, New York 10011 USA [email protected] ABSTRACT queries are then noted, and a model that estimates influenza This paper presents the algorithms which power Google Cor- incidence based on present query data is created. relate[8], a tool which finds web search terms whose popu- This estimate is valuable because the most recent tradi- larity over time best matches a user-provided time series. tional estimate of the ILI rate is only available after a two Correlate was developed to generalize the query-based mod- week delay. Indeed, there are many fields where estimating eling techniques pioneered by Google Flu Trends and make the present[1] is important. Other health issues and indica- them available to end users. tors in economics, such as unemployment, can also benefit Correlate searches across millions of candidate query time from estimates produced using the techniques developed for series to find the best matches, returning results in less than Google Flu Trends. 200 milliseconds. Its feature set and requirements present Google Flu Trends relies on a multi-hour batch process unique challenges for Approximate Nearest Neighbor (ANN) to find queries that correlate to the ILI time series. Google search techniques. In this paper, we present Asymmetric Correlate allows users to create their own versions of Flu Hashing (AH), the technique used by Correlate, and show Trends in real time.
    [Show full text]
  • Efficient Nearest-Neighbor Computation for GPU-Based Motion
    Efficient Nearest-Neighbor Computation for GPU-based Motion Planning Jia Pan and Christian Lauterbach and Dinesh Manocha Department of Computer Science, UNC Chapel Hill http://gamma.cs.unc.edu/gplanner Abstract— We present a novel k-nearest neighbor search nearest neighbor computation, local planning and graph algorithm (KNNS) for proximity computation in motion search using parallel algorithms. In practice, the overall planning algorithm that exploits the computational capa- planning can be one or two orders of magnitude faster bilities of many-core GPUs. Our approach uses locality sen- sitive hashing and cuckoo hashing to construct an efficient than current CPU-based planners. KNNS algorithm that has linear space and time complexity We present a novel algorithm for GPU-based nearest and exploits the multiple cores and data parallelism effec- neighbors computation. Our formulation is based on tively. In practice, we see magnitude improvement in speed locality sensitive hashing (LSH) and cuckoo hashing and scalability over prior GPU-based KNNS algorithm. techniques, which can compute approximate k-nearest On some benchmarks, our KNNS algorithm improves the performance of overall planner by 20 − 40 times for CPU- neighbors in higher dimensions. We present parallel based planner and up to 2 times for GPU-based planner. algorithms for LSH computation as well KNN compu- tation that exploit the high number of cores and data parallelism of the GPUs. Our approach can perform the I. INTRODUCTION computation using Euclidean as well as non-Euclidean Sampling-based motion planning algorithms are distance metrics. Furthermore, it is memory efficient widely used in robotics and related areas to compute and can handle millions of sample points within 1GB collision-free motion for the robots.
    [Show full text]
  • R-Trees with Voronoi Diagrams for Efficient Processing of Spatial
    VoR-Tree: R-trees with Voronoi Diagrams for Efficient Processing of Spatial Nearest Neighbor Queries∗ y Mehdi Sharifzadeh Cyrus Shahabi Google University of Southern California [email protected] [email protected] ABSTRACT A very important class of spatial queries consists of nearest- An important class of queries, especially in the geospatial neighbor (NN) query and its variations. Many studies in domain, is the class of nearest neighbor (NN) queries. These the past decade utilize R-trees as their underlying index queries search for data objects that minimize a distance- structures to address NN queries efficiently. The general ap- based function with reference to one or more query objects. proach is to use R-tree in two phases. First, R-tree's hierar- Examples are k Nearest Neighbor (kNN) [13, 5, 7], Reverse chical structure is used to quickly arrive to the neighborhood k Nearest Neighbor (RkNN) [8, 15, 16], k Aggregate Nearest of the result set. Second, the R-tree nodes intersecting with Neighbor (kANN) [12] and skyline queries [1, 11, 14]. The the local neighborhood (Search Region) of an initial answer applications of NN queries are numerous in geospatial deci- are investigated to find all the members of the result set. sion making, location-based services, and sensor networks. While R-trees are very efficient for the first phase, they usu- The introduction of R-trees [3] (and their extensions) for ally result in the unnecessary investigation of many nodes indexing multi-dimensional data marked a new era in de- that none or only a small subset of their including points veloping novel R-tree-based algorithms for various forms belongs to the actual result set.
    [Show full text]
  • Dimensional Testing for Reverse K-Nearest Neighbor Search
    University of Southern Denmark Dimensional testing for reverse k-nearest neighbor search Casanova, Guillaume; Englmeier, Elias; Houle, Michael E.; Kröger, Peer; Nett, Michael; Schubert, Erich; Zimek, Arthur Published in: Proceedings of the VLDB Endowment DOI: 10.14778/3067421.3067426 Publication date: 2017 Document version: Final published version Document license: CC BY-NC-ND Citation for pulished version (APA): Casanova, G., Englmeier, E., Houle, M. E., Kröger, P., Nett, M., Schubert, E., & Zimek, A. (2017). Dimensional testing for reverse k-nearest neighbor search. Proceedings of the VLDB Endowment, 10(7), 769-780. https://doi.org/10.14778/3067421.3067426 Go to publication entry in University of Southern Denmark's Research Portal Terms of use This work is brought to you by the University of Southern Denmark. Unless otherwise specified it has been shared according to the terms for self-archiving. If no other license is stated, these terms apply: • You may download this work for personal use only. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying this open access version If you believe that this document breaches copyright please contact us providing details and we will investigate your claim. Please direct all enquiries to [email protected] Download date: 05. Oct. 2021 Dimensional Testing for Reverse k-Nearest Neighbor Search Guillaume Casanova Elias Englmeier Michael E. Houle ONERA-DCSD, France LMU Munich, Germany NII, Tokyo,
    [Show full text]
  • Distributed Computation of the Knn Graph for Large High-Dimensional Point Sets
    J of Parallel and Distributed Computing, 2007, vol. 67(3), pp. 346–359 Distributed Computation of the knn Graph for Large High-Dimensional Point Sets Erion Plaku a, Lydia E. Kavraki a,∗ aRice University, Department of Computer Science, Houston, Texas 77005, USA Abstract High-dimensional problems arising from robot motion planning, biology, data mining, and geographic information systems often require the computation of k nearest neighbor (knn) graphs. The knn graph of a data set is obtained by connecting each point to its k closest points. As the research in the above- mentioned fields progressively addresses problems of unprecedented complexity, the demand for com- puting knn graphs based on arbitrary distance metrics and large high-dimensional data sets increases, exceeding resources available to a single machine. In this work we efficiently distribute the computation of knn graphs for clusters of processors with message passing. Extensions to our distributed framework include the computation of graphs based on other proximity queries, such as approximate knn or range queries. Our experiments show nearly linear speedup with over one hundred processors and indicate that similar speedup can be obtained with several hundred processors. Key words: Nearest neighbors; Approximate nearest neighbors; knn graphs; Range queries; Metric spaces; Robotics; Distributed and Parallel Algorithms 1 Introduction The computation of proximity graphs for large high-dimensional data sets in arbitrary metric spaces is often necessary for solving problems arising from robot motion planning [14,31,35,40,45, 46,56], biology [3,17,33,39,42,50,52], pattern recognition [21], data mining [18,51], multimedia systems [13], geographic information systems [34, 43, 49], and other research fields.
    [Show full text]
  • Efficient Nearest Neighbor Searching for Motion Planning
    E±cient Nearest Neighbor Searching for Motion Planning Anna Atramentov Steven M. LaValle Dept. of Computer Science Dept. of Computer Science Iowa State University University of Illinois Ames, IA 50011 USA Urbana, IL 61801 USA [email protected] [email protected] Abstract gies of con¯guration spaces. The topologies that usu- ally arise are Cartesian products of R, S1, and projective We present and implement an e±cient algorithm for per- spaces. In these cases, metric information must be ap- forming nearest-neighbor queries in topological spaces propriately processed by any data structure designed to that usually arise in the context of motion planning. perform nearest neighbor computations. The problem is Our approach extends the Kd tree-based ANN algo- further complicated by the high dimensionality of con¯g- rithm, which was developed by Arya and Mount for Eu- uration spaces. clidean spaces. We argue the correctness of the algo- rithm and illustrate its e±ciency through computed ex- In this paper, we extend the nearest neighbor algorithm amples. We have applied the algorithm to both prob- and part of the ANN package of Arya and Mount [16] to abilistic roadmaps (PRMs) and Rapidly-exploring Ran- handle general topologies with very little computational dom Trees (RRTs). Substantial performance improve- overhead. Related literature is discussed in Section 3. ments are shown for motion planning examples. Our approach is based on hierarchically searching nodes in a Kd tree while handling the appropriate constraints induced by the metric and topology. We have proven the 1 Introduction correctness of our approach, and we have demonstrated the e±ciency of the algorithm experimentally.
    [Show full text]