A Comprehensive Survey and Experimental Comparison of Graph-Based Approximate Nearest Neighbor Search

Total Page:16

File Type:pdf, Size:1020Kb

A Comprehensive Survey and Experimental Comparison of Graph-Based Approximate Nearest Neighbor Search A Comprehensive Survey and Experimental Comparison of Graph-Based Approximate Nearest Neighbor Search Mengzhao Wang1, Xiaoliang Xu1, Qiang Yue1, Yuxiang Wang1,∗ 1Hangzhou Dianzi University, China {mzwang,xxl,yq,lsswyx}@hdu.edu.cn 1 ABSTRACT result vertex Approximate nearest neighbor search (ANNS) constitutes an im- 2 query query 4 portant operation in a multitude of applications, including recom- seed vertex 3 mendation systems, information retrieval, and pattern recognition. (a) Original dataset (b) Graph index In the past decade, graph-based ANNS algorithms have been the Figure 1: A toy example for the graph-based ANNS algorithm. leading paradigm in this domain, with dozens of graph-based ANNS algorithms proposed. Such algorithms aim to provide effective, ef- actual requirements for efficiency and cost61 [ ]. Thus, much of ficient solutions for retrieving the nearest neighbors for agiven the literature has focused on efforts to research approximate NNS query. Nevertheless, these efforts focus on developing and optimiz- (ANNS) and find an algorithm that improves efficiency substantially ing algorithms with different approaches, so there is a real need while mildly relaxing accuracy constraints (an accuracy-versus- for a comprehensive survey about the approaches’ relative perfor- efficiency tradeoff [59]). mance, strengths, and pitfalls. Thus here we provide a thorough ANNS is a task that finds the approximate nearest neighbors comparative analysis and experimental evaluation of 13 represen- among a high-dimensional dataset for a query via a well-designed tative graph-based ANNS algorithms via a new taxonomy and index. According to the index adopted, the existing ANNS algo- fine-grained pipeline. We compared each algorithm in a uniform rithms can be divided into four major types: hashing-based [40, 45]; test environment on eight real-world datasets and 12 synthetic tree-based [8, 86]; quantization-based [52, 74]; and graph-based [38, datasets with varying sizes and characteristics. Our study yields 67] algorithms. Recently, graph-based algorithms have emerged as novel discoveries, offerings several useful principles to improve a highly effective option for ANNS [6, 10, 41, 75]. Thanks to graph- algorithms, thus designing an optimized method that outperforms based ANNS algorithms’ extraordinary ability to express neighbor the state-of-the-art algorithms. This effort also helped us pinpoint relationships [38, 103], they only need to evaluate fewer points of algorithms’ working portions, along with rule-of-thumb recommen- dataset to receive more accurate results [38, 61, 67, 72, 112]. dations about promising research directions and suitable algorithms As Figure 1 shows, graph-based ANNS algorithms build a graph for practitioners in different fields. index (Figure 1(b)) on the original dataset (Figure 1(a)), the vertices in the graph correspond to the points of the original dataset, and PVLDB Reference Format: neighboring vertices (marked as G, ~) are associated with an edge by Mengzhao Wang, Xiaoliang Xu, Qiang Yue, Yuxiang Wang. A evaluating their distance X ¹G,~º, where X is a distance function. In Comprehensive Survey and Experimental Comparison of Graph-Based Figure 1(b), the four vertices (numbered 1–4) connected to the black Approximate Nearest Neighbor Search. PVLDB, 14(1): XXX-XXX, 2021. vertex are its neighbors, and the black vertex can visit its neighbors doi:XX.XX/XXX.XX along these edges. Given this graph index and a query @ (the red star), ANNS aims to get a set of vertices that are close to @. We take PVLDB Artifact Availability: the case of returning @’s nearest neighbor as an example to show The source code, data, and/or other artifacts have been made available at ANNS’ general procedure: Initially, a seed vertex (the black vertex, it https://github.com/Lsyhprum/WEAVESS. can be randomly sampled or obtained by additional approaches [47, 67]) is selected as the result vertex A, and we can conduct ANNS 1 INTRODUCTION from this seed vertex. Specifically, if X ¹=, @º < X ¹A,@º, where = is arXiv:2101.12631v2 [cs.IR] 10 May 2021 Nearest Neighbor Search (NNS) is a fundamental building block in one of the neighbors of A, A will be replaced by =. We repeat this various application domains [7, 8, 38, 67, 70, 80, 108, 117], such as process until the termination condition (e.g., 8=, X ¹=, @º ≥ X ¹A,@º) information retrieval [34, 118], pattern recognition [29, 57], data is met, and the final A (the green vertex) is @’s nearest neighbor. mining [44, 47], machine learning [24, 28], and recommendation Compared with other index structures, graph-based algorithms are systems [69, 82]. With the explosive growth of datasets’ scale and a proven superior tradeoff in terms of accuracy versus efficiency [10, the inevitable curse of dimensionality, accurate NNS cannot meet 38, 61, 65, 67], which is probably why they enjoy widespread use among high-tech companies nowadays (e.g., Microsoft [98, 100], ∗Corresponding author. Alibaba [38, 112], and Yahoo [47, 48, 89]). This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of 1.1 Motivation this license. For any use beyond those covered by this license, obtain permission by emailing [email protected]. Copyright is held by the owner/author(s). Publication rights The problem of graph-based ANNS on high-dimensional and large- licensed to the VLDB Endowment. scale data has been studied intensively across the literature [38]. Proceedings of the VLDB Endowment, Vol. 14, No. 1 ISSN 2150-8097. doi:XX.XX/XXX.XX Dozens of algorithms have been proposed to solve this problem from different optimizations [36, 37, 43, 54, 65, 67, 72]. For these algorithms, existing surveys [10, 61, 85] provide some meaningful they are vital for comprehensive analysis of the index performance. explorations. However, they are limited to a small subset about From our abundance of experiments (see §5 for details), we gain a algorithms, datasets, and metrics, as well as studying algorithms novel discovery: higher graph quality does not necessarily achieve from a macro perspective, and the analysis and evaluation of intra- better search performance. For instance, HNSW [67] and DPG [61] algorithm components are ignored. For example, [61] includes a yield similar search performances on the GIST1M dataset [1]. How- few graph-based algorithms (only three), [10] focuses on efficiency ever, in terms of graph quality, HNSW (63.3%) is significantly lower vs accuracy tradeoff,85 [ ] only considers several classic graphs. than DPG (99.2%) (§5). Note that DPG spends a lot of time improv- This motivates us to carry out a thorough comparative analysis ing graph quality during index construction, but it is unnecessary; and experimental evaluation of existing graph-based algorithms this is not uncommon, as we also see it in [31, 36–38]. via a new taxonomy and micro perspective (i.e., some fine-grained I4: Diversified datasets are essential for graph-based ANNS components). We detail the issues of existing work that ensued. algorithms’ scalability evaluation. Some graph-based ANNS I1: Lack of a reasonable taxonomy and comparative analysis algorithms are evaluated only on a small number of datasets, which of inter-algorithms. Many studies in other fields show that an in- limits analysis on how well they scale on different datasets. Looking sightful taxonomy can serve as a guideline for promising research in at the evaluation results on various datasets (see §5 for details), this domain[99, 102, 105]. Thus, a reasonable taxonomy needs to be we find that many algorithms have significant discrepancies in established, to point to the different directions of graph-based algo- terms of performance on different datasets. That is, the advantages rithms (§3). The index of existing graph-based ANNS algorithms are of an algorithm on some datasets may be difficult to extend to generally derivatives of four classic base graphs from different per- other datasets. For example, when the search accuracy reaches 0.99, spectives, i.e., Delaunay Graph (DG) [35], Relative Neighborhood NSG’s speedup is 125× more than that of HNSW for each query Graph (RNG) [92], K-Nearest Neighbor Graph (KNNG) [75], and on Msong [2]. However, on Crawl [3], NSG’s speedup is 80× lower Minimum Spanning Tree (MST) [58]. Some representative ANNS al- than that of HNSW when it achieves the same search accuracy of gorithms, such as KGraph [31], HNSW [67], DPG [61], SPTAG [27] 0.99. This shows that an algorithm’s superiority is contingent on can be categorized into KNNG-based (KGraph and SPTAG) and the dataset rather than being fixed in its performance. Evaluating RNG-based (DPG and HNSW) groups working off the base graphs and analyzing different scenarios’ datasets leads to understanding upon which they rely. Under this classification, we can pinpoint performance differences better for graph-based ANNS algorithms differences between algorithms of the same category or different in diverse scenarios, which provides a basis for practitioners in categories, to provide a comprehensive inter-algorithm analysis. different fields to choose the most suitable algorithm. I2: Omission in analysis and evaluation for intra-algorithm fine-grained components. Many studies only compare and ana- lyze graph-based ANNS algorithms from two coarse-grained com- 1.2 Our Contributions ponents, i.e., construction and search [79, 85], which hinders in- Driven by the aforementioned issues, we provide a comprehensive sight into the key components. Construction and search, however, comparative analysis and experimental evaluation of representative can be divided into many fine-grained components such as can- graph-based algorithms on carefully selected datasets of varying didate neighbor acquisition, neighbor selection [37, 38], seed acqui- characteristics. It is worth noting that we try our best to reimple- sition [9, 36], and routing [14, 94] (we discuss the details of these ment all algorithms using the same design pattern, programming components in §4).
Recommended publications
  • The Geometry of Musical Rhythm
    The Geometry of Musical Rhythm Godfried Toussaint? School of Computer Science McGill University Montr´eal, Qu´ebec, Canada Dedicated to J´anos Pach on the occasion of his 50th birthday. Abstract. Musical rhythm is considered from the point of view of ge- ometry. The interaction between the two fields yields new insights into rhythm and music theory, as well as new problems for research in mathe- matics and computer science. Recent results are reviewed, and new open problems are proposed. 1 Introduction Imagine a clock which has 16 hours marked on its face instead of the usual 12. Assume that the hour and minute hands have been broken off so that only the second-hand remains. Furthermore assume that this clock is running fast so that the second-hand makes a full turn in about 2 seconds. Such a clock is illustrated in Figure 1. Now start the clock ticking at \noon" (16 O'clock) and let it keep running for ever. Finally, strike a bell at positions 16, 3, 6, 10 and 12, for a total of five strikes per clock cycle. These times are marked with a bell in Figure 1. The resulting pattern rings out a seductive rhythm which, in a short span of fifty years during the last half of the 20th century, has managed to conquer our planet. It is known around the world (mostly) as the Clave Son from Cuba. However, it is common in Africa, and probably travelled from Africa to Cuba with the slaves [65]. In Africa it is traditionally played with an iron bell.
    [Show full text]
  • Separating Point Sets in Polygonal Environments
    Separating point sets in polygonal environments Erik D. Demaine Jeff Erickson MIT Laboratory Department of Computer Science for Computer Science U. of Illinois at Urbana-Champaign [email protected] jeff[email protected] Ferran Hurtado∗ John Iacono Departement de Matem`atica Aplicada II Dept. of Computer and Information Science Universitat Polit`ecnica de Catalunya Polytechnic University [email protected] [email protected] Stefan Langerman† Henk Meijer D´epartement d’Informatique School of Computing Universit´eLibre de Bruxelles Queen’s University [email protected] [email protected] Mark Overmars Sue Whitesides Institute of Information and Computing Sciences School of Computer Science Utrecht University McGill University [email protected] [email protected] ABSTRACT 1 Introduction We consider the separability of two point sets inside a poly- Let us consider the following basic question: given two point gon by means of chords or geodesic lines. Specifically, given sets R and B in the interior of a polygon (the red points and a set of red points and a set of blue points in the interior of the blue points, respectively), is there a chord separating R a polygon, we provide necessary and sufficient conditions for from B? This is the starting problem we study in this paper, the existence of a chord and for the existence of a geodesic where we also consider several related problems. path which separate the two sets; when they exist we also Problems on separability of point sets and other geomet- derive efficient algorithms for their obtention. We study as ric objects have generated a significant body of research in well the separation of the two sets using a minimum number computational geometry, and many kind of separators have of pairwise non-crossing chords.
    [Show full text]
  • The Rhythm That Conquered the World: What Makes a “Good” Rhythm Good?
    The Rhythm that Conquered the World: What Makes a “Good” Rhythm Good? (to appear in Percussive Notes.) By Godfried T. Toussaint Radcliffe Institute for Advanced Study Harvard University Dedicated to the memory of Bo Diddley. THE BO DIDDLEY BEAT Much of the world’s traditional and contemporary music makes use of characteristic rhythms called timelines. A timeline is a distinguished rhythmic ostinato, a rhythm that repeats throughout a piece of music with no variation, that gives the particular flavor of movement of the piece that incorporates it, and that acts as a timekeeper and structuring device for the musicians. Timelines may be played with any musical instrument, although a percussion instrument is usually preferred. Timelines may be clapped with the hands, as in the flamenco music of Southern Spain, they may be slapped on the thighs as was done by Buddy Holly’s drummer Jerry Allison in the hit song “Everyday”, or they may just be felt rather than sounded. In Sub-Saharan African music, timelines are usually played using an iron bell such as the gankogui or with two metal blades. Perhaps the most quintessential timeline is what most people familiar with rockabilly music dub the Bo Diddley Beat, and salsa dancers call the clave son, which they attribute to much other Cuban and Latin American music. This rhythm is illustrated in box notation in Figure 1. Each box represents a pulse, and the duration between any two adjacent pulses is one unit of time. There are a total of sixteen pulses, the first pulse occurs at time zero and the sixteenth at time 15.
    [Show full text]
  • Efficiently Computing All Delaunay Triangles Occurring Over
    Efficiently Computing All Delaunay Triangles Occurring over All Contiguous Subsequences Stefan Funke Universität Stuttgart, Germany [email protected] Felix Weitbrecht Universität Stuttgart, Germany [email protected] Abstract Given an ordered sequence of points P = {p1, p2, . , pn}, we are interested in computing T , the set of distinct triangles occurring over all Delaunay triangulations of contiguous subsequences within P . We present a deterministic algorithm for this purpose with near-optimal time complexity O(|T | log n). Additionally, we prove that for an arbitrary point set in random order, the expected number of Delaunay triangles occurring over all contiguous subsequences is Θ(n log n). 2012 ACM Subject Classification Theory of computation → Computational geometry Keywords and phrases Computational Geometry, Delaunay Triangulation, Randomized Analysis Digital Object Identifier 10.4230/LIPIcs.ISAAC.2020.28 1 Introduction For an ordered sequence of points P = {p1, p2, . , pn}, we consider for 1 ≤ i < j ≤ n the contiguous subsequences Pi,j := {pi, pi+1, . , pj} and their Delaunay triangulations S Ti,j := DT (Pi,j). We are interested in the set T := i<j{t | t ∈ Ti,j} of distinct Delaunay triangles occurring over all contiguous subsequences. Figure 1 shows a sequence of points 2 n p1, . , pn where |T | = Ω(n ) (collinearities could be perturbed away). For j > 2 , any point p will be connected to all points {p , . , p n } in T , so any such T contains Θ(n) j 1 2 1,j 1,j 0 2 Delaunay triangles not contained in any T1,j0 with j < j, hence |T | = Ω(n ).
    [Show full text]
  • Experimental Results on Quadrangulations of Sets of Fixed Points
    Experimental Results on Quadrangulations of Sets of Fixed Points Prosenjit Bose∗ Suneeta Ramaswamiy Godfried Toussaintz Alain Turkix Abstract We consider the problem of obtaining \nice" quadrangulations of planar sets of points. For many applications \nice" means that the quadrilaterals obtained are convex if possible and as \fat" or as squarish as possible. For a given set of points a quadrangulation, if it exists, may not admit all its quadrilaterals to be convex. In such cases we desire that the quadrangulations have as many convex quadrangles as possible. Solving this problem optimally is not practical. Therefore we propose and experimentally investigate a heuristic approach to solve this problem by converting \nice" triangulations to the desired quadrangulations with the aid of maximum matchings computed on the dual graph of the triangulations. We report experiments on several versions of this approach and provide theoretical justification for the good results obtained with one of these methods. The results of our experiments are particularly relevant for those applications in scattered data interpolation which require quadrangulations that should stay faithful to the original data. 1 Introduction Finite element mesh generation is a problem that has received considerable attention in recent years, due to its numerous applications in a number of areas such as medical imaging, computer graphics, and Geographic Information Systems (GIS). Important applications also arise in the manufacturing industry, where a central problem is the simulation of processes such as fluid flow in injection molding, by solving systems of partial differential equations [11]. To make such tasks easier, the classical approach is to use the method of finite elements, where the bounding surface of the relevant object is partitioned into small pieces by using data points sampled on the object surface.
    [Show full text]
  • Computational Geometric Aspects of Rhythm, Melody, and Voice-Leading
    Computational Geometry 43 (2010) 2–22 Contents lists available at ScienceDirect Computational Geometry: Theory and Applications www.elsevier.com/locate/comgeo Computational geometric aspects of rhythm, melody, and voice-leading Godfried Toussaint 1 School of Computer Science and Center for Interdisciplinary Research in Music Media and Technology, McGill University, Montréal, Québec, Canada article info abstract Article history: Many problems concerning the theory and technology of rhythm, melody, and voice- Received 1 February 2006 leading are fundamentally geometric in nature. It is therefore not surprising that the Accepted 1 January 2007 field of computational geometry can contribute greatly to these problems. The interaction Available online 21 March 2009 between computational geometry and music yields new insights into the theories of Communicated by J. Iacono rhythm, melody, and voice-leading, as well as new problems for research in several areas, Keywords: ranging from mathematics and computer science to music theory, music perception, and Musical rhythm musicology. Recent results on the geometric and computational aspects of rhythm, melody, Melody and voice-leading are reviewed, connections to established areas of computer science, Voice-leading mathematics, statistics, computational biology, and crystallography are pointed out, and Evenness measures new open problems are proposed. Rhythm similarity © 2009 Elsevier B.V. All rights reserved. Sequence comparison Necklaces Convolution Computational geometry Music information retrieval Algorithms Computational music theory 1. Introduction Imagine a clock which has 16 hours marked on its face instead of the usual 12. Assume that the hour and minute hands have been broken off so that only the second-hand remains. Furthermore assume that this clock is running fast so that the second-hand makes a full turn in about 2 seconds.
    [Show full text]
  • Output-Sensitive Algorithms for Computing Nearest-Neighbour Decision Boundaries∗
    Output-Sensitive Algorithms for Computing Nearest-Neighbour Decision Boundaries∗ David Bremner† Erik Demaine‡ Jeff Erickson§ John Iacono¶ Stefan Langermank Pat Morin∗∗ Godfried Toussaint†† ABSTRACT. Given a set R of red points and a set B of blue points, the nearest-neighbour decision rule classifies a new point q as red (respectively, blue) if the closest point to q in R ∪ B comes from R (respectively, B). This rule implicitly partitions space into a red set and a blue set that are separated by a red-blue decision boundary. In this paper we develop output-sensitive algorithms for computing this decision boundary for point sets on the line and in R2. Both algorithms run in time O(n log k), where k is the number of points that contribute to the decision boundary. This running time is the best possible when parameterizing with respect to n and k. 1 Introduction Let S be a set of n points in the plane that is partitioned into a set of red points denoted by R and a set of blue points denoted by B. The nearest-neighbour decision rule classifies a new point q as the color of the closest point to q in S. The nearest-neighbour decision rule is popular in pattern recognition as a means of learning by example. For this reason, the set S is often referred to as a training set. Several properties make the nearest-neighbour decision rule quite attractive, including its intu- itive simplicity and the theorem that the asymptotic error rate of the nearest-neighbour rule is bounded from above by twice the Bayes error rate [6, 8, 15].
    [Show full text]
  • Geometric Decision Rules for High Dimensions
    Geometric Decision Rules for High Dimensions Binay Bhattacharya ∗ Kaustav Mukherjee School of Computing Science School of Computing Science Simon Fraser University Simon Fraser University Godfried Toussaint y School of Computer Science McGill University Abstract In this paper we report on a new approach to the instance-based learning problem. The new approach combines five tools: first, editing the data using Wilson-Gabriel- editing to smooth the decision boundary, second, applying Gabriel-thinning to the edited set, third, filtering this output with the ICF algorithm of Brighton and Mellish, fourth, using the Gabriel-neighbor decision rule to clasify new incoming queries, and fifth, using a new data structure that allows the efficient computation of approximate Gabriel graphs in high dimensional spaces. Extensive experiments suggest that our approach is the best on the market. 1 Introduction In the typical non-parametric classification problem (see Devroye, Gyorfy and Lugosi [3]) we have available a set of d measurements or observations (also called a feature vector) taken from each member of a data set of n objects (patterns) denoted by fX; Y g = f(X1; Y1); :::; (Xn; Yn)g, where Xi and Yi denote, respectively, the feature vector on the ith object and the class label of that object. One of the most attractive decision procedures, conceived by Fix and Hodges in 1951, is the nearest-neighbor rule (1-NN-rule). Let Z be a new pattern (feature vector) to be classified and let Xj be the feature vector in fX; Y g = f(X1; Y1); :::; (Xn; Yn)g closest to Z. The nearest neighbor decision rule classifies the unknown pattern Z into class Yj.
    [Show full text]