A Comprehensive Survey and Experimental Comparison of Graph-Based Approximate Nearest Neighbor Search
Total Page:16
File Type:pdf, Size:1020Kb
A Comprehensive Survey and Experimental Comparison of Graph-Based Approximate Nearest Neighbor Search Mengzhao Wang1, Xiaoliang Xu1, Qiang Yue1, Yuxiang Wang1,∗ 1Hangzhou Dianzi University, China {mzwang,xxl,yq,lsswyx}@hdu.edu.cn 1 ABSTRACT result vertex Approximate nearest neighbor search (ANNS) constitutes an im- 2 query query 4 portant operation in a multitude of applications, including recom- seed vertex 3 mendation systems, information retrieval, and pattern recognition. (a) Original dataset (b) Graph index In the past decade, graph-based ANNS algorithms have been the Figure 1: A toy example for the graph-based ANNS algorithm. leading paradigm in this domain, with dozens of graph-based ANNS algorithms proposed. Such algorithms aim to provide effective, ef- actual requirements for efficiency and cost61 [ ]. Thus, much of ficient solutions for retrieving the nearest neighbors for agiven the literature has focused on efforts to research approximate NNS query. Nevertheless, these efforts focus on developing and optimiz- (ANNS) and find an algorithm that improves efficiency substantially ing algorithms with different approaches, so there is a real need while mildly relaxing accuracy constraints (an accuracy-versus- for a comprehensive survey about the approaches’ relative perfor- efficiency tradeoff [59]). mance, strengths, and pitfalls. Thus here we provide a thorough ANNS is a task that finds the approximate nearest neighbors comparative analysis and experimental evaluation of 13 represen- among a high-dimensional dataset for a query via a well-designed tative graph-based ANNS algorithms via a new taxonomy and index. According to the index adopted, the existing ANNS algo- fine-grained pipeline. We compared each algorithm in a uniform rithms can be divided into four major types: hashing-based [40, 45]; test environment on eight real-world datasets and 12 synthetic tree-based [8, 86]; quantization-based [52, 74]; and graph-based [38, datasets with varying sizes and characteristics. Our study yields 67] algorithms. Recently, graph-based algorithms have emerged as novel discoveries, offerings several useful principles to improve a highly effective option for ANNS [6, 10, 41, 75]. Thanks to graph- algorithms, thus designing an optimized method that outperforms based ANNS algorithms’ extraordinary ability to express neighbor the state-of-the-art algorithms. This effort also helped us pinpoint relationships [38, 103], they only need to evaluate fewer points of algorithms’ working portions, along with rule-of-thumb recommen- dataset to receive more accurate results [38, 61, 67, 72, 112]. dations about promising research directions and suitable algorithms As Figure 1 shows, graph-based ANNS algorithms build a graph for practitioners in different fields. index (Figure 1(b)) on the original dataset (Figure 1(a)), the vertices in the graph correspond to the points of the original dataset, and PVLDB Reference Format: neighboring vertices (marked as G, ~) are associated with an edge by Mengzhao Wang, Xiaoliang Xu, Qiang Yue, Yuxiang Wang. A evaluating their distance X ¹G,~º, where X is a distance function. In Comprehensive Survey and Experimental Comparison of Graph-Based Figure 1(b), the four vertices (numbered 1–4) connected to the black Approximate Nearest Neighbor Search. PVLDB, 14(1): XXX-XXX, 2021. vertex are its neighbors, and the black vertex can visit its neighbors doi:XX.XX/XXX.XX along these edges. Given this graph index and a query @ (the red star), ANNS aims to get a set of vertices that are close to @. We take PVLDB Artifact Availability: the case of returning @’s nearest neighbor as an example to show The source code, data, and/or other artifacts have been made available at ANNS’ general procedure: Initially, a seed vertex (the black vertex, it https://github.com/Lsyhprum/WEAVESS. can be randomly sampled or obtained by additional approaches [47, 67]) is selected as the result vertex A, and we can conduct ANNS 1 INTRODUCTION from this seed vertex. Specifically, if X ¹=, @º < X ¹A,@º, where = is arXiv:2101.12631v2 [cs.IR] 10 May 2021 Nearest Neighbor Search (NNS) is a fundamental building block in one of the neighbors of A, A will be replaced by =. We repeat this various application domains [7, 8, 38, 67, 70, 80, 108, 117], such as process until the termination condition (e.g., 8=, X ¹=, @º ≥ X ¹A,@º) information retrieval [34, 118], pattern recognition [29, 57], data is met, and the final A (the green vertex) is @’s nearest neighbor. mining [44, 47], machine learning [24, 28], and recommendation Compared with other index structures, graph-based algorithms are systems [69, 82]. With the explosive growth of datasets’ scale and a proven superior tradeoff in terms of accuracy versus efficiency [10, the inevitable curse of dimensionality, accurate NNS cannot meet 38, 61, 65, 67], which is probably why they enjoy widespread use among high-tech companies nowadays (e.g., Microsoft [98, 100], ∗Corresponding author. Alibaba [38, 112], and Yahoo [47, 48, 89]). This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of 1.1 Motivation this license. For any use beyond those covered by this license, obtain permission by emailing [email protected]. Copyright is held by the owner/author(s). Publication rights The problem of graph-based ANNS on high-dimensional and large- licensed to the VLDB Endowment. scale data has been studied intensively across the literature [38]. Proceedings of the VLDB Endowment, Vol. 14, No. 1 ISSN 2150-8097. doi:XX.XX/XXX.XX Dozens of algorithms have been proposed to solve this problem from different optimizations [36, 37, 43, 54, 65, 67, 72]. For these algorithms, existing surveys [10, 61, 85] provide some meaningful they are vital for comprehensive analysis of the index performance. explorations. However, they are limited to a small subset about From our abundance of experiments (see §5 for details), we gain a algorithms, datasets, and metrics, as well as studying algorithms novel discovery: higher graph quality does not necessarily achieve from a macro perspective, and the analysis and evaluation of intra- better search performance. For instance, HNSW [67] and DPG [61] algorithm components are ignored. For example, [61] includes a yield similar search performances on the GIST1M dataset [1]. How- few graph-based algorithms (only three), [10] focuses on efficiency ever, in terms of graph quality, HNSW (63.3%) is significantly lower vs accuracy tradeoff,85 [ ] only considers several classic graphs. than DPG (99.2%) (§5). Note that DPG spends a lot of time improv- This motivates us to carry out a thorough comparative analysis ing graph quality during index construction, but it is unnecessary; and experimental evaluation of existing graph-based algorithms this is not uncommon, as we also see it in [31, 36–38]. via a new taxonomy and micro perspective (i.e., some fine-grained I4: Diversified datasets are essential for graph-based ANNS components). We detail the issues of existing work that ensued. algorithms’ scalability evaluation. Some graph-based ANNS I1: Lack of a reasonable taxonomy and comparative analysis algorithms are evaluated only on a small number of datasets, which of inter-algorithms. Many studies in other fields show that an in- limits analysis on how well they scale on different datasets. Looking sightful taxonomy can serve as a guideline for promising research in at the evaluation results on various datasets (see §5 for details), this domain[99, 102, 105]. Thus, a reasonable taxonomy needs to be we find that many algorithms have significant discrepancies in established, to point to the different directions of graph-based algo- terms of performance on different datasets. That is, the advantages rithms (§3). The index of existing graph-based ANNS algorithms are of an algorithm on some datasets may be difficult to extend to generally derivatives of four classic base graphs from different per- other datasets. For example, when the search accuracy reaches 0.99, spectives, i.e., Delaunay Graph (DG) [35], Relative Neighborhood NSG’s speedup is 125× more than that of HNSW for each query Graph (RNG) [92], K-Nearest Neighbor Graph (KNNG) [75], and on Msong [2]. However, on Crawl [3], NSG’s speedup is 80× lower Minimum Spanning Tree (MST) [58]. Some representative ANNS al- than that of HNSW when it achieves the same search accuracy of gorithms, such as KGraph [31], HNSW [67], DPG [61], SPTAG [27] 0.99. This shows that an algorithm’s superiority is contingent on can be categorized into KNNG-based (KGraph and SPTAG) and the dataset rather than being fixed in its performance. Evaluating RNG-based (DPG and HNSW) groups working off the base graphs and analyzing different scenarios’ datasets leads to understanding upon which they rely. Under this classification, we can pinpoint performance differences better for graph-based ANNS algorithms differences between algorithms of the same category or different in diverse scenarios, which provides a basis for practitioners in categories, to provide a comprehensive inter-algorithm analysis. different fields to choose the most suitable algorithm. I2: Omission in analysis and evaluation for intra-algorithm fine-grained components. Many studies only compare and ana- lyze graph-based ANNS algorithms from two coarse-grained com- 1.2 Our Contributions ponents, i.e., construction and search [79, 85], which hinders in- Driven by the aforementioned issues, we provide a comprehensive sight into the key components. Construction and search, however, comparative analysis and experimental evaluation of representative can be divided into many fine-grained components such as can- graph-based algorithms on carefully selected datasets of varying didate neighbor acquisition, neighbor selection [37, 38], seed acqui- characteristics. It is worth noting that we try our best to reimple- sition [9, 36], and routing [14, 94] (we discuss the details of these ment all algorithms using the same design pattern, programming components in §4).