Thesis Efficient and Accurate Non-Metric K-NN Search with Applications to Text Matching Leonid Boytsov

Total Page:16

File Type:pdf, Size:1020Kb

Thesis Efficient and Accurate Non-Metric K-NN Search with Applications to Text Matching Leonid Boytsov Thesis Efficient and Accurate Non-Metric k-NN Search with Applications to Text Matching Leonid Boytsov CMU-LTI-18-006 Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15213 www.lti.cs.cmu.edu Thesis Committee: Dr. Eric Nyberg, Carnegie Mellon University, Chair Dr. Jamie Callan, Carnegie Mellon University Dr. Alex Hauptmann, Carnegie Mellon University Dr. James Allan, University of Massachusetts Amherst Dr. J. Shane Culpepper, the Royal Melbourne Institute of Technology Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Copyright c 2018 Leonid Boytsov Abstract In this thesis we advance state-of-the-art of the non-metric k-NN search by carry- ing out an extensive empirical evaluation (both and intrinsic) of generic methods for k-NN search. This work contributes to establishing a collection of strong benchmarks for data sets with generic distances. We start with intrinsic evaluations and demonstrate that non metric k-NN search is a practical and reasonably accurate tool for a wide variety of complex distances. However, somewhat surprisingly, achieving good performance does not require distance mapping/proxying via metric learning or distance symmetrization. Existing search methods can often work directly with non-metric and non-symmetric distances. They outperform the filter-and-refine approach relying on the distance symmetrization in the filtering step. Intrinsic evaluations are complemented with extrinsic evaluations in a realistic text retrieval task. In doing so, we make a step towards replacing/complementing classic term-based retrieval with a generic k-NN search algorithm. To this end we use a similarity function that takes into account subtle term associations, which are learned from a parallel monolingual corpus. An exact brute-force k-NN search using this similarity function is quite slow. We show that an approximate search algorithm can be 100-300 times faster at the expense of only a small loss in accuracy (10%). On one data set, a retrieval pipeline using an approximate k-NN search is twice as efficient as the C++ baseline while being as accurate as the Lucene-based fusion pipeline. We note, however, that it is necessary to compare our methods against more recent ranking algorithms. The thesis is concluded with a summary of learned lessons and open research questions (relevant to this work). We also discuss potential challenges facing a retrieval system designer. iv Acknowledgments This work would have been impossible without encouragement, inspiration, help, and advice of many people. Foremost, I would like to thank my advisor Eric Nyberg for his guidance, encouragement, patience, and assistance. I greatly appreciate his participation in writing a grant proposal to fund my research topic. I also thank my thesis committee: Jamie Callan, James Allan, Alex Hauptmann, and Shane Culpepper for their feedback. I express deep and sincere gratitude to my family. I am especially thankful to my wife Anna, who made this adventure possible, and to my mother Valentina who encouraged and supported both me and Anna. I thank my co-authors Bileg Naidan, David Novak, and Yury Malkov each of whom greatly helped me. Bileg sparked my interest in non-metric search methods and laid the foundation of our NMSLIB library [218]. Yury made key improvements to the graph-based search algorithms [197]. David greatly improved performance of pivot-based search algorithms, which allowed us to obtain first strong results for text retrieval [45]. I thank Chris Dyer for the discussion of IBM Model 1; Nikita Avrelin and Alexan- der Ponomarenko for implementing the first version of SW-graph in NMSLIB; Yubin Kim and Hamed Zamani for the discussion of pseudo-relevance feedback techniques (Hamed also greatly helped with Galago); Chenyan Xiong for the helpful discussion on embeddings and entities; Daniel Lemire for providing the implementation of the SIMD intersection algorithm; Lawrence Cayton for providing the data sets, the bbtree code, and answering our questions; Christian Beecks for answering questions regarding the Signature Quadratic Form Distance; Giuseppe Amato and Eric S. Tellez for help with data sets; Lu Jiang for the helpful discussion of image retrieval algo- rithms; Vladimir Pestov for the discussion on the curse of dimensionality; Mike Denkowski for the references on BLUE-style metrics; Karina Figueroa Mora for proposing experiments with the metric VP-tree applied directly to non-metric data. I also thank Stacey Young, Jennifer Lucas, and Kelly Widmaier for their help. This research was primarily supported by the NSF grant #1618159: “Match- ing and Ranking via Proximity Graphs: Applications to Question Answering and Beyond.”. Furthermore, I was supported by the Open Advancement of Question Answering Systems (OAQA) group as well as by the U.S. Army Research Laboratory and the U.S. Army Research Office under contract/grant number W911NF-10-1-0533. vi Contents 1 Introduction 1 1.1 Research Questions and Contributions . .3 2 Intrinsic Evaluation of k-NN Search Methods for Generic Spaces 7 2.1 Introduction to k-NN Search . .7 2.1.1 Problem Formulation . .7 2.1.2 Properties of Distance Functions . 13 2.1.3 Need for Approximate Search . 18 2.2 Search Methods . 21 2.2.1 Data Mapping, Distance Learning and Transformation . 24 2.2.2 Related Work . 26 2.2.3 A Novel Modification of VP-tree . 44 2.3 Experiments . 54 2.3.1 Cross-Method Comparison with Diverse Data . 54 2.3.2 Experiments with Challenging Distances . 68 2.3.3 Comparison to TriGen . 86 2.4 Summary . 98 3 Extrinsic Evaluation: Applying k-NN to Text Retrieval 101 3.1 Approach . 104 3.1.1 Similarity Models . 115 3.1.2 Making k-NN Search Fast . 119 3.2 Exploratory Experiments . 122 3.2.1 Evaluating Lexical IBM Model1 . 123 3.2.2 Evaluating BM25+Model1 for Linguistically-Motivated Features . 125 3.2.3 Evaluating Cosine Similarity between Dense Document Representations . 130 3.3 Main Experiments . 132 3.3.1 Setup . 132 3.3.2 Tuning . 133 3.3.3 Results . 134 3.3.4 Index Size and Creation Times . 143 3.4 Summary . 150 vii 4 Conclusion 153 4.1 Lessons Learned and Lessons to be Learned . 153 4.2 Summary . 161 Bibliography 163 viii List of Figures 2.1 Key properties and building blocks of search Methods . 28 2.2 An example of the Voronoi diagram/partitioning and its respective Delaunay graph. Thick lines define the boundaries of Voronoi regions, while thin lines denote edges of the Delaunay graph. 35 2.3 A k-NN search algorithm using SW-graph. 38 2.4 Voronoi diagram produced by four pivots πi. The data points are a, b, c, and d. The distance is L2. ........................................... 40 2.5 Three types of query balls in VP-tree. 44 2.6 A k-NN search algorithm using VP-tree. 46 2.7 The empirically obtained pruning decision function Dπ;R(x) (part I). 48 2.8 The empirically obtained pruning decision function Dπ;R(x) (part II). 49 2.9 Distance values in the projected space (on the y-axis) plotted against original distance values (on the x-axis). 63 2.10 A fraction of candidate records (on the y-axis) that are necessary to retrieve to ensure a desired recall level (on the x-axis) for 10-NN search. 65 2.11 Improvement in efficiency (on the y-axis) vs. recall (on the x-axis) on various data sets for 10-NN search. 66 2.12 Effectiveness of distance learning methods for 10-NN search (part I). Best viewed in color. 71 2.13 Effectiveness of distance learning methods for 10-NN search (part II). Best viewed in color. 72 2.14 Efficiency/effectiveness trade-offs of symmetrization in 10-NN search (part I). The number of data points is at most 500K. Best viewed in color. 82 2.15 Efficiency/effectiveness trade-offs of symmetrization in 10-NN search (part II). The number of data points is at most 500K. Best viewed in color. 83 2.16 Efficiency/effectiveness trade-offs of symmetrization in 10-NN search (part III). The number of data points is at most 500K. Best viewed in color. 84 2.17 Improvement in efficiency vs recall for VP-tree based methods in 10-NN search (part I). Best viewed in color. 89 2.18 Improvement in efficiency vs recall for VP-tree based methods in 10-NN search (part II). Best viewed in color. 90 2.19 Improvement in efficiency vs recall for VP-tree based methods in 10-NN search (part III). Best viewed in color. 91 ix 2.20 Improvement in efficiency vs recall for VP-tree based methods in 10-NN search (part IV). Best viewed in color. 92 2.21 Reduction in the number of distance computations vs recall for VP-tree based methods in 10-NN search (part I). Best viewed in color. 93 2.22 Reduction in the number of distance computations vs recall for VP-tree based methods in 10-NN search (part II). Best viewed in color. 94 2.23 Reduction in the number of distance computations vs recall for VP-tree based methods in 10-NN search (part III). Best viewed in color. 95 2.24 Reduction in the number of distance computations vs recall for VP-tree based methods in 10-NN search (part IV). Best viewed in color. 96 3.1 Retrieval pipeline architecture. We illustrate the use of evaluation metrics by dotted lines (metric names are placed inside ovals in the bottom right part of the diagram). Dotted lines also indicate which similarity models are used with which retrieval/re-ranking components. Similarity models are shown using rectangles with round corners and dotted boundaries.
Recommended publications
  • Lock-Free Concurrent Search
    THESIS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Lock-free Concurrent Search BAPI CHATTERJEE Division of Networks and Systems Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY Göteborg, Sweden 2017 Lock-free Concurrent Search Bapi Chatterjee Copyright c Bapi Chatterjee, 2017. ISBN: 978-91-7597-483-5 Series number: 4164 ISSN 0346-718X Technical report 136D Department of Computer Science and Engineering Distributed Computing and Systems Research Group Division of Networks and Systems Chalmers University of Technology SE-412 96 GÖTEBORG, Sweden Phone: +46 (0)31-772 10 00 Author e-mail: [email protected], [email protected] Printed by Chalmers Reproservice GÖTEBORG, Sweden 2017 Lock-free Concurrent Search Bapi Chatterjee Division of Networks and Systems, Chalmers University of Technology ABSTRACT The contemporary computers typically consist of multiple computing cores with high compute power. Such computers make excellent concurrent asyn- chronous shared memory system. On the other hand, though many celebrated books on data structure and algorithm provide a comprehensive study of se- quential search data structures, unfortunately, we do not have such a luxury if concurrency comes in the setting. The present dissertation aims to address this paucity. We describe novel lock-free algorithms for concurrent data structures that target a variety of search problems. (i) Point search (membership query, predecessor query, nearest neighbour query) for 1-dimensional data: Lock-free linked-list; lock-free internal and external binary search trees (BST). (ii) Range search for 1-dimensional data: A range search method for lock- free ordered set data structures - linked-list, skip-list and BST. (iii) Point search for multi-dimensional data: Lock-free kD-tree, specially, a generic method for nearest neighbour search.
    [Show full text]
  • All-Near-Neighbor Search Among High-Dimensional Data Via Hierarchical Sparse Graph Code Filtering
    All-Near-Neighbor Search Among High-Dimensional Data via Hierarchical Sparse Graph Code Filtering by Alexandros-Stavros Iliopoulos Department of Computer Science Duke University Date: Approved: Xiaobai Sun, Advisor Pankaj K. Agarwal Nikos Pitsianis Carlo Tomasi Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science in the Graduate School of Duke University 2020 Abstract All-Near-Neighbor Search Among High-Dimensional Data via Hierarchical Sparse Graph Code Filtering by Alexandros-Stavros Iliopoulos Department of Computer Science Duke University Date: Approved: Xiaobai Sun, Advisor Pankaj K. Agarwal Nikos Pitsianis Carlo Tomasi An abstract of a dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science in the Graduate School of Duke University 2020 Copyright c 2020 by Alexandros-Stavros Iliopoulos All rights reserved except○ the rights granted by the Creative Commons Attribution-Noncommercial Licence Abstract This thesis addresses the problem of all-to-all near-neighbor (all-NN) search among a large dataset of discrete points in a high-dimensional feature space endowed with a distance metric. Near-neighbor search is fundamental to numerous computational tasks arising in modern data analysis, modeling, and knowledge discovery. The ac- curacy and efficiency of near-neighbor search depends on the underlying structure of the dataset. In emerging data-driven analysis applications, the dataset is not necessarily stationary, the structure is not necessarily built once for all queries, and search is not necessarily limited to a few query points. To facilitate accurate and efficient near-neighbor search in a stationary or dynamic dataset, we makeasys- tematic investigation of all-NN search at multiple resolution levels, with attention to several standing or previously overlooked issues associated with high-dimensional feature spaces.
    [Show full text]
  • Parallel K Nearest Neighbor Graph Construction Using Tree-Based Data Structures
    Parallel k Nearest Neighbor Graph Construction Using Tree-Based Data Structures Nazneen Rajani Kate McArdle Inderjit S. Dhillon Dept. of Computer Science The Center for Advanced Dept. of Computer Science University of Texas at Austin Research in Software University of Texas at Austin Austin, TX 78712-1188, USA Engineering Austin, TX 78712-1188, USA [email protected] University of Texas at Austin [email protected] Austin, TX 78712-1188, USA [email protected] ABSTRACT points in a d dimensional space by dividing the space into Construction of a nearest neighbor graph is often a neces- several partitions [3]. Each d-dimensional point in a data sary step in many machine learning applications. However, set is represented by a node in the k-d tree, and every level constructing such a graph is computationally expensive, es- of the tree splits the space along one of the d dimensions. pecially when the data is high dimensional. Python's open Thus every node that is not a leaf node implicitly generates a source machine learning library Scikit-learn uses k-d trees hyperplane perpendicular to the dimension on which its level and ball trees to implement nearest neighbor graph construc- splits, and all nodes in the node's left subtree fall to the left tion. However, this implementation is inefficient for large of the hyperplane, while all nodes in the node's right subtree datasets. In this work, we focus on exploiting these under- fall to the right of the hyperplane. A ball tree, like the k- lying tree-based data structures to optimize parallel execu- d tree, is also a binary tree data structure that organizes tion of the nearest neighbor algorithm.
    [Show full text]
  • Parallel K Nearest Neighbor Graph Construction Using Tree-Based Data Structures
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by UPCommons. Portal del coneixement obert de la UPC Parallel k Nearest Neighbor Graph Construction Using Tree-Based Data Structures Nazneen Rajani Kate McArdle Inderjit S. Dhillon Dept. of Computer Science The Center for Advanced Dept. of Computer Science University of Texas at Austin Research in Software University of Texas at Austin Austin, TX 78712-1188, USA Engineering Austin, TX 78712-1188, USA [email protected] University of Texas at Austin [email protected] Austin, TX 78712-1188, USA [email protected] ABSTRACT points in a d dimensional space by dividing the space into Construction of a nearest neighbor graph is often a neces- several partitions [3]. Each d-dimensional point in a data sary step in many machine learning applications. However, set is represented by a node in the k-d tree, and every level constructing such a graph is computationally expensive, es- of the tree splits the space along one of the d dimensions. pecially when the data is high dimensional. Python's open Thus every node that is not a leaf node implicitly generates a source machine learning library Scikit-learn uses k-d trees hyperplane perpendicular to the dimension on which its level and ball trees to implement nearest neighbor graph construc- splits, and all nodes in the node's left subtree fall to the left tion. However, this implementation is inefficient for large of the hyperplane, while all nodes in the node's right subtree datasets.
    [Show full text]
  • Adaptive Bounding Volume Hierarchies for Efficient Collision
    M¨alardalen University Press Dissertations No.71 Adaptive Bounding Volume Hierarchies for Efficient Collision Queries Thomas Larsson January 2009 School of Innovation, Design and Engineering M¨alardalen University V¨aster˚as, Sweden Copyright c Thomas Larsson, 2009 ISSN 1651-4238 ISBN 978-91-86135-18-8 Printed by Arkitektkopia, V¨aster˚as, Sweden Distribution: M¨alardalen University Press Abstract The need for efficient interference detection frequently arises in computer graphics, robotics, virtual prototyping, surgery simulation, computer games, and visualization. To prevent bodies passing directly through each other, the simulation system must be able to track touching or intersecting geometric primitives. In interactive simulations, in which millions of geometric primitives may be involved, highly efficient colli- sion detection algorithms are necessary. For these reasons, new adaptive collision detection algorithms for rigid and different types of deformable polygon meshes are proposed in this thesis. The solutions are based on adaptive bounding volume hierarchies. For deformable body simulation, different refit and reconstruction schemes to efficiently update the hierarchies as the models deform are presented. These methods permit the models to change their entire shape at every time step of the simulation. The types of deformable models considered are (i) polygon meshes that are deformed by arbitrary vertex repositioning, but with the mesh topology preserved, (ii) models deformed by linear morphing of a fixed number of reference meshes, and (iii) models undergoing completely unstructured relative motion among the geometric primitives. For rigid body simulation, a novel type of bounding volume, the slab cut ball, is introduced, which improves the culling efficiency of the data structure significantly at a low storage cost.
    [Show full text]
  • Ball*-Tree: Efficient Spatial Indexing for Constrained Nearest-Neighbor
    Ball*-tree: Efficient spatial indexing for constrained nearest-neighbor search in metric spaces Mohamad Dolatshah, Ali Hadian, and Behrouz Minaei-Bidgoli Iran University of Science and Technology fdolatshah.mohammad,[email protected], b [email protected] Abstract. Emerging location-based systems and data analysis frame- works requires efficient management of spatial data for approximate and exact search. Exact similarity search can be done using space partitioning data structures, such as KD-tree, R*-tree, and ball-tree. In this paper, we focus on ball-tree, an efficient search tree that is specific for spatial queries which use euclidean distance. Each node of a ball-tree defines a ball, i.e. a hypersphere that contains a subset of the points to be searched. In this paper, we propose ball*-tree, an improved ball-tree that is more efficient for spatial queries. Ball*-tree enjoys a modified space partition- ing algorithm that considers the distribution of the data points in order to find an efficient splitting hyperplane. Also, we propose a new algorithm for KNN queries with restricted range using ball*-tree, which performs better than both KNN and range search for such queries. Results show that ball*-tree performs 39%-57% faster than the original ball-tree algo- rithm. Keywords: Ball-tree, Constrained NN, Spatial indexing, Eigenvector analysis, Range search. 1 Introduction Nearest neighbor (NN) search search is of great importance in many data an- alytics applications such as Geographical Information Systems (GIS), machine learning, computer vision, and robotics. Given a query point q, a common task arXiv:1511.00628v1 [cs.DB] 2 Nov 2015 is to search for the k closest points to q among all points in a dataset.
    [Show full text]