Thesis Efficient and Accurate Non-Metric K-NN Search with Applications to Text Matching Leonid Boytsov
Total Page:16
File Type:pdf, Size:1020Kb
Thesis Efficient and Accurate Non-Metric k-NN Search with Applications to Text Matching Leonid Boytsov CMU-LTI-18-006 Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15213 www.lti.cs.cmu.edu Thesis Committee: Dr. Eric Nyberg, Carnegie Mellon University, Chair Dr. Jamie Callan, Carnegie Mellon University Dr. Alex Hauptmann, Carnegie Mellon University Dr. James Allan, University of Massachusetts Amherst Dr. J. Shane Culpepper, the Royal Melbourne Institute of Technology Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Copyright c 2018 Leonid Boytsov Abstract In this thesis we advance state-of-the-art of the non-metric k-NN search by carry- ing out an extensive empirical evaluation (both and intrinsic) of generic methods for k-NN search. This work contributes to establishing a collection of strong benchmarks for data sets with generic distances. We start with intrinsic evaluations and demonstrate that non metric k-NN search is a practical and reasonably accurate tool for a wide variety of complex distances. However, somewhat surprisingly, achieving good performance does not require distance mapping/proxying via metric learning or distance symmetrization. Existing search methods can often work directly with non-metric and non-symmetric distances. They outperform the filter-and-refine approach relying on the distance symmetrization in the filtering step. Intrinsic evaluations are complemented with extrinsic evaluations in a realistic text retrieval task. In doing so, we make a step towards replacing/complementing classic term-based retrieval with a generic k-NN search algorithm. To this end we use a similarity function that takes into account subtle term associations, which are learned from a parallel monolingual corpus. An exact brute-force k-NN search using this similarity function is quite slow. We show that an approximate search algorithm can be 100-300 times faster at the expense of only a small loss in accuracy (10%). On one data set, a retrieval pipeline using an approximate k-NN search is twice as efficient as the C++ baseline while being as accurate as the Lucene-based fusion pipeline. We note, however, that it is necessary to compare our methods against more recent ranking algorithms. The thesis is concluded with a summary of learned lessons and open research questions (relevant to this work). We also discuss potential challenges facing a retrieval system designer. iv Acknowledgments This work would have been impossible without encouragement, inspiration, help, and advice of many people. Foremost, I would like to thank my advisor Eric Nyberg for his guidance, encouragement, patience, and assistance. I greatly appreciate his participation in writing a grant proposal to fund my research topic. I also thank my thesis committee: Jamie Callan, James Allan, Alex Hauptmann, and Shane Culpepper for their feedback. I express deep and sincere gratitude to my family. I am especially thankful to my wife Anna, who made this adventure possible, and to my mother Valentina who encouraged and supported both me and Anna. I thank my co-authors Bileg Naidan, David Novak, and Yury Malkov each of whom greatly helped me. Bileg sparked my interest in non-metric search methods and laid the foundation of our NMSLIB library [218]. Yury made key improvements to the graph-based search algorithms [197]. David greatly improved performance of pivot-based search algorithms, which allowed us to obtain first strong results for text retrieval [45]. I thank Chris Dyer for the discussion of IBM Model 1; Nikita Avrelin and Alexan- der Ponomarenko for implementing the first version of SW-graph in NMSLIB; Yubin Kim and Hamed Zamani for the discussion of pseudo-relevance feedback techniques (Hamed also greatly helped with Galago); Chenyan Xiong for the helpful discussion on embeddings and entities; Daniel Lemire for providing the implementation of the SIMD intersection algorithm; Lawrence Cayton for providing the data sets, the bbtree code, and answering our questions; Christian Beecks for answering questions regarding the Signature Quadratic Form Distance; Giuseppe Amato and Eric S. Tellez for help with data sets; Lu Jiang for the helpful discussion of image retrieval algo- rithms; Vladimir Pestov for the discussion on the curse of dimensionality; Mike Denkowski for the references on BLUE-style metrics; Karina Figueroa Mora for proposing experiments with the metric VP-tree applied directly to non-metric data. I also thank Stacey Young, Jennifer Lucas, and Kelly Widmaier for their help. This research was primarily supported by the NSF grant #1618159: “Match- ing and Ranking via Proximity Graphs: Applications to Question Answering and Beyond.”. Furthermore, I was supported by the Open Advancement of Question Answering Systems (OAQA) group as well as by the U.S. Army Research Laboratory and the U.S. Army Research Office under contract/grant number W911NF-10-1-0533. vi Contents 1 Introduction 1 1.1 Research Questions and Contributions . .3 2 Intrinsic Evaluation of k-NN Search Methods for Generic Spaces 7 2.1 Introduction to k-NN Search . .7 2.1.1 Problem Formulation . .7 2.1.2 Properties of Distance Functions . 13 2.1.3 Need for Approximate Search . 18 2.2 Search Methods . 21 2.2.1 Data Mapping, Distance Learning and Transformation . 24 2.2.2 Related Work . 26 2.2.3 A Novel Modification of VP-tree . 44 2.3 Experiments . 54 2.3.1 Cross-Method Comparison with Diverse Data . 54 2.3.2 Experiments with Challenging Distances . 68 2.3.3 Comparison to TriGen . 86 2.4 Summary . 98 3 Extrinsic Evaluation: Applying k-NN to Text Retrieval 101 3.1 Approach . 104 3.1.1 Similarity Models . 115 3.1.2 Making k-NN Search Fast . 119 3.2 Exploratory Experiments . 122 3.2.1 Evaluating Lexical IBM Model1 . 123 3.2.2 Evaluating BM25+Model1 for Linguistically-Motivated Features . 125 3.2.3 Evaluating Cosine Similarity between Dense Document Representations . 130 3.3 Main Experiments . 132 3.3.1 Setup . 132 3.3.2 Tuning . 133 3.3.3 Results . 134 3.3.4 Index Size and Creation Times . 143 3.4 Summary . 150 vii 4 Conclusion 153 4.1 Lessons Learned and Lessons to be Learned . 153 4.2 Summary . 161 Bibliography 163 viii List of Figures 2.1 Key properties and building blocks of search Methods . 28 2.2 An example of the Voronoi diagram/partitioning and its respective Delaunay graph. Thick lines define the boundaries of Voronoi regions, while thin lines denote edges of the Delaunay graph. 35 2.3 A k-NN search algorithm using SW-graph. 38 2.4 Voronoi diagram produced by four pivots πi. The data points are a, b, c, and d. The distance is L2. ........................................... 40 2.5 Three types of query balls in VP-tree. 44 2.6 A k-NN search algorithm using VP-tree. 46 2.7 The empirically obtained pruning decision function Dπ;R(x) (part I). 48 2.8 The empirically obtained pruning decision function Dπ;R(x) (part II). 49 2.9 Distance values in the projected space (on the y-axis) plotted against original distance values (on the x-axis). 63 2.10 A fraction of candidate records (on the y-axis) that are necessary to retrieve to ensure a desired recall level (on the x-axis) for 10-NN search. 65 2.11 Improvement in efficiency (on the y-axis) vs. recall (on the x-axis) on various data sets for 10-NN search. 66 2.12 Effectiveness of distance learning methods for 10-NN search (part I). Best viewed in color. 71 2.13 Effectiveness of distance learning methods for 10-NN search (part II). Best viewed in color. 72 2.14 Efficiency/effectiveness trade-offs of symmetrization in 10-NN search (part I). The number of data points is at most 500K. Best viewed in color. 82 2.15 Efficiency/effectiveness trade-offs of symmetrization in 10-NN search (part II). The number of data points is at most 500K. Best viewed in color. 83 2.16 Efficiency/effectiveness trade-offs of symmetrization in 10-NN search (part III). The number of data points is at most 500K. Best viewed in color. 84 2.17 Improvement in efficiency vs recall for VP-tree based methods in 10-NN search (part I). Best viewed in color. 89 2.18 Improvement in efficiency vs recall for VP-tree based methods in 10-NN search (part II). Best viewed in color. 90 2.19 Improvement in efficiency vs recall for VP-tree based methods in 10-NN search (part III). Best viewed in color. 91 ix 2.20 Improvement in efficiency vs recall for VP-tree based methods in 10-NN search (part IV). Best viewed in color. 92 2.21 Reduction in the number of distance computations vs recall for VP-tree based methods in 10-NN search (part I). Best viewed in color. 93 2.22 Reduction in the number of distance computations vs recall for VP-tree based methods in 10-NN search (part II). Best viewed in color. 94 2.23 Reduction in the number of distance computations vs recall for VP-tree based methods in 10-NN search (part III). Best viewed in color. 95 2.24 Reduction in the number of distance computations vs recall for VP-tree based methods in 10-NN search (part IV). Best viewed in color. 96 3.1 Retrieval pipeline architecture. We illustrate the use of evaluation metrics by dotted lines (metric names are placed inside ovals in the bottom right part of the diagram). Dotted lines also indicate which similarity models are used with which retrieval/re-ranking components. Similarity models are shown using rectangles with round corners and dotted boundaries.