Efficient Random Projection Trees for Nearest Neighbor Search and Related Problems

EFFICIENT RANDOM PROJECTION TREES FOR NEAREST NEIGHBOR SEARCH AND RELATED PROBLEMS A Dissertation by Omid Keivani Master of Science, Ferdowsi University of Mashhad, Mashhad, Iran, 2014 Bachelor of Science, Sadjad Institute of Technology, Mashhad, Iran, 2011 Submitted to the Department of Electrical Engineering and Computer Science and the faculty of the Graduate School of Wichita State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy May 2019 © Copyright 2019 by Omid Keivani All Rights Reserved EFFICIENT RANDOM PROJECTION TREES FOR NEAREST NEIGHBOR SEARCH AND RELATED PROBLEMS The following faculty members have examined the final copy of this dissertation for form and content, and recommend that it be accepted in partial fulfillment of the requirement for the degree of Doctor of Philosophy with a major in Computer Science. ________________________________________ Kaushik Sinha, Committee Chair ________________________________________ Krishna Krishnan, Committee Member ________________________________________ Edwin Sawan, Committee Member ________________________________________ Chengzong Pang, Committee Member ________________________________________ Hongsheng He, Committee Member Accepted for the College of Engineering Dennis Livesay, Dean Accepted for the Graduate School Kerry Wilks, Interim Dean iii DEDICATION To my wife, my son, my brother and my parents iv ABSTRACT Nearest neighbor search (NNS) is one of the most well-known problems in the field of computer science. It has been widely used in many different areas such as recommender systems, classification, clustering etc. Given a database 푆 of 푛 objects, a query 푞, and a measure of similarity, the naive way to solve a NNS problem is to perform a linear search over the objects in database 푆 and return an object from 푆, which based on the similarity measure, is most similar to 푞. However, due to the growth of data in recent years, a solution better than linear time complexity is desirable. Locality sensitivity hashing (LSH) and random projection trees (RPT) are two popular methods to solve NNS problem in sublinear time. Earlier works have demonstrated that RPT has superior performance compared to LSH. However, RPT has two major drawbacks, namely, i) its high space complexity ii) if it makes a mistake at any internal node of a single tree, it cannot recover from this mistake and the rest of the search for that tree becomes useless. One of the main contributions of this thesis is to propose new methods to address these two drawbacks. To address the first issue, we design a sparse version of RPT which reduces the space complexity overhead without significantly affecting nearest neighbor search performance. To address the second issue, we develop various strategies that uses auxiliary information and priority function to improve nearest neighbor search performance of original RPT. We support our claims both theoretically and experimentally on many real-world datasets. A second contribution of the thesis is to use the RPT data structure to solve related search problems such as, maximum inner product search (MIPS) and nearest neighbor to query hyperplane (NNQH) search. Both these problems can be reduced to an equivalent NNS problem by applying appropriate transformations. In case of MIPS problem, we establish among many different transformations that reduce a MIPS problem to an equivalent NNS problem, which one is more preferable to be used in conjunction with RPT. In case of NNQH problem, the transformation that reduces NNQH problem to an equivalent NNS problem increases the data dimensionality tremendously and hence space complexity requirement of original RPT. In the latter case, we show that our sparse RPT version comes to rescue. Our NNQH solution which uses space efficient versions of RPT is used to solve active learning problem. We perform extensive empirical evaluations for both these applications on many real world datasets to show superior performance of our proposed methods compare to the state of the art algorithms. v TABLE OF CONTENTS Chapter Page 1 INTRODUCTION 1 1.1 NEAREST NEIGHBOR SEARCH . 1 1.2 LOCAL SENSITIVITY HASHING . 2 1.2.1 HASH FUNCTION . 3 1.2.2 P-STABLE DISTRIBUTION HASH FAMILY . 4 1.3 TREE BASED APPROACHES . 5 1.3.1 KD TREE . 5 1.3.2 SPILL TREE . 6 1.3.3 RANDOM PROJECTION TREE . 7 1.3.4 VIRTUAL SPILL TREE . 8 1.3.5 FAILURE PROBABILITY ANALYSIS . 8 1.4 NEAREST NEIGHBOR SEARCH AND RELATED PROBLEMS . 9 1.4.1 MAXIMUM INNER PRODUCT SEARCH . 10 1.4.2 NEAREST NEIGHBOR TO QUERY HYPERPLANE . 10 1.5 LIMITATIONS OF NEAREST NEIGHBOR SEARCH USING RPT . 11 1.6 OUR CONTRIBUTIONS . 12 2 SPACE EFFICIENT RPT 13 2.1 SPACE COMPLEXITY REDUCTION STRATEGY 1: RPTS . 14 2.2 SPACE COMPLEXITY REDUCTION STRATEGY 2: RPTB . 14 2.3 SPACE COMPLEXITY REDUCTION STRATEGY 3: SPARSE RPT . 16 2.3.1 ANALYSIS OF SPARSE RPT FOR NEAREST NEIGHBOR SEARCH . 17 2.4 EXPERIMENTAL RESULTS . 20 2.4.1 DATASETS . 21 2.4.2 COMPARISON OF SPARSE AND NON-SPARSE RPT . 21 3 IMPROVING THE PERFORMANCE OF A SINGLE TREE 23 3.1 DEFEATIST SEARCH WITH AUXILIARY INFORMATION . 24 3.2 GUIDED PRIORITIZED SEARCH . 27 3.3 COMBINED APPROACH . 30 3.4 EMPIRICAL EVALUATION . 31 3.4.1 EXPERIMENT 1 . 32 3.4.2 EXPERIMENT 2 . 33 3.4.3 EXPERIMENT 3 . 34 3.4.4 EXPERIMENT 4 . 36 3.4.5 EXPERIMENT 5 . 36 3.4.6 EXPERIMENT 6 . 37 3.5 CONCLUSION . 39 4 MAXIMUM INNER PRODUCTS PROBLEM 40 4.1 EXISTING SOLUTIONS FOR MIPS . 41 4.2 MAXIMUM INNER PRODUCT SEARCH WITH RPT . 44 vi TABLE OF CONTENTS (continued) Chapter Page 4.3 EMPIRICAL EVALUATIONS . 46 4.3.1 EXPERIMENT I : POTENTIAL FUNCTION EVALUATION . 46 4.3.2 EXPERIMENT II: PRECISION-RECALL CURVE . 49 4.3.3 EXPERIMENT III: ACCURACY VS INVERSE SPEED-UP . 50 5 ACTIVE LEARNING 53 5.1 WHAT IS ACTIVE LEARNING? . 53 5.2 POOL BASED ACTIVE LEARNING APPROACHES . 54 5.2.1 UNCERTAINTY SAMPLING . 54 5.2.2 QUERY BY COMMITTEE . 54 5.2.3 NEAREST NEIGHBOR TO QUERY HYPERPLANE (NNQH) . 55 5.3 PROPOSED METHOD . 56 5.4 EXPERIMENTAL RESULTS . 57 5.4.1 TOY EXAMPLE . 57 5.4.2 SVM SETTING . 57 5.4.3 ACCURACY VS SPEED-UP TRADE-OFF . 59 5.5 CONCLUSION . 60 6 CONCLUSION AND FUTURE WORK 62 6.1 FUTURE WORK . 62 BIBLIOGRAPHY 64 APPENDIXES 73 A CHAPTER 2 PROOFS 74 A.1 Proof of Lemma1 . 74 A.2 Proof of Lemma3 . 74 A.3 Proof of Theorem4 . 75 A.4 Proof of Corallary5 . 77 A.5 Proof of Corollary6 . 78 A.6 Proof of Lemma7 . 79 A.7 Proof of Lemma8 . 82 A.8 Proof of Lemma9 . 82 B CHAPTER 3 PROOFS 85 B.1 Proof of Theorem10 . 85 B.2 Proof of Theorem11 . 85 B.3 Proof of Theorem12 . 86 C CHAPTER 4 PROOFS 88 C.1 Proof of Theorem13 . 88 C.2 Proof of Corallary14 . 88 C.3 Proof of Theorem15 . 89 vii TABLE OF CONTENTS (continued) Chapter Page C.4 Proof of Theorem16 . ..

Load more