Improved Similarity Search for Large Data in Machine Learning and Robotics

The University of Newcastle, Australia Doctoral Thesis Improved Similarity Search for Large Data in Machine Learning and Robotics Author: Josiah Walker B. Computer Science (Hons), The University of Newcastle A thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Interdisciplinary Machine Learning Research Group School of Electrical Engineering and Computing The University of Newcastle, Australia Submitted April 2016 Declaration of Authorship Statement of Originality I, Josiah Walker, declare that this thesis titled, ‘Improved Similarity Search for Large Data in Machine Learning and Robotics’ and the work presented in it are my own. I confirm that this thesis contains no material which has been accepted for the award of any other degree or diploma in any university or other tertiary institution and, to the best of my knowledge and belief, contains no material previously published or written by another person, except where due reference has been made in the text. I give consent to the final version of my thesis being made available worldwide when deposited in the University’s Digital Repository, subject to the provisions of the Copyright Act 1968. Signed: Date: i Acknowledgements I wish to thank everybody who has supported and encouraged me to make the completion of this thesis possible. This list includes too many friends, family, and colleagues to list, who have all been understanding of the difficulties involved and the unusual work hours sometimes required to finish work on time, and have provided encouragement and inspiration along the way. I want to especially thank my wife Rebecca, for sharing our marriage with my PhD, for being understanding and patient through all the various stages of my candidacy, and for consistently providing support during these times. This work would not have been possible without the support, advice, and instruction of my wonderful supervisors Stephan Chalup and Ljiljana Brankovic. Thankyou both for being supportive, patient, and encouraging during this time. ii Dedicated to my great-grandmother Mary Bade Iwillalwaysrememberyourfaithandcourage. iii Contents Declaration of Authorship i Acknowledgements ii Contents iv List of Figures vii List of Tables ix Abstract x 1Overview 1 1.1 Applications of Nearest-Neighbour Search ................... 2 1.1.1 Motivation: Improving NNS for Machine Learning Applications ... 4 1.2 Contributions ................................... 4 1.2.1 Subsampled Locality Sensitive Hashing (Chapter 3) ......... 5 1.2.2 Approximate Boosted Similarity-Sensitive Collections (Chapter 4) . 5 1.2.3 Approximate Cover Tree Search (Chapter 5) ............. 6 1.2.4 Exact Streaming All-Nearest-Neighbours (Chapter 6) ........ 6 1.3 Thesis Organization ............................... 7 2 Introduction 8 2.1 The Nearest-Neighbour Search Problem .................... 8 2.1.1 Measures of Similarity .......................... 9 2.1.2 Feature Extraction for Similarity Search ................ 10 2.2 Exact Methods for k-Nearest-Neighbour Search ................ 12 2.2.1 KD-trees ................................. 12 2.2.2 Metric trees ................................ 15 2.2.3 Cover trees ................................ 18 2.2.4 Empirical Comparison on Large Benchmark Data ........... 22 2.3 Approximate Nearest-Neighbour Tree Search ................. 25 2.3.1 Spill-trees ................................. 27 iv Contents v 2.3.2 Low-dimensional Projections ...................... 30 2.4 Locality-Sensitive Hashing ............................ 30 2.4.1 LSH Query Methods ........................... 32 2.4.1.1 Naive Multi-Table Search ................... 33 2.4.1.2 Hamming Ball Search ..................... 34 2.4.1.3 Multiprobe Search ....................... 36 2.4.1.4 LSH Forest Search ....................... 37 2.4.2 Query Method Performance Comparison ................ 38 2.4.3 Randomly Constructed Hash Families ................. 40 2.4.3.1 Gaussian Random Projections ................ 40 2.4.3.2 P-Stable projections ...................... 41 2.4.3.3 Double Hadamard Projections ................ 42 2.4.3.4 Shift Invariant Kernels .................... 43 2.4.4 Data Driven Hash Families ....................... 44 2.4.4.1 PCA Hashing with Random Rotations ............ 44 2.4.4.2 Iterative Quantization ..................... 45 2.4.4.3 SPLH .............................. 45 2.4.4.4 K-Means Hashing ....................... 46 2.4.4.5 Product Quantization ..................... 47 2.4.4.6 Boosted Similarity-Sensitive Coding ............. 48 2.4.4.7 Spectral Hashing and Manifold Methods .......... 49 2.4.5 Hierarchical Hashing ........................... 50 2.4.5.1 Beyond LSH .......................... 50 2.4.5.2 Locally Optimized Product Quantization .......... 51 2.5 Optimizing Approximate Search Collections .................. 52 2.5.0.1 Boosted Search Forests .................... 52 2.5.0.2 Data-Sensitive Hashing .................... 53 2.5.0.3 Reciprocal Dominant Hash Functions ............ 54 2.6 Chapter Summary ................................ 55 3 Subsampling for fast Locality Sensitive Hashing 56 3.1 Locality-Sensitive Hash Functions ........................ 58 3.2 Sub-sampled Locality Sensitive Hashing (SLSH) ................ 60 3.2.1 Sub-sampling on Sparse Data ...................... 61 3.3 Experimental Protocols ............................. 62 3.3.1 Test Data ................................. 63 3.3.2 Test Procedure .............................. 64 3.4 Precision and Recall Results ........................... 68 3.5 Timed Performance Results ........................... 70 3.6 Chapter Summary ................................ 73 4 Approximate Boosting for Locality Sensitive Hashing 75 4.1 Introduction .................................... 75 4.2 Related Work ................................... 77 Contents vi 4.2.1 Locality-sensitive Hash-Function Families ............... 79 4.2.2 Optimising Locality Sensitive Hash Collections ............ 80 4.3 Approximate Boosting For Hash Function Collections ............ 82 4.3.1 Approximate Boosting for RDHF .................... 84 4.3.2 Simple Boosted Hash-Function Collections ............... 88 4.4 Methodology ................................... 92 4.5 Results ....................................... 94 4.5.1 E↵ects of Training Data Size ...................... 94 4.5.2 Comparative Performance ........................ 95 4.6 Chapter Summary ................................ 98 5 Improved Approximate Euclidean Search with Cover Trees 99 5.1 s-Approximate Cover Tree Search ........................100 5.1.1 Children within the Cover Tree are Approximately Uniformly Dis- tributed ..................................101 5.1.2 A Spherical Cap Bounds Intersection Volume .............103 5.1.3 Local Intrinsic Dimensionality Matters .................106 5.2 s-Approximate Cover Tree Search ........................107 5.3 Experimental Results ...............................108 5.3.1 Comparison to Theoretical Results ...................110 5.3.2 Performance for Varying Values of k ..................112 5.3.3 Comparison to Locality-sensitive Hashing and Leaf-search ......115 5.4 Chapter Summary ................................118 6 Creating Large-Scale Incremental K-Nearest-Neighbour Graphs 120 6.1 Constructing k-NN Graphs Incrementally Using max(✏, k)-NN queries ...123 6.2 Incremental k-NN Graphs with Cover Trees ..................125 6.2.1 Complexity ................................130 6.2.2 Experimental Results ..........................131 6.3 Chapter Summary ................................134 7 Conclusion 136 7.1 Subsampled Locality-Sensitive Hashing (Chapter 3) .............136 7.2 Subsampled Locality-Sensitive Hashing (Chapter 4) .............137 7.3 Approximate Cover Tree Search (Chapter 5) ..................138 7.4 Incremental k-NN Graph Construction (Chapter 6) ..............138 7.5 Further Work ...................................139 7.6 Final Remarks ..................................140 A Additional Publications During PhD Candidacy 142 Bibliography 144 List of Figures 2.1 Visualisation of KD-Tree Queries ........................ 15 2.2 Visualisation of Metric Tree Queries ...................... 18 2.3 Visualisation of Cover Tree Queries ....................... 19 2.4 PCA-projected standard deviations of BigANN datasets ........... 23 2.5 Exact tree search time comparison ....................... 25 2.6 Leaf-limited approximate tree search times .................. 26 2.7 Leaf-limited approximate tree search recalls .................. 27 2.8 Spill-tree timed performance ........................... 28 2.9 Spill-tree memory performance ......................... 29 2.10 Visualisation of LSH naive multi-table search ................. 33 2.11 Visualisation of LSH Hamming ball search ................... 35 2.12 Visualisation of LSH Multiprobe search .................... 36 2.13 LSH-forest search visualisation ......................... 38 2.14 LSH query method search times ......................... 39 3.1 Sparse summed LSH precision for varying sum sizes ............. 62 3.2 SLSH Precision-Recall on Reuters and TDT-2 ................. 65 3.3 SLSH Precision-Recall on CIFAR-10 and LabelMe .............. 66 3.4 SLSH Precision-Recall on BigANN GIST and SIFT .............. 67 3.5 Performance of SLSH with increasing data size ................ 69 3.6 Tuned SLSH performance comparison ..................... 71 3.7 Tuned SLSH construction performance ..................... 72 4.1 SimpleBoost optimisation gains with

Load more