Improving Dual-Tree Algorithms

IMPROVING DUAL-TREE ALGORITHMS A Dissertation Presented to The Academic Faculty By Ryan R. Curtin In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in Electrical and Computer Engineering School of Electrical and Computer Engineering Georgia Institute of Technology December 2015 Copyright © 2015 by Ryan R. Curtin IMPROVING DUAL-TREE ALGORITHMS Approved by: Dr. David V. Anderson, Committee Chair Dr. Mark A. Clements Associate Professor, School of Electrical and Associate Professor, School of Electrical and Computer Engineering Computer Engineering Georgia Institute of Technology Georgia Institute of Technology Dr. Charles L. Isbell, Jr., Advisor Dr. Polo Chau Professor, School of Interactive Computing Assistant Professor, School of Computational Georgia Institute of Technology Science and Engineering Georgia Institute of Technology Dr. Richard W. Vuduc Associate Professor, School of Computational Science and Engineering Georgia Institute of Technology Date Approved: August 18, 2015 This document is dedicated solely to my cats, who do not and will not ever have the capacity to understand even the title of this manuscript, and who, thanks to domestication, are actually entirely incapable of leading any sort of autonomous lifestyle and thus are mortally dependent on my completion of simple maintenance tasks. TABLE OF CONTENTS LIST OF TABLES ................................... vii LIST OF FIGURES .................................. ix CHAPTER 1 THE POINT ............................. 1 CHAPTER 2 INTRODUCTION .......................... 3 2.1 An abridged history of statistical computing . .3 2.2 A less abridged history of the development of trees . .4 2.3 The explosion of single-tree algorithms . 10 2.4 A smorgasboard of trees . 12 2.5 The fast multipole method and query amortization . 14 2.6 Redirection to statistics and dual-tree algorithms . 17 CHAPTER 3 TREE-INDEPENDENT DUAL-TREE ALGORITHMS . 23 3.1 A bibliographical note . 23 3.2 The goal: unification of dual-tree algorithms . 23 3.3 Space trees . 24 3.4 Space tree notation . 28 3.5 Bounding quantities with space trees . 29 3.6 A quick survey of some space trees . 31 3.6.1 The quad-tree, octree, and hyperoctree . 32 3.6.2 The kd-tree . 34 3.6.3 The ball tree . 34 3.6.4 The metric tree / vantage-point tree . 36 3.6.5 The cover tree . 37 3.7 Traversals and problem-specific rules . 39 3.8 A meta-algorithm to produce a dual-tree algorithm . 44 CHAPTER 4 MLPACK: A FLEXIBLE C++ FRAMEWORK . 47 4.1 A survey of the landscape of machine learning libraries . 47 4.2 The Armadillo linear algebra library . 49 4.3 Template paradigms for fast, generic code . 50 4.4 Design principles of mlpack ........................ 52 4.4.1 Scalable and fast machine learning algorithms . 52 4.4.2 Intuitive, consistent, and simple API . 54 4.4.3 Current functionality of mlpack .................. 56 4.5 Tree-independent dual-tree algorithms in mlpack ............. 58 4.5.1 The TreeType policy . 58 4.5.2 The RuleType policy . 63 4.5.3 The TraversalType policy . 65 4.5.4 Assembling a dual-tree algorithm in mlpack ............ 66 iv CHAPTER 5 TREES ................................ 68 5.1 Free parameters in the cover tree . 68 5.1.1 The cover tree: a rehash . 69 5.1.2 The expansion constant . 71 5.1.3 Root point selection policy . 73 5.1.4 Correlation of tree width to performance . 75 5.2 Cover tree runtime bounds . 78 5.2.1 Tree imbalance . 79 5.2.2 General runtime bound . 82 5.3 An issue with the cover tree single-tree runtime bound proof . 92 CHAPTER 6 TRAVERSALS ............................ 95 6.1 Improved dual depth-first traversal . 95 6.1.1 Prioritized recursions and nearest neighbor search . 97 6.1.2 Delaying reference recursion . 99 6.1.3 Experimental evalution . 102 CHAPTER 7 ALGORITHMS . 106 7.1 Nearest neighbor search . 106 7.1.1 A tree-independent dual-tree algorithm . 107 7.1.2 Correctness proof . 109 7.1.3 Specialization to existing k-NN algorithms . 111 7.1.4 Runtime bounds . 112 7.2 Range search . 115 7.2.1 A tree-independent dual-tree algorithm . 116 7.2.2 Runtime bound . 117 7.3 Kernel density estimation . 121 7.3.1 Dual-tree algorithm for absolute-value approximation . 122 7.3.2 Absolute-value approximate KDE runtime bounds . 124 7.3.3 Relative Value Approximation . 126 7.3.4 Runtime bounds for relative value approximate KDE . 127 7.4 Minimum spanning tree calculation . 128 7.5 Sparse kernel matrix approximation . 130 7.5.1 Sparsity in kernel matrices . 132 7.5.2 Related work on kernel matrix approximation . 134 7.5.3 A dual-tree algorithm . 135 7.5.4 Correctness proof . 137 7.5.5 Application to kernel PCA . 138 7.5.6 Theoretical results . 139 7.5.7 Empirical results for kernel PCA . 142 7.5.8 Extensions . 144 7.5.9 Application to other kernel methods . 145 7.5.10 Discussion . 145 7.6 Gaussian mixture model training . 147 7.6.1 Problem introduction . 147 v 7.6.2 A generalized single-tree algorithm . 148 7.6.3 Possible improvements and extensions . 152 7.7 Max-kernel search . 153 7.7.1 Introduction to max-kernel search . 154 7.7.2 Related work . 157 7.7.3 Unnormalized kernels . 159 7.7.4 Analysis of the problem . 160 7.7.5 Indexing points in H ........................ 163 7.7.6 Bounding the kernel value . 165 7.7.7 Single-tree max-kernel search . 171 7.7.8 Dual-tree fast max-kernel search . 174 7.7.9 Dual-tree algorithm runtime analysis . 176 7.7.10 Extensions for approximate max-kernel search . 182 7.7.11 Empirical evaluation . 187 7.7.12 Future directions for max-kernel search . 194 7.7.13 Wrap-up for max-kernel search . 195 7.8 k-means clustering . 196 7.8.1 Introduction . 196 7.8.2 Scaling k-means . 197 7.8.3 The blacklist algorithm and trees . 199 7.8.4 Pruning strategies . 199 7.8.5 The dual-tree k-means algorithm . 203 7.8.6 Theoretical results . 210 7.8.7 Experiments . 222 7.8.8 Future directions . 225 CHAPTER 8 CONCLUSION AND FUTURE DIRECTIONS . 227 REFERENCES .....................................230 vi LIST OF TABLES 1 Notation for trees. See text for details. 29 2 Properties of hyperoctrees. 33 3 Properties of kd-trees. 34 4 Properties of ball trees. 35 5 Properties of vp-trees. 37 6 Properties of cover trees. 39 7 mlpack benchmark dataset sizes. 53 8 All-k-nearest neighbor benchmarks (in seconds). 53 9 k-means benchmarks (in seconds). 54 10 Runtime statistics for different root point policies. 74 11 Build-time statistics for different root point policies. 74 12 Empirically calculated tree imbalances. 82 13 Dataset information. 102 14 Runtime (distance evaluations) for exact nearest neighbor search. 103 15 Runtime (distance calculations) [ or M/W] for approximate NN search. 104 16 Image denoising performance on the USPS dataset as a function of σ... 133 17 Datasets used for kernel PCA experiments. 142 18 Results for Epanechnikov kernel. 143 19 Results for Gaussian kernel. 144 20 Vector dataset details; jS qj and jS rj denote the number of objects in the query and reference sets respectively and dims denotes the dimensional- ity of the sets. 188 21 Single-tree and dual-tree FastMKS on vector datasets with k = 1, part one.191 22 Single-tree and dual-tree FastMKS on vector datasets with k = 1, part two.192 23 Single-tree and dual-tree FastMKS on protein sequences with k = 1. 193 24 Runtime and memory bounds for k-means algorithms. 198 vii 25 Dataset information for dual-tree k-means. 223 26 Empirical results for k-means. 224 viii LIST OF FIGURES 1 Abstract representation of an example quadtree. .5 2 Geometric representation of the same example quadtree. .6 3 Abstract representation of an example kd-tree. .8 4 Geometric representation of the same example kd-tree (top two levels). .8 5 Geometric representation of the same example kd-tree (bottom two levels).9 6 Abstract representation of an example kd-tree. 25 7 Geometric representation of the same example kd-tree. 26 8 Parameterized representation of the same example kd-tree. 27 9 Another example space tree. ..

Improving Dual-Tree Algorithms

Mlpack: Or, How I Learned to Stop Worrying and Love C++ Ryan R

Data Mining – Intro

Rcppmlpack: R Integration with MLPACK Using Rcpp

Lecture 04 Linear Structures Sort

Finding Neighbors in a Forest: a B-Tree for Smoothed Particle Hydrodynamics Simulations

Deep Learning and Big Data in Healthcare: a Double Review for Critical Beginners

Survey on Recent Machine Learning Tools, Platforms and Interface

Arxiv:1210.6293V1 [Cs.MS] 23 Oct 2012 So the Finally, Accessible

Skip-Webs: Efficient Distributed Data Structures for Multi-Dimensional Data Sets

Octree Generating Networks: Efﬁcient Convolutional Architectures for High-Resolution 3D Outputs

ML Cheatsheet Documentation

Cover Trees for Nearest Neighbor