EFFICIENT RANDOM PROJECTION TREES FOR NEAREST NEIGHBOR SEARCH AND RELATED PROBLEMS

A Dissertation by

Omid Keivani

Master of Science, Ferdowsi University of Mashhad, Mashhad, Iran, 2014

Bachelor of Science, Sadjad Institute of Technology, Mashhad, Iran, 2011

Submitted to the Department of Electrical Engineering and Computer Science and the faculty of the Graduate School of Wichita State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy

May 2019

© Copyright 2019 by Omid Keivani

All Rights Reserved

EFFICIENT RANDOM PROJECTION TREES FOR NEAREST NEIGHBOR SEARCH AND RELATED PROBLEMS

The following faculty members have examined the final copy of this dissertation for form and content, and recommend that it be accepted in partial fulfillment of the requirement for the degree of Doctor of Philosophy with a major in Computer Science.

______Kaushik Sinha, Committee Chair

______Krishna Krishnan, Committee Member

______Edwin Sawan, Committee Member

______Chengzong Pang, Committee Member

______Hongsheng He, Committee Member

Accepted for the College of Engineering

Dennis Livesay, Dean

Accepted for the Graduate School

Kerry Wilks, Interim Dean

iii

DEDICATION

To my wife, my son, my brother and my parents

iv

ABSTRACT

Nearest neighbor search (NNS) is one of the most well-known problems in the field of computer science. It has been widely used in many different areas such as recommender systems, classification, clustering etc. Given a database 푆 of 푛 objects, a query 푞, and a measure of similarity, the naive way to solve a NNS problem is to perform a linear search over the objects in database 푆 and return an object from 푆, which based on the similarity measure, is most similar to 푞. However, due to the growth of data in recent years, a solution better than linear time complexity is desirable. Locality sensitivity hashing (LSH) and random projection trees (RPT) are two popular methods to solve NNS problem in sublinear time. Earlier works have demonstrated that RPT has superior performance compared to LSH. However, RPT has two major drawbacks, namely, i) its high space complexity ii) if it makes a mistake at any internal node of a single , it cannot recover from this mistake and the rest of the search for that tree becomes useless.

One of the main contributions of this thesis is to propose new methods to address these two drawbacks. To address the first issue, we design a sparse version of RPT which reduces the space complexity overhead without significantly affecting nearest neighbor search performance. To address the second issue, we develop various strategies that uses auxiliary information and priority function to improve nearest neighbor search performance of original RPT. We support our claims both theoretically and experimentally on many real-world datasets.

A second contribution of the thesis is to use the RPT data structure to solve related search problems such as, maximum inner product search (MIPS) and nearest neighbor to query hyperplane (NNQH) search. Both these problems can be reduced to an equivalent NNS problem by applying appropriate transformations. In case of MIPS problem, we establish among many different transformations that reduce a MIPS problem to an equivalent NNS problem, which one is more preferable to be used in conjunction with RPT. In case of NNQH problem, the transformation that reduces NNQH problem to an equivalent NNS problem increases the data dimensionality tremendously and hence space complexity requirement of original RPT. In the latter case, we show that our sparse RPT version comes to rescue. Our NNQH solution which uses space efficient versions of RPT is used to solve active learning problem. We perform extensive empirical evaluations for both these applications on many real world datasets to show superior performance of our proposed methods compare to the state of the art algorithms. v

TABLE OF CONTENTS

Chapter Page 1 INTRODUCTION 1 1.1 NEAREST NEIGHBOR SEARCH ...... 1 1.2 LOCAL SENSITIVITY HASHING ...... 2 1.2.1 HASH FUNCTION ...... 3 1.2.2 P-STABLE DISTRIBUTION HASH FAMILY ...... 4 1.3 TREE BASED APPROACHES ...... 5 1.3.1 KD TREE ...... 5 1.3.2 SPILL TREE ...... 6 1.3.3 RANDOM PROJECTION TREE ...... 7 1.3.4 VIRTUAL SPILL TREE ...... 8 1.3.5 FAILURE PROBABILITY ANALYSIS ...... 8 1.4 NEAREST NEIGHBOR SEARCH AND RELATED PROBLEMS ...... 9 1.4.1 MAXIMUM INNER PRODUCT SEARCH ...... 10 1.4.2 NEAREST NEIGHBOR TO QUERY HYPERPLANE ...... 10 1.5 LIMITATIONS OF NEAREST NEIGHBOR SEARCH USING RPT ...... 11 1.6 OUR CONTRIBUTIONS ...... 12 2 SPACE EFFICIENT RPT 13 2.1 SPACE COMPLEXITY REDUCTION STRATEGY 1: RPTS ...... 14 2.2 SPACE COMPLEXITY REDUCTION STRATEGY 2: RPTB ...... 14 2.3 SPACE COMPLEXITY REDUCTION STRATEGY 3: SPARSE RPT ...... 16 2.3.1 ANALYSIS OF SPARSE RPT FOR NEAREST NEIGHBOR SEARCH . . . . . 17 2.4 EXPERIMENTAL RESULTS ...... 20 2.4.1 DATASETS ...... 21 2.4.2 COMPARISON OF SPARSE AND NON-SPARSE RPT ...... 21 3 IMPROVING THE PERFORMANCE OF A SINGLE TREE 23 3.1 DEFEATIST SEARCH WITH AUXILIARY INFORMATION ...... 24 3.2 GUIDED PRIORITIZED SEARCH ...... 27 3.3 COMBINED APPROACH ...... 30 3.4 EMPIRICAL EVALUATION ...... 31 3.4.1 EXPERIMENT 1 ...... 32 3.4.2 EXPERIMENT 2 ...... 33 3.4.3 EXPERIMENT 3 ...... 34 3.4.4 EXPERIMENT 4 ...... 36 3.4.5 EXPERIMENT 5 ...... 36 3.4.6 EXPERIMENT 6 ...... 37 3.5 CONCLUSION ...... 39 4 MAXIMUM INNER PRODUCTS PROBLEM 40 4.1 EXISTING SOLUTIONS FOR MIPS ...... 41 4.2 MAXIMUM INNER PRODUCT SEARCH WITH RPT ...... 44

vi

TABLE OF CONTENTS (continued)

Chapter Page 4.3 EMPIRICAL EVALUATIONS ...... 46 4.3.1 EXPERIMENT I : POTENTIAL FUNCTION EVALUATION ...... 46 4.3.2 EXPERIMENT II: PRECISION-RECALL CURVE ...... 49 4.3.3 EXPERIMENT III: ACCURACY VS INVERSE SPEED-UP ...... 50 5 ACTIVE LEARNING 53 5.1 WHAT IS ACTIVE LEARNING? ...... 53 5.2 POOL BASED ACTIVE LEARNING APPROACHES ...... 54 5.2.1 UNCERTAINTY SAMPLING ...... 54 5.2.2 QUERY BY COMMITTEE ...... 54 5.2.3 NEAREST NEIGHBOR TO QUERY HYPERPLANE (NNQH) ...... 55 5.3 PROPOSED METHOD ...... 56 5.4 EXPERIMENTAL RESULTS ...... 57 5.4.1 TOY EXAMPLE ...... 57 5.4.2 SVM SETTING ...... 57 5.4.3 ACCURACY VS SPEED-UP TRADE-OFF ...... 59 5.5 CONCLUSION ...... 60 6 CONCLUSION AND FUTURE WORK 62 6.1 FUTURE WORK ...... 62 BIBLIOGRAPHY 64 APPENDIXES 73 A CHAPTER 2 PROOFS 74 A.1 Proof of Lemma1 ...... 74 A.2 Proof of Lemma3 ...... 74 A.3 Proof of Theorem4 ...... 75 A.4 Proof of Corallary5 ...... 77 A.5 Proof of Corollary6 ...... 78 A.6 Proof of Lemma7 ...... 79 A.7 Proof of Lemma8 ...... 82 A.8 Proof of Lemma9 ...... 82 B CHAPTER 3 PROOFS 85 B.1 Proof of Theorem10 ...... 85 B.2 Proof of Theorem11 ...... 85 B.3 Proof of Theorem12 ...... 86 C CHAPTER 4 PROOFS 88 C.1 Proof of Theorem13 ...... 88 C.2 Proof of Corallary14 ...... 88 C.3 Proof of Theorem15 ...... 89

vii

TABLE OF CONTENTS (continued)

Chapter Page C.4 Proof of Theorem16 ...... 90 C.5 Proof of Corallary17 ...... 91

viii

LIST OF TABLES

Table Page 2.1 Dataset description ...... 21 2.2 For each value of 푝, the left column is size of 푅 and the right one is accuracy ± standard deviation. Bold values are the best values among that row...... 22 3.1 Datasets details ...... 32 3.2 Effect of c on 1-NN search accuracy...... 33 3.3 Comparison of 1-NN search accuracy using defeatist search strategy with auxiliary information with baseline methods...... 34 3.4 Comparisons of 1-NN accuracy of prioritized search with baseline methods...... 35 3.5 Comparisons of 1-NN accuracy of combined method with different priority functions. Performance of two priority functions in combined approach are very similar except in few cases, for fixed number of iterations, priority function fpr2 performs marginally better compared to priority function fpr1 (shown in bold)...... 36 3.6 10-NN search accuracy using Multi-Combined method with different # of trees (L) and # of iterations per tree (iter)...... 37 4.1 Dataset description ...... 46 4.2 Hellinger distance between PD1; PD2 and PD3...... 48

ix

LIST OF FIGURES

Figure Page 1.1 2-dimensional example for 1-NNS problem with euclidean distance as the distance function...... 2

1.2 Assume r2 > r1, then pi is the probability that vi end up in the same bucket as q. Obviously, since D(q; v1) < D(q; v2), p1 should be much larger than p2...... 4 1.3 KD-tree example and its corresponding tree. Each node stores some data points (number of data points in each node is shown inside it), one projection direction and a split value. Maximum number of points allowed in each node (n0) is 6...... 5

1.4 A possible partitioning by RPT for our toy example with n0 = 6...... 7 1.5 Two ways to partition the space, single split point (left) and multiple split points (right). kd-tree, RPT and VST are using a single split point to construct the tree while ST is using multiple split points. Also, all approaches except VST are using a single split point to traverse the tree and answer the query. Note that in a single split point partitioning _ = 1=2 corresponds to median split...... 9 3.1 Defeatist query processing using auxiliary information. Blue node is where a query is routed to. Red rectangles indicates auxiliary information stored on the opposite side of the split point from which candidate nearest neighbors for the unvisited subtree are selected...... 24 3.2 Query processing for three iterations (visiting three different leaf nodes) using priority function. The three retrieved leaf nodes are colored blue. At each internal node an integer represents the priority score ordering. Lower the value, higher the priority. After each new leaf node visit ordering of priority scores is updated. Note that if a mistake is made at the root level and true nearest neighbor lie on the left subtree rooted at root node, with three iterations, DFS will never visit this left subtree and fails to find true nearest neighbor...... 28 3.3 Query processing using combined approach. Blue circles indicate retrieved leaf nodes. Red rectangles indicate auxiliary information stored on the opposite side of the split point at each internal node, from which candidate nearest neighbors are selected, along query processing path for which only one subtree rooted at that node is explored. This figure should be interpreted as a result of applying ideas from section 3.1 to Figure 3.2 after 3rd iteration...... 30 3.4 Trade-o_ between accuracy and running time as we increase c. Accuracy computed Based on 1-NN...... 34 3.5 Number of iterations required to achieve a fixed level of precision...... 38 3.6 Speedup obtained for a fixed level of precision...... 38

x

LIST OF FIGURES (continued)

Figure Page 4.1 Potential function differences (y-axis) vs query index (x-axis) plot (please view in colour): The green line indicates the sorted differences PF3 - PF1, while the red dots indicate the differences PF2 - PF1 against the sorted index of PF3 - PF1. The blue line indicates the sorted differences PF4 - PF1, while the purple dots indicate the differences PF5 - PF1 against the sorted index of PF4 - PF1...... 47 4.2 Histgoram for the potential function with different transformations on eight different datasets. First, second and third columns in the above figure represents histogram of potential function values obtained by applying transformation T1, T2 and T3 respectively...... 49 4.3 Precision recall curves for MIPS with RPTs (please view in colour): The first, second, third and fourth row correspond to n0 = 10; 20; 40 & 50 respectively...... 50 4.4 Accuracy vs inverse speed up plots for six real world datasets with RPTs and Simple-LSH (please view in colour)...... 51 5.1 Toy Example with 2-d datapoints...... 58 5.2 Steps of the svm setting...... 58 5.3 Average F1-score after 3 runs. Classes with high number of edge cases will get the most benefit from active learning (i.e Automobile, Frog and Truck)...... 59 5.4 Performance of combined sparse-RPTB vs EH-Hash. Markers from left to right corresponds to number of iterations [1; 20; 50; 100; 200] ...... 60

1+휖 푦 ∈ 푅푑 ||푥 − 푞|| ≤ ||푦 − 푞|| ≤ ||푥 − 푞||(√ − 1) A.1 Any point that satisfy 1−휖 and (푞 − 푥)푇(푞 − 푦) > (1 − 2휀)||푞 − 푥||||푞 − 푦|| lies in the shaded region...... 78 휃 푎푟푐푡푎푛(1⁄ 푒) Pr(퐴 ) = 1 = A.2 The left panel corresponds to the case when e > 0. In this case, 1 2휋 2휋 where 휃1 is the shaded angle. The right panel corresponds to the case when e < 0.In this case, 휃 휋+푎푟푐푡푎푛(1⁄ 푒) Pr(퐴 ) = 2 = 휃 1 2휋 2휋 where 2 is the shaded angle...... 82

xi

Chapter 1

INTRODUCTION

The main focus of this dissertation is on the well-known problem of nearest neighbor search (NNS) and how to solve it efficiently in the presence of high dimensional data. We start by giving the reader a brief background on the NNS in section.1.1 followed by its state of the art solutions in sections.1.2 and 1.3. In section.1.4 we describe two related search problems which are related to nearest neighbor search and discuss how solving a transformed nearest neighbor search problem enables us to solve these two related search problems. In section.1.5 we discuss some of the limitations of nearest neighbor search using a well known technique called Random Projection Tree (RPT). A major motivation of this dissertation is to address these limitations. Finally, in section.1.6 we list the main contributions of this dissertation.

1.1 NEAREST NEIGHBOR SEARCH

NNS is a well known problem which is widely used in many applications [1, 2, 3, 4, 5]. The problem is defined as follows: Given a S of n d-dimensional data points x1, x2, . . . , xn ∈ d d R , a distance metric dist(·, ·) and a query q ∈ R , we want to find xi ∈ S such that xi = argminx∈Sdist(x, q). In real world data each dimension corresponds to a feature. For example, if our data contains information about patients, [Age, Height, W eight, Gender] could be some of many potential features. As for the metric, euclidean distance is the most common function which is defined as follows: d ∀x, q ∈ R , dist(x, q) = ||x − q||2 (1.1) In this dissertation, we restrict the choice of metric to be Euclidean distance. Often we are interested in finding the top k nearest neighbor of a given query and the resulting search problem is denoted by k-NNS problem. Figure.1.1 show an example of 1-NNS problem involving 37 2-dimensional data point. The query point is red and its 1-NN point is green. A naive way to get this solution is to perform a linear scan over all 37 data points, that is compute the distance from query point to all 37 data points and choose the data point which is closest to the query point to be the 1-NNS solution. To generalize, since dataset S contains n data points, such naive linear scan will take O(n) time. For large dataset where n is large, such linear query time is often unacceptable.

1 Figure 1.1: 2-dimensional example for 1-NNS problem with euclidean distance as the dis- tance function.

To address this problem researchers proposed multiple approaches to solve NNS in a sublinear time [6, 7, 8, 9, 10, 11], most of which are based on partitioning the space. The most well-known approaches are locality sensitivity hashing (LSH) and its variants [6, 8, 9, 12, 13, 14] and tree based methods such as random projection tree (RPT) [10, 11, 15, 16, 17, 18]. Throughout this thesis we focus on RPT and improve its performance and efficiency and show its improvement by comparing it to LSH based methods.

1.2 LOCAL SENSITIVITY HASHING

LSH is a well-known technique to find approximate solution to NNS which reduces the dimensionaliy of high dimensional data using hash functions. The problem of approximate nearest neighbor search (ANNS) is to find a point x ∈ S s.t dist(x, q) ≤ (1 + ε) ∗ dist(x∗, q) where, q is the query point, x∗ is the actual NN and dist could be any distance function. The main idea of LSH [6] is to map high dimensional data to lower dimensions using hash functions that preserves locality with high probability. That means, any two data points which are close to one another tend to achieve identical mapped representation in lower dimension and thus ends up in the same hash bucket, while far away points tend to achieve different mapped representation in lower dimension and thus ends up in the different

2 hash buckets with high probability. A hash function converts a point in Rd to a binary value and by applying multiple hash functions to a single point we get a binary vector which represents a bucket. When a query comes same set of hash functions are applied to identify its corresponding bucket and do a linear search only on the samples belonging to the same bucket as the query. However, as one can guess the most important part of LSH is its hash function which have a direct effect on its performance. A proper choice of hash function ensures that with high probability, close points will end up in the same bucket while far away points lie in different buckets. In the next section we discuss the properties of a good hash function.

1.2.1 HASH FUNCTION As mentioned before, hash function are the heart of LSH and will ensure that with high probability close points will end up in the same bucket while far away points will lie in different buckets. In this section we define LSH family and show how they ensure the aforementioned property. For a set of points in S with a distance measure D, LSH family is defined as:

DEFINITION. A family H is called (r1, r2, p1, p2)-sensitive for D if for any v, q ∈ S

• if v1 ∈ B(q, r1) then P rH [h(q) = h(v1)] ≥ p1

• if v2 ∈/ B(q, r2) then P rH [h(q) = h(v2)] ≤ p2 where, B(q, r) is a ball centered at q with radius r. If we want to have a meaningful and useful LSH family it should satisfy the following conditions.

1. p1 > p2

2. r2 > r1

In other words, q and v1 should belong to the same bucket and v2 should be in a different bucket. Figure.1.2 depicts why these conditions are necessary to have a good LSH family. log(1/p ) Note that we can define ρ = 1 , where ρ represent the search performance of our log(1/p2) algorithm. The smaller the ρ is (i.e. larger gap between p1 and p2) the better our search performance. Under such conditions, [19] showed there exist an algorithm which uses O(dn+ 1+ρ ρ n ) space, with query time O(n log1/p2 n) to find an ANNS answer. In order to amplify the gap between p1 and p2 we can use multiple hash functions and concatenate the output of each function to make a vector. Lets call this vector g, then g(v) = [h1(v), h2(v), ..., hk(v)], where k would be the number of hash functions (bits) for LSH. Large values of k will result in large number of buckets which most of them will be empty. Also, it will decrease the collision probability between close points (i.e. p1). On the other hand, small value of k will result in very small number of buckets which most of them will be very dense (i.e. large p2) and hence not useful. Therefore, k is a very sensitive parameter which the user should choose it with cautious. There are many LSH families which are satisfying mentioned conditions [12, 19, 20, 6, 21, 22], however, hashing using p-stable distribution [19] has been used a lot for euclidean

3 Figure 1.2: Assume r2 > r1, then pi is the probability that vi end up in the same bucket as q. Obviously, since D(q, v1) < D(q, v2), p1 should be much larger than p2.

distance recently [23, 24] and we are going to focus on this particular method in this thesis. For more details on other methods we refer you to [25].

1.2.2 P -ST ABLE DISTRIBUTION HASH FAMILY

A LSH family based on p-stable distribution [19] can be used to solve NN problem for lp distances where p ∈ (0, 2]. DEFINITION. A distribution f is called p-stable where p ≥ 0 if for any n, real num- bers [a1, a2, ..., an] and i.i.d random variables X1,X2, ..., Xn with distribution f the random Pn Pn p 1/p variable i=1 viXi has the same distribution as ( i=1 |vi| ) X where X is a random vari- able with distribution f. Normal distribution, cauchy distribution and L´evydistribution are all satisfying the above property and hence are stable distributions. In particular, normal distribution is 2-stable while cauchy distribution is a 1-stable distribution. [26] provides lower bound analysis for different distance functions. Now, we are ready to see how to generate hash function using p-stable distribution. Assume we have two d-dimensional vectors v1 and v2 and a d-dimensional vector w which its entries chosen uniformly at random from a p-stable distribution. According to > > p-stable definition, (w v1 − w v2) has the same distribution as ||v1 − v2||pX where X has a p-stable distribution. It is easy to see that the distance between vectors will be preserved > locally [19]. w vi is projecting the vector vi onto a single line (real number). If we discretize this line into r bins and assign hash values based on the vector’s bin, the hash function would obviously satisfy the LSH properties. Hence the hash function would be as follows:

w>v + b h (v) = b i c (1.2) w,b r Where b is a real number chosen uniformly from the range [0, r] and r is user defined. We can see that for a k-bit hash function we need to store k directions which needs O(kd) space. For more details on the proofs and bounds see [19]. A very good implementation for p-stable distribution based LSH is publicly available [27], which we use through out this thesis.

4 Figure 1.3: KD-tree example and its corresponding tree. Each node stores some data points (number of data points in each node is shown inside it), one projection direction and a split value. Maximum number of points allowed in each node (n0) is 6.

1.3 TREE BASED APPROACHES

Tree based methods have been around for a long time and are widely used to partition the space [28, 29, 17, 30, 10]. kd tree [31] is one of the first partitioning methods which has been used in many applications [32, 33, 34, 35]. However, there exists a variety of tree based approaches for partitioning the space and in particular finding NN in a sublinear time [10, 11, 15, 36]. Each method has its advantages and disadvantages, however in general for a NNS problem, tree based approaches are shown to have better performance compare to hashing methods at the cost of higher space complexity [37]. In this section we are going to introduce and discuss the most popular tree based approaches for NNS i.e kd tree, random projection trees (RPT) [10] and its variants.

1.3.1 KD TREE kd tree [31] is one of the first tree based approaches to partition the space. It relies on projecting data points onto a single coordinate. Root node consist of all samples. To create left and right childs we pick a coordinate and project all samples in the parent node onto that coordinate and compute their median. Samples which their projection lies to the left of the median will create the left child while others create the right child. This process will be applied to all nodes unless a node contains less than a predefined n0 samples. Figure.1.3 shows a toy example for kd-tree and its corresponding tree. Once the tree is constructed, a nearest neighbor query is answered by starting at the root node and routing the query to an appropriate leaf node and do a linear search for the data points in that leaf node. Algorithm.1 shows the steps of this search. The depth of the tree is log2(n/n0) hence the time complexity would be O(n0 + log2(n/n0)), where n is the total number of data points. Also, the space complexity would be O(n(log2 d + 1)). It has been reported that such a strategy does not work well beyond 30 dimensions [36]. It is easy to observe that the strategy is very similar to LSH in principle. However, the advantage of tree based method is the fine grained control over the number of samples in

5 Algorithm 1 Search strategy Function: SearchTree(T ree, Query) Input: T ree, Query 1: index = Root 2: while T ree[index].type 6= Leaf do 3: if inner(T ree[index].projection direction, Query) ≤ T ree[index].sv then 4: index = T ree[index].left child 5: else 6: index = T ree[index].right child 7: end if 8: end while 9: return Tree[index].samples 10: do a linear search on the returned samples

each bucket. We mentioned that in LSH some buckets could be empty or dense. However, in tree based methods the user has the power of controlling the number of samples in each n bucket by determining n . The number of points in each bucket is in the range [ 0 , n ]. This 0 2 0 will prevent redundancy and will lead to a more efficient and robust algorithm. This approach fails when the query is close to the bucket boundary. In this situation true NN may lie in an adjacent node while the query is routed to the other branch of the subtree rooted at this node. This will make the failure probability become unacceptably high (close to 1/2 [11]). Various strategies have been developed to reduce the failure probability of kd- tree, for example [11] improved the failure probability by increasing time/space complexity and adding randomness which is inspired by LSH. In the following sections we describe spill tree (ST) [36], random projection tree (RPT) [10] and virtual spill tree (VST) [11]. All these methods have one important assumption that the metric is euclidean distance. Since euclidean distance is the most widely used metric, these methods can be applied to many real world problems. In case of non euclidean distance there exist other tree based approaches such as [15] and ball tree [38]. However, in this thesis our metric of choice is euclidean distance.

1.3.2 SPILL TREE As we mentioned, kd-tree fails when the query is close to the decision boundary. Spill tree (ST) was introduced [36] to address this shortcoming. ST is just a variant of kd-tree, which uses the same projection direction and structure as kd-tree. The only difference is that ST choose multiple split values instead of one. To be exact, after projecting data points into a coordinate, just like kd-tree we find the median (med). However, unlike kd-tree, two split values are chosen, med − τ and med + τ would be our two split values. Moreover, when we want to decide if a sample belongs to the left or right child, we apply the following rule: ( v ∈ Left Child, if v ≤ med + τ Let v = inner(X, P rojection Direction) v ∈ Right Child, if v ≥ med − τ

6 Figure 1.4: A possible partitioning by RPT for our toy example with n0 = 6.

Due to the allowed overlap samples in between [med − τ, med + τ] will belong to both left and right child. This way one can capture points which are close to decision boundaries and not lose their information, hence, a data point may belong to multiple leaves. But, when a query comes, since we are only interested in returning a single leaf node, we still use the median as our split value. However, depending on τ value the depth of the tree can grow and the whole partitioning could become meaningless, hence τ should be selected carefully and should usually be small. Another variant of ST was introduced by using a random direction as the projection direction [11]. Also, instead of fixing a real number τ to determine the two split values, they used 1 1 − α and + α fractiles. From now on, whenever we mention ST we are pointing to this 2 2 approach and not the one from [36]. Note that the time and space complexity of ST is 1/1−log2 (1+2α) O(n0 + log2(n/n0)) and O(n d) respectively [11].

1.3.3 RANDOM PROJECTION TREE One way to overcome the problem of kd-tree is to inject randomness, where at each internal node, data points are projected on to a random projection direction. The resulting tree is called Random Projection Tree (RPT) [10]. It was shown that the constructed RPT adopts to intrinsic low dimension of the data [10]. Such a random tree construction, with slight modification, was shown to perform well for solving nearest neighbor search problem and its failure probability to find exact nearest neighbor search was thoroughly analyzed in [11]. Further randomness was added to this approach by picking the split value uniformly at 1 3 1 random from [ , ] fractile ( would correspond to median). Algorithm.2 shows how to 4 4 2 make RPT’s structure. Figure.1.4 shows one possible partitioning of RPT for our toy example. After constructing the tree we can use the same search strategy as in kd-tree (algorithm.1) to answer any query. The time complexity of RPT is the same as kd-tree and ST, however unlike ST its space complexity would be linear O(nd) [10, 11].

7 Algorithm 2 Pseudocode for constructing an RPT Function: MakeRPT(S) Input: S, n0 1: Left = {∅}, Right = {∅} 2: if |S| ≤ n0 then 3: return leaf containing S 4: else 5: Pick a random projection U uniformly at random 1 3 6: Pick α randomly between [ , ] 3 4 7: Determine the split value (sv) from the projection of S onto U 8: for x ∈ S do 9: if inner(x, U) ≤ sv then 10: Left = {Left} ∪ x 11: else 12: Right = {Right} ∪ x 13: end if 14: end for 15: Left Child = MakeRPT(Left) 16: Right Child = MakeRPT(Right) 17: end if

1.3.4 VIRTUAL SPILL TREE Another way to improve the failure probability of kd-tree is to use virtual overlap, where the resulting tree is called Virtual Spill Tree (VST) [11]. Similar to ST, it keeps two split points α − τ and α + τ where α can be the projected median or any randomly chosen point close to the projected mdian to ensure that tree depth is logarithmic in number of samples (data points). While routing a query through any internal node, if the projected query lies to left side of α − τ, only the left subtree is traversed, if it lies to rge right of α + τ, only the right subtree is traversed, otherwise both subtrees are travesed. Since this is done recursively, while answering a NNS query using VST it is possible to access multiple leaf nodes. However, one downside for using VST is that unlike RPT and ST we won’t have control on number of retrieved points any more. Assume n0 = 10 then for both RPT and ST we are sure that the maximum number of retrieved points would be 10, but for VST depending on α value the number of retrieved points could be very large, hence making the time complexity of VST to be close to linear.

1.3.5 FAILURE PROBABILITY ANALYSIS We mentioned two different approaches to improve kd-tree performance and both are parti- tioning the space in a different way (figure.1.5). But, how are they different and how much they improve the performance of kd-tree? A thorough analysis for all three approaches has been done in [11]. Therefore, we are only going to discuss the important results here. Failure probability of RPT, ST and VST can be unified into a single framework [11]. It

8 Figure 1.5: Two ways to partition the space, single split point (left) and multiple split points (right). kd-tree, RPT and VST are using a single split point to construct the tree while ST is using multiple split points. Also, all approaches except VST are using a single split point to traverse the tree and answer the query. Note that in a single split point partitioning α = 1/2 corresponds to median split.

d d has been shown that for xi ∈ R where i = 1, 2..., n and a query q ∈ R , the probability of not finding the nearest neighbor for the aforementioned three methods are related to the following function [11].

n 1 X ||q − x1||2 φ(q, {x , x , ..., x }) = (1.3) 1 2 n n ||q − x || i=2 i 2

{x1, x2, ..., xn} is in ascending order based on the distance of the data points to the query q. Hence, x1 is the actual nearest neighbor. This function is very intuitive and it is easy to see that φ would be close to 1 when all data points are very close to each other and 0 while they are far apart from the actual nearest neighbor. Therefore, φ is a very intuitive measure for the difficulty of the NNS problem. RPT’s failure probability is proportional to φ(log(1/φ)) while this probability for ST and VST is φ [11]. As mentioned earlier we can use φ prior to solving NNS to see how difficult the data set is and what to expect for our accuracy [37]. Note that φ is showing the probability of failure for a single tree only. In reality we are going to use multiple trees to boost the performance and reduce the failure probability [37, 11].

1.4 NEAREST NEIGHBOR SEARCH AND RELATED PROBLEMS

With the growth of technology, capability of storing data increased as well. In fact memory has become so cheap that almost anyone can store millions of data even on their personal computer. This is the main reason that big data became such a popular field in the recent

9 years. But as storage becomes cheaper and cheaper we need to have a way to process it as well. Unfortunately linear processing time is no longer acceptable at the presence of big data in many cases. Many companies have big data and need to use and analyze it in an efficient manner. Some common fields for use of big data are weather forecasting [39], agriculture [40], bioinformatics [41, 42], networking [43], finance [44], speech recognition [45] and many more. NNS is a well-known problem which has applications in many real world problems [46, 47, 48]. Also, NNS is an example of big data where linear time complexity is hardly acceptable. In fact, NNS is such an intuitive problem that in many fields researchers try to convert their problem to a NNS and then solve it. This thesis covers two such problems a) maximum inner product search (MIPS) [49] and b) nearest neighbor to query hyperplane (NNQH) [50].

1.4.1 MAXIMUM INNER PRODUCT SEARCH Chapter 4 discuss this problem in detail, but we give a high level overview in this section. The problem of MIPS is as follows: given a set S ⊂ Rd of d-dimensional points and a query point q ∈ Rd, the task is to find a p ∈ S such that, p = arg max q>x. (1.4) x∈S

Recommender systems is an example of MIPS [51, 52] where given a dataset consist of millions of users and rated items (e.g. Amazon) we try to answer the following question:

• Given a user, which items could be more appealing to that specific user?

Solving equation.1.4 answers this question, where x is the items and q is the user we are asking question about. Now in order to transform this problem to an equivalent NNS (equation.1.1) we need transformations P and Q such that:

2 > arg min ||P (x) − Q(q)||2 = arg max q x. (1.5) x∈S x∈S Once the problem is reduced to NNS it can be solved using any NNS technique such as LSH [53, 51, 54]. However, no one studied which transformation performs best. Chapter 4 is dedicated to use most widely used transformations in combine with RPT and analyses which transformation performs best.

1.4.2 NEAREST NEIGHBOR TO QUERY HYPERPLANE NNQH is a well defined problem where given a hyperplane we try to find the nearest neighbor. This problem is critical to pool-based active learning, where the goal is to request labels for those points that appear most informative. An example of such a problem is semi-supervised support vector machine (SVM) where only a limited number of labeled data points are available [55, 56, 57]. NNQH is defined as follows: given a database x1, ..., xn ∈ S of n points in Rd, the goal is to retrieve the points from the database that are closest to a given hyperplane query whose normal is given by w ∈ Rd. Without loss of generality, we assume

10 that the hyperplane passes through origin, and that each xi, w is unit norm. The Euclidean distance of a point x to a given hyperplane hw parameterized by normal w is:

> > d(hw, x) = ||(x w)w|| = |x w| (1.6) Hence, the task of NNQH is to find x∗ such that: x∗ = arg min |x>w| (1.7) x∈S Two different approaches has been suggested to convert NNQH problem to an equivalent NNS [50] such that LSH would be applicable. However, to the best of our knowledge, there are no existing algorithms to make use of RPT to solve NNQH problem.

1.5 LIMITATIONS OF NEAREST NEIGHBOR SEARCH USING RPT

Due to its superior theoretical guarantee RPT is the main focus of this thesis. However, it is known that RPT has the following major drawbacks which we address them throughout this thesis. • Failure probability of nearest neighbor search using RPT depends on data distribution and a data dependent term called potential function (equation.1.3). Depending on data distribution, failure probability for a single RPT can be high, to compensate for this high failure probability, often multiple (a forest of) RPTs are used. This increases both space complexity and query time which is undesirable for large scale applications. • RPT has high space complexity even for a single tree. Space complexity of storing a single tree is O(nd), where n is the number of samples and d is the dimension of our data. For large n and d, such space complexity is not desirable for many real world applications. • RPTs theoretical guarantee only holds for Euclidean distance. Hence for many ap- plications one needs to reduce the problem to a NNS where the similarity metric is Euclidean distance. While reducing a related search problem such as MIPS (or NNQH) and solve through RPT, it is not clear how to reduce the MIPS to an NNS problem, especially when multiple equivalent reductions are possible. The main motivation of this thesis is to alleviate these issues by answering the following questions: • How space complexity for RPT can be reduced without affecting search quality and query time significantly? • How to improve search quality of a single RPT, thereby reducing the need of using multiple RPTs? • How to reduce a related search problem to a NNS problem in optimal way by ensuring theoretical guarantee.

11 1.6 OUR CONTRIBUTIONS

The main contribution of this thesis is to address the research questions posed in the previous sections. In particular we make the following contributions:

• Address the high space complexity of RPT and propose three different strategies to reduce it without sacrificing the accuracy (chapter2).

• Address the need to have large number of trees to get acceptable accuracy and propose three algorithms to improve a single tree’s accuracy (chapter3).

• Apply proposed RPTs to maximum inner product problem and rank the existing so- lutions. The resulting method is used to solve recommender system problem (chapter 4).

• Apply proposed RPTs to active learning problem by reducing NNQH problem to an equivalent NNS problem (chapter 5).

Rest of this dissertation is organized as follows: In the next chapter we discuss high space complexity problem of RPT and provide three different algorithms to address this issue. In chapter 3, we present three different algorithms to enhance the accuracy of a single RPT and its usability for real world problems. In chapter 4, we apply RPT to solve maximum inner product search problem and use the proposed algorithm to solve recommender system applications. We discuss active learning problem and how RPT can be used for active learning in chapter 5. We conclude in chapter 6 and also provide possible future work directions. Note that part of the contents of chapters 2, 3 and 4 have been published in [58, 59, 60, 61].

12 Chapter 2

SPACE EFFICIENT RPT

In this chapter we address a major drawback of RPT, namely, its high space complexity. We first explain this drawback in details in the next paragraph. We then propose three different strategies to reduce the space complexity in sections 2.1, 2.2 and 2.3 followed by our experimental results in section.5.4. In spite of having nice theoretical guarantee in terms of finding exact nearest neighbors, RPTs are not memory efficient. Each internal node of RPT needs to store a pair consisting of a d dimensional random projection vector and a scalar random split point. Space required to Pln(n/n0) i store these projections directions is i=0 2 · d = O(dn) for constant n0 for a single RPT. Moreover, if L independent such RPTs are used, total memory requirement for storing all the projection directions is O(Ldn). This leads to total space complexity of O(nd + Lnd + Ln) for L RPTs, where the first term is to store the dataset(n d-dimensional data points), the second term is to store the random projection directions and the third term is to store random split points. In comparison, space complexity of LSH, which is a random projection based method, is O(nd + nρd log n + n1+ρ). The first term above is to store n d-dimensional data points. The second terms corresponds to space required to store random projection directions for computing the random hash functions. For a single hash table the random hash function has the form h : Rd → {0, 1}k and one needs to store k d-dimensional random projection vectors to store this. To ensure constant failure probability in solving approximate nearest neighbor search, it is recommended to use k = log n and use L = nρ hash tables, where the value of ρ is 1/c if LSH needs to return a c-approximate nearest neighbor solution1 [6, 19, 12]. Finally the third term n1+ρ corresponds to space required to store nρ hash tables each of which takes O(n) space. In practice however, practitioners use different values of k and L and the space complexity of LSH reduces to O(nd + Lkd + Ln). As one can see from the above discussion, the dominating term appearing in the space complexity expression of RPTs is the term O(Lnd) (compared to the corresponding O(Lkd) term for LSH), i.e. the space required to store the random projection directions. In the following, we discuss three strategies that reduces this term significantly.

1 d ∗ d ∗ For a query q ∈ R , let p be its exact nearest neighbor in S ⊂ R , i.e., p = argminp∈S kq − pk.A c-approximate nearest neighbor of q is any p ∈ S, that satisfies kq − pk ≤ (1 + c)kq − p∗k.

13 2.1 SPACE COMPLEXITY REDUCTION STRATEGY 1: RPTS

Our first strategy is to reduce the space complexity for individual RPTs. Here, instead of storing separate random projection direction at each internal node, we keep a single common random projection direction for all internal nodes located at any fixed tree depth (level). We call this space reduction strategy as RPTS and provide its pseudo code in algorithm 3. Since RPTS tree depth is at most O(log n), each RPTS requires O(d log n) space to store all

Algorithm 3 Function ChooseRule for RPTS Input : data S, depth of the current node from root dl Output : rule function ChooseRule(S, dl)

1: if no projection direction have been chosen for this level dl yet then 2: Pick U uniformly at random from the unit sphere by choosing each of its coordinate independently at random from a standard Normal distribution 3: Pick β uniformly at random from [1/4, 3/4] 4: else 5: Use same U and β already chosen for this level. 6: end if 7: Let v be the β-fractile point on the projection of S onto U 8: Rule(x) = (x · U ≤ v) 9: return (Rule)

the projection directions for that tree. Consequently, if there are L such trees, total space requirement is O(Ld log n). Performance guarantee of RPTS is immediate as projection directions at different levels are independent to each other and we can simply use an union bound of the failure probabilities over the path that conveys query q from root to leaf node of an RPTS and is given in the following lemma.

Lemma 1. Given any query point q, probability that a RPTS fails in finding true nearest neighbor of q is same as that of a RPT.

2.2 SPACE COMPLEXITY REDUCTION STRATEGY 2: RPTB

While RPTS has reduced space complexity as compared to RPT, space required to store the projection directions still increases linearly with L, the number of trees. We now present a second strategy, for which, memory required to store all the projection directions of L trees is independent of L. To achieve this, we keep a fixed number of independent projec- tion directions, chosen uniformly at random from the unit sphere in a bucket. Projection directions from this bucket is used to construct all L randomized partition trees. Using this strategy, while constructing a randomized partition tree, we still use a single projec- tion direction for all nodes located at a fixed level as in RPTS, but the difference now is

14 that projection directions at each level are chosen uniformly at random without replacement from the bucket. Since all projection directions stored in the bucket are independent to each other, this strategy ensures that projection directions at different level of the tree thus constructed are still independent to each other. We call this space reduction strategy as RPTB and provide its pseudo code in algorithm 4. Number of projection directions stored in a bucket is typically a constant times log n, as a consequence, space required for storing all the projections directions of L RPTBs is O(d log n) and is independent of L.

Algorithm 4 Function ChooseRule for RPTB Input : data S, depth of the current node from root dl, constant c Output : rule function ChooseRule(S, dl)

1: if no projection direction have been chosen for this level dl yet then 2: Pick U uniformly at random without replacement from a bucket containing c · log n projection directions. 3: Pick β uniformly at random from [1/4, 3/4] 4: else 5: Use same U and β already chosen for this level. 6: end if 7: Let v be the β-fractile point on the projection of S onto U 8: Rule(x) = (x · U ≤ v) 9: return (Rule)

As before, it is easy to see that,

Lemma 2. Given any query point q, probability that a RPTB fails in finding true nearest neighbor of q is same as that of a RPT.

The reason for keeping the bucket size to be a constant times log n is apparent from the following lemma which states that with high probability no two RPTBs have same sequence of projection directions at every level from root to leaf node.

Lemma 3. For any c ≥ 3, suppose the bucket in RPTB contains c · log n distinct projection directions uniformly chosen at random from a unit sphere,√ where n is the number of data points in S. If the number of RPTBs is limited to at most n, then with probability at least   √1 1 − 2 n , no two RPTBs will have same sequence of projection directions at each level along the path from root to leaf node.

1 Note that, extreme tree depths, i.e., (5/2) · log n or 2 log n happens very rarely. For all our experiments tree depth was very close to log n and a bucket containing 2 · log n distinct projection directions preformed very well.

15 2.3 SPACE COMPLEXITY REDUCTION STRATEGY 3: SPARSE RPT

As for our third strategy we target the d-dimensional projection direction itself and propose a way to avoid storing all dimensions. To reduce the O(d) space complexity at each internal node of an RPT, we propose a sparse RPT where, at each internal node, the random projec- tion direction U ∈ Rd is made sparse by pre-multiplying U with random d×d diagonal matrix B, whose entries are drawn i.i.d from a Bernoulli distribution with success probability p. It is easy to see that for small p, only few entries of this new projection direction BU is non-zero, and these are the entries that need to be stored at each internal node. However, this poses a potential problem because if the entries of data points xi and query q that corresponds > > to the nonzero indices of BU are zero, then (BU) xi = (BU) q = 0. Inspired by [62], we solve this problem by densifying xi and q with an application of norm preserving random rotation using a Walsh-Hadamard matrix and a random diagonal matrix. In particular, let −1/2 hi−1,j−1i H be a d × d Walsh-Hadamard matrix whose entries are given by Hij = d (−1) , where hi − 1, j − 1i is the dot product (modulo 2) of the vectors i, j expressed in binary. Also, let D be a d × d diagonal matrix whose entries are drawn independently from {−1, 1} with success probability 1/2. It is easy to see that kHDxik = kxik, kHDqk = kqk and kHD(x − q)k = kx − qk. This simple modification leads to our sparse RPT which is shown in algorithm 5 and 6.

Algorithm 5 Sparse RP-tree d Input : data S = {x1, . . . , xn} ⊂ R , maximum number of data points in leaf node n0 Preprocessing : Pre-multiply each xi ∈ S and let S = {HDxi : xi ∈ S}. Output : tree data structure function MakeTree(S, n0)

1: if |S| ≤ n0 then 2: return leaf containing S 3: else 4: Rule = ChooseRule(S) 5: LeftTree = MakeTree({x ∈ S : Rule = true}, n0) 6: RightTree = MakeTree({x ∈ S : Rule = false}, n0) 7: return (Rule, LeftTree, RightTree) 8: end if

Note that while answering query q, we first need to apply the same transformation and find nearest neighbors of HDq. As we will in next section, for any fixed , δ ∈ (0, 1), setting   nd  (1+) log( δ ) log(1/δ) p = min 1, Θ 2d leads to expected fraction of non-nearest neighbors of q that falls between q and its nearest neighbor to be same as that of non-sparse RPT except an additional multiplicative factor (1 + ) and an additive factor (δ + η()), where η() is an increasing function of  defined in Corollary 5. This reduces the space complexity of  nd  n log( δ ) log(1/δ) sparse RPT to Θ 2 as compared to Θ(nd) in case of non-sparse RPT. Note

16 also that by property of Walsh-Hadamard matrix, any matrix vector multiplication involving d × d Walsh-Hadamard matrix can be computed in O(d log d) time. As a by product of this,  nd  log n log( δ ) log(1/δ) query time of our proposed sparse RPT becomes O d log d + 2 . For large n (compared to d), this query time can be potentially much faster as compared to that of its non-sparse version (see Corollary 6 for details).

Algorithm 6 Function ChooseRule for sparse RP-tree Input : data S Output : rule function ChooseRule(S)

1: Pick U uniformly at random from the unit sphere by choosing each of its coordinate independently at random from a standard Normal distribution 2: Pick a diagonal matrix B whose entries are drawn independently from a Bernoulli dis- tribution with success probability p. 3: Pick β uniformly at random from [1/4, 3/4] 4: Let v be the β-fractile point on the projection of S onto BU 5: Rule(x) = (x>BU ≤ v) 6: return (Rule)

2.3.1 ANALYSIS OF SPARSE RPT FOR NEAREST NEIGHBOR SEARCH In this section we present theoretical analysis of our proposed sparse RPT for nearest neigh- bor search. Since structurally, sparse and non-sparse versions of RPT are very similar except the sparse random projection direction, if we can get an estimate of the expected fraction of non-nearest neighbors that fall between query q and its nearest neighbor upon projection at any internal node, we can essentially reuse the proof technique developed in [11] to bound the failure probability of nearest neighbor search, simply by plugging in the corresponding estimate for sparse version of RPT. What we will show in this section is that the above es- timate for sparse RPT is very similar to that of non-sparse RPT except an small additional multiplicative as well as small additive term. More importantly, these additional terms are user controllable and can be made as small as one wants at the expense of how much sparsity our proposed method can handle for a fixed d. Before we present the actual proof, we provide a high level proof sketch.

2.3.1.1 Proof sketch

Crux of the analysis is to solve the following problem: given any x, y, q ∈ Rd with kq − xk ≤ kq − yk, what is the probability that upon projection onto a random direction U, U >y falls strictly between U >q and U >x, which is equivalent to asking what is the probability that U >(y − q) falls strictly between 0 and U >(x − q). In [11], without loss of general- ity, this problem is solved by assuming that x = kxke1 = (kxk, 0,..., 0) and simplify-

17 ing the proof by taking advantage of that assumption. In our proposed method we can not make this assumption since we are densifying the query and data points by applying Walsh-Hadamard transform. Additionally, in our case projection direction is not U but BU, where B is a d × d diagonal matrix whose entries are drawn independently from a > > > Bernoulli distribution. Letting xB = (BU) HDx, yB = (BU) HDy, qB = (BU) HDq > > and X1 = (BU) HD(x − q),X2 = (BU) HD(y − q), we observe the conditioned on > B,(X1,X2) follows a bivariate normal distribution with zero mean and covariance ma-  2 >  kxB − qBk (xB − qB) (yB − qB) trix CB = > 2 . Using this observation we develop (xB − qB) (yB − qB) kyB − qBk a new proof technique to find the probability that X2 fall strictly between 0 and X1 in Lemma 7. Note that Lemma 7 can be applied to non-sparse version of RPT and we can recover Lemma 1 of [11]. Note that in non-sparse case (where H,B,D are identity matrix) > (X1,X2) follows a bivariate normal distribution with zero mean and covariance matrix  kx − qk2 (x − q)>(y − q) and due to assumption on x, y, q, second diagonal ky −qk2 (x − q)>(y − q) ky − qk2 is at least as large as the first diagonal kx−qk2. This makes the proof simpler. In sparse case 2 2 however, due to random choice of B, kyB − qBk may be even smaller than kxB − qBk and this makes the proof of Lemma 7 more involved. Next, we observe that taking expectation  pkx − qk2 p(x − q)>(y − q) with respect to B, (C ) = . Moreover, we show in EB B p(x − q)>(y − q) pky − qk2 2 Lemma 9 that over random choice of B, entries of CB are tightly concentrated near their respective exceptions with high probability. This gives us the desired value of Bernoulli suc- cess probability p. Equipped with this, using Lemma 7, 9 and technical Lemma 8 we prove the main theorem (Theorem 4) for sparse RPT.

2.3.1.2 Main result Here we present the main theorem of this chapter. Statement of all auxiliary lemmas are presented at the end of this section.

Theorem 4. Let H be d × d Walsh-Hadamard matrix. Pick any q, x, y ∈ Rd with kq − xk ≤ kq − yk. Pick any random U ∈ Rd whose entries are drawn i.i.d from a standard Normal distribution and a random diagonal matrix D ∈ Rd×d, whose entries are ±1 and drawn independently and uniformly. Pick any , δ ∈ (0, 1) and a random diagonal matrix B ∈ Rd×d whose entries are drawn i.i.d from a Bernoulli distribution with success probability   nd  (1+) log( δ ) log(1/δ) p = min 1, Θ 2d and are independent from entries of U and B. Let B be the event, B ≡ {U >BHDy falls between U >BHDq and U >BHDx}. Then the following holds.

( 1 kq−xk (1 + ) + δ, if 1C > 0 Pr(B) ≤ 2 kq−yk 1 otherwise.

2The diagonal terms are more tightly concentrated than the off-diagonal terms as norm is much better preserved than inner product in this case.

18 where, the indicator function 1C is defined as, ( 1, if (x − q)>(y − q) ≤ (1 − 2)kq − xkkq − yk 1C = 0 otherwise.

The following corollary follows immediately from Theorem 4.

Corollary 5. Let H be d × d Walsh-Hadamard matrix. Pick any random U ∈ Rd whose entries are drawn i.i.d from a standard Normal distribution and a random diagonal matrix D ∈ Rd×d, whose entries are ±1 and drawn independently and uniformly. Pick any , δ ∈ (0, 1) and a random diagonal matrix B ∈ Rd×d whose entries are drawn i.i.d from a Bernoulli   nd  (1+) log( δ ) log(1/δ) distribution with success probability p = min 1, Θ 2d and are independent d from entries of U and B. Pick any q, x1, . . . , xn ∈ R . If these points are projected to BU then the expected fraction of the projected xi that fall between q and x(1) is at most 1 2 (1 + )Φn(q, {x1, . . . , xn}) + δ + η(), where η() is the fraction of the points xi that satisfy q 1+  > kx(1) − qk ≤ kxi − qk ≤ kx(1) − qk 1− − 1 and (q − x(1)) (q − xi) > (1 − 2)kq − x(1)kkq − xik. Note that in case of original RP-tree, the expected fraction of non-neighbor points that 1 fall between q and its nearest neighbor x(1) upon projection is 2 Φn(q, {x1, . . . , xn}). It is easy to see that for our proposed sparse version, the additional multiplicative term (1 + ) and the additive term (η() + δ) can be made small. To see this, fix ρ, 0 < ρ < 1, and set q nd 1  log( δ ) log( δ )  = Θ dρ . While on one hand this ensures the space complexity of sparse RPT to be O(ndρ) as opposed to O(nd) in case of original RPT, on the other hand, for fixed ρ, as d increases,  decreases (in fact  → 0 as d → ∞), and consequently the additional multiplicative term (1+) approaches 1. In addition, as can be seen from Figure A.1, volume of the shaded region tends to zero as  → 0, therefore, the fraction of the data points that fall within this shaded region, η(), tends to zero as well. Finally, the confidence parameter δ can be chosen arbitrarily small as its effect is reflected in terms of log(1/δ) in p. Next, we show that for large d, query time of our proposed sparse RPT is smaller than that of the original RP-tree, for large dataset size.

Corollary 6. Fix any ρ ∈ (0, 1/2) and  ∈ (0, 1). Choose d large enough so that space complexity of proposed sparse RPT is O(ndρ). Then query time of our proposed sparse RPT is smaller than that of original RP-tree if n ≥ d2.

The following lemmas are required to prove Theorem 4.

d > d Lemma 7. Pick any q, x, y ∈ R . Pick any random U = (U1,...,Ud) ∈ R whose el- ements are drawn i.i.d from a standard Normal distribution. Let A be the event A ≡ {U >y falls (strictly) between U >q and U >x}. Then the following holds. (a) If ky − qk2 ≥ (x − q)>(y − q), then,  s  1 kx − qk (x − q)>(x − y)2 Pr(A) = arcsin 1 − π  ky − qk kx − qkkx − yk 

19 (b) If ky − qk2 < (x − q)>(y − q), then,  s  1 kx − qk (x − q)>(x − y)2 Pr(A) = 1 − arcsin 1 − π  ky − qk kx − qkkx − yk 

d d Lemma 8. Let S = {x1, . . . , xn} ⊂ R be a set of n vectors in R , H be a d × d determin- istic Walsh-Hadamard matrix and D be a d × d diagonal matrix, where each Dii is drawn independently from {−1, +1} with probability 1/2. Then for any δ > 0, with probability at q 2 log(2nd/δ) least 1 − δ, the following holds for all xi ∈ S : kHDxik∞ ≤ kxik d . q d 2nd  Lemma 9. Let v1, v2 ∈ R be any two vectors such that kv1k∞ ≤ kv1k 2 log δ /d and q 2nd  > d kv2k∞ ≤ kv2k 2 log δ /d and let U = (U1,...,Ud) ∈ R be a random vector whose entries are drawn i.i.d from a standard Normal distribution. Also, for any , δ ∈ (0, 1), let B be a d × d diagonal matrix whose diagonal entries Bii’s are drawn i.i.d from a Bernoulli  2nd  4(1+/3) log( δ ) log(8/δ) distribution with success probability p = min 1, 2d , and where diagonal > > entries of B are independent from entries of U. Let Y1 = U Bv1 and Y2 = U Bv2. Then the following holds:

> 1. (Y1,Y2) follows a bivariate normal distribution with zero mean and covariance matrix CB = Pd 2 2 Pd 2 ! i=1 Biiv1i i=1 Biiv1iv2i Pd 2 Pd 2 2 , where CB is random quantity. i=1 Biiv1iv2i i=1 Biiv2i  2 >  pkv1k p(v1 v2) 2. EB(CB) = > 2 . p(v1 v2) pkv2k δ 2 Pd 2 2 2 2 3. With probability at least 1− 2 , (1−)pkv1k ≤ i=1 Biiv1i ≤ (1+)pkv1k and (1−)pkv2k ≤ Pd 2 2 2 i=1 Biiv2i ≤ (1 + )pkv2k . δ >  2 2  Pd 2 >  2 2  4. With probability at least 1− 2 , p v1 v2 − 2 (kv1k + kv2k ) ≤ i=1 Biiv1iv2i ≤ p v1 v2 + 2 (kv1k + kv2k ) .

2.4 EXPERIMENTAL RESULTS

In this section we report empirical performance of the proposed sparse RPT data structure in solving nearest neighbor search problem. We used four real world datasets of varied dimensionality. Number of data points to construct our proposed data structure and number of queries are listed in Table 2.1. In our experiments, we randomly split each dataset D into two disjoint subsets S and Q, such that, D = S ∪ Q, where all data points in S were used to construct the RPTs (or their sparse versions) and these tree data structure was then used to find 10 nearest neighbors for each query q from Q. We evaluated our proposed method using number of trees (L) chosen from the set {8, 16, 32, 64, 128}. For each choice of L, we created L independent sparse RPTs. Now given a query point q, we retrieved union of all the points corresponding to L leaf nodes of these trees (call this set R) and find the 10 nearest neighbors from R. We say that our method accurately finds 10 nearest neighbors

20 only if our answer is same as the true 10 nearest neighbors obtained by performing a linear scan over the entire dataset. Averaging over all queries in Q, we report nearest neighbor accuracy and standard deviation of our proposed method. We also report average number of retrieved points (size of R). In our experiments, Q contains 5000 query points, except for USPS dataset for which Q contain 2298 data points. We also set n0 to be 100.

2.4.1 DATASETS Details of the datasets used for our experimental evaluations are listed in table.2.1. The USPS dataset contains hand written digits. The AERIAL dataset contains texture information of large areal photographs [63]. The COREL dataset is available at the UCI repository [64]. After removing the missing data we keep only 50,000 instances. This SIFT dataset contains SIFT image descriptors, introduced in [65]. The original dataset contains 1 million image descriptors. We used 50,000 image descriptors from this dataset for our experiments. USPS had only 9298 data points. Therefore, we used 7000 for building the data structure and the remaining as query points.

Table 2.1: Dataset description

Dataset # points in S # of queries # dimension AERIAL 45000 5000 60 COREL 45000 5000 89 SIFT 45000 5000 128 USPS 7000 2298 256

2.4.2 COMPARISON OF SPARSE AND NON-SPARSE RPT To empirically demonstrate the effectiveness of our proposed method we used multiple p values, where p is Bernoulli success probability of choosing a non-zero coordinates of a projection direction at each internal node of an RPT. In particular, we used p values from the set {0.1, 0.3, 0.5, 0.7, 1.0}. Note that, we only need to store the non-zero coordinates of projection direction at internal node, since other coordinates do not contribute in computing an inner product (1-D projection). Thus, if we have m number of internal nodes in a tree, the non-sparse version needs to store (m·d) coordinates (where d is the data dimensionality), whereas sparse version will store (m · d · p) coordinates on average. Thus percentage of space savings is (m·d−m·d·p)/(m·d) = (1−p) percentage. Thus p = 1.0 corresponds to original non-sparse RPT (no space savings), whereas, p = 0.1, we have 90% space savings. Note that based on lemma 1 and 2 RPTS and RPTB will have the performance as original RPT. Goal of this experiment was to evaluate how sparse RPT answers nearest neighbor search query compared to its non-sparse counterpart for various values of p. The results for four datasets are reported in Table 2.2. As we expected, in most cases non-sparse RPT has the best performance in terms of higher accuracy and lower number of retrieved points (R). More importantly, both accuracy and number of retrieved points of sparse RPTs are very close to that of non-sparse RPTs, even when p equals to 0.1.

21 Table 2.2: For each value of p, the left column is size of R and the right one is accuracy ± standard deviation. Bold values are the best values among that row.

Aerial p = 1 p = 0.7 p = 0.5 p = 0.3 p = 0.1 L = 8 503 0.716 ± 0.21 505 0.709 ± 0.22 503 0.707 ± 0.22 508 0.711 ± 0.22 508 0.701 ± 0.22 L = 16 923 0.885 ± 0.14 925 0.884 ± 0.14 921 0.885 ± 0.14 925 0.888 ± 0.14 930 0.878 ± 0.15 L = 32 1617 0.975 ± 0.06 1620 0.974 ± 0.06 1615 0.974 ± 0.06 1619 0.975 ± 0.06 1626 0.974 ± 0.06 L = 64 2691 0.998 ± 0.01 2702 0.998 ± 0.01 2686 0.998 ± 0.01 2692 0.998 ± 0.02 2703 0.998 ± 0.01 L = 128 4243 1 ± 0.00 4246 1 ± 0.00 4220 1 ± 0.00 4239 0.999 ± 0.00 4261 0.999 ± 0.00 Corel 1 − p = 0 1 − p = 0.3 1 − p = 0.5 1 − p = 0.7 1 − p = 0.9 L = 8 510 0.757 ± 0.18 510 0.750 ± 0.18 510 0.748 ± 0.18 513 0.752 ± 0.18 514 0.754 ± 0.18 L = 16 936 0.925 ± 0.10 936 0.923 ± 0.11 939 0.920 ± 0.11 939 0.921 ± 0.10 940 0.924 ± 0.10 L = 32 1644 0.990 ± 0.03 1648 0.990 ± 0.04 1652 0.989 ± 0.04 1651 0.988 ± 0.04 1647 0.989 ± 0.04 L = 64 2742 1 ± 0.01 2749 1 ± 0.01 2752 0.999 ± 0.01 2753 0.999 ± 0.01 2742 1 ± 0.01 L = 128 4324 1 ± 0.00 4347 1 ± 0.00 4342 1 ± 0.00 4338 1 ± 0.00 4318 1 ± 0.00 SIFT 1 − p = 0 1 − p = 0.3 1 − p = 0.5 1 − p = 0.7 1 − p = 0.9 L = 8 546 0.444 ± 0.24 546 0.441 ± 0.24 547 0.437 ± 0.24 547 0.434 ± 0.24 546 0.433 ± 0.24 L = 16 1062 0.637 ± 0.23 1058 0.633 ± 0.23 1057 0.635 ± 0.23 1061 0.631 ± 0.23 1056 0.625 ± 0.23 L = 32 2007 0.824 ± 0.17 2003 0.822 ± 0.17 2003 0.822 ± 0.17 2009 0.819 ± 0.17 2008 0.818 ± 0.17 L = 64 3669 0.948 ± 0.09 3678 0.946 ± 0.09 3676 0.947 ± 0.09 3678 0.945 ± 0.09 3683 0.946 ± 0.09 L = 128 6387 0.993 ± 0.03 6400 0.993 ± 0.03 6400 0.993 ± 0.03 6399 0.992 ± 0.03 6405 0.993 ± 0.03 USPS 1 − p = 0 1 − p = 0.3 1 − p = 0.5 1 − p = 0.7 1 − p = 0.9 L = 8 496 0.740 ± 0.22 513 0.730 ± 0.22 497 0.733 ± 0.22 505 0.743 ± 0.22 507 0.749 ± 0.21 L = 16 907 0.907 ± 0.13 930 0.906 ± 0.13 901 0.904 ± 0.14 916 0.909 ± 0.13 923 0.906 ± 0.13 L = 32 1567 0.981 ± 0.05 1595 0.981 ± 0.06 1573 0.980 ± 0.06 1578 0.981 ± 0.05 1602 0.981 ± 0.05 L = 64 2541 0.998 ± 0.02 2587 0.998 ± 0.02 2553 0.998 ± 0.01 2537 0.998 ± 0.02 2575 0.998 ± 0.01 L = 128 3781 1 ± 0.00 3806 1 ± 0.00 3782 1 ± 0.00 3768 1 ± 0.00 3795 1 ± 0.00

This chapter introduced three strategies to reduce the space complexity of RPT and we showed, both theoretically and empirically, these strategies are performing very similar to RPT. Next chapter is going to discuss how to improve a single RPT performance.

22 Chapter 3

IMPROVING THE PERFORMANCE OF A SINGLE TREE

In this chapter we address another major drawback of RPTs, which is the need to use high number of trees to achieve acceptable accuracy. We first explain why so many trees are needed and then in sections.3.1, 3.2 and 3.3 we propose three different approaches to overcome this problem. Finally section.3.4 includes the experimental results for the proposed approaches. As we mentioned in chapter1, although RPT has proven to have superior performance compare with LSH methods, it has its own shortcomings. One disadvantage is the high space complexity which we addressed in chapter2. The other problem is if a mistake is made at the top level (near the root node), that is, say left subtree at this level is not visited and this subtree contains the true nearest neighbor, then rest of the search becomes useless. To avoid this two search strategies, spill tree and virtual spill tree [36] have also been proposed. Both methods have been discussed in details in 1.3.2 and 1.3.4. ST’s improvement comes at the cost of super-linear (as opposed to linear) space complexity for a single tree, which may be unacceptable for large scale applications. Also, while VST’s search performance typically improves with allowed virtual overlap, enforcing user defined control on how many leaf nodes to visit becomes problematic and often leaf nodes containing useless information are retrieved. It is worth mentioning that we need to use multiple trees to get an acceptable performance for RPT, which will increase space complexity as well. To address these issues, in this chapter we propose various strategies to improve nearest neighbor search performance of a single space partition tree by using auxiliary information and priority functions. We use properties of random projection to choose auxiliary information. Our proposed auxiliary information based prioritized guided search improves nearest search performance of a single RPT significantly. We make the following contributions in this chapter. • Using properties of random projection, we show how to store auxiliary information of additional space complexity O˜(n + d) at the internal nodes of an RPT to improve nearest neighbor search performance using defeatist search strategy.

• We propose two priority functions to retrieve data points from multiple leaf nodes of

23 a single RPT and thus extend the usability of RPT beyond defeatist search. Such pri- ority functions can be used to perform nearest neighbor search under a computational budget, where to goal is to retrieve data points from most informative leaf nodes of a single RPT specified by the computational budget.

• We combine the above two approaches to present an effective methodology to improve nearest neighbor search performance of a single RPT and perform extensive exper- iments on six real world datasets to demonstrate the effectiveness of our proposed method.

3.1 DEFEATIST SEARCH WITH AUXILIARY INFORMATION

In our first approach, we introduce a modified defeatist search strategy, where, at each internal node we store auxiliary information to compensate for the fact that while routing a query from root node to a leaf node, only one of the two branches is chosen at each internal node. The stored auxiliary information at any internal node aims to compensate for the un- visited subtree rooted at this node by identifying a small set of candidate nearest neighbors that lie in this unvisited subtree. Note that this small set of candidate nearest neighbor points otherwise would not be considered had we adopted the traditional defeatist search strategy. Now, to answer a nearest neighbor query, a linear scan is performed among the points lying in the leaf node (where the query is routed to) and union of the set of identified candidate nearest neighbor points at each internal node along the query routing path as shown in Figure 3.1. A natural question that arises is, what kind of auxiliary information

Figure 3.1: Defeatist query processing using auxiliary information. Blue node is where a query is routed to. Red rectangles indicates auxiliary information stored on the opposite side of the split point from which candidate nearest neighbors for the unvisited subtree are selected. can we store to achieve this? On one hand, we would like to ensure that auxiliary infor- mation does not increase space complexity of the data structure significantly, while on the other hand, we would like the candidate nearest neighbors identified at each internal node along the query routing path to be query dependent (so that not the same candidate nearest neighbors are used for every query), and therefore, this additional query dependent compu- tation (for each query) needs to be performed quickly without significantly increasing overall query processing time. We argue next that we can exploit properties of random projection to store auxiliary information that helps us achieve the above goals. Note that if we have an ordering of the distances of points from S to query q, then any one dimensional random projection has the

24 Algorithm 7 RPT construction with auxiliary information d Input : data S = {x1, . . . , xn} ⊂ R , maximum number of data points in leaf node n0, aux- d−1 iliary index size c, m independent random vectors {V1,...,Vm} sampled uniformly from S .

Output : tree data structure function MakeTree(S, n0)

1: if |S| ≤ n0 then 2: return leaf containing S 3: else 4: Pick U uniformly at random from Sd−1 5: Let v be the median of projection of S onto U 6: Set ail be the set of indices of c points in S so that upon projection onto U, they are the c closest points to v from the left. 7: Set air be the set of indices of c points in S so that upon projection onto U, they are the c closest points to v from the right. th 8: Construct a c × m matrix Lcnn whose i row is the vector > > > (V1 xail(i),V2 xail(i),...,Vm xail(i)). th 9: Construct a c × m matrix Rcnn whose i row is the vector > > > (V1 xair(i),V2 xair(i),...,Vm xair(i)). 10: Rule(x) = (x>U ≤ v) 11: LSTree = MakeTree({x ∈ S : Rule(x) = true}, n0) 12: RSTree = MakeTree({x ∈ S : Rule(x) = false}, n0) 13: return (Rule, LSTree, RSTree) 14: end if

property that upon projection, this ordering (of projected points) is perturbed locally near projected q but is preserved globally with high probability as shown below.

d d Theorem 10. Pick any query q ∈ R and set of database points S = {x1, . . . , xn} ⊂ R and let x(1), x(2),... denote the re-ordering of the points by increasing distance from q, so that x(1) is nearest neighbor of q in S. Consider any internal node of RPT that contains a subset S ⊂ S containing x(1) and q. If q and points from S are projected onto a direction U chosen at random from a unit sphere, then for any 1 ≤ k < |S|, the probability that there exists a > > subset of k points from S that are all not more than |U (q − x(1))| distance away from U q upon projection is at most 1 P|S| kq−x(1)k2 . k i=1 kq−x(i)k2 Roughly speaking, this means that upon projection, those points that are far away from projected q are unlikely to be close to q in original high dimension (thus unlikely to be its nearest neighbors). On the other hand, the projected points that are close to projected q, may or may not be its true nearest neighbors in original high dimension. In other words, with high probability, true nearest neighbor of any query will remain close to the query even after projection, since distance between two points does not increase upon projection, however points which were far away from query in original high dimension may come closer to the query upon projection. Therefore, among the points which are close to the projected

25 q will definitely contain q’s true nearest neighbor but possibly will also contain points which are not q’s true nearest neighbors (we call such points nearest neighbor false positives). We utilize this important property to guide us choose auxiliary information as follows. At any internal node of an RPT, suppose the projected q lies on left side of the split point (so that left child node falls on the query routing path). From the above discussion, it is clear that if q’s nearest neighbor lie on the right subtree rooted at this node, then their 1-d projections will be very close to the split point on the right side. Therefore, to identify q’s true nearest neighbor, one possibility is to store actual c high dimensional points (where c is some pre-defined fixed number) which are closest c points (upon projection) to the right side of the split point as auxiliary information for this node. There are two potential problems with this approach. First, additional space complexity due to such auxiliary information is O(nd) which may be prohibitive for large scale applications. Second, because of 1-d random projection property (local ordering perturbation), we may have nearest neighbor false positives within these c points. To prune out these nearest neighbor false positives for each query, if we attempt to compute actual distance from q to these c points in original high dimensions and keep only closest points based on actual distance as candidate nearest neighbors, this extra computation, will increase query time for large d. To alleviate this, we rely on celebrated Johnson Lindenstrauss lemma [66] which says that if we use m =  log(c+1)  O 2 random projections then pairwise distance between c points and q are preserved in Rm within a multiplicative factor of (1 ± ) of the original high dimensional distance in Rd. Equipped with this result, at each internal node of an RPT we store two matrices of size c × m (one for left subtree, one for right) as auxiliary information, that is for each of these c points, we store their m dimensional representation as a proxy for their original high dimensional representation. For all our experiments, we set m to be 20. Algorithm 7 provides details of an RPT construction where line 6-9 is different from traditional RPT construction and describes how auxiliary information is stored. With this modification and the following theorem shows that additional space complexity due to aux- ˜ iliary information is O(n + d), where we hide the log log(n/n0) factor. Theorem 11. Consider a modified version of RPT where, at each internal node of the RPT, auxiliary information is stored in the form of two matrices, each of size c × m (one for left subtree, one for right). If we choose c ≤ 10n0, additional space complexity of this modified version of RPT due to auxiliary information is O˜(n + d).

In the above theorem n0 is user defined maximum leaf node size of an RPT and for all practical purpose log log(n/n0) can be treated as a constant. Therefore, additional space complexity is merely O(n+d). While processing a query q using this strategy, to prune out the nearest neighbor false positives at any internal node of an RPT, we project q onto m random projection directions to have q’s m dimensional representationq ˜. At each internal node along the query routing path, we useq ˜ to add c0, where c0 < c and c0 is some pre-defined fixed number, candidate nearest neighbors to the set of retrieved points (by computing distances fromq ˜ to c data points in m dimensional representation and keeping the c0 closest ones) to compensate for the unvisited subtree rooted at this node. We use c0 = 10 for all our experiments. Details of query processing using this approach is presented in Algorithm 8.

26 Algorithm 8 Query processing using defeatist search with auxiliary information

Input : RP tree constructed using Algorithm.7, m independent random vectors {V1,...,Vm} sampled uniformly from Sd−1, query q, number of candidate neighbors at each node c0. Output : Candidate nearest neighbors

1: Set Cq = ∅. > > > 2: Setq ˜ = (V1 q, V2 q, . . . , Vm q) 3: Set current node to be root node of the input tree. 4: while current node 6= leaf node do 5: if U >q < v then 6: A = Rcnn 7: ai = air 8: current node = current node.left 9: else 10: A = Lcnn 11: ai = ail 12: current node = current node.right 13: end if 14: Sort the rows of A in increasing order according to their distance fromq ˜ and let array a contains these sorted indices. 0 15: Cq = Cq ∪ {ai(a(1)), ai(a(2)),..., ai(a(c ))} 16: end while 17: Set leafq to be the indices of points in S that lie in leaf node. 18: Cq = Cq ∪ leafq 19: return Cq

Note that due to this modification, time to reach leaf node increases from O(d log(n/n0)) 1 to O ((d + cm + c log(c)) log(n/n0) + md) and the number of retrieved points that require 2 0 a linear scan increases from n0 to (n0 + c log(n/n0)). We also note that using spill tree for defeatist search yields a super-linear space complexity as shown by the following theorem.

Theorem 12. Space complexity for a spill tree with α percentile overlap, where α ∈ (0, 1),   1  on each side of the median at each internal node is O dn 1−log2(1+2α) .

3.2 GUIDED PRIORITIZED SEARCH

In our second approach, we seek to retrieve data points from multiple leaf nodes, as opposed to a single leaf node, of an RPT as candidate nearest neighbors. We can specify a constant

1This increase is due to one time computation of m dimensional representationq ˜ of q, and then at each internal node along the query routing path computation of distances in m dimensional space fromq ˜ to c points, and sorting these c distances. 2This increase is due to additional c0 retrieved points at each internal node along the query routing path.

27 Figure 3.2: Query processing for three iterations (visiting three different leaf nodes) using priority function. The three retrieved leaf nodes are colored blue. At each internal node an integer represents the priority score ordering. Lower the value, higher the priority. After each new leaf node visit ordering of priority scores is updated. Note that if a mistake is made at the root level and true nearest neighbor lie on the left subtree rooted at root node, with three iterations, DFS will never visit this left subtree and fails to find true nearest neighbor.

number of leaf nodes (say, l) a-priori from which points will be retrieved and a linear scan will be performed. In order to identify l such appropriate leaf nodes, we present a guided search strategy based on priority functions. The general strategy is as follows. First, the query is routed to the appropriate leaf node as usual. Next, we compute a priority score for all the internal nodes along the query routing path. These priority scores along with their node locations are stored in a priority queue sorted by priority scores in decreasing order. Now we choose the node with highest priority score, remove it from priority queue, and route the query from this node to its child node that is not visited earlier. Once reached at this child node, standard query routing is followed to reach to a different leaf node. Next, priority scores are computed for all internal nodes along this new query routing path and are inserted to the priority queue. This process is repeated until l different leaf nodes are visited (see Figure 3.2). This search process is guided by the current highest priority score where high priority score of an internal node indicates that there is a high likelihood that nearest neighbors of the query lies in unexplored subtree rooted at this node and must be visited to improve nearest neighbor search performance. We use local perturbation property of 1-d random projection to define a priority score. At any internal node (with stored projection direction U and split point v) of an RPT, we define priority score at this node to be 1 f (U, v, q) = (3.1) pr1 |v − U >q| Here the intuition is that if the projected query lies very close to the split point and since distance ordering is perturbed locally (Theorem 10), there is a very good chance that true nearest neighbor of the query, upon projection is located on the other side of the split point. Therefore, this node should be assigned a high priority score. Just because a query upon projection lies close to the split point at any internal node of an RPT does not necessarily mean that nearest neighbor of the query lies on the opposite side of the split point (unvisited subtree rooted at this node). However, at any internal node of an RPT, if the minimum distance (original distance in Rd) between the query and the set of points lying on the same side of the split point as the query is larger than the minimum distance between the query and the set of points lying on the opposite side of the split point, then visiting the unvisited child node rooted at this node will make more sense. We use this idea to design our next priority function. Since computing actual distance in Rd will increase query time, we use auxiliary information for this purpose.

28 Algorithm 9 Query processing using priority function Input : RP tree constructed using Algorithm.7, query q, number of iterations t Output : Candidate nearest neighbors

1: Set Cq = ∅. 2: Set P to be an empty priority queue. 3: Set current node to be root node of the input tree. 4: if t > 0 then 5: while current node 6= leaf node do 6: if U >q < v then 7: current node = current node.left 8: else 9: current node = current node.right 10: end if 11: Compute priority value 12: P.insert(current node, priority value). 13: end while 14: Set leafq to be the indices of points in S that lie in leaf node. 15: Cq = Cq ∪ leafq 16: Set t = t − 1 17: Set current node = P.extract max 18: end if 19: return Cq

29 Figure 3.3: Query processing using combined approach. Blue circles indicate retrieved leaf nodes. Red rectangles indicate auxiliary information stored on the opposite side of the split point at each internal node, from which candidate nearest neighbors are selected, along query processing path for which only one subtree rooted at that node is explored. This figure should be interpreted as a result of applying ideas from section 3.1 to Figure 3.2 after 3rd iteration.

Note that at each internal node, on each side of the split point, we store m dimen- sional representation of c closest points from the split point upon 1-d projection as auxiliary information. For any query, once at any internal node of an RPT, using m dimensional representation of q, we first compute the closest point (in Rm) among these c points that same lie on the same side of the split point as the query upon 1-d projection and call it dmin . opp same opp In a similar manner we compute dmin. Due to JL lemma, dmin and dmin are good proxy for original minimum distance between and q and those c points in Rd on both sides of the split opp same point. Ideally, if dmin ≤ dmin , the priority score should increase and vice versa because one of the c points on the unexplored side is closer to the query compared to the c points on the same side of the query. To take this into account, we propose a new priority function defined as,

 same  opp same 1 dmin fpr2(U, v, q, dmin, dmin ) = > · opp (3.2) |v − U q| dmin Note that at each internal node, while the first priority function can be computed in O(1) , the second priority function takes time O(mc) plus an additional O(md) time to compute m dimensional representationq ˜. We note that, while a priority function similar to fpr1 has been proposed recently [67], fpr2 is new. In all our experiments, prioritized search based on fpr2 outperforms fpr1. Algorithm for query processing using priority function is presented in Algorithm 9.

3.3 COMBINED APPROACH

Integrating ideas from section 3.1 and 3.2, we present a combined strategy for effective nearest neighbor search using RPT, where data points are retrieved from multiple informative leaf nodes based on priority function (as described in section 3.2) and also from internal nodes along these query processing routes as described in section 3.1. Note that while accessing multiple leaf nodes using priority function, if at any internal node of an RPT both its subtrees are visited then there is no need to use auxiliary information at that node.

30 Algorithm 10 Query processing using combined approach

Input : RP tree constructed using Algorithm 7, m independent random vectors {V1,...,Vm} sampled uniformly from Sd−1, query q, number of iterations t, number of candidate neigh- bors at each node c0. Output : Candidate nearest neighbors

1: Set Cq = ∅, set B to be an empty binary 2: Set P to be an empty priority queue, set count = 0 > > > 3: Setq ˜ = (V1 q, V2 q, . . . , Vm q) 4: Set current node to be root node of the input tree. 5: if t > 0 then 6: while current node 6= leaf node do 7: if U >q < v then 8: A = Rcnn, ai = air 9: current node = current node.left 10: else 11: A = Lcnn, ai = ail 12: current node = current node.right 13: end if 14: Computepriority value, set count = count + 1 15: Sort the rows of A in increasing order according to their distance fromq ˜ and let array a contains these sorted indices. 16: Insert {ai(a(1)),..., ai(a(c0))} into B with value count 17: Set struct = (count, current node) 18: P.insert(struct, priority value). 19: end while 20: Set leafq to be the indices of points in S that lie in leaf node. 21: Set Cq = Cq ∪ leafq, t = t − 1 22: Set current node=(P.extract max).current node 23: Delete from B candidate set with value (P.extract max).count 24: end if 25: return Cq ∪ {all candidate set from B}

This combined approach is shown in Figure 3.3 and algorithm for query processing is presented in Algorithm 10.

3.4 EMPIRICAL EVALUATION

In this section we present empirical evaluations of our proposed method and compare them with baseline methods3. We use six real world datasets of varied size and dimension as shown in Table 3.1. Among these, MNIST, SIFT and SVHN are image datasets, JESTER

3We do not compare with LSH, since earlier study demonstrated that RPT achieves superior nearest neighbor search accuracy compared to LSH by retrieving fewer data points [37].

31 is a recommender systems dataset and 20Newsgroup and SIAM07 are text mining datasets. Both SIAM07 and 20Newsgroup have very large data dimensions (d), while SIFT has very large number of instances (n). For each dataset, we randomly choose instances (as shown in

Table 3.1: Datasets details

Dataset # instances #queries #dimensions MNIST 65000 5000 768 SIFT 400000 10000 128 SVHN 68257 5000 3072 JESTER 68421 5000 101 20Newsgroup 15846 3000 26214 SIAM07 23596 5000 30438 2nd column of Table 3.1) to build appropriate data structure and randomly choose queries (as shown in 3rd column of Table 3.1) to report nearest neighbor search performance of various methods. We design six different experiments. For the first four experiments we present our results for 1-NN search problem, whereas for the last two experiments we present results for 10-NN search problem. We use accuracy to measure the effectiveness of various methods. For 1-NN search, accuracy is simply calculated as the fraction of the query points for which its true nearest neighbor is within the retrieved set of points returned by respective methods. For 10-NN search problem, accuracy (which is essentially, precision) of a query q is defined T R T T R as |Aq (k) ∩ Aq (k)|/|Aq (k)|, where Aq (k) is the set of true k nearest neighbors and Aq (k) be the set of k-nearest neighbors reported by a nearest neighbor search algorithm. We report accuracy averaged over number of queries listed in third column of Table 3.1. For all our experiments, we set n0 to be 100 and use median split while constructing an RPT. Also as explained in section 3.1 we set m = 20 for all our experiments.

3.4.1 EXPERIMENT 1 In this experiment, we study the effect of c, the number of data points stored (in m dimen- sional representation) on each side of the split point at each internal node of an RPT, on nearest neighbor search performance. Note that as we increase c (that is, we use increasing amount of auxiliary information), it is expected that accuracy, space complexity and query time of our proposed method will increases. Consequently, purpose of this experiment is to empirically find a value for c that can be fixed for subsequent experiments. In Table 3.2 we report how accuracy of 1-NN search varies as we increase c. As can be seen from Table 3.2, with increasing c, 1-NN search accuracy increases. The biggest increase in accuracy is when we set c to be 500. For example, for JESTER dataset, the accuracy difference between c = 100 and c = 500 is 17% while this difference is only 5% when we change c from 500 to 1000. An obvious question that immediately arises is that at what cost do we achieve this improvement in nearest neighbor search accuracy? To answer this question, note that total

32 4 number computations Tnc required to answer a query can be represented as :

Tnc = d · log(n/n0) + m · d + (m · c + c · log c) · log(n/n0) 0 +(n0 + c · log(n/n0)) · d 0 = (n0 + log(n/n0)) · d + (m + c · log(n/n0)) · d | {z } | {z } vanilla RP T computation extra computation

+ (m · c + c · log c) · log(n/n0) (3.3) | {z } extra computation

where (on right hand side of the first equality) the first term denotes the number of inner

Table 3.2: Effect of c on 1-NN search accuracy.

c 50 100 500 1000 MNIST 22% 30% 44% 51% SIFT 32% 35% 47% 52% SVHN 19% 24% 39% 48% JESTER 25% 31% 48% 53% 20Newsgroup 17% 21% 33% 38% SIAM07 8% 9% 12% 14%

products computations required at internal nodes along the query routing path to determine which of the two branches (left or right) should be taken, the second term denotes (one time) computation required for m dimensional representation of the query, the third term denotes the amount of computation required to compute distance of m dimensional representation of the query from c points and sorting these distances at internal nodes of the query routing path and the last term indicates computation required for exhaustive distance computation (in original d dimensional space) for all retrieved data points. The second equality in the above equation clearly indicates extra computation required while answering a query due to auxiliary information. Since m = 20 and c0 = 10 are small constants, it is clear that with increasing c, query processing time also increases. To see this effect in Figure 3.4 we plot the ratio of 1-NN accuracy and actual query time against increasing c for three of our datasets. As can been seen from this figure, initially with increasing c, the ratio of 1-NN accuracy and actual query time increases but after around c = 500, either it decreases or does not increase at the same rate. This indicates that for c larger than 500, query time increases at a much faster rate compared to increase in 1-NN accuracy and thus increasing c beyond c = 500 may not worth it. Therefore all our subsequent experiments we set c = 500.

3.4.2 EXPERIMENT 2 In the second experiment, we empirically show how auxiliary information stored at internal nodes of an RPT improves 1-NN search accuracy using defeatist search (for fixed c = 500). We compare our proposed method (which we call RPT1) with vanilla RPT without any

4For brevity we have omitted the Big O notation.

33 (a) JESTER (b) 20NEWSGROUP (c) SIAM07

Figure 3.4: Trade-off between accuracy and running time as we increase c. Accuracy computed based on 1-NN.

stored information (which we call Normal RPT) and spill tree with different percentile of overlap5 α. As can be seen from Table 3.3, RPT1 outperforms all other methods by a significant mar- gin. With increasing α, search accuracy of spill tree increases but so does space complexity. For example, when α = 0.1, space complexity of spill tree is O (dn1.36), which is super-linear in n (see Theorem 12).

Table 3.3: Comparison of 1-NN search accuracy using defeatist search strategy with auxiliary information with baseline methods.

Dataset RPT1 Normal RPT STα=0.025 STα=0.05 STα=0.1 MNIST 44% 12% 12% 16% 21% SIFT 47% 23% 26% 30% 38% SVHN 39% 8% 8% 12% 15% JESTER 48% 13% 13% 19% 23% 20Newsgroup 33% 7% 9% 9% 13% SIAM07 12% 5% 5% 6% 7%

3.4.3 EXPERIMENT 3

In the third experiment we empirically evaluate how two priority functions fpr1 and fpr2 pro- posed in this chapter help in improving guided 1-NN search accuracy by retrieving multiple leaf nodes of a single RPT. A natural competitor of our approach is depth first search (DFS) strategy and virtual spill tree. Note that a virtual spill tree is same as vanilla RPT, except at each internal node, in addition to a random projection direction U and a split point v, 1 two points corresponding to ( 2 − α) percentile point upon projection onto U (call it l) and 1 ( 2 + α) percentile point upon projection onto U (call it r) are stored. While processing a query q at any internal node, if U >q ∈ [l, r] then both left and right child nodes are visited otherwise, similar to defeatist query processing in RPT, a single child node is visited. Em- pirical comparison of these four methods are provided in Table 3.4. In case of fpr1, fpr2 and DFS, iter is used to indicate how many distinct leaf nodes are accessed, while α is used to indicate virtual spill amount in case of a virtual spill tree. As can be seen from Table 3.4,

5Note that, when we split an internal node of a spill tree with overlap α on both sides of the median, left 1 1 and right child nodes represent data points corresponding to 0 to ( 2 + α) ∗ 100 percentile and ( 2 − α) ∗ 100 to 100 percentile upon projection.

34 Table 3.4: Comparisons of 1-NN accuracy of prioritized search with baseline methods.

MNIST (iter, α) (2,0.025) (5,0.05) (10,0.075) (15,0.1) (20,0.15)

fpr2 19% 33% 47% 55% 61% fpr1 19% 32% 44% 51% 56% DFS 15% 22% 28% 30% 34% Virtual 15% 20% 25% 30% 42% SIFT (iter, α) (2,0.025) (5,0.05) (10,0.075) (15,0.1) (20,0.15)

fpr2 34% 47% 57% 61% 65% fpr1 32% 43% 52% 58% 61% DFS 27% 32% 36% 37% 40% Virtual 29% 35% 41% 47% 58% SVHN (iter, α) (2,0.025) (5,0.05) (10,0.075) (15,0.1) (20,0.15)

fpr2 12% 23% 34% 42% 48% fpr1 13% 24% 34% 41% 47% DFS 10% 16% 22% 25% 29% Virtual 10% 14% 18% 21% 34% JESTER (iter, α) (2,0.025) (5,0.05) (10,0.075) (15,0.1) (20,0.15)

fpr2 21% 34% 46% 54% 60% fpr1 20% 32% 43% 50% 55% DFS 16% 22% 28% 30% 35% Virtual 17% 22% 26% 31% 41% 20Newsgroup (iter, α) (2,0.025) (5,0.05) (10,0.075) (15,0.1) (20,0.15)

fpr2 11% 20% 29% 34% 40% fpr1 11% 19% 28% 33% 36% DFS 9% 14% 18% 21% 26% Virtual 9% 11% 13% 15% 22% SIAM07 (iter, α) (2,0.025) (5,0.05) (10,0.075) (15,0.1) (20,0.15)

fpr2 8% 14% 19% 22% 25% fpr1 7% 12% 17% 21% 24% DFS 6% 9% 13% 15% 18% Virtual 6% 7% 9% 11% 14%

1-NN search accuracy of all methods improve with increasing iter and α. Observe that fpr1 outperforms both DFS and Virtual spill tree. Moreover, fpr2 always performs better than fpr1. One observation that we make from Table 3.4 is that, for fpr1 or fpr2, as we increase number of iterations, initially accuracy increases at a much faster rate but with increasing iterations, it slows down. This indicates that later iterations are not as useful as initial iterations.

35 Table 3.5: Comparisons of 1-NN accuracy of combined method with different priority func- tions. Performance of two priority functions in combined approach are very similar except in few cases, for fixed number of iterations, priority function fpr2 performs marginally better compared to priority function fpr1 (shown in bold).

Datasets iter = 2 iter = 5 iter = 10 iter = 15 iter = 20

Priority Function fpr1 fpr2 fpr1 fpr2 fpr1 fpr2 fpr1 fpr2 fpr1 fpr2 MNIST 54% 54% 65% 66% 74% 74% 78% 79% 81% 82% SIFT 49% 49% 61% 61% 69% 69% 73% 73% 76% 76% SVHN 49% 49% 61% 61% 69% 70% 74% 75% 78% 79% Jester 56% 56% 66% 66% 74% 74% 78% 78% 81% 82% 20Newsgroup 39% 39% 48% 48% 55% 55% 59% 60% 63% 63% SIAM07 15% 15% 20% 20% 25% 25% 29% 29% 32% 32%

3.4.4 EXPERIMENT 4 In this experiment we empirically evaluate 1-NN search accuracy of our proposed combined approach, which exploits auxiliary information and priority functions, using a single random- ized partition tree. The empirical results of this combined approach are presented in Table 3.5 where the combined approach is allowed to access up to 20 leaf nodes of a single ran- domized partition tree guided by the priority function. We have empirically demonstrated earlier in section 3.4.2 and section 3.4.3 respectively that auxiliary information and priority functions individually achieves superior nearest neighbor search accuracy compared to its baseline competitors, namely, vanilla RPT, spill tree, virtual spill tree and depth first search strategy. The results presented in Table 3.5 are directly comparable to the results presented in Table 3.3 and Table 3.4. For example, for any of the six dataset, NN search accuracy of the combined approach is better than NN search accuracy using auxiliary function alone or better than NN-search accuracy of priority functions alone (for fixed number of itera- tions). Also note that, NN search accuracy of two priority functions in combined approach are very similar except in few cases where, for fixed number of iterations, priority function fpr2 performs marginally better compared to priority function fpr1 (shown in bold in Table 3.5).

3.4.5 EXPERIMENT 5 We have empirically demonstrated in the previous experiment that our proposed combined approach indeed improves NN search accuracy of a vanilla RPT. However, it is clear from Table 3.5 that as we increase number of iterations (that is, access increasing number of leaf nodes) of a single tree using our combined approach, NN-accuracy increases at a much faster rate during initial increase of number of iterations compared to later increase of number of iterations. In other words, the increase in NN search accuracy using our combined approach and a single tree kind of saturates as we increase number of iterations. One possible hypoth- esis for this phenomenon is that there is lack of randomness in a single tree and possibly using our proposed combined approach on multiple RPTs (thereby increasing randomness) could increase NN search accuracy even further. To test our hypothesis, in this section we design an experiment to empirically find a trade-off between number of trees and number of

36 Table 3.6: 10-NN search accuracy using Multi-Combined method with different # of trees (L) and # of iterations per tree (iter).

(L, iter) (1,20) (2,10) (3,7) (4,5) (5,4) MNIST 72% 84% 89% 91% 93% SIFT 55% 66% 74% 77% 80% SVHN 63% 77% 84% 87% 89% JESTER 72% 82% 87% 89% 90% 20Newsgroup 39% 47% 52% 54% 56% SIAM07 19% 21% 24% 24% 25%

iterations per tree. All results presented in this section are for 10-NN search problem. We use fpr2 as our choice of priority function for the combined approach. We design our experiment to answer 10-NN query under a budget constraint where we are allowed to access only 20 leaf nodes using combined approach. However, these 20 leaf nodes can be accessed varying number of RPTs. This can be done in multiple ways, such as, 20 iterations in a single tree, 10 iterations each in two trees, etc. While using three trees, number of iterations per tree will be roughly 7. We call this search strategy ‘Multi-Combined’, as we are using multiple trees and using combined search strategy in each tree. The results are listed in Table 3.66. As can be seen from Table 3.6, increasing number of trees increases search accuracy (of course, at the cost of additional space complexity). The biggest increasing in accuracy occurs when we use two trees instead of one. This happens not only because we are introducing additional randomness by adding an extra tree, but also because we are reducing later iterations in each tree which we already observed are not very useful in increasing accuracy. We see from Table 3.6 that beyond three trees, accuracy increases very slowly, and due to additional space complexity overhead, adding more than three trees is probably not worth it. We reiterate that each additional normal RPT requires O(nd) space, while additional space requirement for each Multi-combined tree is O˜(n + d).

3.4.6 EXPERIMENT 6 Goal of this final experiment is to empirically evaluate how our proposed combined approach, using a single RPT, performs across different precision range to answer a 10-NN query. We vary precision in steps of 0.1 starting at 0.5 and going all the way up to 1.0. For a query to achieve a certain fixed level of precision, we keep on visiting different leaf nodes of the tree, guided by our choice of priority function in combined approach, until precision reaches the desired level. We average over all queries and report the average number of iterations required to reach a fixed level of precision in Figure 3.5. As can be seen from Figure 3.5, our proposed combined approach requires fewer iterations compared to DFS to reach a certain level of precision. Also, the gap between the number of iterations requires for our proposed combine approach and DFS widens with increasing level of precision. However, note that in the combined approach additional data points from internal nodes (c0 = 10) are also added

6Note that these numbers are slightly different from Table 3.5 since we are considering 10-NN query as opposed to 1-NN query.

37 (a) MNIST (b) SIFT (c) SVHN

(d) JESTER (e) 20NEWS (f) SIAM

Figure 3.5: Number of iterations required to achieve a fixed level of precision.

(a) MNIST (b) SIFT (c) SVHN

(d) JESTER (e) 20NEWS (f) SIAM

Figure 3.6: Speedup obtained for a fixed level of precision. to the set of retrieved points which will be much smaller compared to retrieved data points obtained by visiting additional leaf nodes in case of DFS method. As we increase number of iterations, according to our cost model given in Equation 3.3, query time is dominated by the exhaustive distance computation in original d dimensional N space for all retrieved data points. Therefore, we define speedup to be #RetrievedP oints where N is the number of training samples (dataset size) and plot speedup against precision for our proposed combined approach and DFS method in Figure 3.6. Note that speed ≈ 1 it means that we are retrieving all samples and essentially performing a linear search. As can be seen from Figure 3.6, our proposed method achieves significantly higher (often order of magnitude different) speed up compared to DFS method. This difference is more prominent for lower precision values and decreases as precision approaches 1.

38 3.5 CONCLUSION

In this chapter we presented various strategies to improve nearest neighbor search perfor- mance using a single space partition tree, where basic tree construct was an RPT. Exploiting proprieties of random projection, we demonstrated how to store auxiliary information of ad- ditional space complexity O˜(n + d) at the internal nodes of an RPT that helps to improve nearest search performance using defeatist search and guided prioritized search as well as their combination. Empirical results on six real world demonstrated that our proposed method indeed improve the search accuracy of a single RPT compared to baseline methods. We end this chapter noting that our proposed method can also be used for efficiently solving related search problems that can be reduced to an equivalent nearest neighbor search problem and solved using RPT, for example, maximum inner product search problems [61, 58].

39 Chapter 4

MAXIMUM INNER PRODUCTS PROBLEM

In this chapter we discuss maximum inner product search (MIPS) problem and take advan- tage of proposed methods in previous chapters to solve it. We first define the problem in the next paragraph and then provide a background on existing solutions in section 4.1. Then we show how to use RPT to solve MIPS and prove which of the existing solutions will have the best performance combined with RPT in section.4.2. Finally we back our theoretical results by doing some experiments in section.4.3. The problem of MIPS has received considerable attention in recent years due to its wide usage in problem domains such as matrix factorization based recommender systems [52, 68, 69], multi-class prediction with large number of classes [70, 71], large scale object detection [70, 72] and structural SVMs [73, 74]. The problem of MIPS is as follows: given a set S ⊂ Rd of d-dimensional points and a query point q ∈ Rd, the task is to find a p ∈ S such that, p = arg max q>x. (4.1) x∈S A naive way to solve this problem is to perform a linear search over S, which often becomes impractical as the size of S increases. The goal is to develop sub-linear algorithms to solve MIPS. Towards this end, recent work [49, 53, 51, 54] proposed the solution of MIPS with algorithms for nearest-neighbour search (NNS) by presenting transformations to all points x ∈ S and the query q which reduce the problem of MIPS to a problem of NNS1. The sublinear approximate NNS algorithm used for MIPS is locality sensitive hashing (LSH) [12, 75, 19]. While LSH is widely used for approximate NNS (and now for approximate MIPS), it is known to not provide a good control over the accuracy-efficiency trade-off for approximate NNS or approximate MIPS. Moreover, specific to the problem of MIPS, it is not clear which of the multiple proposed reductions [49, 53, 51, 54] would provide the best accuracy or efficiency (or the best trade-off of the two). We will discuss both these issues in more detail in the following section. These issues raise two questions that we address here:

• Can we develop a MIPS solution which provides fine-grained control over the accuracy- efficiency trade-off?

1More precisely, a true nearest neighbor of transformed q in transformed S is the transformed form of p, where p is a solution of the MIPS problem defined in Equation 4.1.

40 • Can we definitively choose the best MIPS-to-NNS reduction in terms of the accuracy- efficiency trade-off?

To this end, we propose the use of an ensemble of randomized partition trees (RPTs) [11] for MIPS. RPTs have been shown to demonstrate favorable accuracy-efficiency trade-off for NNS relative to LSH while providing fine-grained control on the trade-off spectrum [37]. We demonstrate that this usability of RPTs seamlessly transfers over to the problem of MIPS, thereby addressing the first question. Moreover, RPTs theoretically bound the probability of failing to find the exact nearest-neighbor as opposed to the failure probability guarantees of LSH which only apply to the approximate nearest-neighbor2. We build upon this theoretical property of RPTs to address the second question by providing a theoretical ordering of the existing MIPS-to-NNS reductions with respect to the performance of RPTs. We continue this chapter by discussing the existing solutions to exact and approximate MIPS and present their limitations. Also we motivate the use of RPT for MIPS by demon- strating how RPTs overcome the limitations of existing solutions for MIPS.

4.1 EXISTING SOLUTIONS FOR MIPS

MIPS has received a lot of recent attention. Linear search over the set S scales as O(d|S|) per query, becoming prohibitively high for moderately large sets. There was an attempt to solve exact MIPS using a space-partitioning tree and a branch-and-bound algorithm in the original input space [76, 77, 78]. These branch-and-bound algorithms were shown to have a logarithmic scaling on |S| but the dependence on the dimensionality was exponential, limiting their usability to small or moderate number of dimensions. To improve upon this, the problem of MIPS was approximated, reduced to the problem of NNS and solved using an approximate NNS algorithm. Approximate MIPS. The problem of MIPS is approximated in the following way: given the set S ⊂ Rd, a query q ∈ Rd and an approximation parameter , the task is to find any p0 ∈ S such that

q>p0 ≥ (1 − ) max q>x. (4.2) x∈S Connecting MIPS to NNS. Recent work [49, 53, 51, 54] pointed out that MIPS and NNS are closely related and a MIPS problem can be reduced to an NNS problem by applying appropriate transformation to the points in the set S and the query q. We list four such transformations below. Typically these transformations add extra dimensions to the points in S as well to query q so that solving MIPS in original space is equivalent to solving th NNS in the transformed higher dimensional space. We use Pi(·) and Qi(·) to denote the i transformations applied to data points and query point respectively.

d d+1 d d+1 • Transformation 1 (T1): P1 : R → R and Q1 : R → R are defined as follows: 2We will further discuss these points in the next section.

41  s  2   x kxk2 x P1(x) =  , 1 − 2  ,Q1(x) = , 0 β β kxk2

with β = maxx∈S kxk2 is the maximum norm among all data points in S [54].

d d+2 d d+2 • Transformation 2 (T2): P2 : R → R and Q2 : R → R are defined as follows:

s 2 ! x kxk2 P2(x) = , 1 − 2 , 0 β1 β1

s 2 ! x kxk2 Q2(x) = , 0, 1 − 2 β1 β1

Here β1 = max {maxx∈S kxk2, maxq kqk2} is maximum norm among all data points in S as well as all possible query points3 [54].

d d+1 d d+1 • Transformation 3 (T3): P3 : R → R and Q3 : R → R are defined as follows:

 q  2 2 P3(x) = x, β − kxk2 ,Q3(x) = (x, 0) .

where β = maxx∈S kxk2 is maximum norm among all data points in S [51].

d d+m d d+m • Transformation 4 (T4): P4 : R → R and Q4 : R → R are defined as follows:

 x kxk2 kxk2m  P (x) = , 2 , ··· , 2 4 α α2 α2m  x 1 1 Q4(x) = , , ··· , kxk2 2 2

Here β = maxx∈S kxk2 is maximum norm among all data points in S and α = cβ for some c > 1 [53]. Each of the above transformations satisfy the following: Theorem 13. Suppose S ⊂ Rd is a set of data points and q ∈ Rd is a query point. Then the following holds:

> arg max q x = arg min kP1(x) − Q1(q)k2 x∈S x∈S

= arg min kP2(x) − Q2(q)k2 x∈S

= arg min kP3(x) − Q3(q)k2 x∈S   = arg min lim kP4(x) − Q4(q)k2 . x∈S m→∞ 3Note that for this transformation, the maximum norm over all possible queries is needed in advance.

42 The above theorem simply says that each of the four transformations reduces a MIPS problem to an NNS problem and the solution to the exact NNS problem in the transformed space is the solution to the exact MIPS problem. Approximating MIPS via LSH. LSH approximates the NNS problem in the following manner: exact NNS to find the p ∈ S such that p = arg minx∈S kq −xk2; LSH solves the approximate version of the problem with an approximation parameter  to find any p0 ∈ S 0 such that kq − p k2 ≤ (1 + ) minx∈S kq − xk2. LSH is known to scale sub-linearly to |S| and polynomially to the dimensionality d under favorable conditions. However, LSH comes with some known shortcomings for NNS. LSH parameters (hash code length and number of hash tables) do not allow the user to have fine-grained control over the accuracy-efficiency trade-off. For example, specifying a particular hash code length and number of hash tables does not provide any information on the maximum number of points that might be retrieved. This is due to the fact that LSH builds hash tables on a grid irrespective of the data density, and hence can have buckets with no points or with large number of points for the same data set. These issues directly transfer over to the problem of approximate MIPS. However, a lot of research has been done to improve the performance and the accuracy-efficiency trade-off of LSH. But many of these improvements either require a deep understanding of LSH which is limited to LSH experts or are based on computationally intensive data-dependent indexing schemes without any theoretical guarantees. In addition to this, there are some issues with LSH which apply only to the MIPS problem. Firstly, with multiple ways of reducing MIPS to NNS, it is not (theoretically) clear which transformation is best for solving approximate MIPS via LSH4. Secondly, after transforming the data with functions P and Q, if LSH is solving the (1 + )-approximate NNS in the transformed space, it translates to a (1 − f(, P, Q, β))-approximate MIPS solution where f is some function. For an user to control the approximation in the MIPS problem, they have to carefully translate the approximation desired with LSH, adversely affecting the usability of LSH for approximate MIPS. We propose the use of randomized partition trees (RPTs) [11] for MIPS. We will provide details regarding RPTs in the following section. We believe that RPTs avoid the aforemen- tioned shortcomings of LSH for MIPS as follows:

• NNS with RPTs allows the user to have fine-grained control over the accuracy-efficiency tradeoff – the user just needs to set two parameters, the maximum leaf size n0 (for each tree) and the number of trees L, and the maximum number of retrieved points is upper bounded by L · n0. We will show how this property will seamlessly transfer to the task of MIPS.

• RPTs provide guarantees of the following form for NNS with a given set S and a query q – the probability of finding the exact nearest neighbors of q with each RPT is at least some ρ ∈ (0, 1) where ρ depends on q and S. Using L trees boosts this probability of finding the exact neighbors to at least 1 − (1 − ρ)L.

4Recent work has identified transformations which preserve locality sensitive property better [53, 54]. But there is no clear understanding (to the best of our knowledge) of how that translates to the final MIPS performance.

43 With Theorem 13, for a set S, a query q and transformations (P,Q), we know that the exact NNS solution in the transformed space is the exact MIPS solution. The only change for MIPS is that ρ now depends on q, S and (P,Q). Unlike LSH, we no longer need to translate the approximation  in the NNS solution to a different approximation f(, P, Q, β) for the MIPS solution.

• The quantity ρ for RPTs is (theoretically) controlled by an intuitive potential function [11]. Since, for MIPS, ρ depends on q, S and (P,Q), we will be able to present a theo- retical ordering of the values of ρ for the aforementioned transformations (Pi,Qi) ∀i = 1,..., 4. This allows one to definitively answer the question which is the best transfor- mation for solving MIPS via RPTs and relinquishing the user from having to choose a transformation (unlike in MIPS with LSH where the user needs to make such a choice and then translate the approximation guarantee).

4.2 MAXIMUM INNER PRODUCT SEARCH WITH RPT

Given transformations (P,Q), we can solve MIPS with RPTs by first preprocessing S as follows:

• Choose RPT parameters n0 and L, • Generate set P (S) = {P (x)∀x ∈ S},

• Build L RPTs τ1, . . . , τL on P (S) with leaf size n0.

l For a query q, let S (q) ⊂ S be the points in the leaf of τl containing Q(q). The MIPS solution for q is obtained by:

• Generate Q(q) and initialize candidate set R = ∅,

• For l = 1,...,L, R = R ∪ Sl(q),

> • return arg maxx∈R q x

By construction, RPTs are balanced and require O (n0 + log(n/n0)) time per tree which 5 is O(L log n) for a small constant n0  n and L trees . The probability of success depends on the value of the potential function (Equation 1.3). The following corollary of Theorem 13 defines this potential function for the MIPS problem:

d d Corollary 14. Given a set S ⊂ R of n data points and a query q ∈ R , let (x(1), x(2), . . . , x(n)) > > be an ordering, satisfying q x(i) ≥ q x(i+1) for i = 1, . . . , n−1. For any j = 1,..., 4, suppose transformation (Pj,Qj) is applied to (S, q). Then the following hold. (i) (Pj(x(1)),Pj(x(2)),...,Pj(x(n))) is an ordering satisfying, kPj(x(i))−Qj(q)k2 ≤ kPj(x(i+1))− Qj(q)k2, for i = 1, . . . , n − 1.

5 O(log(n/n0)) time is required to route a query to its corresponding leaf. Then, at most n0 points are processed at this leaf.

44 0 (ii) For any subset S of S = {P (x ),...,P (x )}, the potential function Φ 0 (Q (q),S ) j j j (1) j (n) |Sj | j j is defined as:

|S0 | j   1 X kPj(x(1)) − Qj(q)k2 Φ|S0 |(Qj(q),Sj) = (4.3) j |S0 | kP (x ) − Q (q)k j i=2 j (i) j 2 Lower values of this potential function imply higher probabilities of success in finding the exact MIPS solution. This puts us in a unique position – finding the transformation T∗ = (P ∗,Q∗) which achieves the lowest potential value among all transformations T1 - T4 . To this end, we consider each term in the right hand side of Equation 4.3, which corresponds to relative placement of any two data points x, y ∈ Rd and query point q ∈ Rd > > such that q y ≥ q x. Applying any of the four transformations (Pj,Qj)(j = 1,..., 4) kPj (y)−Qj (q)k ensures that the ratio ≤ 1. The following theorem shows that T1 = (P1,Q1) kPj (x)−Qj (q)k achieves the smallest (among the four considered) ratio for any q, x and y:

Theorem 15. Let x, y ∈ Rd be any two data points and let q ∈ Rd be a query point such > > 2  that q y ≥ q x. Suppose transformation T4 uses c > 1 and m satisfying m ≥ 4 c − 1 . kQ1(q)−P1(y)k Then, kQ (q)−P (x)k ≤ γ, where, n1 1 o γ = min kQ2(q)−P2(y)k , kQ3(q)−P3(y)k , kQ4(q)−P4(y)k . kQ2(q)−P2(x)k kQ3(q)−P3(x)k kQ4(q)−P4(x)k The above theorem implies that T1 will achieve the lowest potential value (as defined in Equation 4.3), and will have the highest success probability in finding the exact MIPS solution. Under mild conditions, we provide a total ordering of all the transformations with respect to the ratio kPj (y)−Qj (q)k in the following theorem: kPj (x)−Qj (q)k

Theorem 16. Let x, y ∈ Rd be any two data points and let q ∈ Rd be a query point such > > 8 n kqk β o that q y ≥ q x. Suppose T4 uses c > 1 and m satisfying m ≥ c max β , kqk . Then the following holds: kQ1(q)−P1(y)k ≤ kQ3(q)−P3(y)k ≤ kQ2(q)−P2(y)k ≤ kQ4(q)−P4(y)k . kQ1(q)−P1(x)k kQ3(q)−P3(x)k kQ2(q)−P2(x)k kQ4(q)−P4(x)k This implies a total ordering among the four transformations in increasing order of the resulting potential values:

Corollary 17. Given a set S ⊂ Rd of data points and a query q ∈ Rd, let Ti  Tj indicate that applying Ti = (Pi,Qi) to (S, q) yields a lower potential value as compared to applying 8 n kqk β o Tj to the same. Suppose T4 uses c and m satisfying m ≥ c max β , kqk . Then T1  T3  T2  T4 holds.

This result suggests that, for RPTs, we can expect the MIPS accuracy of each of the four transformations to follow the same ordering. Note that, T1  T3  T2 always holds. If the query norm is either too large or too small relative to the maximum point norm in S, m needs to be large enough to maintain relative position of T4 in the above ordering. This result theoretically suggests that T1 is the best (among the existing) transformation for solving MIPS with RPTs.

45 4.3 EMPIRICAL EVALUATIONS

In this section, we present empirical results in the form of three experiments. The first experiment validates the ordering of the transformations T1 - T4 presented in Corollary 17. The second experiment demonstrates that the ordering of the potential function values agrees with the actual performance of RPTs for MIPS – that is, the transformation with the lowest potential function value provides the best accuracy-efficiency trade-off for MIPS. The final experiment compares our proposed MIPS solution with the state-of-the-art approximate MIPS solution with LSH. For all our empirical evaluations we consider six real world datasets. The sizes of these sets are shown in Table 4.1.

Table 4.1: Dataset description Dataset # points in S # queries # dimensions d

Aerial 50000 10000 60 Corel 50000 10000 89 MNIST 50000 10000 784 Reuters 6000 2293 18933 Netflix 17770 10000 300 Movielens 10677 10000 150

The AERIAL dataset contains texture information of large areal photographs [63]. The COREL dataset is available at the UCI repository [64]. MNIST is a dataset of handwritten digits. REUTERS is a common text dataset used in machine learning and is available in Matlab format in [79]. After removing the missing data, the dataset contained 8293 documents. Netflix and Movielens are datasets usually used in recommender systems. We used exactly the same pre-processing step described in [54] for these two datasets. For Aerial, Corel and MNIST dataset we randomly chose 50000 points as database points that were used to construct appropriate data structure (RPT or hash tables) and 10000 points as queries. For Movielense and Netflix dataset [54], number of database points were fixed to 10677 and 17770 respectively. We chose the first 10000 points as query points in each case. For Reuters dataset, we randomly chose 6000 data points as database points and the remaining 2293 points as query points.

4.3.1 EXPERIMENT I : POTENTIAL FUNCTION EVALUATION In this experiment, we compute the potential function values for each query after applying each of the four transformations. For any dataset, we use PFi, i = 1, 2 & 3 to denote the vector of potential function values for all queries upon applying transformation Ti. Since transformation T4 depends on the choice of m, we choose two values – at one extreme, we choose a small m = 3 (as suggested in [53]), and at the other extreme, we choose a large m = 100. The corresponding vectors of potential function values are denoted by PF4 and

46 Figure 4.1: Potential function differences (y-axis) vs query index (x-axis) plot (please view in colour): The green line indicates the sorted differences PF3 − PF1, while the red dots indicate the differences PF2 − PF1 against the sorted index of PF3 − PF1. The blue line indicates the sorted differences PF4 − PF1, while the purple dots indicate the differences PF5 − PF1 against the sorted index of PF4 − PF1.

PF5 respectively. We visualize the relative ordering of the transformations for each dataset in Figure 4.1. The left panel for each dataset compares T1 to T2 and T3 , while the right panel compares T1 to T4 with two different values of m. To create the visualization in the left panel, we first compute the differences PF3 − PF1, sort it in the increasing order and generate the green line. We generate the red dots by plotting the differences PF2 − PF1 against the sorted index of PF3 − PF1. The green line is always positive for all datasets, indicating that T1 produces lower potential function values than T3 . The red dots values are always on or above the green line, indicating that T3 produces lower potential function values than T2 (and T1 produces lower values that both). This demonstrates that T1  T3  T2 holds in practice. The visualization in the right panel of each sub-figure in Figure 4.1 is generated in a similar manner. We use the sorted differences PF4 − PF1 to generate the blue line. We generate the purple dots by plotting the differences PF5 − PF1 against the sorted index of PF4 − PF1. The results indicate that T1  T4 for different values of m. We do not present a direct comparison of T2 and T3 with T4 because their relative ordering depends on the parameters chosen for T4 and the dataset characteristics (Corollary 17). However, we would like to note that T4 ensures that the exact NNS solution in the transformed space is the exact MIPS solution in the original space only as m → ∞ (Theorem 13). Since RPTs provide guarantees on the exact NNS solution (and hence, the exact MIPS solution), T4 only makes sense for RPTs with large m. However, our empirical results indicate that, as m grows, the potential function value grows, making NNS (and subsequently MIPS) harder, and T4 undesirable for MIPS with RPTs. Hence, we will not consider T4 any further in our evaluations. Note that, T4 for small m will direct RPTs to find the exact NN in the transformed space which could be significantly different from the exact MIPS

47 solution in the original space. Since the rest of the transformations will direct RPTs to find the exact MIPS solution, the relative comparison for MIPS with T4 can be unstable and unintuitive. T4 with large m will direct RPTs to find the exact MIPS solution and produce more intuitive results. However, T4 with large m is undesirable because the transformed space dimensionality as well as the potential function value will have increased significantly (as observed in the right panels of Figure 4.1). Note that, Figure 4.1 provides only a qualitative view of the relative ordering of various transformation by plotting the potential function values. To present a quantitative view, we first plot histograms (we used 50 bins to plot these histograms) of PF1, PF2 and PF3 for all eight datasets in Figure 5.4. To draw these histograms, first the range of potential function, which is [0, 1], is discretized in to 50 disjoint bins, where each bin corresponds to a range of potential function values and then number of queries whose potential function values lie in that range are plotted. Shape of these histograms agree with the ordering of the transformations T1 , T2 and T3 in the sense that histograms of PF1 (that corresponds to T1 ) is concentrated more towards the left as compared to the histograms of PF2 and PF3 for almost all datasets. This indicates that transformation T1 results in more queries to have lower potential function values as compared to transformation T2 and T3 . To quantify this, we convert the histograms into a discrete probability distribution in a straight forward manner by dividing number of query points in each bin by the total number of query points. Let us call these probability distributions PD1, PD2 and PD3, each of which is a vector of size 50. To quantify the different between these distributions, we compute the well known Hellinger distance between these probability distributions using equation 4.4 for each dataset and present in Table 4.2. v u K √ 1 uX p 2 H(PDi, PDj) = √ t ( PDik − PDjk) (4.4) 2 k=1

Table 4.2: Hellinger distance between PD1, PD2 and PD3.

Dataset H(PD1, PD2) H(PD1, PD3) H(PD2, PD3) Aerial 0.96 0.84 0.31 Corel 0.81 0.66 0.22 MNIST 0.74 0.34 0.45 Movielens 0.98 0.89 0.23 Netflix 0.95 0.75 0.28 Reuters 0.35 0.27 0.11 SIFT 0.08 0.003 0.08 USPS 0.15 0.01 0.15

Hellinger distance H(PDi, PDj) between any two discrete probability distribution PDi and PDj, as given in equation 4.4, is symmetric and is always bounded between 0 and 1. Hellinger distribution 0 implies that two probability distributions are exactly same while Hellinger distance 1 implies they are very very different. In general, higher that Hellinger distance

48 between two probability distributions, more different the two distributions are. Note that a popular measure to compare two probability distribution is Kullback-Leibler divergence or KL-divergence for short. Unlike Hellinger distance, KL-divergence is not symmetric (thus not a distance metric) and is unbounded and therefore we choose Hellinger distance to represent the difference between these probability distributions. Please note that, Helinger distance and KL divergence are related as follows: for any two distributions P and Q, 1 1  4 H(P,Q) ≤ 2 DKL(P ||Q) . This follows from the fact that Hellinger distance H(P,Q) and 2 √total variation distance δ(P,Q) are related by the following inequality H (P,Q) ≤ δ(P,Q) ≤ 2H(P,Q) and Pinsker’s inequality relates total variation distance and KL divergence via q 1 δ(P,Q) ≤ 2 DKL(P ||Q).

Figure 4.2: Histgoram for the potential function with different transformations on eight different datasets. First, second and third columns in the above figure represents histogram of potential function values obtained by applying transformation T1 , T2 and T3 respectively.

As can be seen from Table 4.2, H(PD1, PD2) ≥ H(PD1, PD3) for all eight datasets, indi- cating that PD1 is more similar to PD3 than to PD2. This agrees with visual inspection of histogram plots in Figure 4.2. In particular, the value of H(PD1, PD3) sheds some light on the relative position of the green line (PF3−PF1) in Figure 4.1. For example, in case of SIFT and and USPS dataset H(PD1, PD3) ≈ 0, and this explains why the green line representing (PF3 − PF1) lies very close to zero in Figure 4.1, whereas in case of, say Aerial dataset value of H(PD1, PD3) is very high and explains why the green line representing (PF3 − PF1) lies far away from the x-axis in Figure 4.1.

4.3.2 EXPERIMENT II: PRECISION-RECALL CURVE In this experiment, we generate precision recall (p-r) curves for finding the 20 highest inner products for each query using RPTs and present the relative performance when using one of T1 - T3 as the MIPS-to-NNS reduction. We use TREC interpolation rule [80] to produce precision recall (p-r) curves for all our datasets. TREC interpolation rule requires the precision and recall values be first computed based on a ranked list, and then interpolated to produce the final p-r curve. To produce such

49 Figure 4.3: Precision recall curves for MIPS with RPTs (please view in colour): The first, second, third and fourth row correspond to n0 = 10, 20, 40 & 50 respectively.

a ranked list, we use the experimental procedure used in [53]. We consider four different values of n0 and present the results in Figure 4.3. The results indicate that the p-r curves for T1 usually dominate the other p-r curves by a significant margin for all datasets and values of n0. Moreover, the T3 performance domi- nates the T2 performance. This demonstrates that the MIPS performance of RPTs for the different transformations agrees with the ordering of the potential function values presented in Corollary 17. We would like to point out that the RPT guarantees are probabilistic and there are times when the ordering is violated. For example, T3 achieves higher precision than T1 at low recall values for the Reuters set with n0 = 20 and for the Corel set with n0 = 50.

4.3.3 EXPERIMENT III: ACCURACY VS INVERSE SPEED-UP In this experiment, we compare the accuracy-efficiency trade-off of RPTs with T1 for MIPS to an existing baseline. We choose Simple-LSH as a representative of LSH methods since it has reportedly produced the best performance among other LSH based MIPS solutions [54]. We consider the task of finding the 10 highest inner products for each query. We choose the hash code lengths of 4, 8, 16 for Simple-LSH as they reportedly produce the best perfor- mance. We choose leaf sizes n0 of 50, 150 for RPTs. The accuracy of a method is defined as the recall of the 10 highest inner products averaged over all the queries. The efficiency of a method is defined by the inverse speed up over linear search and computed as

Inverse speed up = (T otal # inner products)/|S| (4.5)

with any set S for a MIPS query6.

6Precision-recall curves make sense when the underlying method/parameters are the same, and imply the

50 Figure 4.4: Accuracy vs inverse speed up plots for six real world datasets with RPTs and Simple-LSH (please view in colour).

The number of RPTs and Simple-LSH hash tables are chosen from the set {4, 8, 16, 32, 64, 128, 256} to obtain 7 (accuracy, inverse speed up) pairs for each method-parameter combination. These are used to generate the accuracy-efficiency trade-off curves in Figure 4.4. Moving from left to right on each curve implies increased computation (hence lower efficiency). The results indicate that RPTs with T1 achieve a particular level of accuracy much more efficiently (at smaller values of inverse speed up) than Simple-LSH. Moreover, the performance of RPTs does not appear to be significantly affected by the choice of n0 (as long as n0  |S|) and RPTs provide easy access to the full accuracy-efficiency trade-off spectrum. On the other hand, performance of Simple-LSH is quite sensitive to its parameters. For large hash code lengths (16 or bigger), the probability of generating hash buckets with very low density increases significantly and small (or even empty) candidate sets are generated, leading to low accuracy (recall). For small hash code lengths (say, 4), most hash buckets become very dense, leading to large candidate sets, which produce high accuracy but also high values of inverse speed up (low efficiency). For almost all datasets in Figure 4.4, a hash code length of 4 requires an inverse speed up close to 1 to achieve 90% accuracy. In our experiments, only the hash code length of 8 in Simple LSH allows the access to the full accuracy-efficiency trade-off spectrum for two of the datasets (Aerial and MNIST). In contrast, RPTs provide full access to the trade-off spectrum for all datasets and parameters. Furthermore, RPTs same amount of computation to retrieve the candidate set. We consider the total number of inner products µ required for a MIPS query to provide a fair comparison between different kinds of methods/parameters. For each RPT, this corresponds to the number of inner products needed to route a query to a leaf and process the points in that leaf. For each LSH hash table, this corresponds to the number of inner products needed to generate a hash code of a length k and process the points in that hash bucket. Let R be the size of the candidate set retrieved by each method. For RPTs with L trees, µ = R + L log |S|. For LSH with H hash tables, µ = R + Hk.

51 achieve almost 100% accuracy with an inverse speed up between 0.3-0.5 for three of the datasets (Aerial, Corel and MNIST). To summarize this chapter, we proposed the use of RPTs for the solution of MIPS for two main reasons – (i) to obtain a MIPS solution that allows simple but fine-grained control over the accuracy-efficiency trade-off, and (ii) to theoretically determine the best (among the set of existing) MIPS-to-NNS reduction in terms of the accuracy (for fixed efficiency). Our empirical results validate our theoretical claims and also demonstrate our superiority to the current state-of-the-art. For example, at 80% accuracy, our proposed MIPS solution produced results 2-5 times more efficiently than the state-of-the-art. However, there are a couple of limitations to our proposed solution. Firstly, a single RPT has a memory requirement of O(dn/n0). Hence, the complete ensemble of L RPTs would require a memory overhead of O(Lnd/n0) which can be severely limiting. Secondly, while we are able to get state-of-the-art performance for MIPS, we are not taking advantage of the fact that, after using T1 to reduce MIPS to NNS, the norms P1(x) = Q1(q) = constant ∀x ∈ S & ∀q. Finally, there is also the unanswered question of whether (and how) we can develop a better, or even optimal, MIPS-to-NNS reduction in terms of the potential function. While we did not address this question here, we presented a precise notion (the potential function in the transformed space) which can be used to answer questions such as “how is a MIPS-to-NNS reduction better?” or “how is a MIPS-to-NNS reduction optimal?”.

52 Chapter 5

ACTIVE LEARNING

In this chapter we focus on a well-known topic in machine learning called active learning. We use our proposed methods from chaper.2 and chapter.3 to solve this problem efficiently.

5.1 WHAT IS ACTIVE LEARNING?

n In traditional supervised learning setting, given a labeled training set of size n, {xi, yi}i=1 ⊂ d X × Y , where xi ∈ X is description of any object in R and yi ∈ Y is its label, one aims to learn function f : X → Y (also called classifier or predictor) to predict the label of future unseen examples. Choice of function f is typically by choosing appropriate model class such as linear vs non-linear or parametric vs non-parametric etc. Once an appropriate model class is decided classification accuracy typically improves with labeled training set size. However, for many real world application fields obtaining a large labeled training set is often not possible as it requires domain expertise, time and money. The key idea behind active learning is that a classifier can achieve greater accuracy with fewer training labels if it is allowed to choose the data from which it learns. An active learner may introduce queries to be labeled by an oracle. Active learning is well-motivated in many modern machine learning problems, where unlabeled data is available at almost no cost, but labels are time-consuming, or expensive to obtain (pool based active learning). This is a common issue in health related data where knowing if a patient is ill or not could require doing multiple tests which could be expensive and time consuming (e.g. MRI). Another example would be in text classification, imagine how long it would take for a human to label each webpage or sentence (e.g. sports, politics and etc.). Not only the process would be tedious, but it is error prone as well. Hence, the goal is to be able to provide an algorithm which can learn from a small set of labeled samples. Consider the medical data example and assume we have a fixed budget to run some tests on 100 patients, it is obvious that handpicking the patients would be beneficial as oppose to choosing them randomly. In other words, running tests on patients which are more informative would benefit us more.

53 5.2 POOL BASED ACTIVE LEARNING APPROACHES

As mentioned earlier active learning focus on identifying informative samples and ask for their labels. There exist many methods to define informativeness of a sample. For details of existing active learning methods please see [81, 82, 83]. We discuss three common approaches next.

5.2.1 UNCERTAINTY SAMPLING Uncertainty sampling is based on information theory measures. These methods try to select samples which the model is uncertain about. In other words, getting the true label of these samples will benefit the model most. One way to identify such samples is through entropy. Imagine a binary Naive Bayes classifier, the model has the most Uncertainty about a sample when probability of the sample (xi) belonging to each of the classes (c) is 0.5 i.e. P (c = 1|xi) = P (c = 0|xi) = 0.5. This happens when the entropy of the model prediction th for the i sample (Ent(pi)) is maximized. Another well-known information theory measure which can be used is the gini index. For more information and comparison of the methods we refer interested readers to [84, 85, 86, 87, 88]. Like every approach Uncertainty sampling has its shortcomings. First of all in order to use Uncertainty sampling one is only limited to use probability based classifiers e.g. bayes classifiers, decision trees and etc.. Second drawback of Uncertainty sampling is when we face imbalanced data, where the classifier usually predict the dominant class with a high probability (90% or more). Fortunately, this is easy to solve in most cases. We just need to assign prior knowledge to each class to resolve this issue.

5.2.2 QUERY BY COMMITTEE This approach [89] uses a committee of different classifiers, which are trained on the current set of labeled instances. The instance for which the classifiers disagree the most is selected as the most informative sample. It is easy to see query-by-committee method achieves similar goal as the uncertainty sampling, except that it does so by measuring the differences in the different classifiers, rather than the uncertainty of labeling a particular instance. Interestingly, the method for measuring the disagreement is also quite similar between the two methods. The most common technique is called vote entropy [90].

X V (yi) V (yi) log (5.1) M M i

th Where yi is the label of the i classifier, V (yi) is the number of votes that a label receives among the committee members and M is the committee size. We can see this equation corresponds to entropy (same as uncertainty sampling). In addition, other probabilistic measures such as the KL-divergence have been proposed in [91] for this purpose. The construction of the committee can be achieved by varying the model parameters of a particular classifier, using bagging/boosting methods, using a subset of features to build multiple classifiers or by using totally different classifiers. The crux of

54 query by committee methods is to be able to achieve diverse classifiers. This is done by ensemble methods [92, 93].

5.2.3 NEAREST NEIGHBOR TO QUERY HYPERPLANE (NNQH) In this section we focus on linear (hyperplane based) classifier under the setting when the size of the training set is very small but we have a large pool of unlabeled examples. For any such linear classifier such as support vector machine (SVM), the classification function > can be written as f(x) = sign(w x + b), where w is the normal vector to a hyperplane hw and b is offset from the origin. The hyperplane separates examples from two classes from one another in case of a binary classification. Given a labeled training set SVM aims to learn the ideal hyperplane hw ( specified by normal vector w) that separates examples from two classes. However, with small training set the learned hyperplane is far from ideal. We propose the active learning setting where, given this far from ideal hyperplane, we aim to identify most informative (for finding a better hyperplane) examples from the unlabeled set, ask for their labels and add them to the training set and re-train SVM. For such classifiers (e.g. Perceptron and SVM) we measure the informativeness of a sample based on its distance to the hyperplane. The closer the point to the hyperplane the more uncertainty it has i.e. if a point is far away from the hyperplane we are confident about its label. The Euclidean distance of a point x to a given hyperplane hw parameterized by normal w is: > > d(hw, x) = ||(x w)w|| = |x w| (5.2) Hence, the task of NNQH is to find x∗ such that:

x∗ = arg min |x>w| (5.3) x∈S As always doing an exhaustive search will give us the exact answer with a linear time com- plexity which is not desirable in many cases. Two different approaches has been suggested to solve NNQH problem in a sublinear time [50, 94] such that LSH would be applicable. These two approaches are, 1) Using the angle distance and the fact that close points are almost perpendicular to the w and defining a new hyperplane hash function which satisfies the local sensitivity condition (H-Hash) [50] 2) Using a transformation to reduce the NNQH problem to a NNS problem and then solve it using existing LSH methods (EH-Hash). While the former is less expensive it is not as accurate as the second approach. To the best of our knowledge, there are no existing algorithms to make use of RPT to solve NNQH problem. This chapter uses the latter approach and RPT to solve NNQH problem.

5.2.3.1 Reducing NNQH to NNS As mentioned before, [50] used a tranformation to reduce NNQH to NNS. This transforma- tion relies on a euclidean embedding for the hyperplane and points. Given a d-dimensional

55 vector x, we compute an embedding [94] that yields a d2-dimensional vector by vectorizing the corresponding rank-1 matrix xx>: 2 2 2 > x1 x2 xd V (x) = vec(xx ) = [√ , x1x2, ..., x1xd, √ , x2x3..., √ ] (5.4) 2 2 2 th where xi denotes the i element of x. Assuming x and y to be unit vectors, the euclidean dis- 2 > 2 tance between the embeddings V (x) and −V (y) is given by ||V (x)−(−V (y))||2 = 2+2(x y) . Hence, minimizing the distance between the two embeddings is equivalent to minimizing |x>y|, our intended function. 2 However, this embedding increases the dimension quadratically Rd → Rd . Recall that RPT and LSH space complexities are O(Lnd) and O(Lkd) respectively, hence making the proposed transformation inefficient. However, based on the proposed techniques in chap- ter.2 and chapter.3 it is possible to reduce RPT’s space complexity and compensate for the quadratic dimension resulting from the transformation. Next section discuss our approach to solve NNQH problem using RPT.

5.3 PROPOSED METHOD

As mentioned in the previous section EH-Hash transformation increases the dimension quadratically which is problematic for vanilla RPT due to its high space complexity. We combine all our best performing approaches discussed in chapter 2 and chapter 3 to make the most space efficient version of RPT and compensate for the quadratic dimension. One of the most effective approaches to reduce the space complexity of RPT without hurting the accuracy was RPTB and Sparse RPT which were introduced in chapter. 2. The space complexity of these two approaches for L trees are O(d log n) and O(Lndρ). By combining these two approaches the space complexity of L RPT would reduce to O(dρ log n). Also, chapter.3 introduced the combined approach which reduced the number of required trees to achieve an acceptable accuracy significantly using auxiliary information and priority functions. Combining all these methods helps us to have a sparse version of RPT which only requires a few number of trees. We call this approach combined sparse-RPTB. Having all these methods in our arsenal we can use the aforementioned transformation to reduce the NNQH to NNS and solve the active learning problem via RPT. Note that in chapter. 2 in order to have a sparse projection direction we sampled uni- formly at random. However we can slightly improve the performance of RPT by sampling based on the weight of each element. This approach is based on a lemma which states that sampling a vector v according to the weights of each element leads to a good approximation to vT y for any vector y (with constant probability). Similar sampling schemes have been used for a variety of matrix approximation problems [95]. 2 d vi d th Lemma 18. Let v ∈ R and define pi = ||v||2 . Construct vˆ ∈ R such that the i element is vi with probability pi and is 0 otherwise. Select t such elements using sampling with d c replacement. Then, for any y ∈ R ,  > 0, c ≥ 1, t > 2 1 P r[|vˆT y − vT y| ≤ ||v||2||y||2] > 1 − (5.5) c

56 Interested readers can see the proof at [96]. This will give us an upper bound on the error when we make a random direction sparse and by keeping the elements with high weight we guarantee to keep as much information as possible. We call this version of RPT as combined sparse-RPTBsamp. In our experiments we show that sparsifying a vector this way will indeed improve RPT’s performance slightly. Note that by doing the sampling this way, we are not adding any space complexity. Next section applies the combined sparse-RPTB to active learning problems.

5.4 EXPERIMENTAL RESULTS

In this section we compare combined sparse-RPTB mehod to EH-Hash and exhaustive search. For this purpose we are going to use 2 datasets. 1) SIFT with 150000 samples and 128 dimensions. This is a very well known data set for finding nearest neighbors which is widely used [58, 60, 61]. 2) CIFAR-10 which is an image processing dataset with 128 dimensions (we used gist descriptor), 10 classes and 60000 total samples. In all our experiments we fix the hyperparameters of combined sparse-RPTB to the following values C0 = 10,C =

1000, BucketSize = 3 ∗ log2 N, m = 100, n0 = 250 and sparsity = 1%. In section 5.4.1, we use a toy example to show exactly what the method is doing. Section 5.4.2 compares combined sparse-RPTB to EH-Hash and exhaustive approach using the svm setting and finally section 5.4.3 does the comparison to show how accurate each method is and at what cost we gain this accuracy.

5.4.1 TOY EXAMPLE In all our experiments we are using linear svm with one vs all setting. This means that if the data has t classes, svm will result in t hyperplanes and we are interested in finding the closest points to these hyperplanes (NNQH problem). Figure.5.1 is showing an example with different t values and show us the result of exhaus- tive search and combined sparse-RPTB. Note that for multi-class classification it is possible that none of the points in one class are the answer to NNQH (e.g. 3 classes and 4 classes in figure.5.1). We can see that in this toy example combined sparse-RPTB is able to find the exact answer, however, in bigger datasets with nonlinearity, this won’t be the case. The next two sections will deal with large datasets.

5.4.2 SVM SETTING

In this section we evaluate combined sparse-RPTBsamp in regards to svm performance. Figure.5.2 shows the general schema of this experiment. Assume the original dimension of the dataset is d and the transformed dimension is d0, we can learn our model of choice offline based on all the transformed data. Also, in the beginning we will choose 5 sam- ples from each class to train svm and calculate its performance on the test dataset (10,000 samples). After that, at each step we will find the k-NNQH for each hyperplane and add them to the training data and re-train svm. This way we may end up with an unbalanced training data after a couple of iterations as some of the classes could have a lot of edge cases

57 (a) 2 classes (b) 3 classes (c) 4 classes

Figure 5.1: Toy Example with 2 − d datapoints.

Figure 5.2: Steps of the svm setting.

(close to hyperplanes) while others are far away from the hyperplanes (we discussed this in section5.4.1). To avoid this we choose candidate points in a way so that we end up with 5 samples from each class and add them to the training set. We iterate 300 times and evaluate the performance of svm at each iteration. We are using F1-score to evaluate the performance of svm. P recision ∗ Recall F 1 = 2 ∗ (5.6) Score P recision + Recall Figure.5.3 shows the average result for some of the classes (Automobile, Deer, Frog and Truck) after 3 runs. As we mentioned before some classes may have more edge cases than others and in our experiment we observed that Automobile, Frog and Truck classes are having many more edge cases compare to others. This is what we expect as for example automobile and trucks are very similar while deer or airplanes are not similar to any other classes. This phenomena is the reason for significant improvement in the performance of svm for some classes and not the others. Also, we can see that both combined sparse-RPTB and EH-Hash are performing very similar to exhaustive method. However, as [50] mentioned, under this setting, a method could even outperform exhaus- tive search since there is no guarantee that the best active choice will help test performance

58 (a) Automobile Class (b) Deer Class

(c) Frog Class (d) Truck Class

Figure 5.3: Average F1-score after 3 runs. Classes with high number of edge cases will get the most benefit from active learning (i.e Automobile, Frog and Truck).

(e.g. automobile and frog class). Hence, this experiment is not the best way to compare methods and won’t tell us how good combined sparse-RPTB or EH-Hash are performing and at what cost (number of retrieved points). But, we did this comparison just for completeness and to show that combined sparse-RPTB follows the same trend as EH-Hash and exhaustive search. Section 5.4.3 does the comparison in a way so that we can exactly see how well each algorithm performs compare to the exhaustive search and at what cost.

5.4.3 ACCURACY VS SPEED-UP TRADE-OFF In this section, we divide the dataset into 2 sets, training set (N-10000) and test set (hyper- plane queries, 10000) and compare the top 10 NNQHs for combined sparse-RPTB, EH-Hash and exhaustive search. Note that exhaustive search will return the exact answer at a linear time and the whole purpose of combined sparse-RPTB and EH-Hash is to get as close as pos-

59 (a) SIFT Dataset (b) CIFAR-10 Dataset

Figure 5.4: Performance of combined sparse-RPTB vs EH-Hash. Markers from left to right corresponds to number of iterations [1, 20, 50, 100, 200]

sible to the exact answer in a sublinear time. Figure.5.4 shows this trade-off. Each marker in the curve corresponds to the number of iterations that has been used to achieve that performance. We fix the number of iterations to [1, 20, 50, 100, 200]. X-axis is the inverse speed up and indicates how much better we are doing in compare to a linear search (i.e. 0.2 in the x-axis shows we are spending 80% less time than the linear search) and y-axis is the accuracy. # retrieved points inverse speedup = (5.7) T otal number of samples For EH-Hash we use two different values for k, [16, 32]. From figure.5.4 we can see that EH-Hash with k = 32 is not performing well at all and the reason is that most of the buckets are empty. Even when k is 16 combined sparse-RPTB is performing much better. By better performance, we mean retrieving less number of points and having high accuracy. It is obvious that with the same number of iterations, combined sparse-RPTB always retrieves less number of points and still have higher accuracy compare to EH-Hash. Also, cleverly making projection directions sparse (combined sparse-RPTBsamp) based on lemma.18, performs slightly better than doing it uniformly at randomly.

5.5 CONCLUSION

In this chapter we discussed the problem of active learning and demonstrated how RPT can be used to perform active learning. One of the most recent approaches involves using a transformation which reduces NNQH to NNS but increases the dimension quadratically. With this increase in dimensionality we have no hope of using RPT due to its high space complexity. However, we used space efficient versions of RPT proposed in chapter. 2 and

60 chapter. 3 to make the problem manageable. We did a toy example to demonstrate the process of selecting samples for SVM. Then we compared exhaustive search, a LSH based method and our proposed method on 2 different data sets with 2 different settings to show our RPT based approach is superior to LSH based approach both from accuracy and efficiency perspectives.

61 Chapter 6

CONCLUSION AND FUTURE WORK

This thesis addresses two major drawbacks of RPT which are its high space complexity and high error for a single tree. We proposed multiple approaches in chapter 2 to reduce the space complexity by introducing sparsity and reducing number of possible projection directions. Our proposed techniques reduces space complexity without significantly affecting nearest neighbor search quality. In chapter 3 we demonstrated how auxiliary information and priority functions can be used to improve nearest neighbor search performance of a single RPT. These purposed techniques improves search quality by retrieving data points from multiple informative leaf nodes of the tree. As a result of these approaches RPT required about 85% less number of trees to achieve same level of accuracy. We tested our proposed approach against other tree based methods on real world data sets and showed our proposed method has superior performance compaerd to all other tree based methods. Chapters 4 and 5 focused on RPT applications, namely, maximum inner product search and nearest neighbor to query hyperplane search. The former problem is more widely known and has many transformations to reduce it to an equivalent NNS problem. We ranked such transformations and showed both theoretically and experimentally which one would perform best in conjunction with RPT. On the other hand, to the best of our knowledge there exist only one transformation for the latter problem to reduce it to an equivalent NNS problem and we used it to solve the NNQH problem via our modified versions of RPT. For both MIPS and NNQH problems we performed extensive empirical evaluations and compared against the state of the art approaches on real world data sets and demonstrated superiority of our proposed approach.

6.1 FUTURE WORK

The work presented in this thesis can be extended in many interesting directions. For both MIPS and NNQH we use existing transformations which were not designed to minimize RPT’s error. However, it would be interesting to investigate and find new transformations to not only reduce the MIPS and NNQH problems to equivalent NNS problems, but also to minimize RPT’s error at the same time. By doing so, we anticipate major improvement to nearest neighbor search performance of a RPT.

62 Since similarity search is very popular and has many applications, another interesting direction could be to find other big data applications of similarity search where even linear time complexity is not desirable. Therefore, RPT’s high accuracy and low query time com- plexity could be vital in solving these type of problems efficiently. Once such a problem is identified, we can aim to reduce it to an equivalent NNS problem and solve it efficiently via RPT. As a specific example, similarity search in kernel space can be reduced to an equivalent NNS problem. Kernel similarity search has many applications in bioinformatics, information retrieval and etc. [97, 98].

63 BIBLIOGRAPHY

64 BIBLIOGRAPHY

[1] V. Ramasubramanian and K. K. Paliwal, “Fast k-dimensional tree algorithms for nearest neighbor search with application to vector quantization encoding,” IEEE Transactions on Signal Processing, vol. 40, no. 3, pp. 518–531, 1992.

[2] V. Garcia, E. Debreuve, F. Nielsen, and M. Barlaud, “K-nearest neighbor search: Fast gpu-based implementations and application to high-dimensional feature matching,” in Image Processing (ICIP), 2010 17th IEEE International Conference on, pp. 3757–3760, IEEE, 2010.

[3] L. Devroye, “The uniform convergence of nearest neighbor regression function estima- tors and their application in optimization,” IEEE Transactions on Information Theory, vol. 24, no. 2, pp. 142–151, 1978.

[4] C.-L. Liu and M. Nakagawa, “Evaluation of prototype learning algorithms for nearest- neighbor classifier in application to handwritten character recognition,” Pattern Recog- nition, vol. 34, no. 3, pp. 601–615, 2001.

[5] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE transactions on information theory, vol. 13, no. 1, pp. 21–27, 1967.

[6] P. Indyk and R. Motwani, “Approximate nearest neighbors: towards removing the curse of dimensionality,” in Proceedings of the thirtieth annual ACM symposium on Theory of computing, pp. 604–613, ACM, 1998.

[7] I. S. Dhillon, P. K. Ravikumar, and A. Tewari, “Nearest neighbor based greedy coordi- nate descent,” in Advances in Neural Information Processing Systems, pp. 2160–2168, 2011.

[8] V. Athitsos, M. Potamias, P. Papapetrou, and G. Kollios, “Nearest neighbor retrieval using distance-based hashing,” in Data Engineering, 2008. ICDE 2008. IEEE 24th In- ternational Conference on, pp. 327–336, IEEE, 2008.

[9] M. Slaney and M. Casey, “Locality-sensitive hashing for finding nearest neighbors [lec- ture notes],” IEEE Signal processing magazine, vol. 25, no. 2, pp. 128–131, 2008.

[10] S. Dasgupta and Y. Freund, “Random projection trees and low dimensional manifolds,” in Proceedings of the fortieth annual ACM symposium on Theory of computing, pp. 537– 546, ACM, 2008.

65 [11] S. Dasgupta and K. Sinha, “Randomized partition trees for exact nearest neighbor search,” in Conference on Learning Theory, pp. 317–337, 2013. [12] A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions,” in Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on, pp. 459–468, IEEE, 2006. [13] H. Xu, J. Wang, Z. Li, G. Zeng, S. Li, and N. Yu, “Complementary hashing for approx- imate nearest neighbor search,” in Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 1631–1638, IEEE, 2011. [14] M. Muja and D. G. Lowe, “Scalable nearest neighbor algorithms for high dimensional data,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 11, pp. 2227– 2240, 2014. [15] A. Beygelzimer, S. Kakade, and J. Langford, “Cover trees for nearest neighbor,” in Proceedings of the 23rd international conference on Machine learning, pp. 97–104, ACM, 2006. [16] N. Roussopoulos, S. Kelley, and F. Vincent, “Nearest neighbor queries,” in ACM sigmod record, vol. 24, pp. 71–79, ACM, 1995. [17] H. V. Jagadish, B. C. Ooi, K.-L. Tan, C. Yu, and R. Zhang, “idistance: An adaptive b+-tree based indexing method for nearest neighbor search,” ACM Transactions on Database Systems (TODS), vol. 30, no. 2, pp. 364–397, 2005. [18] N. Katayama and S. Satoh, “The sr-tree: An index structure for high-dimensional nearest neighbor queries,” ACM Sigmod Record, vol. 26, no. 2, pp. 369–380, 1997. [19] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitive hashing scheme based on p-stable distributions,” in Proceedings of the twentieth annual sympo- sium on Computational geometry, pp. 253–262, ACM, 2004. [20] K. Eshghi and S. Rajaram, “Locality sensitive hash functions based on concomitant rank order statistics,” in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 221–229, ACM, 2008. [21] J. Ji, J. Li, S. Yan, B. Zhang, and Q. Tian, “Super-bit locality-sensitive hashing,” in Advances in Neural Information Processing Systems, pp. 108–116, 2012. [22] K. Terasawa and Y. Tanaka, “Spherical lsh for approximate nearest neighbor search on unit hypersphere,” in Workshop on Algorithms and Data Structures, pp. 27–38, Springer, 2007. [23] P. Li, M. Mitzenmacher, and A. shrivastava, “Coding for random projections and ap- proximate near neighbor search,” arXiv preprint arXiv:1403.8144, 2014. [24] A. Dasgupta, R. Kumar, and T. Sarl´os,“Fast locality-sensitive hashing,” in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1073–1081, ACM, 2011.

66 [25] J. Wang, H. T. Shen, J. Song, and J. Ji, “Hashing for similarity search: A survey,” arXiv preprint arXiv:1408.2927, 2014.

[26] R. O’Donnell, Y. Wu, and Y. Zhou, “Optimal lower bounds for locality-sensitive hashing (except when q is tiny),” ACM Transactions on Computation Theory (TOCT), vol. 6, no. 1, p. 5, 2014.

[27] A. Andoni, “E2lsh 0.1 user manual,” http://www. mit. edu/andoni/LSH/, 2005.

[28] W. N. Venables and B. D. Ripley, “Tree-based methods,” in Modern Applied Statistics with S, pp. 251–269, Springer, 2002.

[29] W. F. Mitchell, “A refinement-tree based partitioning method for dynamic load bal- ancing with adaptively refined grids,” Journal of Parallel and Distributed Computing, vol. 67, no. 4, pp. 417–429, 2007.

[30] S. Ahmed, F. Coenen, and P. Leng, “Tree-based partitioning of date for association rule mining,” Knowledge and information systems, vol. 10, no. 3, pp. 315–331, 2006.

[31] J. L. Bentley, “Multidimensional binary search trees used for associative searching,” Communications of the ACM, vol. 18, no. 9, pp. 509–517, 1975.

[32] M. Shevtsov, A. Soupikov, and A. Kapustin, “Highly parallel fast kd-tree construction for interactive ray tracing of dynamic scenes,” in Computer Graphics Forum, vol. 26, pp. 395–404, Wiley Online Library, 2007.

[33] A. Nuchter, K. Lingemann, and J. Hertzberg, “Cached kd tree search for icp algo- rithms,” in 3-D Digital Imaging and Modeling, 2007. 3DIM’07. Sixth International Conference on, pp. 419–426, IEEE, 2007.

[34] K. Xu, Y. Li, T. Ju, S.-M. Hu, and T.-Q. Liu, “Efficient affinity-based edit propagation using kd tree,” in ACM Transactions on Graphics (TOG), vol. 28, p. 118, ACM, 2009.

[35] W. Hunt, W. R. Mark, and G. Stoll, “Fast kd-tree construction with an adaptive error- bounded heuristic,” in Interactive Ray Tracing 2006, IEEE Symposium on, pp. 81–88, IEEE, 2006.

[36] T. Liu, A. W. Moore, K. Yang, and A. G. Gray, “An investigation of practical ap- proximate nearest neighbor algorithms,” in Advances in neural information processing systems, pp. 825–832, 2005.

[37] K. Sinha, “Lsh vs randomized partition trees: Which one to use for nearest neighbor search?,” in Machine Learning and Applications (ICMLA), 2014 13th International Conference on, pp. 41–46, IEEE, 2014.

[38] S. M. Omohundro, Five balltree construction algorithms. International Computer Sci- ence Institute Berkeley, 1989.

67 [39] P. C. Reddy and A. S. Babu, “Survey on weather prediction using big data analystics,” in Electrical, Computer and Communication Technologies (ICECCT), 2017 Second In- ternational Conference on, pp. 1–6, IEEE, 2017. [40] M. Rajshree, S. Arya, and R. Agarwal, “Data mining technique for agriculture and related areas,” International Journal of Advanced Research in Computer Science, vol. 2, no. 6, 2011. [41] P. J. Clark and F. C. Evans, “Distance to nearest neighbor as a measure of spatial relationships in populations,” Ecology, vol. 35, no. 4, pp. 445–453, 1954. [42] M.-L. Zhang and Z.-H. Zhou, “A k-nearest neighbor based algorithm for multi-label classification,” in Granular Computing, 2005 IEEE International Conference on, vol. 2, pp. 718–721, IEEE, 2005. [43] K. Mouratidis, M. L. Yiu, D. Papadias, and N. Mamoulis, “Continuous nearest neighbor monitoring in road networks,” in Proceedings of the 32nd international conference on Very large data bases, pp. 43–54, VLDB Endowment, 2006. [44] M. Blume, M. A. Lazarus, L. S. Peranich, F. Vernhes, W. R. Caid, T. E. Dunning, G. R. Russell, and K. L. Sitze, “Predictive modeling of consumer financial behavior using supervised segmentation and nearest-neighbor matching,” Jan. 4 2005. US Patent 6,839,682. [45] S. Parameswaran and K. Q. Weinberger, “Large margin multi-task metric learning,” in Advances in neural information processing systems, pp. 1867–1875, 2010. [46] T. Liu, C. Rosenberg, and H. A. Rowley, “Clustering billions of images with large scale nearest neighbor search,” in null, p. 28, IEEE, 2007. [47] B. M. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Recommender systems for large- scale e-commerce: Scalable neighborhood formation using clustering,” in Proceedings of the fifth international conference on computer and information technology, vol. 1, pp. 291–324, 2002. [48] N. Bhatia et al., “Survey of nearest neighbor techniques,” arXiv preprint arXiv:1007.0085, 2010. [49] A. Shrivastava and P. Li, “Asymmetric lsh (alsh) for sublinear time maximum in- ner product search (mips),” in Advances in Neural Information Processing Systems, pp. 2321–2329, 2014. [50] P. Jain, S. Vijayanarasimhan, and K. Grauman, “Hashing hyperplane queries to near points with applications to large-scale active learning,” in Advances in Neural Informa- tion Processing Systems, pp. 928–936, 2010. [51] Y. Bachrach, Y. Finkelstein, R. Gilad-Bachrach, L. Katzir, N. Koenigstein, N. Nice, and U. Paquet, “Speeding up the xbox recommender system using a euclidean trans- formation for inner-product spaces,” in Proceedings of the 8th ACM Conference on Recommender systems, pp. 257–264, ACM, 2014.

68 [52] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” Computer, no. 8, pp. 30–37, 2009.

[53] A. Shrivastava and P. Li, “Improved asymmetric locality sensitive hashing (alsh) for maximum inner product search (mips),” arXiv preprint arXiv:1410.5410, 2014.

[54] B. Neyshabur and N. Srebro, “On symmetric and asymmetric lshs for inner product search,” arXiv preprint arXiv:1410.5518, 2014.

[55] S. Tong and D. Koller, “Support vector machine active learning with applications to text classification,” Journal of machine learning research, vol. 2, no. Nov, pp. 45–66, 2001.

[56] G. Schohn and D. Cohn, “Less is more: Active learning with support vector machines,” in ICML, pp. 839–846, Citeseer, 2000.

[57] B. Settles, “Active learning,” Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 6, no. 1, pp. 1–114, 2012.

[58] O. Keivani, K. Sinha, and P. Ram, “Improved maximum inner product search with bet- ter theoretical guarantee using randomized partition trees,” Machine Learning, pp. 1–26, 2018.

[59] K. Sinha and O. Keivani, “Sparse randomized partition trees for nearest neighbor search,” in Artificial Intelligence and Statistics, pp. 681–689, 2017.

[60] O. Keivani and K. Sinha, “Improved nearest neighbor search using auxiliary information and priority functions,” in International Conference on Machine Learning, pp. 2578– 2586, 2018.

[61] O. Keivani, K. Sinha, and P. Ram, “Improved maximum inner product search with better theoretical guarantees,” in Neural Networks (IJCNN), 2017 International Joint Conference on, pp. 2927–2934, IEEE, 2017.

[62] N. Ailon and B. Chazelle, “The fast johnson–lindenstrauss transform and approximate nearest neighbors,” SIAM Journal on computing, vol. 39, no. 1, pp. 302–322, 2009.

[63] B. S. Manjunath and W.-Y. Ma, “Texture features for browsing and retrieval of image data,” IEEE Transactions on pattern analysis and machine intelligence, vol. 18, no. 8, pp. 837–842, 1996.

[64] K. Bache and M. Lichman, “Uci machine learning repository,” 2013.

[65] H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 1, pp. 117– 128, 2011.

[66] W. B. Johnson and J. Lindenstrauss, “Extensions of lipschitz mappings into a hilbert space,” Contemporary mathematics, vol. 26, no. 189-206, p. 1, 1984.

69 [67] A. Babenko and V. S. Lempitsky, “Product split trees.,” in CVPR, pp. 6316–6324, 2017.

[68] N. Srebro, J. Rennie, and T. S. Jaakkola, “Maximum-margin matrix factorization,” in Advances in neural information processing systems, pp. 1329–1336, 2005.

[69] P. Cremonesi, Y. Koren, and R. Turrin, “Performance of recommender algorithms on top-n recommendation tasks,” in Proceedings of the fourth ACM conference on Recom- mender systems, pp. 39–46, ACM, 2010.

[70] T. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, and J. Yagnik, “Fast, accurate detection of 100,000 object classes on a single machine,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1814–1821, 2013.

[71] P. Jain and A. Kapoor, “Active learning for large multi-class problems,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 762–769, IEEE, 2009.

[72] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 9, pp. 1627–1645, 2010.

[73] T. Joachims, “Training linear svms in linear time,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 217– 226, ACM, 2006.

[74] T. Joachims, T. Finley, and C.-N. J. Yu, “Cutting-plane training of structural svms,” Machine Learning, vol. 77, no. 1, pp. 27–59, 2009.

[75] A. Gionis, P. Indyk, R. Motwani, et al., “Similarity search in high dimensions via hashing,” in Vldb, vol. 99, pp. 518–529, 1999.

[76] P. Ram and A. G. Gray, “Maximum inner-product search using cone trees,” in Proceed- ings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 931–939, ACM, 2012.

[77] R. R. Curtin, P. Ram, and A. G. Gray, “Fast exact max-kernel search.,” in Proceedings of SIAM Data Mining, 2013.

[78] R. R. Curtin and P. Ram, “Dual-tree fast exact max-kernel search,” Statistical Analysis and Data Mining, vol. 7, no. 4, pp. 229–253, 2014.

[79] D. Cai, “Text Dat sets in Matlab Format,” 2009. Available at : http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html.

[80] “Trec interpolation.” http://trec.nist.gov/pubs/trec16/appendices/measures.pdf. Ac- cessed: 2016-09-14.

[81] B. Settles, “Active learning literature survey,” tech. rep., University of Wisconsin- Madison Department of Computer Sciences, 2009.

70 [82] Y. Fu, X. Zhu, and B. Li, “A survey on instance selection for active learning,” Knowledge and information systems, vol. 35, no. 2, pp. 249–283, 2013.

[83] M. Elahi, F. Ricci, and N. Rubens, “A survey of active learning in collaborative filtering recommender systems,” Computer Science Review, vol. 20, pp. 29–50, 2016.

[84] A. Culotta and A. McCallum, “Reducing labeling effort for structured prediction tasks,” in AAAI, vol. 5, pp. 746–751, 2005.

[85] B. Settles and M. Craven, “An analysis of active learning strategies for sequence label- ing tasks,” in Proceedings of the conference on empirical methods in natural language processing, pp. 1070–1079, Association for Computational Linguistics, 2008.

[86] D. D. Lewis and J. Catlett, “Heterogeneous uncertainty sampling for supervised learn- ing,” in Machine learning proceedings 1994, pp. 148–156, Elsevier, 1994.

[87] C. K¨ornerand S. Wrobel, “Multi-class ensemble-based active learning,” in European conference on machine learning, pp. 687–694, Springer, 2006.

[88] R. Hwa, “Sample selection for statistical parsing,” Computational linguistics, vol. 30, no. 3, pp. 253–276, 2004.

[89] H. S. Seung, M. Opper, and H. Sompolinsky, “Query by committee,” in Proceedings of the fifth annual workshop on Computational learning theory, pp. 287–294, ACM, 1992.

[90] I. Dagan and S. P. Engelson, “Committee-based sampling for training probabilistic classifiers,” in Machine Learning Proceedings 1995, pp. 150–157, Elsevier, 1995.

[91] M. McCallum and K. Nigam, “Employing em in pool-based active learning for text classification, 1998,” in International Conference on Machine Learning (ICML).

[92] N. C. Oza and S. Russell, Online ensemble learning. University of California, Berkeley, 2001.

[93] C. Zhang and Y. Ma, Ensemble machine learning: methods and applications. Springer, 2012.

[94] R. Basri, T. Hassner, and L. Zelnik-Manor, “Approximate nearest subspace search,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 2, pp. 266–278, 2011.

[95] R. Kannan, S. Vempala, et al., “Spectral algorithms,” Foundations and Trends R in Theoretical Computer Science, vol. 4, no. 3–4, pp. 157–288, 2009.

[96] S. Vijayanarasimhan, P. Jain, and K. Grauman, “Hashing hyperplane queries to near points with applications to large-scale active learning,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 2, pp. 276–288, 2014.

[97] J. Salomon and D. R. Flower, “Predicting class ii mhc-peptide binding: a kernel based approach using similarity scores,” BMC bioinformatics, vol. 7, no. 1, p. 501, 2006.

71 [98] W. Wu, J. Xu, H. Li, and S. Oyama, “Learning a robust relevance model for search using kernel methods,” Journal of Machine Learning Research, vol. 12, no. May, pp. 1429– 1458, 2011.

[99] K. Li and J. Malik, “Fast k-nearest neighbour search via dynamic continuous indexing,” in International Conference on Machine Learning, pp. 671–679, 2016.

72 APPENDIXES

73 Appendix A

CHAPTER 2 PROOFS

In this appendix we provide all proofs for chapter2.

A.1 Proof of Lemma1

Proof. As projection directions at intermediate nodes at different levels are independent of each other along any path from root to any leaf node, using union bound, the failure probability analysis is essentially the same as in [11].

A.2 Proof of Lemma3

Proof. An RPTB is constructed by choosing a projection direction for each level of the RPT uniformly at random without replacement from the bucket. Let m be the depth of such an RPTB. Consider any two instantiations of RPTBs, namely, τi and τj. Let Aij be the event that τi and τj has same sequence of projection directions at every level along the path 1 1 1 1 1 from root to leaf node. Then Pr(Aij) = N · N−1 ··· N−(m−1) ≤ (N−m+1)m ≤ (N−m)m . Note that because of the choice random split, depth m of any RPTB can be at most log4/3 n = 1 (log4/3 2) · log n ≤ (5/2) · log n, and at least log4 n = 2 log n. Suppose the bucket contains N = cm = c · log n projection directions for some c ≥ 3. If we have L RPTBs, then the probability that any pair of RPTBs have same sequence of projection directions at every

74 level along the path from root to leaf node is: L Pr (∃(i, j) such that A happens) ≤ Pr(A ) ij 2 ij L2  1  ≤ 2 (N − m)m ! L2 1 ≤ 1 log n 2 ((c − 1) · m) 2 ! L2 1 ≤ (c−1) 1 2 2 log n ( 2 log n) ! L2 1 ≤ (c−1) 1 1 2 2 log n 2 log n ( 2 ) (log n) ! a L2 1 ≤ 1 log n 2 (log n) 2 b L2  1  ≤ 1 log log n 2 n 2 c L2  1  ≤ 3 2 n 2 1 L2  d 1 ≤ √ ≤ √ 2 n n 2 n

Inequality a is due to choice of c while inequality b follows from the following observation. β 1 log n Suppose log n = 2 for some β > 0. This implies β = log log n. Clearly, (log n) 2 = β log n log n β β 1 log log n 2 2 = (2 ) 2 = n 2 = n 2 . Inequality c holds as long as n ≥ 256 and inequality d follows from the restriction on L.

A.3 Proof of Theorem4

Proof. For ease of readability we use the following notation. Let xH = HDx, yH = HDx, qH = > HDq and let xB = BxH , yB = ByH , qB = BqH . Also, let X1 = U BHD(x − q) = > > > > > U B(xH − qH ) = U (xB − qB) and X2 = U BHD(y − q) = U B(yH − qH ) = U (yB − qB). Event B can now be written as : B ≡ {U >BHDy falls between U >BHDq and U >BHDx} ≡ {U >BHD(y − q) falls between 0 and U >BHD(x − q)} > > ≡ {U (yB − qB) falls between 0 and U (xB − qB)}

≡ {X2 falls between 0 and X1}

≡ {0 < X2 < X1} ∪ {X1 < X2 < 0} δ Using Lemma 8, with probability at least 1 − 2 , we have kxH − qH k∞ = kHD(x − q)k∞ ≤ q 2 log(4nd/δ) q 2 log(4nd/δ) kx − qk d and kyH − qH k∞ = kHD(y − q)k∞ ≤ ky − qk d . Also note

75 Pd 2 2 > 2 that, i=1 Bii((xH )i − (qH )i) = (B(xH − qH )) B(xH − qH ) = kxB − qBk , and similarly, Pd 2 2 2 Pd 2 i=1 Bii((xH )i − (qH )i) = kyH − qH k and i=1 Bii((xH )i − (qH )i) · ((yH )i − (qH )i) = > > (xB − qB) (yB − qB). Using this observation and Lemma 9, it follows that (X1,X2) follows a bivariate normal distribution with zero mean and covariance matrix given by  2 >  kxB − qBk (xB − qB) (yB − qB) δ CB = > 2 , where with probability at least 1 − 2 , (xB − qB) (yB − qB) kyB − qBk the following holds:

2 2 2 (1 − )pkq − xk ≤ kxB − qBk ≤ (1 + )pkq − xk (A.1)

2 2 2 (1 − )pkq − yk ≤ kyB − qBk ≤ (1 + )pkq − yk (A.2)

  |(x − q )>(y − q ) − p(q − x)>(q − y)| ≤ p kq − xk2 B B B B 2  + kq − yk2 (A.3)

Next we use this information and Lemma 7 to estimate Pr(B). We consider the following two cases for this purpose. Case 1: (q − x)>(q − y) ≤ (1 − 2)kq − xkkq − yk > 2 We will show that if (q − x) (q − y) ≤ (1 − 2)kq − xkkq − yk then kyB − qBk ≥ > (xB − qB) (yB − qB) and we can use the first case of Lemma 7. To see this suppose the condition (q − x)>(q − y) ≤ tkq − xkkq − yk holds for some positive t. Then using equation A.3 we can write,

> (xB − qB) (yB − qB)    ≤ p (q − x)>(q − y) + (kq − xk2 + kq − yk2) 2    ≤ p tkq − xkkq − yk + (kq − xk2 + kq − yk2) 2    ≤ p tkq − yk2 + (kq − yk2 + kq − yk2) 2 = pkq − yk2(t + )

> 2 Therefore, the maximum value of (xB −qB) (yB −qB) can be at most pkq−yk (t+). Now 2 2 using equation A.2 it is easy to see that kyB −qBk can be at least pkq−yk (1−). Therefore, 2 > 2 2 kyB − qBk ≥ (xB − qB) (yB − qB) if, pkq − yk (1 − ) ≥ pkq − yk (t + ) ⇒ t ≤ (1 − 2). Using Lemma 7, we see

76

1 kx − q k Pr(B) = arcsin B B π kyB − qBk s ! (x − q )>(x − y )2 × 1 − B B B B kxB − qBkkxB − yBk 1 kx − q k ≤ arcsin B B π kyB − qBk   p ! 1 kxB − qBk 1 kx − qk (1 + ) ≤ ≤ p 2 kyB − qBk 2 ky − qk (1 − ) 1 kx − qk ≤ (1 + 2) 2 ky − qk

Case 2: (q − x)>(q − y) > (1 − 2)kq − xkkq − yk 2 > Note that in this case we can have kyB −qBk < (xB −qB) (yB −qB). Since CB is positive 2 2 > 2 definite, its determinant is non-negative, i.e, kxB −yBk kyB −qBk ≥ ((xB −qB) (yB −qB)) . 2 2 Combining these two facts it is easy to see that kxB − qBk > kyB − qBk , or in other 2 2 2 kxB −qB k kxB −qB k kx−qk (1+) words, 2 > 1. Now using equation A.1 and A.2, we see that 2 ≤ 2 . kyB −qB k kyB −qB k ky−qk (1−)  kx−qk2(1+)   ky−qk2 1+  Combining these two facts we see that 1 ≤ ky−qk2(1−) ⇒ kx−qk2 ≤ 1− . However, it 2 is assumed that kx − qk ≤ ky − qk. Therefore, over random choice of B, kyB − qBk < > (xB − qB) (yB − qB), when the following events holds

ky − qk2 1 +  1 ≤ ≤ kx − qk2 1 −  and (q − x)>(q − y) > (1 − 2)kq − xkkq − yk This corresponds to the shaded region in Figure A.1. Note that for fix q and x, whenever y falls in this shaded region Pr(B) can be close to 1 in the worst case. However, by choosing small  we can control the volume of this shaded region and make it as small as we want.

A.4 Proof of Corallary5  q 1+  > Proof. Let A = xi : kx(1) −qk ≤ kxi −qk ≤ kx(1) −qk 1− − 1 and (q−x(1)) (q−xi) >  (1−2)kq−x(1)kkq−xik . Note that |A| = nη(). Let Zi be the indicator variable that that Pn takes value 1 if x(i) falls between x(1) and q in projection and zero otherwise. Let Z = i=2.

77 d q 1+  Figure A.1: Any point y ∈ R that satisfy kx − qk ≤ ky − qk ≤ kx − qk 1− − 1 and (q − x)>(q − y) > (1 − 2)kq − xkkq − yk lies in the shaded region.

Using Theorem 4, it is easy to see that

n n X X E(Z) = E(Zi) = Pr(Zi = 1) i=2 i=2 X X = Pr(Z(i) = 1) + Pr(Z(i) = 1)

x(i)∈A x(i)∈/A   X X 1 kq − x(1)k ≤ 1 + (1 + ) + δ 2 kq − x(i)k x(i)∈A x(i)∈/A n   X 1 kq − x(1)k ≤ nη() + (1 + ) + δ 2 kq − x k i=2 (i) n 1 X kq − x(1)k = nη() + (1 + ) + (n − 1)δ 2 kq − x k i=2 (i)

≤ (1 + )nΦn(q, {x1, . . . , xn}) + n (η() + δ)

Therefore, the expected fraction of the points that fall between x(1) and q is at most (1 + )Φn(q, {x1, . . . , xn}) + ((η) + δ).

A.5 Proof of Corollary6

Proof. Note that query time of RPT (or sparse RPT) data structure is the sum of, (a) time required to reach to a leaf node (from root node) and (b) time required to process the data points lying in that leaf node. By construction, maximum number of data points at leaf node of an RP-tree (or Sparse RP-tree) is at most n0, and consequently second part of query

78 time is n0d in both cases. Now, in case of RP-tree, time required to reach a leaf node is O(d log n) as the depth of the tree is at most O(log n) and an inner product needs to be computed along the path from root node to a leaf node. Now in case of sparse RP-tree, Walsh Hadamard transform of a d-dimensional query point can be computed in O(d log d) q nd 1  log( δ ) log( δ ) time. Choose d large enough so that  = Θ dρ . This ensures that average number of non-zero coordinates of the projection direction stored at each internal node of sparse RP-tree is pd = dρ. Therefore, time required to reach to a leaf node (from root node), in case of sparse RP-tree is O(d log d + dρ log n). Without considering the constants ρ 2 in asymptotic notation we would like to show that d log n ≥ d log d + d log n when n√≥ d . To achieve this, first we claim that d ≤ 2. To see this, note that, ρ < 1/2 ⇒ dρ < d ⇒ √ d−dρ (d− d) < (d−dρ) ⇒ d < d√ ⇒ d < 2, where the last implication follows from that d−dρ d− d d−dρ d d fact that √ ≤ 2 for any d ≥ 4. Now if n ≥ d2, that would imply n ≥ d2 ≥ d d−dρ . taking d− d d logarithm on both sides yield, log n ≥ d−dρ log d. After cross multiplication and rearranging the terms it is easy to see that d log n ≥ d log d + dρ log n.

A.6 Proof of Lemma7

> > Proof. Let X1 = U (x − q) and X2 = U (y − q). Without loss of generality we can write A as, {U >(y − q) falls between U >(q − q) and U >(x − q)} ≡ {U >(y − q) falls between 0 and U >(x − q)}

≡ {0 < X2 < X1} ∪ {X1 < X2 < 0} 2 Using 2-stability of normal distribution it is easy to see that X1 ∼ N(0, kx − qk ) and 2 similarly X2 ∼ N(0, ky − qk ). Moreover,  d X 2 EU (X1X2) = EU Ui (xi − qi) · (yi − qi) i=1 X  + UiUj(xi − qi) · (yi − qi) i6=j d X 2 = EU (Ui ) · (xi − qi) · (yi − qi) i=1 X + EU (Ui) · EU (Uj) · (xi − qi) · (yi − qi) i6=j d X = (xi − qi) · (yi − qi) i=1 = (x − q)>(y − q) For ease of notation let us use the notation a2 = kx − qk2, b2 = ky − qk2, c = (x − q)>(y − q). > Then it is easy to see that (X1,X2) follows a bivariate normal distribution with zero mean

79 a2 c  and covariance matrix . Let Z and Z be i.i.d standard Normal random variables. c b2 1 2 Then we can write X and X as a linear combination of Z and Z as follows: X = 1 √ 2 1 2 1 c   a2b2−c2  2 dZ1+ b Z2, where, d = b and X2 = bZ2. It is easy to see that, X1 ∼ N(0, a ),X2 ∼ 2 N(0, b ) and E(X1X2) = c. Let A1 and A2 be the events as follows: A1 ≡ {0 < X2 < X1} and A2 ≡ {X1 < X2 < 0}. Then we can write

A1 ≡ {0 < X2 < X1}

≡ {0 < bZ2 < dX1 + (c/b)Z2}

≡ {Z2 > 0,Z1 > eZ2}

b2−c 2 where, e = bd . Since Z1,Z2 are independent, when b > c, slope of the line Z1 = eZ2 is positive and therefore A1 corresponds to an angular sector in the (Z2,Z1) plane with angle 0 < θ1 ≤ π/2 (see left panel of Figure A.2). By invariance of rotation of the distribution > arctan(1/e) (Z2,Z1) , Pr(A1) = 2π . Similarly, A2 can be represented as

A2 ≡ {X1 < X2 < 0}

≡ {dX1 + (c/b)Z2 < bZ2 < 0}

≡ {Z2 < 0,Z1 < eZ2}

arctan(1/e) Using similar argument, it is easy to see that Pr(A2) = 2π . Since A1, A2 are disjoint, arctan(1/e) Pr(A) = Pr(A1) + Pr(A2) = π . 2 When b < c, slope of the line Z1 = eZ2 is negative. Therefore, the region A1 corresponds to an angular sector in the (Z2,Z1) plane with angle 0 < θ2 ≤ π (see right panel of Figure > π+arctan(1/e) A.2). By invariance of rotation of the distribution (Z2,Z1) , Pr(A1) = 2π . Using π+arctan(1/e) similar argument, it is easy to see that Pr(A2) = 2π , and consequently, Pr(A) = arctan(1/e) Pr(A1) + Pr(A2) = 1 + π . Therefore,

( arctan(1/e) 2 π , if b ≥ c Pr(A) = arctan(1/e) (A.4) 1 + π otherwise.

80 Next, note that if b2 ≥ c then,

arctan(1/e) √  a2b2 − c2  = arctan b2 − c √ ! a2b2 − c2 = arcsin p(a2b2 − c2) + (b2 − c)2 ! r a2b2 − c2 = arcsin a2b2 + b4 − 2b2c ! ar b2 − (c/a)2 = arcsin b a2 + b2 − 2c s ! a a2b2 − c2 = arcsin b a2(a2 + b2 − 2c) s ! a (a2 − c)2 = arcsin 1 − b a2(a2 + b2 − 2c)  s  kx − qk ((x − q)>((x − q) − (y − q)))2 =α arcsin 1 − ky − qk kx − qk2k(x − q) − (y − q)k2   s  kx − qk (x − q)>(x − y)2 = arcsin 1 − ky − qk kx − qkkx − yk 

where, equality α follows by plugging values of a, b and c and subsequent simplification. However, if c > b2, using the same argument as above we get,

arctan(1/e) √  a2b2 − c2  = arctan b2 − c √ ! a2b2 − c2 = arcsin − p(a2b2 − c2) + (b2 − c)2 √ ! a2b2 − c2 = − arcsin p(a2b2 − c2) + (b2 − c)2  s  kx − qk (x − q)>(x − y)2 = − arcsin 1 − ky − qk kx − qkkx − yk 

The result now follows immediately from equation A.4.

81 θ1 Figure A.2: The left panel corresponds to the case when e > 0. In this case, Pr(A1) = 2π = arctan(1/e) 2π , where θ1 is the shaded angle. The right panel corresponds to the case when e < 0. θ2 π+arctan(1/e) In this case, Pr(A1) = 2π = 2π , where θ2 is the shaded angle.

A.7 Proof of Lemma8

> Proof. Fix any xi ∈ S and define a random variable u = HDx˜i = (u1, . . . , ud) , where√ Pd x˜i = xi/kxik so that kx˜ik = 1. Note that u1 has the form j=1 ajx˜ij, where each aj = ±1/ d is chosen independently and uniformly. Next we present a Chernoff bound type argument.  √  2 2 2 sdu1  Q sdaj x˜ij  Q s dkx˜ik /2 s d/2 For any s > 0, we have E e = j E e = j cosh s dx˜ij ≤ e = e . Now applying Markov’s inequality we get,

 2  sdu1 s d Pr (|u1| > s) = 2Pr e > e

2 2 2 sdu1  s d s d/2−s d ≤ 2E e /e ≤ 2e = 2e−s2d/2 ≤ δ/(nd)

Setting s = p2 log(2nd/δ)/d yields the last inequality. Taking union bound over all nd coordinates of vectors {HDx˜i : xi ∈ S} and noting that HDxi = kxikHDx˜i, the result follows.

A.8 Proof of Lemma9

Pd Pd Proof. Note that we can write Y1 and Y2 as, Y1 = i=1 Ui(Biiv1i) and Y2 = i=1 Ui(Biiv2i).  Pd 2 2   Pd 2 2  By 2-stability of Normal distribution, Y1 ∼ N 0, i=1 Biiv1i and Y2 ∼ N 0, i=1 Biiv2i .

82 Next note that,

EU (Y1Y2) d ! d !! X X = EU Ui(Biiv1i) · Ui(Biiv2i) i=1 i=1 d ! X 2 2 X = E Ui Biiv1iv2i + UiUjBiiBjjv1iv2j i=1 i6=j d X 2 2 = EU Ui Biiv1iv2i i=1 X + EU (Ui) EU (Uj) BiiBjjv1iv2j i6=j d X 2 X = 1 · (Biiv1iv2i) + 0 · 0 · BiiBjj(v1iv2j) i=1 i6=j d X 2 = Biiv1iv2i i=1

> Therefore, (Y1,Y2) follows a bivariate normal distribution with zero mean and covariance  Pd 2 2 Pd 2  i=1 Biiv1i i=1 Biiv1iv2i matrix CB = Pd 2 Pd 2 2 . Note that CB is a random quantity that i=1 Biiv1iv2i i=1 Biiv2i Pd 2 2  depends on B. Taking expectation with respect to B, we see that, EB i=1 Biiv1i = Pd 2 2 2 Pd 2 2  2 i=1 EB(Bii)v1i = pkv1k . Similarly, EB i=1 Biiv2i = pkv2k , and EBEU (Y1Y2) =  2 >  Pd 2 > pv1 p(v1 v2) i=1 EB(Bii)v1iv2i = p(v1 v2). That is EB(CB) = > 2 . Let us denote p(v1 v2) pv2 this matrix by C. > Therefore, in expectation (with respect to B), (Y1,Y2) follows a bivariate normal dis- tribution with zero mean and fixed covariance matrix C. What we show next, is that over randomization of B, diagonal entries of CB are tightly concentrated near its expected value, i.e., corresponding diagonal entries of C. Pd 2 2 We start with the term i=1 Biiv1i and observe that,

d d X 2 2 X 4 2 Var( Biiv1i) = v1iVar(Bii) i=1 i=1 d d X 4 2 X 2 = v1ip(1 − p) ≤ pkv1k∞ v1i i=1 i=1 ! 2 log( 2nd ) ≤ pkv k4 δ 1 d

Now using Bernstein inequality we get,

83 d !

X 2 2 2 2 Pr Biiv1i − qkv1k > pkv1k i=1 d d ! !

X 2 2 X 2 2 2 = Pr Biiv1i − EB Biiv1i > pkv1k i=1 i=1   1 2 2 (pkv1k ) − 2  Pd 2 2 1 2 2 2 log(2nd/δ)  Var(B v )+( pkv1k )·(kv1k ( )) ≤ 2e i=1 ii 1i 3 d 1 2 2 4 ! − 2  p kv1k 2 log(2nd/δ) 2 log(2nd/δ) pkv k4 + 1 pkv k4 ≤ 2e 1 ( d ) 3 1 ( d ) ! −2pd 4(1+  ) log 2nd δ = 2e 3 ( δ ) ≤ 4

  where the last nequality follows due to the choice of p. Using similar argument, Pr Pd B2 v2 − pkv k2 > qkv k2 ≤ i=1 ii 2i 2 2 δ 4 . Applying the above result to (v1 + v2) and (v1 − v2), we get that the following holds with δ probability at least 1 − 2 .

d 2 X 2 2 2 (1 − )pkv1 + v2k ≤ Bii(v1i + v2i) ≤ (1 + )pkv1 + v2k i=1

d 2 X 2 2 2 (1 − )pkv1 − v2k ≤ Bii(v1i − v2i) ≤ (1 + )pkv1 − v2k i=1 Using the above we get,

d X 2 4 Biiv1iv2i i=1 d d X 2 2 X 2 2 = Bii(v1i + v2i) − Bii(v1i − v2i) i=1 i=1 2 2 ≥ (1 − )pkv1 + v2k − (1 + )pkv1 − v2k > 2 2 = 4pv1 v2 − 2p(kv1k + kv2k )    = 4p v>v − (kv k2 + kv k2) 1 2 2 1 2

Pd 2 >  2 2  Using similar argument it is easy to show that, 4 i=1 Biiv1iv2i ≤ 4p v1 v2 + 2 (kv1k + kv2k ) and the result follows.

84 Appendix B

CHAPTER 3 PROOFS

In this section we are providing all the proofs for Chapter3.

B.1 Proof of Theorem10

Proof. For any x(i) ∈ S, using lemma 1 of [99], we get,   > >  2 kq − x(1)k2 Pr |U (q − x(i))| ≤ |U (q − x(1))| ≤ 1 − arccos π kq − x(i)k2

π 2θ Noting that for any z, arccos(z) = 2 − arcsin(z), and the inequality θ ≥ sin θ ≥ π , for 0 ≤ π > >  kq−x(1)k2 θ ≤ , we get Pr |U (q − x(i))| ≤ |U (q − x(1))| ≤ . Let Zi be indicator variable 2 kq−x(i)k2 > > kq−x(1)k2 that takes value 1 if |U (q − x(i))| ≤ |U (q − x(1))|, and 0 otherwise. Then (Zi) ≤ . E kq−x(i)k2 P|S| Let Z= i=1 Zi. Then Z indicates the number of points in S whose distance from q upon > projection is smaller than |U (q − x(1))|. Using Markov’s inequality, P|S| (Z ) kq−x k Pr(Z > k) ≤ E(Z) = i=1 E i ≤ 1 P|S| (1) 2 . k k k i=1 kq−x(i)k2

B.2 Proof of Theorem11

k Proof. Let c = kn0. Since we are using median split, it is easy to see that exactly d 2 e levels from the leaf node level (and excluding leaf node level) will have less than c points on each side of the median. Note that at the leaf node level, we have at most d n e nodes. Now n0 consider the level just above the leaf node level. Total number of nodes at this level is at 1 n most · d e and on each side of the median we have at most n0 points. Since n0 < c, on 2 n0 each side of the median, for each node at this level, will store a matrix of size n0 × (m + 1) matrix (n0 × m matrix for m dimensional representation of n0 points and additional n0 × 1 space for storing index of these n0 points). If we further go one level up, maximum number 1 n of nodes at this level is · d e and on each side of the median we have at most 2n0 points 4 n0 and so on. Therefore total additional space complexity for storing auxiliary information at

85 k d 2 e levels from the leaf node level is,

d k e d k e X2  1 n  X2 1 2(m + 1) · d e · (2i−1n ) ≤ 2(m + 1)n 2i n 0 2 i=1 0 i=1 k = (m + 1)nd e 2 For the remaining levels, we store c × (m + 1) matrix on each side of the median at each internal node. Total additional space required to store this auxiliary information is,

logd n e− d k e+1  n0 ( 2 )  X i 2(m + 1)c  2  i=0

logd n e−d k e ! 2 n0 2 − 1 = 2(m + 1)c 2 − 1 1 n k ≤ 2(m + 1)c d e ≤ 2(m + 1)n d k e d k e 2 2 n0 2 2 Summing these two terms, additional space requirement is,

 k 2k  (m + 1)n d e + ≤ 6(m + 1)n d k e 2 2 2

where the last inequality follows from the fact that d k e+ 2k ≤ 6 for all k ≤ 10. In addition, 2 d k e 2 2 we also need to store m random projection directions for the entire tree requiring extra md space. Therefore, total additional space requirement for storing auxiliary information is at most 6(m + 1)n + md ≤ (m + 1)(6n + d). Now, note that we want to choose number of projection directions m in such a way that for all auxiliary data points stored at the internal nodes along a query routing path (from root node to leaf node) and for the query, pairwise distances are preserved up to a multiplicative error (1 ± ) compared to the respective original distances. Total number of such points   is n0 = 2c logd n e − 1 + 1 ≤ 2c logd n e, considering all levels except leaf node level. JL n0 n0 0  log n   log c+log log(n/n0)  lemma tells us that m = O 2 = O 2 = O (log log(n/n0)) would suffice, where the last inequality follows if we fix  and since c is a fixed quantity. therefore total additional space complexity for storing auxiliary information is O ((n + d) log log(n/n0)), ˜ which we write as O(n + d) hiding the log log(n/n0) factor.

B.3 Proof of Theorem12

Proof. Due to overlap, at each successive level size of the internal node reduces by factor of 1 ( + α). Simple calculation shows that depth of the tree is at most k = log 2 (n/n0). Total 2 1+2α

86 number of nodes is,

k−1 1 log 2 (n/n0) 1 log2(n/n0)·log 2 2 2 = · 2 1+2α = · 2 1+2α 2 2 1  log 2 2   2 1 n 1+2α 1 n log2 = · = · 1+2α 2 n0 2 n0 1 1  n  1−log2(1+2α) = · 2 n0 Due to median split, total number of internal nodes will be exactly one less than the number of leaf nodes, moreover, we need to store a random projection direction at each of these   1  1−log (1+2α) internal nodes. Therefore, for fixed n0, space complexity of spill tree is O dn 2 .

87 Appendix C

CHAPTER 4 PROOFS

This section provides all proofs for Chapter 4.

C.1 Proof of Theorem13

Proof. For different transformations, the theorem have already been proven in different lit- erature. For the sake of completeness, we provide the proof below. Note that for any trans- 2 2 2 > formation Tj, j = 1,..., 4, we have,kQj(q) − Pj(x)k = kQj(q)k + kPj(x)k − 2Q(q) Pj(x). after doing simple algebraic calculations, for different transformations this yields to:  q>x  kQ (q) − P (x)k2 = 2 1 − (C.1) 1 1 kqkβ  q>x kQ (q) − P (x)k2 = 2 1 − (C.2) 2 2 β2 2 2 2 > kQ3(q) − P3(x)k = β + kqk − 2q x (C.3)

q>x kQ (q) − P (x)k2 = (1 + m/4) − (2/α) 4 4 kqk m+1 kxk 2 + 2 (C.4) α It is easy to see from Equation C.1, C.2, C.3 that maximizing q>x is same as minimizing kQj(q) − Pj(x)k for j = 1, 2, 3 which corresponds to transformations T1 , T2 , T3 . Since the last term of Equation C.4 tends to zero as m → 0, the same holds for Transformation T4 as well.

C.2 Proof of Corallary14

> > Proof. Consider any pair (x(i), x(i+1)) such that q x(i) ≥ q x(i+1). From Equation C.1, C.2, C.3, it is easy to see that kQj(q) − Pj(x(i+1))k − kQj(q) − Pj(x(i))k ≥ 0 for any j = 1, 2, 3.

88 Moreover, as can be seen from Equation C.4 that the same holds for j = 4 as m → 0. Considering all the pairs for i = 1, 2,..., (n − 1), it is easy to see that kQj(q) − Pj(x(1))k ≤ kQj(q) − Pj(x(2))k ≤ · · · ≤ kQj(q) − Pj(x(n))k. Combining this fact with the definition of potential function from Equation 1.3, Equation 4.3 follows.

C.3 Proof of Theorem15

Proof. First, we prove a simple Lemma that will be used in our proof. a Lemma 19. Let a, b two positive scalars such that b ≤ 1. For positive scalars x, y such that x ≤ y, the following holds: a a + x a + y ≤ ≤ b b + x b + y Proof. The first inequality follows by observing that a ≤ b ⇒ ax ≤ bx ⇒ ab + ax ≤ ab + bx and then rearranging terms. Since (y − x) ≥ 0, the second inequality follows in similar way.

2 kQ1(q)−P1(y)k a Now, we will express 2 as and then show that same ratio for other transfor- kQ1(q)−P1(x)k b a+x  mation functions can be represented as b+x for some positive x. Invoking Lemma 19 then 2  q>y  will yield the desired result. Note that, kQ1(q) − P1(y)k = 2 1 − βkqk . Therefore,

2 > kQ1(q) − P1(y)k (βkqk − q y) 2 = > (C.5) kQ1(q) − P1(x)k (βkqk − q x)

2  q>y  Now, kQ2(q) − P2(y)k = 2 1 − β2 . Therefore,

2 2 > > 2 kQ2(q) − P2(y)k (β1 − q y) (βkqk − q y) + (β1 − βkqk) 2 = 2 > = > 2 (C.6) kQ2(q) − P2(x)k (β1 − q x) (βkqk − q x) + (β1 − βkqk) 2 Combining Lemma 19, Equation C.5 and the fact that βkqk ≤ β1 · β1 = β1 , we get, kQ1(q)−P1(y)k ≤ kQ2(q)−P2(y)k . Next, kQ1(q)−P1(x)k kQ2(q)−P2(x)k

2 2 2 > kQ3(q) − P3(y)k (kqk + β − 2q y) 2 = 2 2 > kQ3(q) − P3(x)k (kqk + β − 2q x) 2 2 ( kqk +β − q>y) = 2 kqk2+β2 > ( 2 − q x) 2 2 (βkqk − q>y) + ( kqk +β − βkqk) = 2 (C.7) > kqk2+β2 (βkqk − q x) + ( 2 − βkqk) Note that (kqk2 + β2)/2 ≥ βkqk. This follows from the fact that (kqk − β)2 ≥ 0 for any value of kqk and β. Therefore, combining this with Equation C.5 and Lemma 19 we get, kQ1(q)−P1(y)k ≤ kQ3(q)−P3(y)k . Now, kQ1(q)−P1(x)k kQ3(q)−P3(x)k

89 m+1  m 2q>x kxk2 kQ (q) − P (x)k2 = 1 + − + 4 4 4 αkqk α m+1 2q>x m  kxk2 = 2 − + − 1 + αkqk 4 α 2(αkqk − q>x) + αkqk m − 1 + αkqkf(x, m) = 4 αkqk  > αkqk m  αkqk  2 (αkqk − q x) + 2 4 − 1 + 2 f(x, m) = αkqk 2 (βkqk − q>x) + βkqk c (1 + m ) − 1 = 2 4 αkqk  αkqk  2 2 f(x, m) + (C.8) αkqk

2m+1  kxk  where, f(x, m) = α . Therefore,

kQ (q) − P (y)k2 (βkqk − q>y) + βkqk c (1 + m ) − 1 + αkqk f(y, m) 4 4 = 2 4 2 2 c m αkqk kQ4(q) − P4(x)k >  (βkqk − q x) + βkqk 2 (1 + 4 ) − 1 + 2 f(x, m)

2  For m is large enough (m is at least 4 c − 1 so that the second term is non-negative) f(·, m) tends to zero doubly exponentially fast, and therefore, combining this with Equation C.5 and Lemma 19 we get, kQ1(q)−P1(y)k ≤ kQ4(q)−P4(y)k . kQ1(q)−P1(x)k kQ4(q)−P4(x)k

C.4 Proof of Theorem16

Proof. We have shown in Theorem 15 that the following four ratios 2 2 2 2 kQ1(q)−P1(y)k kQ2(q)−P2(y)k kQ3(q)−P3(y)k kQ4(q)−P4(y)k a a+δ1 a+δ2 2 , 2 , 2 and 2 can be expressed as , , kQ1(q)−P1(x)k kQ2(q)−P2(x)k kQ3(q)−P3(x)k kQ4(q)−P4(x)k b b+δ1 b+δ2 a+δ3(y) and respectively (see Equation C.5, C.6, C.7, C.8). Note that in δ3(·) differs in the b+δ3(x) numerator and denominator as it contains terms f(y, m) and f(x, m) respectively. However, since f decreases doubly exponentially fast in m, for large enough m both f(x, m) and f(y, m) approach zero and δ3(y) becomes equal to δ3(x). We will show δ1 ≥ δ2 and for large enough m, δ3 ≥ δ1, then invoking Lemma 19 will yield the desired result. Note 2 kqk2+β2 that δ1 = β1 − βkqk and δ2 = 2 − βkqk. Since kqk ≤ β1 and β ≤ β1, we have 2 2 2 2 kqk +β β1 +β1 δ2 = 2 − βkqk ≤ 2 − βkqk = δ1. Next, note that,  c  m  αkqk δ = βkqk 1 + − 1 + f(x, m) 3 2 4 2 βkqkc  m a ≥ 1 + − βkqk ≥ δ 2 4 1

90   2   βkqkc m  2 2 β1 Inequality a holds as long as, 2 1 + 4 ≥ β1 =⇒ m ≥ 4 c βkq − 1 or for any 2 2 8β1 β1 kqk m ≥ cβkqk . Note that when kqk ≥ β, β1 = max{β, kqk} = kqk, and therefore, βkqk = β . 2 β1 β Alternatively, when β ≥ kqk, β1 = max{β, kqk} = β, and therefore, βkqk = kqk . Therefore, as 8 n kqk β o long as m is large enough (at least c max β , kqk ) the statement of the theorem holds.

C.5 Proof of Corallary17

Proof. The corollary follows imemdiately from Theorem 16, definition of transformations T1 , T2 , T3 , T4 and equation 4.3.

91