The CRO Kernel: Using Concomitant Rank Order Hashes for Sparse High Dimensional Randomized Feature Maps

The CRO Kernel: Using Concomitant Rank Order Hashes for Sparse High Dimensional Randomized Feature Maps Kave Eshghi and Mehran Kafai Hewlett Packard Labs 1501 Page Mill Rd. Palo Alto, CA 94304 fkave.eshghi, [email protected] Abstract—Kernel methods have been shown to be effective for quadratic with the number of training samples. Classification many machine learning tasks such as classification, clustering is expensive because for each classification task the kernel and regression. In particular, support vector machines with the function must be applied for each of the support vectors, whose RBF kernel have proved to be powerful classification tools. The standard way to apply kernel methods is to use the ’kernel trick’, number may be large. As a result, kernel SVMs are rarely used where the inner product of the vectors in the feature space is when the number of training instances is large or for online computed via the kernel function. Using the kernel trick for applications where classification must happen very fast. Many SVMs, however, leads to training that is quadratic in the number approaches have been proposed in the literature to overcome of input vectors and classification that is linear with the number these efficiency problems with non-linear kernel SVMs. of support vectors. We introduce a new kernel, the CRO (Concomitant Rank The situation is very different with sparse, high dimensional Order) kernel that approximates the RBF kernel for unit length input vectors and linear kernels. For this class of problems, input vectors. We also introduce a new randomized feature there are efficient algorithms for both training and classifica- map, based on concomitant rank order hashing, that produces tion. One implementation is LIBLINEAR [2], from the same sparse, high dimensional feature vectors whose inner product group that implemented LIBSVM [3]. When the data fits this asymptotically equals the CRO kernel. Using the Hadamard transform for computing the CRO hashes ensures that the cost model, e.g. for text classification, these algorithms are very of computing feature vectors is very low. Thus, for unit length effective. But when the data does not fit this model, e.g. input vectors, we get the accuracy of the RBF kernel with the for image classification, these algorithms are not particularly efficiency of a sparse high dimensional linear kernel. We show the efficient and the classification accuracy is low. efficacy of our approach using a number of standard datasets. We introduce a new kernel Kγ (A; B) defined as I. INTRODUCTION Kγ (A; B) = Φ2 (γ; γ; cos(A; B)) (1) Kernel methods have been shown to be effective for many where γ is a constant, and Φ2 (x; y; ρ) is the CDF of the machine learning tasks such as classification, clustering and standard bivariate normal distribution with correlation ρ. regression. The theory behind kernel methods relies on a We also introduce the randomized feature map Fγ;Q (A), mapping between the input space and the feature space such where Q 2 RU×N is an instance of an iid matrix of standard that the inner product of the vectors in the feature space can normal random variables. We prove that be computed via the kernel function, aka the ’kernel trick’. Fγ;Q (A) ·Fγ;Q (B) The kernel trick is used because a direct mapping to the E = Kγ (A; B) (2) feature space is expensive or, in the case of the RBF kernel, U!1 U impossible, since the feature space is infinite dimensional. To perform the mapping from the input space to the feature The canonical example is Support Vector Machine (SVM) space, we use a variant of the concomitant rank order hash classification with the Gaussian kernel. It has been shown function [4], [5]. Relying on a result first presented in [5], we that for many types of data, their classification accuracy far use a random permutation followed by a Hadamard transform surpasses linear SVMs. For example, for the MNIST [1] to compute the random projection that is at the heart of this handwritten number recognition dataset, SVM with the RBF operation. The resulting algorithm for computing the feature kernel achieves accuracy of 98.6%, whereas linear SVM can map is highly efficient. only achieve 92.7%. The proposed kernel K and feature map F have interesting For SVMs, the main drawback of the kernel trick is that properties: both training and classification can be expensive. Training is • The kernel approximates the RBF kernel on the unit expensive because the kernel function must be applied for each sphere. pair of the training samples, making the training task at least • The feature map is sparse and high dimensional. • The feature map can be computed very efficiently. III. THE CONCOMITANT RANK ORDER (CRO) KERNEL Thus, for the class of problems where this kernel is effective, AND FEATURE MAP we have the best of both worlds: the accuracy of the RBF A. Notation kernel, and the efficiency of the sparse, high dimensional linear 1) Φ(x): We use Φ(x) to denote the CDF of the standard models. normal distribution N (0; 1), and φ(x) to denote its PDF. Along the way, we prove a new result in the theory of con- 2 comitant rank order statistics for bivariate normal distributions, 1 − x φ(x) = p e 2 (3) given in theorem 2 in the Appendix. 2π We show the efficacy of our approach, in terms of classi- Z x fication accuracy, training time, and classification time on a Φ(x) = φ(u) du (4) −∞ number of standard datasets. We also make a detailed com- parison with alternative approaches for randomized mapping 2) Φ2(x; y; ρ): We use Φ2(x; y; ρ) to denote the CDF of to the feature space presented in [6], [7], [8]. the standard bivariate normal distribution 0 1 ρ II. RELATED WORK N ; 0 ρ 1 Reducing the training and classification cost of non-linear SVMs has attracted a great deal of attention in the literature. and φ2(x; y; ρ) to denote its PDF. Joachims et. al. [9] use basis vectors other than support vectors to find sparse solutions that speed up training and prediction. −x2−y2+2ρxy 1 2 Segata et. al. [10] use local SVMs on redundant neighborhoods φ2(x; y; ρ) = e 2(1−ρ ) (5) 2πp1 − ρ2 and choose the appropriate model at query time. In this x y way, they divide the large SVM problem into many small Z Z local SVM problems. Tsang et. al. [11] re-formulate the Φ2(x; y; ρ) = φ2(u; v; ρ) du dv (6) kernel methods as minimum enclosing ball (MEB) problems −∞ −∞ in computational geometry, and solve them via an efficient 3) R(x; A): We use R(x; A) to denote the rank of x in A, approximate MEB algorithm, leading to the idea of core sets. defined as follows: Nandan et. al. [12] choose a subset of the training data, called the representative set, to reduce the training time. This Definition 1. For the scalar x and vector A 2 RN , R(x; A) subset is chosen using an algorithm based on convex hulls and is the count of the of elements of A which are less than or extreme points. equal to x. A number of approaches compute approximations to the feature vectors and use linear SVM on these vectors. Chang B. The CRO Kernel et. al. [13] do an explicit mapping of the input vectors into Definition 2. Let A; B 2 RN . Let γ 2 R. Then the kernel low degree polynomial feature space, and then apply fast linear Kγ (A; B) is defined as: SVMs for classification. Vedaldi et. al. [14] introduce explicit feature maps for the additive kernels, such as the intersection, Kγ (A; B) = Φ2 (γ; γ; cos(A; B)) (7) Hellinger’s, and χ2. In order for Kγ (A; B) to be admissible as a kernel for sup- Weinberger et. al. [15] use hashing to reduce the dimension- port vector machines, it needs to satisfy the Mercer conditions. ality of the input vectors. Litayem et. al. [16] use hashing to Theorem 3 in the Appendix proves this. reduce the size of the input vectors and speed up the prediction In proposition 4 in the Appendix we derive the following phase of linear SVM. Su et. al. [17] use sparse projection to equation for Kγ (A; B): reduce the dimensionality of the input vectors while preserving cos(A;B) 2 the kernel function. Z exp −γ 2 1+ρ Rahimi et. al. [6] map the input data to a randomized low- Kγ (A; B) = Φ (x) + dρ (8) 2πp1 − ρ2 dimensional feature space using sinusoids randomly drawn 0 from the Fourier transform of the kernel function. Quoc et. al. [7] replace the random matrix proposed in [6] with an C. CRO Kernel Approximates RBF Kernel on the Unit Sphere approximation that allows for fast multiplication. Pham et. Let al. [8] introduce Tensor Sketching, a method for approximating polynomial kernels which relies on fast convolution of Count α = Φ(γ) (9) Sketches. Both [7] and [8] improve upon Rahimi’s work [6] in terms of time and storage complexity [18]. Raginsky et. From the definition of bivariate normal distribution we can al. [19] compute locality sensitive hashes where the expected show that Hamming distance between the binary codes of two vectors Φ (γ; γ; 0) = α2 (10) is related to the value of a shift-invariant kernel. They use the 2 results in [6] for this purpose. Φ2(γ; γ; 1) = α (11) Thus log(Φ2(γ; γ; 0)) = 2 log(α) (12) log(Φ2(γ; γ; 1)) = log(α) (13) If we linearly approximate log(Φ2(γ; γ; ρ)) between ρ = 0 and ρ = 1, we get the following: log(Φ2(γ; γ; ρ)) ≈ log(α)(2 − ρ) (14) Figure 1 shows the two sides of eq.

The CRO Kernel: Using Concomitant Rank Order Hashes for Sparse High Dimensional Randomized Feature Maps

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support