The CRO Kernel: Using Concomitant Rank Order Hashes for Sparse High Dimensional Randomized Feature Maps

Kave Eshghi and Mehran Kafai Hewlett Packard Labs 1501 Page Mill Rd. Palo Alto, CA 94304 {kave.eshghi, mehran.kafai}@hpe.com

Abstract—Kernel methods have been shown to be effective for quadratic with the number of training samples. Classification many tasks such as classification, clustering is expensive because for each classification task the kernel and regression. In particular, support vector machines with the function must be applied for each of the support vectors, whose RBF kernel have proved to be powerful classification tools. The standard way to apply kernel methods is to use the ’kernel trick’, number may be large. As a result, kernel SVMs are rarely used where the inner product of the vectors in the feature space is when the number of training instances is large or for online computed via the kernel function. Using the kernel trick for applications where classification must happen very fast. Many SVMs, however, leads to training that is quadratic in the number approaches have been proposed in the literature to overcome of input vectors and classification that is linear with the number these efficiency problems with non-linear kernel SVMs. of support vectors. We introduce a new kernel, the CRO (Concomitant Rank The situation is very different with sparse, high dimensional Order) kernel that approximates the RBF kernel for unit length input vectors and linear kernels. For this class of problems, input vectors. We also introduce a new randomized feature there are efficient algorithms for both training and classifica- map, based on concomitant rank order hashing, that produces tion. One implementation is LIBLINEAR [2], from the same sparse, high dimensional feature vectors whose inner product group that implemented LIBSVM [3]. When the data fits this asymptotically equals the CRO kernel. Using the Hadamard transform for computing the CRO hashes ensures that the cost model, e.g. for text classification, these algorithms are very of computing feature vectors is very low. Thus, for unit length effective. But when the data does not fit this model, e.g. input vectors, we get the accuracy of the RBF kernel with the for image classification, these algorithms are not particularly efficiency of a sparse high dimensional linear kernel. We show the efficient and the classification accuracy is low. efficacy of our approach using a number of standard datasets. We introduce a new kernel Kγ (A, B) defined as

I.INTRODUCTION Kγ (A, B) = Φ2 (γ, γ, cos(A, B)) (1)

Kernel methods have been shown to be effective for many where γ is a constant, and Φ2 (x, y, ρ) is the CDF of the machine learning tasks such as classification, clustering and standard bivariate normal distribution with correlation ρ. regression. The theory behind kernel methods relies on a We also introduce the randomized feature map Fγ,Q (A), mapping between the input space and the feature space such where Q ∈ RU×N is an instance of an iid matrix of standard that the inner product of the vectors in the feature space can normal random variables. We prove that be computed via the kernel function, aka the ’kernel trick’.   Fγ,Q (A) ·Fγ,Q (B) The kernel trick is used because a direct mapping to the E = Kγ (A, B) (2) feature space is expensive or, in the case of the RBF kernel, U→∞ U impossible, since the feature space is infinite dimensional. To perform the mapping from the input space to the feature The canonical example is Support Vector Machine (SVM) space, we use a variant of the concomitant rank order hash classification with the Gaussian kernel. It has been shown function [4], [5]. Relying on a result first presented in [5], we that for many types of data, their classification accuracy far use a random permutation followed by a Hadamard transform surpasses linear SVMs. For example, for the MNIST [1] to compute the random projection that is at the heart of this handwritten number recognition dataset, SVM with the RBF operation. The resulting algorithm for computing the feature kernel achieves accuracy of 98.6%, whereas linear SVM can map is highly efficient. only achieve 92.7%. The proposed kernel K and feature map F have interesting For SVMs, the main drawback of the kernel trick is that properties: both training and classification can be expensive. Training is • The kernel approximates the RBF kernel on the unit expensive because the kernel function must be applied for each sphere. pair of the training samples, making the training task at least • The feature map is sparse and high dimensional. • The feature map can be computed very efficiently. III.THE CONCOMITANT RANK ORDER (CRO)KERNEL Thus, for the class of problems where this kernel is effective, AND FEATURE MAP we have the best of both worlds: the accuracy of the RBF A. Notation kernel, and the efficiency of the sparse, high dimensional linear 1) Φ(x): We use Φ(x) to denote the CDF of the standard models. normal distribution N (0, 1), and φ(x) to denote its PDF. Along the way, we prove a new result in the theory of con- 2 comitant rank order for bivariate normal distributions, 1 − x φ(x) = √ e 2 (3) given in theorem 2 in the Appendix. 2π We show the efficacy of our approach, in terms of classi- Z x fication accuracy, training time, and classification time on a Φ(x) = φ(u) du (4) −∞ number of standard datasets. We also make a detailed com- parison with alternative approaches for randomized mapping 2) Φ2(x, y, ρ): We use Φ2(x, y, ρ) to denote the CDF of to the feature space presented in [6], [7], [8]. the standard bivariate normal distribution 0 1 ρ II.RELATED WORK N , 0 ρ 1 Reducing the training and classification cost of non-linear SVMs has attracted a great deal of attention in the literature. and φ2(x, y, ρ) to denote its PDF. Joachims et. al. [9] use basis vectors other than support vectors to find sparse solutions that speed up training and prediction. −x2−y2+2ρxy 1 2 Segata et. al. [10] use local SVMs on redundant neighborhoods φ2(x, y, ρ) = e 2(1−ρ ) (5) 2πp1 − ρ2 and choose the appropriate model at query time. In this x y way, they divide the large SVM problem into many small Z Z local SVM problems. Tsang et. al. [11] re-formulate the Φ2(x, y, ρ) = φ2(u, v, ρ) du dv (6) kernel methods as minimum enclosing ball (MEB) problems −∞ −∞ in computational geometry, and solve them via an efficient 3) R(x, A): We use R(x, A) to denote the rank of x in A, approximate MEB algorithm, leading to the idea of core sets. defined as follows: Nandan et. al. [12] choose a subset of the training data, called the representative set, to reduce the training time. This Definition 1. For the scalar x and vector A ∈ RN , R(x, A) subset is chosen using an algorithm based on convex hulls and is the count of the of elements of A which are less than or extreme points. equal to x. A number of approaches compute approximations to the feature vectors and use linear SVM on these vectors. Chang B. The CRO Kernel et. al. [13] do an explicit mapping of the input vectors into Definition 2. Let A, B ∈ RN . Let γ ∈ R. Then the kernel low degree polynomial feature space, and then apply fast linear Kγ (A, B) is defined as: SVMs for classification. Vedaldi et. al. [14] introduce explicit feature maps for the additive kernels, such as the intersection, Kγ (A, B) = Φ2 (γ, γ, cos(A, B)) (7) Hellinger’s, and χ2. In order for Kγ (A, B) to be admissible as a kernel for sup- Weinberger et. al. [15] use hashing to reduce the dimension- port vector machines, it needs to satisfy the Mercer conditions. ality of the input vectors. Litayem et. al. [16] use hashing to Theorem 3 in the Appendix proves this. reduce the size of the input vectors and speed up the prediction In proposition 4 in the Appendix we derive the following phase of linear SVM. Su et. al. [17] use sparse projection to equation for Kγ (A, B): reduce the dimensionality of the input vectors while preserving cos(A,B)  2  the kernel function. Z exp −γ 2 1+ρ Rahimi et. al. [6] map the input data to a randomized low- Kγ (A, B) = Φ (x) + dρ (8) 2πp1 − ρ2 dimensional feature space using sinusoids randomly drawn 0 from the Fourier transform of the kernel function. Quoc et. al. [7] replace the random matrix proposed in [6] with an C. CRO Kernel Approximates RBF Kernel on the Unit Sphere approximation that allows for fast multiplication. Pham et. Let al. [8] introduce Tensor Sketching, a method for approximating polynomial kernels which relies on fast of Count α = Φ(γ) (9) Sketches. Both [7] and [8] improve upon Rahimi’s work [6] in terms of time and storage complexity [18]. Raginsky et. From the definition of bivariate normal distribution we can al. [19] compute locality sensitive hashes where the expected show that Hamming distance between the binary codes of two vectors Φ (γ, γ, 0) = α2 (10) is related to the value of a shift-invariant kernel. They use the 2 results in [6] for this purpose. Φ2(γ, γ, 1) = α (11) Thus

log(Φ2(γ, γ, 0)) = 2 log(α) (12)

log(Φ2(γ, γ, 1)) = log(α) (13)

If we linearly approximate log(Φ2(γ, γ, ρ)) between ρ = 0 and ρ = 1, we get the following:

log(Φ2(γ, γ, ρ)) ≈ log(α)(2 − ρ) (14) Figure 1 shows the two sides of eq. (14) as ρ goes from 0 to 1. The choice of γ = −2.42617 is for a typical use case of the kernel. α is related to γ through eq. (9). Fig. 2. Comparison of the two sides of eq. (21) for γ = −2.42617

5 log[Φ2 훾, 훾, 휌 ]

6 is negative then the values of both kernels are very close to log(훼)(2 − 휌) zero, so we ignore such cases. 7 We will now describe the feature map that allows us to use the kernel Kγ (A, B) for tasks such SVM training. 8 IV. THE CONCOMITANT RANK ORDER (CRO)HASH 9 FUNCTIONAND FEATURE MAP A. The CRO hash function 0.2 0.4 휌 0.6 0.8 1.0 Definition 3. Let U, τ be positive integers, with τ < U. Let A ∈ RN and Q ∈ RU×N . Let P = QA. Then Fig. 1. Comparison of the two sides of eq. (14) for γ = −2.42617 HQ,τ (A) = {j : R(P [j],P ) ≤ τ} (22) From eq. (14) we get where R is the rank function defined in Definition 1. log(α)(2−ρ) Φ2(γ, γ, ρ) ≈ e (15) log(α)(1−ρ) We call HQ,τ (A) the CRO hash set of A with respect to ≈ αe (16) Q and τ. In words, the hash set of A with respect to Q and Let ρ = cos(A, B). Then from Definition 2 and eq. (16) we τ is the set of indexes of those elements of P whose rank is get less than or equal to τ. If P does not have any repetitions, the following Matlab code returns the hash set: log(α)(1−cos(A,B)) Kγ (A, B) ≈ αe (17) [˜, ix] = sort(P); H = ix[1:tau]; Let A, B ∈ RN be on the unit sphere, i.e. According to the above definition, the universe from which kAk = kBk = 1 (18) the hashes are drawn is 1 ...U, where U is the number of Then rows of Q. kA − Bk2 kAk2 + kBk2 − 2kAkkBk cos(A, B) Theorem 1. Let M be a U ×N matrix of iid N (0, 1) random = (19) 2 2 variables. Let Q be an instance of M. Let A, B ∈ RN . = 1 − cos(A, B) (20) Let 0 < λ < 1 be a constant. Let τ = dλ(U − 1)e. Let τ Replacing the rhs of eq. (20) in eq. (17) we get γ = Φ−1( ) (23) U − 1 log(α) kA−Bk2 Kγ (A, B) ≈ αe 2 (21) Then   But the rhs of eq. (21) is the definition of an RBF kernel with |HQ,τ (A) ∩ HQ,τ (B)| lim E = Φ2(γ, γ, cos(A, B)) parameter log(α). U→∞ U Notice that the purpose of the comparison with the RBF (24) kernel is to give us an intuition about Kγ (A, B), to enable us Proof: In the Appendix. to make meaningful comparisons with implementations that use RBF. B. The CRO Feature Map To this end, how close is the approximation to the RBF In this section we introduce the feature map kernel? The following diagram shows the two sides of eq. (21) N U as cos(A, B) goes from 0 to 1. Notice that when cos(A, B) Fγ,Q (A): R → R (25) This is the function that maps vectors from the input space to D. Computing the CRO Hash Set sparse, high dimensional vectors in the feature space. The computation of the projection vector P = QA requires Definition 4. Let A, B, Q, γ, τ, U be defined as in theorem 1. U × N operations, which can be very expensive when N is N U The feature map Fγ,Q (A): R → R is defined as follows: large. To avoid this, we use the scheme described in [5] to compute the CRO hash set for the input vectors.   1 if j ∈ HQ,τ (A) The CRO hash function maps input vectors to a set of hashes Fγ,Q (A)[j] = (26) chosen from a universe 1 ...U, where U is a large integer. We  0 if j∈ / HQ,τ (A) use τ to denote the number of hashes that we require per input vector. Proposition 1 establishes the relationship between the fea- Let A ∈ RN be the input vector. The hash function takes as ture map Fγ,Q (A) and the kernel Kγ (A, B). a second input a random permutation Π of 1 ...U. It should Proposition 1. be emphasized that the random permutation Π is chosen once and used for hashing all input vectors.   Fγ,Q (A) ·Fγ,Q (B) Table I shows the procedure for computing the CRO hash E = Kγ (A, B) (27) U→∞ U set. Here we use −A to represent the vector A multiplied by −1. We use A, B, C... to represent concatenation of vectors Proof: It is a simple matter to show that A, B, C etc.

F (A) ·F (B) = |H (A) ∩ H (B)| (28) γ,Q γ,Q Q,τ Q,τ TABLE I COMPUTINGTHE CRO HASHSETFORINPUTVECTOR A By theorem 1   ˆ Fγ,Q (A) ·Fγ,Q (B) 1) Let A = A, −A E = Φ2(γ, γ, cos(A, B)) (29) U→∞ U 2) Create a repeated input vector A0 as follows: A0 = A,ˆ A,ˆ . . . , Aˆ 000 where d = U div |Aˆ| and r = Thus, from Definition 2 and eq. (29) | {z } |{z} d r   ˆ 0 Fγ,Q (A) ·Fγ,Q (B) U mod |A|. div represents integer division. Thus |A | = 2dN + E = Kγ (A, B) (30) r = U. U→∞ U 3) Apply the random permutation Π to A0 to get permuted input C. The Feature Map is Binary and Sparse vector V . From Definition 3 and Definition 4 it follows that 4) Calculate the Hadamard Transform of V to get S. If an efficient • The total number of elements in Fγ,Q (A) is U. implementation of the Hadamard Transform is not available, we • The number of non-zero elements in Fγ,Q (A) is τ. can use another orthogonal transform, for example the DCT transform. • All the non-zero elements of Fγ,Q (A) are 1. τ Recall that α = U−1 . We call α the sparsity of the feature 5) Find the indices of the smallest τ members of S. These indices map. are the hash set of the input vector A. As discussed previously, Kγ (A, B) approximates the RBF log(α) kA−Bk2 kernel αe 2 on the unit sphere, where α is the Table II presents an implementation of the CRO hash sparsity of the feature map. It follows that the relationship function in Matlab. between the RBF parameter and the sparsity of the feature map is exponential. V. EXPERIMENTS We can interpret eq. (30) as follows: Fγ,Q (A) and Fγ,Q (B) We present the experiment results in two sections. In Sec- are the projections of A and B into the feature space whose ex- tion V-A we compare the proposed CRO feature map with pected inner product is UKγ (A, B). What is more, Fγ,Q (A) other approaches that perform random feature mappings with and Fγ,Q (B) are sparse and high dimensional. respect to the transformation time. In Section V-B we eval- Thus, as long as we can efficiently compute Fγ,Q (A) and uate CROSVM (CRO kernel + linear SVM) and compare it Fγ,Q (B), we can use algorithms optimized for sparse linear to LIBLINEAR [2], LIBSVM [3], Fastfood [7] and Tensor computations on binary vectors in the feature space. Sketch [8]. As an example, when the task we are trying to perform is to We’ve implemented a highly optimized Walsh-Hadamard train a support vector machine, instead of using the kernel trick transform function which we call within the CRO hash in the input space, we can use linear SVM with sparse high function. The built-in Matlab implementation of the Walsh- dimensional vectors in the feature space using highly efficient Hadamard transform (fwht) is slow so we recommend using sparse linear SVM solvers such as LIBLINEAR [2] to do the the DCT function or a faster implementation of the WHT training and classification. function. TABLE II efficiency. Fastfood generates an approximate mapping for MATLAB CODE FOR THE CRO HASHFUNCTION the RBF kernel using the Walsh-Hadamard transform. Tensor Sketch approximates polynomial kernels using tensor sketch function hashes = CROHash(A,U,P,tau) convolution. % A is the input vector. Classification, or learning in general, using random feature % U is the size of the hash universe. mapping approaches requires three main steps: transformation, % P is a random permutation of 1:U training, and testing. In this section we compare the CRO %% chosen once and for all and used %% in all hash calculations. kernel with Fastfood [7] and Tensor Sketch [8] in terms of % tau is the desired number of hashes transformation time. E=zeros(1,U); Figure 3 compares the CRO kernel, Fastfood, and Tensor AHat = [A,-A]; Sketch in terms of transformation time with increasing input N2=length(AHat); space dimensionality d. The input size n is equal to 10, 000 d=floor(U/N2); for i=0:d-1 and the feature dimensionality D is set to 4096. The results E(i*N2+1:(i+1)*N2)=AHat; show that Tensor sketch transformation time increases linearly end with the increase whereas Fastfood and the CRO kernel show Q=E(P); logarithmic dependency to d. Transformation via the CRO kernel is faster than Fastfood for the given parameter values % If an efficient implementation of % the Walsh-Hadamard transform is for all experimented values of d. % available, we can use it instead, i.e. % S=fwht(Q);

S=dct(Q); [˜,ix]=sort(S); hashes=ix(1:tau);

A. Transformation time comparison Random feature mappings introduced by Rahimi et. al. [6] works as follows: • A projection matrix P is created where each row of P is drawn randomly from the Fourier transform of the kernel function. • To calculate the feature vector for the input vector A, Fig. 3. Transformation time with increasing input space dimensionality d Ap = PA is calculated. • The feature vector Φ(A) is a vector whose length is Figure 4 illustrates the linear dependency between transfor- twice the length of A , and for each coordinate i in A p p mation time and input size n for all three approaches. For this Φ(A)[2i] = cos(A [i]) and Φ(A)[2i + 1] = sin(A [i]). p p experiment we used face image data with a dimensionality The authors show that the expected inner product of the feature d=1024. Also, the feature space dimensionality D was set vectors is the kernel function applied to the input vectors. Our to 4096. The results show that the transformation time for approach has the following two advantages over the work by Fastfood increases at a higher linear rate compared to Tensor Rahimi et. al. [6]: Sketch and the CRO kernel. The results show that transfor- 1) We can use the Hadamard transform to compute the mation via the CRO kernel is 15% faster than Tensor Sketch. feature vectors, whereas they need to do the projection through multiplying the input vector with the projection Figure 5 presents the transformation time comparison be- matrix. This can be more expensive if the input vector tween the three approaches with increasing values of feature is relatively high dimensional. space dimensionality D. All three approaches show linear 2) We generate a sparse feature vector (only τ entries increase in transformation time when increasing feature space out of U in the feature vector are non-zero, where dimensionality D; however, transformation using the CRO τ << U), whereas their feature vector is dense. Having kernel has a smaller average rate of change. sparse feature vectors can be advantageous in many circumstances, for example many training algorithms B. Evaluation of CROSVM converge much more quickly with sparse feature vectors. CROSVM is the combination of the CRO kernel and linear More recently, Fastfood [7] and Tensor Sketch [8] proposed SVM. We evaluated CROSVM on three publicly available approaches for random feature mappings which improved datasets listed in Table IV, all downloaded from the LIB- upon the work by Rahimi et. al. [6] in terms of speed and SVM [3] website. TABLE III CLASSIFICATION ACCURACY AND PROCESSING TIME COMPARISON (SECONDS)

LIBLINEAR LIBSVM RBF CROSVM dataset training testing accuracy training testing accuracy training testing accuracy w8a 0.3 .01 98.34% 33 2.9 99.39% 7.1 .02 99.45% covtype 6.2 .02 67.14% 10228 416 84.42% 610 .1 85.67%

delivers the fastest transformation time; however, the main advantage of CROSVM is related to the training time. The output of the CRO kernel is a sparse vector whereas Fastfood and Tensor Sketch generate dense vectors. As a result, the training time for CROSVM is significantly less than Fastfood and Tensor Sketch. CROSVM performs similar to LIBSVM in terms of classification accuracy but in less time. CROSVM achieves 98.54% classification accuracy whereas LIBSVM achieves 98.57%; however, CROSVM takes 12.6 seconds for SVM training compared to 465 seconds for LIBSVM. LIBLINEAR has the fastest training and testing time; however, Fig. 4. Transformation time (in seconds) with increasing input size n the error rate is higher compared to other approaches.

TABLE V CLASSIFICATION RESULTS ON MNIST DATASET

method trans. time training time testing time error rate LIBLINEAR 0 9.1 0.01 8.4% CROSVM 11 12.6 0.1 1.5% LIBSVM RBF 0 465 104 1.4% Fastfood 28 265 16 1.9% Tensor Sketch 14 254 9 2.6%

Table III presents the classification accuracy and training and testing time for LIBSVM, LIBLINEAR and CROSVM on w8a and covtype. The time reported in Table III for CROSVM includes only the time required for training and predicting. Table VI presents the processing time breakdown for the entire CROSVM pro- cess which includes mapping the input vector into the sparse, Fig. 5. Transformation time as a function of feature space dimensionality D high dimensional feature vectors, and then performing linear SVM. The total time required for CROSVM is still much less TABLE IV compared to LIBSVM with RBF kernel. DATASET INFORMATION

dataset training size testing size attributes TABLE VI mnist 60,000 10,000 780 PROCESSINGTIMEBREAKDOWNFOR CROSVM w8a 49,749 14,951 300 covtype 500,000 81,012 54 Training time (seconds) Testing time (seconds) dataset hash liblinear train total hash liblinear predict total w8a 3.7 7.1 10.9 1.1 0.02 1.12 covtype 55 610 665 8.9 0.1 9 The largest dataset is the Covertype dataset (covtype) which includes 581,012 instances and 54 attributes. We chose these On the covtype dataset CROSVM achieves higher accuracy datasets because the RBF kernel shows significant improve- than LIBSVM but it requires less time for training and testing. ment in accuracy over the linear kernel. Thus, for example, The accuracy of CROSVM is not very sensitive to the values we did not include the adult dataset, since for this dataset of U and τ. Table VII shows the CROSVM cross-validation linear SVMs are as good as non-linear ones. accuracy on the MNIST dataset for different values of log2(U) Table V presents the classification results on the MNIST and τ. dataset [1] for LIBLINEAR, LIBSVM, CROSVM, Fastfood, We used the grid function from the LIBSVM toolbox and Tensor Sketch. LIBLINEAR is used as the linear classifier to find the best cost (c) and gamma (g) parameters values for CROSVM, Fastfood, and Tensor Sketch. for experiments with LIBSVM. For LIBLINEAR, we chose In comparison with Fastfood and Tensor Sketch, CROSVM the value for parameter s (solver type) which achieves the TABLE VII [5] M. Kafai, K. Eshghi, and B. Bhanu, “Discrete cosine transform locality- MNIST CROSS VALIDATION ACCURACY WITH CROSVM sensitive hashes for face retrieval,” IEEE Trans. on Multimedia, no. 4, pp. 1090–1103, June 2014. log(U) 15 16 17 18 19 [6] A. Rahimi and B. Recht, “Random features for large-scale kernel τ machines,” in Advances in neural info. proc. systems, 2007, pp. 1177– 400 98.03 98.16 98.12 98.14 98.11 1184. 500 98.10 98.20 98.25 98.25 98.24 [7] Q. Le, T. Sarlos,´ and A. Smola, “Fastfood: Approximate kernel expan- 600 98.17 98.25 98.31 98.32 98.30 sions in loglinear time,” in Int. Conf. on Machine Learning, 2013. 700 98.12 98.32 98.44 98.44 98.43 [8] N. Pham and R. Pagh, “Fast and scalable polynomial kernels via explicit 800 98.21 98.32 98.50 98.49 98.46 feature maps,” in 19th ACM SIGKDD Int. Conf. on Knowledge Discovery 900 98.21 98.32 98.50 98.48 98.46 and (KDD), 2013, pp. 239–247. 1000 98.20 98.37 98.54 98.53 98.52 [9] T. Joachims and C.-N. J. Yu, “Sparse kernel svms via cutting-plane training,” Mach. Learn., vol. 76, no. 2-3, pp. 179–193, Sep. 2009. [10] N. Segata and E. Blanzieri, “Fast and scalable local kernel machines,” Journal of Machine Learning Research, vol. 11, pp. 1883–1926, 2010. highest accuracy; s=1 is L2-regularized L2-loss support vector [11] I. W. Tsang, J. T. Kwok, and P.-M. Cheung, “Core vector machines: Fast classification (dual) and s = 2 is L2-regularized L2-loss svm training on very large data sets,” in Journal of Machine Learning support vector classification (primal). Research, 2005, pp. 363–392. [12] M. Nandan, P. P. Khargonekar, and S. S. Talathi, “Fast svm training using For CROSVM, we performed a cross-validation search for approximate extreme points,” Journal of Machine Learning Research, each dataset to find the optimal values for parameters U vol. 15, no. 1, pp. 59–98, 2014. and τ. Table VIII presents the parameter values used in the [13] Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, and C.-J. Lin, “Training and testing low-degree polynomial data mappings via linear experiments for each dataset. svm,” J. of Machine Learning Research, vol. 11, pp. 1471–1490, Aug. 2010. TABLE VIII [14] A. Vedaldi and A. Zisserman, “Efficient additive kernels via explicit fea- EXPERIMENT PARAMETERS FOR EACH DATASET ture maps,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 34, no. 3, pp. 480–492, 2012. LIBLINEAR LIBSVM CROSVM [15] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg, “Feature hashing for large scale multitask learning,” in Int. Conf. on dataset s c g log2(U) τ mnist 2 64 0.03125 17 1000 Machine Learning (ICML). ACM, 2009, pp. 1113–1120. w8a 1 8 8 13 500 [16] S. Litayem, A. Joly, and N. Boujemaa, “Hash-based support vector covtype 2 32768 8 16 500 machines approximation for large scale prediction.” in British Machine Vision Conference, 2012, pp. 1–11. [17] Y.-C. Su, T.-H. Chiu, Y.-H. Kuo, C.-Y. Yeh, and W. Hsu, “Scalable mobile visual classification by kernel preserving projection over high- C. Space Complexity dimensional features,” IEEE Trans. on Multimedia, vol. 16, no. 6, pp. The space complexity of our approach is O(nτ), whereas 1645–1653, Oct. 2014. [18] J. von Tangen Sivertsen, “Scalable learning through linearithmic time Fastfood and Tensor Sketch both have a space complexity of kernel approximation techniques,” Master’s thesis, IT University of O(nD). For example in the results reported in Table V τ = Copenhagen, 2014. 1000 and D = 4096. [19] M. Raginsky and S. Lazebnik, “Locality-sensitive binary codes from shift-invariant kernels,” in Advances in neural information processing systems, 2009, pp. 1509–1517. VI.CONCLUSION [20] M. M. Siddiqui, “Distribution of quantiles in samples from a bivariate We introduced a highly efficient feature map that transforms population,” Journal of Research of the National Institute of Standards and Technology, 1960. vectors in the input space to sparse, high dimensional vectors [21] A. J. Smola and B. Scholkopf,¨ “A tutorial on support vector regression,” in the feature space where the inner product in the feature Statistics and Computing, vol. 14, no. 3, pp. 199–222, Aug. 2004. space is the CRO kernel. We showed that the CRO kernel [22] O. A. Vasicek, “A series expansion for the bivariate normal integral,” Journal of Computational Finance, Jul. 2000. approximates the RBF kernel for unit length input vectors. [23] M. Sibuya, “Bivariate extreme statistics, i,” Annals of the Institute of This allows us to use very efficient linear SVM algorithms Statistical Mathematics, vol. 11, no. 2, pp. 195–210, 1959. that have been optimized for high dimensional sparse vectors. The results show that we can achieve the same accuracy as VII.APPENDIX non-linear RBF SVMs with linear training time and constant A. Concomitant Rank classification time. This approach can enable many new time- Theorem 1. Let M be a U ×N matrix of iid N (0, 1) random sensitive applications where accuracy does not need to be variables. Let Q be an instance of M. Let A, B ∈ RN . sacrificed for training and classification speed. Let 0 < λ < 1 be a constant. Let τ = dλ(U − 1)e. Let −1 τ REFERENCES γ = Φ ( U−1 ). Then [1] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning   |HQ,τ (A) ∩ HQ,τ (B)| applied to document recognition,” Proceedings of the IEEE, vol. 86, lim E = Φ2(γ, γ, cos(A, B)) no. 11, pp. 2278 –2324, Nov. 1998. U→∞ U [2] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, (31) “LIBLINEAR: A library for large linear classification,” Journal of Machine Learning Research, vol. 9, pp. 1871–1874, June 2008. Proof: First, we note that for any scalar s > 0 and input [3] C.-C. Chang and C.-J. Lin, “LIBSVM : a library for support vector vector A, HQ,τ (sA) = HQ,τ (A). This is because multiplying machines,” ACM Trans. on Intelligent Systems and Technology, 2011. [4] K. Eshghi and S. Rajaram, “Locality sensitive hash functions based on the input vector by a positive scalar simply scales the elements concomitant rank order statistics,” in KDD, 2008, pp. 221–229. in QA, and does not change the order of the elements. T We also note that for any scalars s1 6= 0, s2 6= 0, be a sample from Φ2(x, y, ρ). Let x = (x1, x2, . . . , xU ) , and T cos(A, B) = cos(s1A, s2B). y = (y1, y2, . . . , yU ) . Let 1 ≤ i ≤ U. Then Let A = A√ and B = B√ . We will prove the kAk U kBk U lim P [R(xi, x) ≤ t ∧ R(yi, y) ≤ t] = Φ2(γ, γ, ρ) (44) following: U→∞   Proof: First, we note that since x and y are samples from |HQ,τ (A) ∩ HQ,τ (B)| lim E = Φ2(γ, γ, cos(A, B)) a normal distribution, the probability that they have duplicate U→∞ U (32) elements is zero. So we will assume that they do not have Which, by the argument above, is sufficient to prove eq. (31). duplicate elements. Consider the vectors Consider the sample         x1 xi−1 xi+1 xU S− = ,..., , ,..., x = QA (33) y1 yi−1 yi+1 yU   y = QB (34) xi which is S with taken out. Let yi Let T x− = (x1, . . . xi−1, xi+1, . . . , xU ) ρ = cos(A, B) (35) T y− = (y1, . . . yi−1, yi+1, . . . , yU )       x1 x2 xU It is possible to prove that , ,..., are iid Since there are no duplicates in x− or y−, there are unique y1 y2 yU samples from a bivariate normal distribution, where for all elements xˆ and yˆ in x− and y− such that 1 ≤ i ≤ U, R(ˆx, x ) = t       − x D 0 1 ρ i −→N , (36) yi 0 ρ 1 R(ˆy, y−) = t Now consider the following question: for any 1 ≤ i ≤ U, It is not too difficult to show that what is the probability that R(xi, x) ≤ t ∧ R(yi, y) ≤ t ⇐⇒ xi < xˆ ∧ yi < yˆ (45) i ∈ H (A) ∩ H (B) Q,τ Q,τ That is, From Definition 3 and the definition of x and y it follows that P [R(xi, x) ≤ t ∧ R(yi, y) ≤ t] = P [xi < xˆ ∧ yi < yˆ] i ∈ HQ,τ (A) ⇐⇒ R(xi, x) ≤ τ (37) (46) Let i ∈ H (B) ⇐⇒ R(y , y) ≤ τ (38) Q,τ i q = Φ−1(λ) (47) Where R is the rank function introduced in Definition 1. Thus Then, when we apply proposition 3 to S−, we get the following: i ∈ HQ,τ (A) ∩ HQ,τ (B) ⇐⇒ !! (R(x , x) ≤ τ) ∧ (R(y , y) ≤ τ) (39)     √ s r i i xˆ D q U−1 U−1 −−−−→ N , r s (48) yˆ U→∞ q √ By theorem 2 U−1 U−1 with s and r constants solely dependent on λ. lim P [R(xi, x) ≤ τ ∧ R(yi, y) ≤ τ] = Φ2(γ, γ, ρ) (40) U→∞ Now, by assumption,       Substituting in eq. (39) x D 0 1 ρ i −→N , (49)   yi 0 ρ 1 lim P i ∈ HQ,τ (A) ∩ HQ,τ (B) = Φ2(γ, γ, ρ) (41) U→∞ x  xˆ Since 1 ≤ i ≤ U, from eq. (41) it follows that We know that i and are instances of independent yi yˆ random variables with distributions eq. (48) and eq. (49). Us-   ing standard procedure for the subtraction of two independent lim E |HQ,τ (A) ∩ HQ,τ (B)| = UΦ2(γ, γ, ρ) (42) U→∞ bivariate normal random variables, we can derive |H (A) ∩ H (B)| Q,τ Q,τ  q 2  lim E = Φ2(γ, γ, ρ) (43)     1 + s ρ + r U→∞ U xi − xˆ D −q U−1 U−1 −−−−→ N  ,  q  yi − yˆ U→∞ −q r s2 Theorem 2. Let U be a positive integer. Let 0 < λ < 1 be a ρ + U−1 1 + U−1 −1 t (50) constant. Let t = dλ(U − 1)e. Let γ = Φ ( U−1 ). Let Now, by definition, x  x  x  S = 1 , 2 ,..., U y1 y2 yU t = dλ(U − 1)e (51) Thus for some 0 < δ < 1 q = Φ−1(λ) t = λ(U − 1) + δ (52) pλ(1 − λ) s = t δ φ(q) = λ + (53) U − 1 U − 1 2 t Φ(q, q, ρ) − λ lim = λ (54) r = 2 U→∞ U − 1 φ (q)  t  lim Φ−1 = Φ−1(λ) (55) Let X(1:n),X(2:n),...,X(n:n) be the order statistics on U→∞ U − 1 x1, x2, . . . , xn and Y(1:n),Y(2:n),...,Y(n:n) the order statis- lim γ = q (56) tics on y1, y2, . . . , yn. Then, as n → ∞, U→∞    −1  s r !! X D Φ (λ) √ It is also clear that (t:n) −→N , n n (68) r Y Φ−1(λ) r √s s2 (t:n) n n lim 1 + = 1 (57) U→∞ U − 1 Proof: This follows from the theorem on page 148 in [20] r lim ρ + = ρ (58) with appropriate substitutions. U→∞ U − 1 Thus, from eq. (50), eq. (56), eq. (57) and eq. (58) we can B. Kγ (A, B) is an Admissible Support Vector Kernel n derive, Theorem 3. Let A, B ∈ R . Then Kγ (A, B) satisfies the       x − xˆ D −γ 1 ρ Mercer’s conditions, and is an admissible SV kernel. i −−−−→ N , (59) yi − yˆ U→∞ −γ ρ 1 Proof: According to criteria in Theorem 8 in [21], to prove Thus by proposition 2 that Kγ (A, B) satisfies Mercer’s conditions, it is sufficient to prove that there is a convergent series lim P [xi − xˆ < 0 ∧ yi − yˆ < 0] = Φ2(γ, γ, ρ) (60) U→∞ ∞ X n lim P [xi < xˆ ∧ yi < yˆ] = Φ2(γ, γ, ρ) (61) Kγ (A, B) = αn(A · B) (69) U→∞ n=0 Substituting the left hand side of eq. (61) in the right hand where α ≥ 0 for all n. side of eq. (46) we get n By definition,

lim P [R(xi, x) ≤ t ∧ R(yi, y) ≤ t] = Φ2(γ, γ, ρ) (62) Kγ (A, B) = Φ2(γ, γ, cos(A, B)) (70) U→∞

Proposition 2. Let Using tetrachoric series expansion for Φ2 [22] we can show       that x D q 1 ρ −→N , (63) ∞ 2 y q ρ 1 X (Hek(γ)) Φ (γ, γ, ρ) = Φ2(γ) + φ2(γ) ρk+1 (71) 2 (k + 1)! Then k=0 P (x < 0 ∧ y < 0) = Φ2(−q, −q, ρ) ∞ 2 X [φ(γ)Hek−1(γ)] = Φ2(γ) + ρk (72) u x − q k! Proof: Let = . Then from eq. (63) it easily k=1 v y − q th follows that where Hek(x) is the k Hermite polynomial. Let       u D 0 1 ρ −→N , (64) ρ = cos(A, B) (73) v 0 ρ 1 1 z = (74) Thus, by eq. (64) and definition of Φ2 kAkkBk Then we have P [u < −q ∧ v < −q] = Φ2(−q, −q, ρ) (65) ρ = z(A · B) (75) By definition, u = x−q and v = y−q. Substituting in eq. (65) we get Substituting in eq. (70) and eq. (72) and simplifying we get ∞ 2 P [x − q < −q ∧ y − q < −q] = Φ (−q, −q, ρ) X [φ(γ)Hek−1(γ)] k 2 (66) K (A, B) = Φ2(γ) + (z(A · B)) γ k! P [x < 0 ∧ y < 0] = Φ2(−q, −q, ρ) (67) k=1       (76) x1 x2 xn Proposition 3. Let , ,..., be a sample ∞ 2 k X [φ(γ)Hek−1(γ)] z k y1 y2 yn = Φ2(γ) + (A · B) from Φ2(x, y, ρ). Let 0 < λ < 1 be a constant. Let k! k=1 t = dλne (77) Since Φ2(γ) ≥ 0 and for all k [φ(γ)He (γ)]2 zk k−1 ≥ 0 (78) k! eq. (77) is sufficient to prove the theorem. C. Formula for The CRO Kernel Proposition 4.

cos(A,B)  2  Z exp −γ 2 1+ρ Kγ (A, B) = Φ (x) + dρ (79) 2πp1 − ρ2 0 Proof: Let C(x, ρ) = Φ(x, x, ρ) (80) By definition,

Kγ (A, B) = C(γ, cos(A, B)) (81) Now, as proved in [23] d Φ (x, y, ρ) = φ (x, y, ρ) (82) dρ 2 2 i.e.  −x2−y2+2ρxy  d exp 2(1−ρ2) Φ2(x, y, ρ) = (83) dρ 2πp1 − ρ2 Thus  −x2−x2+2ρx2  d exp 2(1−ρ2) C(x, ρ) = (84) dρ 2πp1 − ρ2  −x2(1−ρ)  exp (1−ρ2) = (85) 2πp1 − ρ2 Let ρ 6= 1 (86) Then eq. (85) can be simplified as

 −x2  d exp 1+ρ C(x, ρ) = (87) dρ 2πp1 − ρ2 We know that C(x, 0) = Φ2(x). Thus

r  −x2  Z exp 1+ρ C(x, r) = Φ2(x) + dρ (88) 2πp1 − ρ2 0 From eq. (81) and eq. (88) we get

cos(A,B)  2  Z exp −γ 2 1+ρ Kγ (A, B) = Φ (x) + dρ (89) 2πp1 − ρ2 0 Notice that condition 86 is satisfied inside the integral in eq. (89)