Fast and Scalable Polynomial Kernels Via Explicit Feature Maps *
Total Page:16
File Type:pdf, Size:1020Kb
Fast and Scalable Polynomial Kernels via Explicit Feature Maps * Ninh Pham Rasmus Pagh IT University of Copenhagen IT University of Copenhagen Copenhagen, Denmark Copenhagen, Denmark [email protected] [email protected] ABSTRACT data space into high-dimensional feature space, where each Approximation of non-linear kernels using random feature coordinate corresponds to one feature of the data points. In mapping has been successfully employed in large-scale data that space, one can perform well-known data analysis al- analysis applications, accelerating the training of kernel ma- gorithms without ever interacting with the coordinates of chines. While previous random feature mappings run in the data, but rather by simply computing their pairwise in- O(ndD) time for n training samples in d-dimensional space ner products. This operation can not only avoid the cost and D random feature maps, we propose a novel random- of explicit computation of the coordinates in feature space ized tensor product technique, called Tensor Sketching, for but also handle general types of data (such as numeric data, approximating any polynomial kernel in O(n(d + D log D)) symbolic data). time. Also, we introduce both absolute and relative error While kernel methods have been used successfully in a va- bounds for our approximation to guarantee the reliability riety of data analysis tasks, their scalability is a bottleneck. of our estimation algorithm. Empirically, Tensor Sketching Kernel-based learning algorithms usually scale poorly with achieves higher accuracy and often runs orders of magni- the number of the training samples (a cubic running time tude faster than the state-of-the-art approach for large-scale and quadratic storage for direct methods). This drawback real-world datasets. is becoming more crucial with the rise of big data applica- tions [12, 3]. Recently, Joachims [9] proposed an efficient training algorithm for linear SVMs that runs in time lin- Categories and Subject Descriptors ear in the number of training examples. Since one can view I.5.2 [Pattern Recognition]: Design Methodology|Clas- non-linear SVMs as linear SVMs operating in an appro- sifier design and evaluation priate feature space, Rahimi and Recht [16] first proposed a random feature mapping to approximate shift-invariant kernels in order to combine the advantages of both linear General Terms and non-linear SVM approaches. This approach approxi- Algorithms, Performance, Experimentation mates kernels by an explicit data mapping into relatively low-dimensional random feature space. In this random fea- Keywords ture space, the kernel of any two points is well approximated by their inner product. Therefore, one can apply existing polynomial kernel; SVM; tensor product; Count Sketch; FFT fast linear learning algorithms to find data relations corre- sponding to non-linear kernel methods in the random feature 1. INTRODUCTION space. That leads to a substantial reduction in training time Kernel machines such as Support Vector Machines (SVMs) while obtaining similar testing error. have recently emerged as powerful approaches for many ma- Following up this line of work, many randomized approaches chine learning and data mining tasks. One of the key proper- to approximate kernels are proposed for accelerating the ties of kernel methods is the capability to efficiently find non- training of kernel machines [10, 12, 20, 21]. While the train- linear structure of data by the use of kernels. A kernel can be ing algorithm is linear, existing kernel approximation map- viewed as an implicit non-linear data mapping from original pings require time proportional to the product of the number of dimensions d and the number of random features D. This *This work is supported by the Danish National Research means that the mapping itself is a bottleneck whenever dD Foundation under the Sapere Aude program. is not small. In this paper we address this bottleneck, and present a near-linear time mapping for approximating any polynomial kernel. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not Particularly, given any two points of a dataset S of n d made or distributed for profit or commercial advantage and that copies bear points, x = fx1; ··· ; xdg ; y = fy1; ··· ; ydg 2 S ⊂ R and d this notice and the full citation on the first page. Copyrights for components an implicit feature space mapping φ : R 7! F, the inner of this work owned by others than ACM must be honored. Abstracting with product between these points in the feature space F can credit is permitted. To copy otherwise, or republish, to post on servers or to be quickly computed as hφ(x); φ(y)i = κ(x; y) where κ() is redistribute to lists, requires prior specific permission and/or a fee. Request an easily computable kernel. An explicit random feature permissions from [email protected]. d D KDD’13, August 11–14, 2013, Chicago, Illinois, USA. mapping f : R 7! R can efficiently approximate a kernel Copyright 2013 ACM 978-1-4503-2174-7/13/08 ...$15.00. κ() if it satisfies: of kernel methods' coefficients by performing coordinate as- cent on subsets of the training set until KKT conditions have E [hf(x); f(y)i] = hφ(x); φ(y)i = κ(x; y) : been satisfied to within a certain tolerance. Although such So we can transform data from the original data space into a approaches can handle the memory restrictions involving the low-dimensional explicit random feature space and use any dense kernel matrix, they still involve numerical solutions of linear learning algorithm to find non-linear data relations. optimization subproblems and therefore can be problematic Rahimi and Recht [16] introduced a random projection- and expensive to large-scale datasets. based algorithm to approximate shift-invariant kernels (e.g. In order to apply kernel methods to large-scale datasets, the Gaussian kernel κ(x; y) = exp(− kx − yk2 =2σ2), for σ > many approaches have been proposed for quickly approxi- 0). Vempati et al. [21] extended this work to approximate mating the kernel matrix, including the Nystr¨om methods generalized radial-basic function (RBF) kernels (e.g. the [5, 23], sparse greedy approximation [19] and low-rank kernel exponential-χ2 kernel κ(x; y) = exp(−χ2(x; y)=2σ2), where approximation [7]. These approximation schemes can reduce σ > 0 and χ2 is the Chi-squared distance measure). Re- the computational and storage costs of operating on a kernel cently, Kar and Karnick [10] made use of the Maclaurin se- matrix while preserving the quality of results. An assump- ries expansion to approximate inner product kernels (e.g. tion of these approaches is that the kernel matrix has many the polynomial kernel κ(x; y) = (hx; yi + c)p, for c ≥ 0 and zero eigenvalues. This might not be true in many datasets. an integer p). Furthermore, there is a lack of experiments to illustrate the These approaches have to maintain D random vectors efficiency of these approaches on large-scale datasets [16]. d !1; ··· !D 2 R in O(dD) space and need O(ndD) operations Instead of approximating the kernel matrix, recent ap- for computing D random feature maps. That incurs sig- proaches [10, 12, 16, 20, 21, 24] approximate the kernels nificant (quadratic) computational and storage costs when by explicitly mapping data into a relatively low-dimensional D = O(d) and d is rather large. When the decision bound- random feature space. The explicit mapping transforms ary of the problem is rather smooth, the computational cost data into a random feature space where the pairwise in- of random mapping might dominate the training cost. In ner products of transformed data points are approximately addition, the absolute error bounds of previous approaches equal to kernels in feature space. Therefore, we can apply are not tight. Particularly, the Maclaurin expansion based existing fast linear learning algorithms [6, 9, 18] to find non- approach [10] suffers from large error because it approxi- linear data relations in that random feature space. While mates the homogeneous polynomial kernel κ(x; y) = hx; yip previous such approaches can efficiently accelerate the train- Qp Qp d ing of kernel machines, they incur significant computational by i=1 h!i; xi i=1 h!i; yi where !i 2 f+1; −1g . Our experiments show that large estimation error results in ei- cost (quadratic in the dimensionality of data). That results ther accuracy degradation or negligible reduction in training in performance degradation on large-scale high-dimensional time. datasets. In this work, we consider the problem of approximating the commonly used polynomial kernel κ(x; y) = (hx; yi + c)p to accelerate the training of kernel machines. We develop 3. BACKGROUND AND PRELIMINARIES a fast and scalable randomized tensor product technique, named Tensor Sketching, to estimate the polynomial kernel 3.1 Count Sketch of any pair of points of the dataset. Our proposed approach works in O(np(d + D log D)) time and requires O(pd log D) Charikar et al. [2] described and analyzed a sketching ap- space for random vectors. The main technical insight is the proach, called Count Sketch, to estimate the frequency of connection between tensor product and fast convolution of all items in a stream. Recently, the machine learning com- Count Sketches [2, 14], which enables us to reduce the com- munity has used Count Sketch as a feature hashing tech- putational complexity and space usage. We introduce both nique for large-scale multitask learning [22] because Count absolute and relative error bounds for our approximation to Sketches can preserve the pairwise inner products within guarantee the reliability of our estimation algorithm. The an arbitrarily small factor.