Incremental Randomized Sketching for Online Kernel Learning

Incremental Randomized Sketching for Online Kernel Learning Xiao Zhang 1 Shizhong Liao 1 Abstract updating time significantly (Wang et al., 2012). To solve the Randomized sketching has been used in offline problem, budgeted online kernel learning has been proposed kernel learning, but it cannot be applied directly to to maintain a buffer of support vectors (SVs) with a fixed online kernel learning due to the lack of incremen- budget, which limits the model size to reduce the computa- tal maintenances for randomized sketches with tional complexities (Crammer et al., 2003). Several budget regret guarantees. To address these issues, we maintenance strategies were introduced to bound the buffer propose a novel incremental randomized sketch- size of SVs. Dekel et al.(2005) maintained the budget by ing approach for online kernel learning, which discarding the oldest support vector. Orabona et al.(2008) has efficient incremental maintenances with theo- projected the new example onto the linear span of SVs in retical guarantees. We construct two incremental the feature space to reduce the buffer size. But the buffer of randomized sketches using the sparse transform SVs cannot be used for kernel matrix approximation directly. matrix and the sampling matrix for kernel ma- Besides, it is difficult to provide a lower bound of the buffer trix approximation, update the incremental ran- size that is key to the theoretical guarantees for budgeted domized sketches using rank-1 modifications, and online kernel learning. construct a time-varying explicit feature mapping Unlike the budget maintenance strategies of SVs, sketches for online kernel learning. We prove that the pro- of the kernel matrix have been introduced into online kernel posed incremental randomized sketching is statis- learning to approximate the kernel incrementally. Engel et al. tically unbiased for the matrix product approxi- (2004) used the approximate linear dependency (ALD) rule mation, obtains a 1 + relative-error bound for to construct the sketches for kernel approximation in kernel the kernel matrix approximation, enjoys a sublin- regression. But the kernel matrix approximation error is not ear regret bound for online kernel learning, and compared with the best rank-k approximation. Sun et al. has constant time and space complexities at each (2012) proved that the size of sketches constructed by ALD round for incremental maintenances. Experimen- rule grows sublinearly with respect to the data size when tal results demonstrate that the incremental ran- the eigenvalues decay fast enough. Recently, randomized domized sketching achieves a better learning per- sketches of kernel matrix were used as surrogates to perform formance in terms of accuracy and efficiency even the offline kernel approximation with finite sample bounds in adversarial environments. (Sarlos´ , 2006; Woodruff, 2014; Raskutti & Mahoney, 2015), but few efforts have been made to formulate suitable randomized sketches for online kernel learning. Existing online 1. Introduction kernel learning approaches using randomized sketches usu- Online kernel learning must rise to the challenge of incre- ally adopted a random sampling strategy. Lu et al.(2016) mental maintenance in a low computational complexity with applied the Nystrom¨ approach for kernel function approxi- accuracy and convergence guarantees at each round. Online mation and constructed the explicit feature mapping using gradient descent is a judicious choice for online kernel learn- the singular value decomposition, but it fixed the random- ing due to its efficiency and effectiveness (Bottou, 2010; ized sketches after initialization, which is equivalent to the Shalev-Shwartz et al., 2011). However, the kernelization offline Nystrom¨ approach using uniform sampling that may of the online gradient descent may cause a linear growth suffer from a low approximation accuracy in adversarial en- with respect to the model size and increase the hypothesis vironments due to the non-uniformity structure of the data (Kumar et al., 2009). Calandriello et al.(2017b) assumed 1College of Intelligence and Computing, Tianjin University, that the kernel function can be expressed as an inner product Tianjin 300350, China. Correspondence to: Shizhong Liao in the explicit feature space, and proposed a second order <[email protected]>. online kernel learning approach using online sampling strat- Proceedings of the 36 th International Conference on Machine egy, but the construction of the explicit feature mapping Learning, Long Beach, California, PMLR 97, 2019. Copyright needs to be further explored. Calandriello et al.(2017a) 2019 by the author(s). Incremental Randomized Sketching for Online Kernel Learning formulated an explicit feature mapping using a Nystrom¨ the kernel matrix increases over time, a scenario for which approximation, and performed effective second-order up- the offline kernel matrix approximation approaches do not dates for hypothesis updating, which enjoys a logarithmic readily apply due to the lack of efficient incremental main- regret and has a per-step cost that grows with the effective tenances with regret guarantees. dimension. In this paper, we propose an incremental randomized sketch- 2.2. Randomized Sketches ing approach for online kernel learning. The proposed in- Randomized sketches via hash functions can be described cremental randomized sketching inherits the efficient incre- in general using hash-based sketch matrices. Without mental maintenance for matrix approximations and explicit loss of generality, we assume that the sketch size s is feature mappings, has a constant time complexity at each divisible by d. Let h1; : : : ; hd be hash functions from round, achieves a 1 + relative-error for kernel matrix ap- f1; : : : ; ng to f1; : : : ; s=dg, and σ1; : : : ; σd the other d proximation, and enjoys a sublinear regret for online ker- hash functions. We denote the hash-based sketch submatrix n×(s=d) nel learning, therefore presenting an effective and efficient by Sk 2 R and the hash-based sketch matrix by n×s sketching approach to online kernel learning. S = [S1; S2;:::; Sd] 2 R ; where [Sk]i;j = σk(i) for j = hk(i) and [Sk]i;j = 0 for j 6= hk(i), k = 1; : : : ; d. 2. Notations and Preliminaries The randomized sketch of A 2 RT ×n is denoted by AS 2 RT ×s. Sketch matrices of different randomized Let [A]∗i and [A]i∗ denote the i-th column of the matrix sketches can be specified by different parameters s, d and A and the column vector consisting of the entries from the d functions fσkgk=1 as follows. i-th row of A respectively, kAkF and kAk2 the Frobenius Count Sketch Matrix (Charikar et al., 2004): d = 1 and y and spectral norm of A respectively, and A the Moore- σ1(i) 2 {−1; +1g is 2-wise independent hash function. Penrose pseudoinverse of A. Let [M] = f1; 2;:::;Mg, Sparser Johnson-Lindenstrauss Transform (SJLT) T T S = f(xi; yi)gi=1 ⊆ (X × Y) be the sequence of T (Kane & Nelson, 2014): For the block construction of SJLT, D p p instances, where X ⊆ R and Y = {−1; 1g. We denote σk(i) 2 {−1= d; +1= dg is O(log T )-wise independent the loss function by ` : Y × Y ! R+ [ f0g; the kernel hash function for k = 1; : : : ; d, where d is the number of function by κ : X × X ! R and its corresponding kernel blocks. SJLT is a generalization of the Count Sketch matrix. T ×T matrix by K = (κ(xi; xj)) 2 R , the i-th eigenvalue K λ (K) λ (K) ≥ λ (K) ≥ · · · ≥ λ (K) of by i , 1 2 T , and 3. Novel Incremental Randomized Sketching the reproducing kernel Hilbert space (RKHS) associated with κ by Hκ = spanfκ(·; x): x 2 X g. In this section, we propose a novel incremental randomized sketching approach. Specifically, we construct the incremen- 2.1. Kernel Matrix Approximation tal randomized sketches for the kernel matrix approximation, formulate the incremental maintenance for the incremental Given a kernel matrix K 2 T ×T , the sketched kernel ma- R randomized sketches, and build a time-varying explicit fea- trix approximation problem in the offline setting is described ture mapping. In the online setting, at round t + 1, a new as follows (Wang et al., 2016): (t+1) example xt+1 arrives and the kernel matrix K can be | | 2 represented as a bordered matrix as follows: Ffast = arg min kS (CFC − K)SkF; F 2 B×B R (t) (t+1) (t+1) K (t+1)×(t+1) K = (t+1) (t+1) 2 R ; from which we can obtain the approximate kernel matrix | ξ T ×s Kfast = CFfastC| ≈ K; where S 2 R is a sketch T ×B (t+1) matrix, C 2 R and s; B > 0 are the sketch sizes. where ξ = κ(xt+1; xt+1) and The Nystrom¨ approach uses the column selection matrix (t+1) = [κ(x ; x ); κ(x ; x ); : : : ; κ(x ; x )]| : P as the sketch matrix (Williams & Seeger, 2001), i.e., t+1 1 t+1 2 t+1 t S = P 2 RT ×B, and formulate C as C = KP , which To approximate the kernel matrix incrementally, we in- yields the approximate kernel matrix K = CW yC|; (t+1) ny troduce a sketch matrix Sp to reduce the complexity B×B T ×T where W = P |KP 2 . While S = I 2 , (t+1) R T R of the problem (2) and another sketch matrix S to the modified Nystrom¨ approach (also called the prototype m reduce the size of the approximate kernel matrix, where model) (Wang & Zhang, 2013) is obtained as (t+1) (t+1)×s Sp 2 R p is an SJLT (defined in Section 2.2) | (t+1) (t+1)×sm Kmod = CFmodC ≈ K; and Sm 2 R is a sub-sampling matrix. Using y the two sketch matrices, we first formulate the incremental F = CyK (C|) where mod , which is more accurate but randomized sketches of K(t+1) as follows: more computationally expensive than the standard Nystrom.¨ (t+1) (t+1)| (t+1) (t+1)

Incremental Randomized Sketching for Online Kernel Learning

Short and Deep: Sketching and Neural Networks

Fast and Scalable Polynomial Kernels Via Explicit Feature Maps *

Learning-Based Frequency Estimation Algorithms

Compressing Gradient Optimizers Via Count-Sketches

Communication-Efficient Federated Learning with Sketching

Tensors in Modern Statistical Learning 1 Introduction

(Learned) Frequency Estimation Algorithms Under Zipfian Distribution

Communication-Efficient Distributed SGD with Sketching

Privacy for Free: Communication-Efficient Learning

Frequency Estimation in Data Streams: Learning the Optimal Hashing Scheme

Question Type Guided Attention in Visual Question Answering

Finding Heavily-Weighted Features with the Weight-Median Sketch