Polynomial Tensor Sketch for Element-Wise Function of Low-Rank Matrix
Total Page:16
File Type:pdf, Size:1020Kb
Polynomial Tensor Sketch for Element-wise Function of Low-Rank Matrix Insu Han 1 Haim Avron 2 Jinwoo Shin 3 1 Abstract side, we obtain a o(n2)-time approximation scheme of f (A)x ≈ T T >x for an arbitrary vector x 2 n due This paper studies how to sketch element-wise U V R to the associative property of matrix multiplication, where functions of low-rank matrices. Formally, given the exact computation of f (A)x requires the complexity low-rank matrix A = [A ] and scalar non-linear ij of Θ(n2). function f, we aim for finding an approximated low-rank representation of the (possibly high- The matrix-vector multiplication f (A)x or the low-rank > rank) matrix [f(Aij)]. To this end, we propose an decomposition f (A) ≈ TU TV is useful in many machine efficient sketching-based algorithm whose com- learning algorithms. For example, the Gram matrices of plexity is significantly lower than the number of certain kernel functions, e.g., polynomial and radial basis entries of A, i.e., it runs without accessing all function (RBF), are element-wise matrix functions where entries of [f(Aij)] explicitly. The main idea un- the rank is the dimension of the underlying data. Such derlying our method is to combine a polynomial matrices are the cornerstone of so-called kernel methods approximation of f with the existing tensor sketch and the ability to multiply the Gram matrix to a vector scheme for approximating monomials of entries suffices for most kernel learning. The Sinkhorn-Knopp of A. To balance the errors of the two approx- algorithm is a powerful, yet simple, tool for computing imation components in an optimal manner, we optimal transport distances (Cuturi, 2013; Altschuler et al., propose a novel regression formula to find poly- 2017) and also involves the matrix-vector multiplication nomial coefficients given A and f. In particu- f (A)x with f(x) = exp(x). Finally, f (UV >) can also lar, we utilize a coreset-based regression with a describe the non-linear computation of activation in a layer rigorous approximation guarantee. Finally, we of deep neural networks (Goodfellow et al., 2016), where U, demonstrate the applicability and superiority of V and f correspond to input, weight and activation function the proposed scheme under various machine learn- (e.g., sigmoid or ReLU) of the previous layer, respectively. ing tasks. Unlike the element-wise matrix function f (A), a tradi- tional matrix function f(A) is defined on their eigenval- ues (Higham, 2008) and possesses clean algebraic prop- 1. Introduction erties, i.e., it preserves eigenvectors. For example, given n×n a diagonalizable matrix A = P DP −1, it is defined that Given a low-rank matrix A = [Aij] 2 R with A = > n×d f(A) = P f (D)P −1. A classical problem addressed UV for some matrices U; V 2 R with d n and a in the literature is of approximating the trace of f(A) (or scalar non-linear function f : R ! R, we are interested in the following element-wise matrix function:∗ f(A)x) efficiently (Han et al., 2017; Ubaru et al., 2017). However, these methods are not applicable to our prob- arXiv:1905.11616v3 [cs.LG] 29 Jun 2020 n×n f (A) 2 R ; where f (A)ij := [f(Aij)] : lem because element-wise matrix functions are fundamen- tally different from the traditional function of matrices, Our goal is to design a fast algorithm for computing tall- e.g., they do not guarantee the spectral property. We are 2 and-thin matrices TU ;TV in time o(n )such that f (A) ≈ unaware of any approximation algorithm that targets gen- > TU TV . In particular, our algorithm should run without eral element-wise matrix functions. The special case of computing all entries of A or f (A) explicitly. As an element-wise kernel functions such as f(x) = exp(x) and k 1 f(x) = x , as been address in the literature, e.g., using ran- School of Electrical Engineering, KAIST, Daejeon, Korea dom Fourier features (RFF)(Rahimi & Recht, 2008; Pen- 2School of Mathematical Sciences, Tel Aviv University, Israel 3Graduate School of AI, KAIST, Daejeon, Korea. Correspondence nington et al., 2015), Nystrom¨ method (Williams & Seeger, to: Jinwoo Shin <[email protected]>. 2001), sparse greedy approximation (Smola & Schokopf¨ , ∗ Proceedings of the 37 th International Conference on Machine In this paper, we primarily focus on the square matrix A for Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by simplicity, but it is straightforward to extend our results to the case the author(s). of non-square matrices. Polynomial Tensor Sketch Pr 2000) and sketch-based methods (Pham & Pagh, 2013; (instead of monomials), i.e., pr(x) = j=0 cjtj(x) for the Avron et al., 2014; Meister et al., 2019; Ahle et al., 2020), Chebyshev polynomial tj(x) of degree j, in order to resolve just to name a few examples. We aim for not only designing numerical issues arising for large degrees. an approximation framework for general f, but also outper- We evaluate the approximation quality of our algorithm un- forming the previous approximations (e.g., RFF) even for der the RBF kernels of synthetic and real-world datasets. the special case f(x) = exp(x). Then, we apply the proposed method to classification using The high-level idea underlying our method is to combine kernel SVM (Cristianini et al., 2000; Scholkopf & Smola, two approximation schemes: TENSORSKETCH approximat- 2001) and computation of optimal transport distances (Cu- ing the element-wise matrix function of monomial xk and turi, 2013) which require to compute element-wise matrix Pr j a degree-r polynomial approximation pr(x) = j=0 cjx functions with f(x) = exp(x). Our experimental results Pr 1 j confirm that our scheme is at the order of magnitude faster of function f, e.g., f(x) = exp(x) ≈ pr(x) = j=0 j! x . More formally, we consider the following approximation: than the exact method with a marginal loss on accuracy. Furthermore, our scheme also significantly outperforms a r state-of-the-art approximation method, RFF for the afore- (a) X j f (A) ≈ cj · (x ) (A) mentioned applications. Finally, we demonstrate a wide ap- j=0 plicability of our method by applying it to the linearization r (b) X j of neural networks, which requires to compute element-wise ≈ cj · TENSORSKETCH (x ) (A) ; matrix functions with f(x) = sigmoid(x). j=0 which we call POLY-TENSORSKETCH. This is a linear-time 2. Preliminaries approximation scheme with respect to n under the choice of r = O(1). Here, a non-trivial challenge occurs: a larger In this section, we provide backgrounds on the random- degree r is required to approximate an arbitrary function f ized sketching algorithms COUNTSKETCH and TENSORS- KETCH that are crucial components of the proposed scheme. better, while the approximation error of TENSORSKETCH is known to increase exponentially with respect to r (Pham First, COUNTSKETCH (Charikar et al., 2002; Weinberger & Pagh, 2013; Avron et al., 2014). Hence, it is important et al., 2009) was proposed for an effective dimensionality to choose good cj’s for balancing two approximation er- reduction of high-dimensional vector u 2 Rd. Formally, rors in both (a) and (b). The known (truncated) polynomial consider a random hash function h :[d] ! [m] and a expansions such as Taylor or Chebyshev (Mason & Hand- random sign function s :[d] ! {−1; +1g, where [d] := scomb, 2002) are far from being optimal for this purpose f1; : : : ; dg. Then, COUNTSKETCH transforms u into Cu 2 (see Section 3.2 and 4.1 for more details). m P R such that [Cu]j := i:h(i)=j s(i)ui for j 2 [m]. The To address this challenge, we formulate an optimization algorithm takes O(d) time to run since it requires a single pass over the input. It is known that applying the samey problem on the polynomial coefficients (the cjs) by relating them to the approximation error of POLY-TENSORSKETCH. COUNTSKETCH transform on two vectors preserves the However, this optimization problem is intractable since the dot-product, i.e., hu; vi = E [hCu;Cvi]. objective involves an expectation taken over random vari- TENSORSKETCH (Pham & Pagh, 2013) was proposed as a ables whose supports are exponentially large. Thus, we generalization of COUNTSKETCH to tensor products of vec- derive a novel tractable alternative optimization problem tors. Given u 2 Rd and a degree k, consider i.i.d. random based on an upper bound the approximation error of POLY- hash functions h1; : : : ; hk :[d] ! [m] and i.i.d. random TENSORSKETCH. This problem turns out to be an instance sign functions s1; : : : ; sk :[d] ! {−1; +1g, TENSORS- of generalized ridge regression (Hemmerle, 1975), and as KETCH applied to u is defined as the m-dimensional vector such has a closed-form solution. Furthermore, we observe m Tu 2 R such that j 2 [m], that the regularization term effectively forces the coefficients to be exponentially decaying, and this compensates the ex- (k) X [Tu ]j := s1(i1) : : : sk(ik)ui1 : : : uik ; ponentially growing errors of TENSORSKETCH with respect H(i ;:::;i )=j to the rank, while simultaneously maintaining a good poly- 1 k nomial approximation to the scalar function f given entries z where H(i1; : : : ; ik) ≡ h1(i1)+···+hk(ik) (mod m): In of A. We further reduce its complexity by regressing only a (Pham & Pagh, 2013; Avron et al., 2014; Pennington et al., subset of the matrix entries found by an efficient coreset clus- 2015; Meister et al., 2019), TENSORSKETCH was used for tering algorithm (Har-Peledx, 2008): if matrix entries are approximating of the power of dot-product between vectors.