Approximate Manifold Regularization: Scalable Algorithm and Generalization Analysis

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Approximate Manifold Regularization: Scalable Algorithm and Generalization Analysis Jian Li1;2 , Yong Liu1 , Rong Yin1;2 and Weiping Wang1∗ 1Institute of Information Engineering, Chinese Academy of Sciences 2School of Cyber Security, University of Chinese Academy of Sciences flijian9026, liuyong, yinrong, [email protected] Abstract vised to approximate Laplacian graph by a line or spanning tree [Cesa-Bianchi et al., 2013] and improved by minimiz- Graph-based semi-supervised learning is one of ing tree cut (MTC) in [Zhang et al., 2016]. (2) Acceler- the most popular and successful semi-supervised ate operations associated with kernel matrix. Several dis- learning approaches. Unfortunately, it suffers from tributed approaches have been applied to semi-supervised high time and space complexity, at least quadratic learning [Chang et al., 2017], decomposing a large scale with the number of training samples. In this pa- problem into smaller ones. Anchor Graph regularization per, we propose an efficient graph-based semi- (Anchor) constructs an anchor graph with the training sam- supervised algorithm with a sound theoretical guar- ples and a few anchor points to approximate Laplacian graph antee. The proposed method combines Nystrom [Liu et al., 2010]. The work of [McWilliams et al., 2013; subsampling and preconditioned conjugate gradi- Rastogi and Sampath, 2017] applied random projections in- ent descent, substantially improving computational cluding Nystrom¨ methods and random features into mani- efficiency and reducing memory requirements. Ex- fold regularization. Gradient methods are introduced to solve tensive empirical results reveal that our method manifold regularization on the primal problem, such as pre- achieves the state-of-the-art performance in a short conditioned conjugate gradient [Melacci and Belkin, 2011], time even with limited computing resources. stochastic gradient descent [Wang et al., 2012]. In this paper, we focus on the latter scalability issue. With 1 Introduction sound theoretical guarantees, we devise a novel graph-based SSL framework, substantially reducing computational time Recently, the explosive growth of computing power and ap- and memory requirements. More precisely, our approach ap- plications of the network makes data generation and acqui- proximates Laplacian regularized least squares (LapRLS) by sition more easily. However, most of the collected data are Nystrom¨ methods and then accelerates the solution with pre- unlabeled, while data annotation is laborious. Further, semi- conditioned conjugate gradient methods. It’s a non-trivial ex- supervised learning (SSL) methods are developed to esti- tension of FALKON [Rudi et al., 2017] to graph SSL with mate specific learner from a few labeled samples together technical challenges in algorithm design and theoreticalp anal- with a significant amount of unlabeled data, such as trans- ysis. Theoretical analysis demonstrates that O( m) labeled ductive support vector machines [Joachims, 1999] and graph- samples and O(log m) iterations (m is the number of labeled based methods [Belkin et al., 2006; Camps-Valls et al., 2007]. samples) can guarantee good statistical properties. Complex-p Graph-based manifold regularization methods draw wide at- ity analysis shows our method solve LapRLS with O(n n) tention of SSL area due to their good performance and rela- time and O(n) space (n is the number of all samples). tive simplicity of implementation [Belkin et al., 2006]. De- spite those advantages of manifold regularization, it remains 2 Related Work challenges to process gigantic datasets, for suffering high computational complexity, typically kernel matrix related op- To overcome the computational and memory bottleneck of erations at least O(n2) and construction of graph Laplacian LapRLS, practical algorithms were developed, including at least O(n log n), where n is total sample size. Nystrom¨ methods [Williams and Seeger, 2001] of which To tackle those scalability issues, many approaches were statistical properties are well studied in [Rastogi and Sam- proposed [Liu et al., 2012; Jiang et al., 2017; Liu et al., 2019]: path, 2017], and preconditioned conjugate gradient (PCG) (1) Accelerate construction of Laplacian graph. Methods which reduces the number of iterations [Cutajar et al., 2016]. based on the fast spectral decomposition of Laplacian ma- FALKON approach combines Nystrom¨ methods and PCG in trix have been well-studied in [Talwalkar et al., 2013], which supervised learning [Rudi et al., 2017]. Further, our work ex- use a few eigenvalues of graph Laplacian to represent man- tends the combination to SSL with high computation gains ifold structure. Graph sparsification approaches were de- and sound statistical guarantees. The approachp improves computational efficiency from O(n3) to O(n n) and reduce ∗Corresponding author memory cost from O(n2) to O(n). 2887 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) 3 Preliminaries 4.1 Nystrom¨ Subsampling on LapRLS 3.1 Problem Definition We consider Nystrom¨ subsampling to reduce memory re- Assume there is a fixed but unknown distribution ρ on X ×Y, quirement, which uses a smaller matrix obtained from ran- where X = Rd and Y = R. Further, m labeled samples dom column sampling to approximate the empirical kernel f(x1; y1); ··· ; (xm; ym)g 2 X ×Y are drawn i.i.d from ρ and matrix. Thus, a smaller hypothesis space Hs is introduced n − m unlabeled samples fxm+1; ··· ; xng 2 X are drawn s i.i.d according to the marginal distribution ρX of ρ. X s Hs = ff 2 Hjf = αiK(xi; ·); α 2 R g; 3.2 Manifold Regularization i=1 Manifold learning methods based on the spectral graph, where s ≤ n and xs = (xe1; ··· ; xes) are Nystrom¨ centers known as graph-based SSL, is a typical solution to semi- selected by uniform subsampling from the training set. The [ ] supervised learning Zhu et al., 2003; Belkin et al., 2006 , minimizer of LapRLS (2) over the space Hs is in the form: which is to find a smooth low-dimensional manifold embed- s ded in the high-dimensional vector space, based on sample s X points. Correctly, Laplacian regularization [Belkin et al., fbλ(x) = αiK(xi; x); with 2006] is extensively used in graph-based SSL. i=1 (4) R T T y T For a Mercer kernel K : X ×X ! , there is an associated α = (KmsKms + λAKss + λI KnsLKns) Kmsy; reproducing kernel Hilbert space (RKHS) H of functions f : | {z } | {z } H z X! R with corresponding norm k · kH. The following optimization is considered in manifold regularization: where Hy denotes the Moore-Penorse pseudoinverse of a ma- m X trix H, (Kms)ij = K(xi; xj) with i 2 f1; ··· ; mg and j 2 f = arg min `(y ; f(x )) + λ kfk2 + λ f T Lf; (1) e bλ i i A H I f1; ··· ; sg, (Kss)kj = K(xek; xej) with k; j 2 f1; ··· ; sg f2H i=1 T m and y = [y1; ··· ; ym] 2 R . where ` is loss function, L is Laplacian matrix by L = T D − W, f = [f(x1); ··· ; f(xn)] , λA controls the com- 4.2 Solving the Linear System by Preconditioning plexity of the function in the ambient space, and λI controls the complexity of the function in the intrinsic space. Here, Nystrom¨ subsampling for LapRLS problems resulting solu- W 2 Rn×n records undirected weight between points and tion (4) is also a linear system, so we consider how to accel- Pn erate the solution by preconditioning that is the diagonal matrix D is given by Dii = j=1 Wij . The minimizer of the optimization problem (1) admits an P−1Hα = P−1z: expansion in terms of both labeled and unlabeled data n As we all know, the number of iterations for precondition- X −1 fbλ(x) = αiK(xi; x): ing methods depends on the condition number cond(P H), i=1 such that the preconditioner needs to be approximate to H. To 3.3 Laplacian Regularized Least Squares obtain a smaller condition number but also avoid inefficient computation, we define the following preconditioners: (LapRLS) p With squared loss function, the problem (1) becomes LapRLS • m ≤ n m 2 X 2 2 T T λI n arg min (yi − f(xi)) + λAkfk + λI f Lf: (2) P = K K + λ K + K L K : (5) H ms ms A ss s2 ss ss ss f2H i=1 p Setting the derivative of the objective function be zero leads • m > n a closed form solution −1 2 αb = (JK + λAI + λI LK) yn; (3) m T λI n P = KssKss + λAKss + 2 KssLssKss: (6) where Kij = K(xi; xj) is n × n kernel matrix on train data, s s J = diag(1; ··· ; 1; 0; ··· ; 0) with the first m diagonal entries T In each iteration of any PCG solver, calculation of Hα is as 1 and the rest 0, and yn = [y1; y2; ··· ; ym; 0; ··· ; 0] with corresponding m labels and the rest filled by 0. Note needed. To accelerate computation, Hα is decomposed into a series of matrix-vector multiplications that when λI = 0, Equation (3) gives zero coefficients over unlabeled data, thus the form reduces to the standard RLS. T T Hα = Kms(Kmsα) + λAKssα + λI Kns(L(Knsα)): (7) 4 Algorithm Remark 1. We use LU or QR decomposition to calculate We devise a fast and scalable graph-based semi-supervised matrix inversion P−1 because they show significant improve- learning framework Nystrom-PCG¨ shown as Algorithm 1, ment in speed than Cholesky decomposition. which consists of two steps: (1) Nystrom¨ with uniform sam- Remark 2. The storage of kernel matrix Kns needs at least pling on train data for the LapRLS problem, resulting in a O(ns) memory, but it turns to be O(s2) when we perform linear system Hα = z: (2) Define a preconditioner P to ap- matrix multiplications in s × s blocks.

Approximate Manifold Regularization: Scalable Algorithm and Generalization Analysis

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support