
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Approximate Manifold Regularization: Scalable Algorithm and Generalization Analysis Jian Li1;2 , Yong Liu1 , Rong Yin1;2 and Weiping Wang1∗ 1Institute of Information Engineering, Chinese Academy of Sciences 2School of Cyber Security, University of Chinese Academy of Sciences flijian9026, liuyong, yinrong, [email protected] Abstract vised to approximate Laplacian graph by a line or spanning tree [Cesa-Bianchi et al., 2013] and improved by minimiz- Graph-based semi-supervised learning is one of ing tree cut (MTC) in [Zhang et al., 2016]. (2) Acceler- the most popular and successful semi-supervised ate operations associated with kernel matrix. Several dis- learning approaches. Unfortunately, it suffers from tributed approaches have been applied to semi-supervised high time and space complexity, at least quadratic learning [Chang et al., 2017], decomposing a large scale with the number of training samples. In this pa- problem into smaller ones. Anchor Graph regularization per, we propose an efficient graph-based semi- (Anchor) constructs an anchor graph with the training sam- supervised algorithm with a sound theoretical guar- ples and a few anchor points to approximate Laplacian graph antee. The proposed method combines Nystrom [Liu et al., 2010]. The work of [McWilliams et al., 2013; subsampling and preconditioned conjugate gradi- Rastogi and Sampath, 2017] applied random projections in- ent descent, substantially improving computational cluding Nystrom¨ methods and random features into mani- efficiency and reducing memory requirements. Ex- fold regularization. Gradient methods are introduced to solve tensive empirical results reveal that our method manifold regularization on the primal problem, such as pre- achieves the state-of-the-art performance in a short conditioned conjugate gradient [Melacci and Belkin, 2011], time even with limited computing resources. stochastic gradient descent [Wang et al., 2012]. In this paper, we focus on the latter scalability issue. With 1 Introduction sound theoretical guarantees, we devise a novel graph-based SSL framework, substantially reducing computational time Recently, the explosive growth of computing power and ap- and memory requirements. More precisely, our approach ap- plications of the network makes data generation and acqui- proximates Laplacian regularized least squares (LapRLS) by sition more easily. However, most of the collected data are Nystrom¨ methods and then accelerates the solution with pre- unlabeled, while data annotation is laborious. Further, semi- conditioned conjugate gradient methods. It’s a non-trivial ex- supervised learning (SSL) methods are developed to esti- tension of FALKON [Rudi et al., 2017] to graph SSL with mate specific learner from a few labeled samples together technical challenges in algorithm design and theoreticalp anal- with a significant amount of unlabeled data, such as trans- ysis. Theoretical analysis demonstrates that O( m) labeled ductive support vector machines [Joachims, 1999] and graph- samples and O(log m) iterations (m is the number of labeled based methods [Belkin et al., 2006; Camps-Valls et al., 2007]. samples) can guarantee good statistical properties. Complex-p Graph-based manifold regularization methods draw wide at- ity analysis shows our method solve LapRLS with O(n n) tention of SSL area due to their good performance and rela- time and O(n) space (n is the number of all samples). tive simplicity of implementation [Belkin et al., 2006]. De- spite those advantages of manifold regularization, it remains 2 Related Work challenges to process gigantic datasets, for suffering high computational complexity, typically kernel matrix related op- To overcome the computational and memory bottleneck of erations at least O(n2) and construction of graph Laplacian LapRLS, practical algorithms were developed, including at least O(n log n), where n is total sample size. Nystrom¨ methods [Williams and Seeger, 2001] of which To tackle those scalability issues, many approaches were statistical properties are well studied in [Rastogi and Sam- proposed [Liu et al., 2012; Jiang et al., 2017; Liu et al., 2019]: path, 2017], and preconditioned conjugate gradient (PCG) (1) Accelerate construction of Laplacian graph. Methods which reduces the number of iterations [Cutajar et al., 2016]. based on the fast spectral decomposition of Laplacian ma- FALKON approach combines Nystrom¨ methods and PCG in trix have been well-studied in [Talwalkar et al., 2013], which supervised learning [Rudi et al., 2017]. Further, our work ex- use a few eigenvalues of graph Laplacian to represent man- tends the combination to SSL with high computation gains ifold structure. Graph sparsification approaches were de- and sound statistical guarantees. The approachp improves computational efficiency from O(n3) to O(n n) and reduce ∗Corresponding author memory cost from O(n2) to O(n). 2887 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) 3 Preliminaries 4.1 Nystrom¨ Subsampling on LapRLS 3.1 Problem Definition We consider Nystrom¨ subsampling to reduce memory re- Assume there is a fixed but unknown distribution ρ on X ×Y, quirement, which uses a smaller matrix obtained from ran- where X = Rd and Y = R. Further, m labeled samples dom column sampling to approximate the empirical kernel f(x1; y1); ··· ; (xm; ym)g 2 X ×Y are drawn i.i.d from ρ and matrix. Thus, a smaller hypothesis space Hs is introduced n − m unlabeled samples fxm+1; ··· ; xng 2 X are drawn s i.i.d according to the marginal distribution ρX of ρ. X s Hs = ff 2 Hjf = αiK(xi; ·); α 2 R g; 3.2 Manifold Regularization i=1 Manifold learning methods based on the spectral graph, where s ≤ n and xs = (xe1; ··· ; xes) are Nystrom¨ centers known as graph-based SSL, is a typical solution to semi- selected by uniform subsampling from the training set. The [ ] supervised learning Zhu et al., 2003; Belkin et al., 2006 , minimizer of LapRLS (2) over the space Hs is in the form: which is to find a smooth low-dimensional manifold embed- s ded in the high-dimensional vector space, based on sample s X points. Correctly, Laplacian regularization [Belkin et al., fbλ(x) = αiK(xi; x); with 2006] is extensively used in graph-based SSL. i=1 (4) R T T y T For a Mercer kernel K : X ×X ! , there is an associated α = (KmsKms + λAKss + λI KnsLKns) Kmsy; reproducing kernel Hilbert space (RKHS) H of functions f : | {z } | {z } H z X! R with corresponding norm k · kH. The following optimization is considered in manifold regularization: where Hy denotes the Moore-Penorse pseudoinverse of a ma- m X trix H, (Kms)ij = K(xi; xj) with i 2 f1; ··· ; mg and j 2 f = arg min `(y ; f(x )) + λ kfk2 + λ f T Lf; (1) e bλ i i A H I f1; ··· ; sg, (Kss)kj = K(xek; xej) with k; j 2 f1; ··· ; sg f2H i=1 T m and y = [y1; ··· ; ym] 2 R . where ` is loss function, L is Laplacian matrix by L = T D − W, f = [f(x1); ··· ; f(xn)] , λA controls the com- 4.2 Solving the Linear System by Preconditioning plexity of the function in the ambient space, and λI controls the complexity of the function in the intrinsic space. Here, Nystrom¨ subsampling for LapRLS problems resulting solu- W 2 Rn×n records undirected weight between points and tion (4) is also a linear system, so we consider how to accel- Pn erate the solution by preconditioning that is the diagonal matrix D is given by Dii = j=1 Wij . The minimizer of the optimization problem (1) admits an P−1Hα = P−1z: expansion in terms of both labeled and unlabeled data n As we all know, the number of iterations for precondition- X −1 fbλ(x) = αiK(xi; x): ing methods depends on the condition number cond(P H), i=1 such that the preconditioner needs to be approximate to H. To 3.3 Laplacian Regularized Least Squares obtain a smaller condition number but also avoid inefficient computation, we define the following preconditioners: (LapRLS) p With squared loss function, the problem (1) becomes LapRLS • m ≤ n m 2 X 2 2 T T λI n arg min (yi − f(xi)) + λAkfk + λI f Lf: (2) P = K K + λ K + K L K : (5) H ms ms A ss s2 ss ss ss f2H i=1 p Setting the derivative of the objective function be zero leads • m > n a closed form solution −1 2 αb = (JK + λAI + λI LK) yn; (3) m T λI n P = KssKss + λAKss + 2 KssLssKss: (6) where Kij = K(xi; xj) is n × n kernel matrix on train data, s s J = diag(1; ··· ; 1; 0; ··· ; 0) with the first m diagonal entries T In each iteration of any PCG solver, calculation of Hα is as 1 and the rest 0, and yn = [y1; y2; ··· ; ym; 0; ··· ; 0] with corresponding m labels and the rest filled by 0. Note needed. To accelerate computation, Hα is decomposed into a series of matrix-vector multiplications that when λI = 0, Equation (3) gives zero coefficients over unlabeled data, thus the form reduces to the standard RLS. T T Hα = Kms(Kmsα) + λAKssα + λI Kns(L(Knsα)): (7) 4 Algorithm Remark 1. We use LU or QR decomposition to calculate We devise a fast and scalable graph-based semi-supervised matrix inversion P−1 because they show significant improve- learning framework Nystrom-PCG¨ shown as Algorithm 1, ment in speed than Cholesky decomposition. which consists of two steps: (1) Nystrom¨ with uniform sam- Remark 2. The storage of kernel matrix Kns needs at least pling on train data for the LapRLS problem, resulting in a O(ns) memory, but it turns to be O(s2) when we perform linear system Hα = z: (2) Define a preconditioner P to ap- matrix multiplications in s × s blocks.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages7 Page
-
File Size-