Linear Manifold Regularization for Large Scale Semi-Supervised Learning
Total Page:16
File Type:pdf, Size:1020Kb
Linear Manifold Regularization for Large Scale Semi-supervised Learning Vikas Sindhwani [email protected] Partha Niyogi [email protected] Mikhail Belkin [email protected] Department of Computer Science, University of Chicago, Hyde Park, Chicago, IL 60637 Sathiya Keerthi [email protected] Yahoo! Research Labs, 210 S Delacey Avenue, Pasadena, CA 91105 Abstract tion (Belkin, Niyogi & Sindhwani, 2004; Sindhwani, The enormous wealth of unlabeled data in Niyogi and Belkin, 2005). In this framework, unla- many applications of machine learning is be- beled data is incorporated within a geometrically mo- ginning to pose challenges to the designers tivated regularizer. We specialize this framework for of semi-supervised learning methods. We linear classification, and focus on the problem of deal- are interested in developing linear classifica- ing with large amounts of data. tion algorithms to efficiently learn from mas- This paper is organized as follows: In section 2, we sive partially labeled datasets. In this paper, setup the problem of linear semi-supervised learning we propose Linear Laplacian Support Vector and discuss some prior work. The general Manifold Machines and Linear Laplacian Regularized Regularization framework is reviewed in section 3. In Least Squares as promising solutions to this section 4, we discuss our methods. Section 5 outlines problem. some research in progress. 1. Introduction 2. Linear Semi-supervised Learning The problem of linear semi-supervised clasification is A single web-crawl by search engines like Yahoo and setup as follows: We are given l labeled examples Google indexes billions of webpages. Only a very l l+u fxi; yigi=1 and u unlabeled examples fxigi=l+1, where small fraction of these web-pages can be hand-labeled d and assembled into topic directories. The remaining the patterns x 2 X ⊂ R are d-dimensional vectors web-pages form a massive collection of unlabeled doc- and the labels yi 2 {−1; +1g represent two classes. We are interested in developing tractable algorithms uments. Text categorization is among an increasing T range of applications that stand to immensely benefit to learn a linear classifier f(x) = sign(w x + b) from the development of large scale semi-supervised given by a weight vector w and a threshold b, in the learning methods. case where u is very large. We can hope for tractability when the size of the data matrix is still manageable, In this paper, we propose linear semi-supervised clas- e.g when dealing with low dimensional problems, or sification algorithms for dealing with large datasets. with very sparse high-dimensional problems (e.g text Linear methods have traditionally played a pivotal role categorization). in the development of machine learning algorithms. They have been pervasively deployed in information A number of recent efforts have considered semi- retrieval, data analysis and pattern recognition sys- supervised extensions of well-established supervised tems. methods e.g Support Vector Machines (SVM). Our algorithms are rooted in a general framework for Transductive SVMs (Joachims, 1999) implement the semi-supervised learning called Manifold Regulariza- following philosophy for using unlabeled data to choose a weight vector: Find a weight vector and a labeling Appearing in Proc. of the 22 st ICML Workshop on Learn- of the unlabeled examples so that the data (both la- ing with Partially Classified Training Data, Bonn, Ger- beled and unlabeled examples) is separated with max- many, August 2005. Copyright 2005 by the author(s). imum margin. This requires a joint optimization of 80 the SVM objective function over possible choices of ing rests on how much information unlabeled exam- binary-valued labels on the unlabeled data and the ples carry about the distribution of labels in the pat- weight vector. tern space. Frequently, unlabeled examples may help in identifying clusters or a low-dimensional manifold structure along which labels can be assumed to vary Transductive SVM: smoothly. These are the cluster and manifold assump- l tion respectively. 1 (w∗; b∗) = argmin wT w + C V (y ; wT x + b) 2 l X i i The Manifold regularization framework extends the w;b i=1 yl+1::: yl+u classical framework of regularization in Reproducing l+u Kernel Hilbert Spaces (RKHS) to exploit the geome- T +Cu X V (yi; w xi + b) try of the marginal distribution, as estimated by unla- i=l+1 beled data, when these assumptions are satisfied. Un- labeled data is incorporated via a regularization term where V (y; f) = (1 − yf)+ is the hinge loss and C ; C l u (in addition to the RKHS norm) whose role is to pe- are real-valued parameters that balance the hinge loss nalize functions for changing rapidly along a manifold between the labeled and unlabeled examples. Thus, or within a cluster. TSVM minimizes regularized hinge loss over choices of labels for the unlabeled examples. This optimiza- The empirical estimate of these underlying structures tion is implemented in (Joachims, 1999) by first using can be encoded as a graph whose vertices are the an inductive SVM to label the unlabeled data and then labeled and unlabeled data points and whose edge l+u iteratively solving SVM quadratic programs. At each weights fWij gi;j=1 represent appropriate pairwise sim- step labels are switched to improve the objective func- ilarity relationships between examples. Given this tion. This procedure is susceptible to local minima graph, one can bias learning in an RKHS towards and requires an unknown, possibly large number of functions that vary smoothly along the edges (along label switches before converging. Thus, this approach the underlying structure). The following optimization does not seem to be well-suited to large scale problems. problem is solved over an RKHS HK with kernel func- tion K(x; y) and loss function V : Semi-supervised SVMs (S3VM) (Bennett & Demirez, 1998) use a similar objective function optimized us- ing mixed integer programming { Assume both posi- Manifold Regularization: tive and negative labels for each labeled examples and l choose the label that incurs smaller total hinge loss: ∗ 1 2 f = argmin X V (xi; yi; f) + γAkfkK 2H l Semi-supervised SVM: f K i=1 + l l u 2 ∗ ∗ 1 T T +γ (f(x ) − f(x )) W (w ; b ) = argmin w w + Cl X V (yi; w xi + b) + I X i j ij 2 =1 w;b i=1 i;j l+u Here, γA; γI are regularization parameters that control T T Cu X min V (−1; w xi + b); V (+1; w xi + b) the RKHS norm and the intrinsic norm respectively. i=l+1 T Defining, f^ = [f(x1); : : : ; f(xl+u)] , and L as the The 1-norm of the weight vector is preferred in (Ben- Laplacian matrix of the graph, given by L = D − W nett & Demirez, 1998); the optimization found to be where D is the diagnonal degree matrix given by D l+u W intractable even for moderate amounts of unlabeled ii = Pj=1 ij , we re-write the above problem as: data. (Fung & Mangasarian, 2001) reformulate this l approach as a concave minimization problem which is ∗ 1 2 ^T ^ f = argmin X V (xi; yi; f) + γAkfkK + γI f Lf 2H l solved by a successive linear approximation algorithm. f K i=1 These methods are applied to relatively small sized One can also take powers of the Laplacian Lp to de- problems. fine the graph regularizer. For more on graph regu- larization, see (Belkin,Matveeva & Niyogi; Kondor & 3. Manifold Regularization Lafferty,2002). Before discussing our algorithms, we briefly discuss A version of the Representer theorem can be easily the intuition and algorithmic framework of Manifold proved that shows that the minimizer has the following Regularization. The success of semi-supervised learn- form: 81 Here Xl is the submatrix of X corresponding to la- + l u beled examples and Y is the vector of labels. This is f(x) = X αiK(xi; x) (1) a d × d system which can be easily solved when d is i=1 small. When d is large but feature vctors are highly This reduces the optimization problem in manifold sparse with d¯ number of non-zero entries, we can em- regularization into a finite dimensional problem of es- ploy Conjugate Gradient (CG) methods to solve this timating the (l + u) expansion coefficients αi. The system. CG techniques are Krylov methods that solve algorithms that estimate these coefficients and con- a system Ax = b by repeatedly multiplying a candi- struct the optimal function are called Laplacian RLS date solution z by A. In the case of linear Laplacian and Laplacian SVM for the squared loss and hinge RLS, we can construct the matrix-vector product in loss respectively. Both algorithms involve inverting the LHS of (3) in time O(n(d¯+pk)),where p (typically (l + u) × (l + u) dense Gram matrices so that the com- very small) is the power of the Laplacian matrix (if plexity of a naive implementation is O(l + u)3 which Lp is used as the graph regularizer), and k (typically is prohibitive for large datasets. Experiments showing small) is average number of entries per column in L. state-of-the-art semi-supervised learning performance This is done by using an intermediate vector Xw and with small to moderate number of unlabeled examples appropriately forming sparse matrix-vector products. is reported in (Sindhwani, Niyogi and Belkin, 2005). Thus, the algorithm can employ very well-developed methods for efficiently obtaining the solution. 4. Linear Manifold Regularization 4.2. Linear Laplacian SVM Linear manifold regularization specializes the above al- To solve the optimization problem 4 for the hinge gorithms to linear functions resulting in the following loss, we adopt a different strategy (which also works optimization problem: for Linear Laplacian RLS).