A Simple Algorithm for Semi-Supervised Learning with Improved Generalization Error Bound
Total Page:16
File Type:pdf, Size:1020Kb
A Simple Algorithm for Semi-supervised Learning with Improved Generalization Error Bound Ming Ji∗‡ [email protected] Tianbao Yang∗† [email protected] Binbin Lin\ [email protected] Rong Jiny [email protected] Jiawei Hanz [email protected] zDepartment of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA yDepartment of Computer Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA \State Key Lab of CAD&CG, College of Computer Science, Zhejiang University, Hangzhou, 310058, China ∗Equal contribution Abstract which states that the prediction function lives in a low In this work, we develop a simple algorithm dimensional manifold of the marginal distribution PX . for semi-supervised regression. The key idea It has been pointed out by several stud- is to use the top eigenfunctions of integral ies (Lafferty & Wasserman, 2007; Nadler et al., operator derived from both labeled and un- 2009) that the manifold assumption by itself is labeled examples as the basis functions and insufficient to reduce the generalization error bound learn the prediction function by a simple lin- of supervised learning. However, on the other hand, it ear regression. We show that under appropri- was found in (Niyogi, 2008) that for certain learning ate assumptions about the integral operator, problems, no supervised learner can learn effectively, this approach is able to achieve an improved while a manifold based learner (that knows the man- regression error bound better than existing ifold or learns it from unlabeled examples) can learn bounds of supervised learning. We also veri- well with relatively few labeled examples. Compared fy the effectiveness of the proposed algorithm to the manifold assumption, theoretical results based by an empirical study. on cluster assumption appear to be more encouraging. In the early studies (Castelli & Cover, 1995; 1996), the authors show that under the assumption that 1. Introduction the marginal distribution PX is a mixture of class conditional distributions, the generalization error will Although numerous algorithms have been develope- be reduced exponentially in the number of labeled d for semi-supervised learning (Zhu (2008) and ref- examples if the mixture is identifiable. Rigollet erences therein), most of them do not have theoreti- (2007) defines the cluster assumption in terms of cal guarantee on improving the generalization perfor- density level sets, and shows a similar exponential mance of supervised learning. A number of theories convergence rate given a sufficiently large number have been proposed for semi-supervised learning, and of unlabeled examples. Furthermore, Singh et al. most of them are based on one of the two assumption- (2008) show that the mixture components can be s: (1) the cluster assumption (Seeger, 2001; Rigollet, identified if PX is a mixture of a finite number of 2007; Lafferty & Wasserman, 2007; Singh et al., 2008; smooth density functions and the separation/overlap Sinha & Belkin, 2009) which assumes that two da- between different mixture components is significantly ta points should have the same class label or sim- large. Despite the encouraging results, one major ilar values if they are connected by a path passing problem of the cluster assumption is that it is difficult through a high density region; (2) the manifold as- to be verified given a limited number of labeled exam- sumption (Lafferty & Wasserman, 2007; Niyogi, 2008) ples. In addition, the learning algorithms suggested Appearing in Proceedings of the 29 th International Confer- in (Rigollet, 2007; Singh et al., 2008; Zhang & Ando, ence on Machine Learning, Edinburgh, Scotland, UK, 2012. 2005) are difficult to implement efficiently even if the Copyright 2012 by the author(s)/owner(s). cluster assumption holds, making them unpractical A Simple Algorithm for Semi-supervised Learning for real-world problems. Algorithm 1 A Simple Algorithm for Semi- supervised Learning In this work, we aim to develop a simple algorithm 1: Input for semi-supervised learning that on one hand is easy • D = fx ;:::; x g: labeled and unlabeled ex- to implement, and on the other hand is guaranteed 1 N amples to improve the generalization performance of super- • y = (y ; : : : ; y )>: labels for the first n ex- vised learning under appropriate assumptions. The l 1 n amples in D main idea of the proposed algorithm is to estimate the • s: the number of eigenfunctions to be used top eigenfunctions of the integral operator from the 2: Compute (ϕb ; λb ); i = 1; : : : ; s, the first s eigen- both labeled and unlabeled examples, and learn from i i functions and eigenvalues for the integral operator the labeled examples the best prediction function in Lb defined in (4). the subspace spanned by the estimated eigenfunction- N 3: Compute the prediction gb(x) in (5), where γ∗ = s. Unlike the previous studies of exploring eigenfunc- (γ∗; : : : ; γ∗)> is given by solving the following re- tions for semi-supervised learning (Fergus et al., 2009; 1 s gression problem Sinha & Belkin, 2009), we show that under appro- 0 1 priate assumptions, the proposed algorithm achieves 2 Xn Xs a better generalization error bound than supervised ∗ @ b A γ = arg min γjϕj(xi) − yi (1) learning algorithms. 2Rs γ i=1 j=1 To derive the generalization error bound, we make a different set of assumptions from previous stud- 4: Output prediction function gb(·) ies. First, we assume a skewed eigenvalue distribu- tion and bounded eigenfunctions of the integral oper- ator. The assumption of skewed eigenvalue distribu- exploiting both labeled and unlabeled examples. Be- tions has been verified and used in multiple studies low we first present our algorithm and then verify its of kernel learning (Koltchinskii, 2011; Steinwart et al., empirical performance by comparing to the state-of- 2006; Minh, 2010; Zhang & Ando, 2005), while the as- the-art algorithms for supervised and semi-supervised sumption of bounded eigenvectors was mostly found learning. in the study of compressive sensing (Cand`es& Tao, 2006). Second, we assume that a sufficient num- 2.1. A Simple algorithm for Semi-Supervised ber of labeled examples are available, which is also Learning used by the other analysis of semi-supervised learn- Let κ(·; ·): X × X ! R be a Mercer kernel, and let ing (Rigollet, 2007). It is the combination of these H be a Reproducing Kernel Hilbert space (RKHS) assumptions that allow us to derive better generaliza- κ of functions X! R endowed with kernel κ(·; ·). We tion error bound for semi-supervised learning. assume that κ is a bounded function, i.e., jκ(x; x)j ≤ The rest of the paper is arranged as follows. Section 2 1; 8x 2 X . Similar to most semi-supervised learning presents the proposed algorithm and verifies its effec- algorithms, in order to effectively exploit the unlabeled tiveness by an empirical study. Section 3 shows the data, we need to relate the prediction function f(x) to improved generalization error bound for the proposed the unlabeled examples (or the marginal distribution semi-supervised learning, and Section 4 outlines the PX ). To this end, we assume there exists an accurate 2 H k k ≤ proofs. Section 5 concludes with future work. prediction function g(x) κ with g Hκ R. More specifically, we define 2. Algorithm and Empirical Validation 2 2 " = min Ex[(f(x) − h(x)) ]; (2) h2H ;khkH ≤R X κ κ Let be a compact domain or a manifold in the Eu- 2 d g(x) = arg min Ex[(f(x) − h(x)) ]: (3) clidean space R . Let D = fxi; i = 1;:::;N jxi 2 X g h2H ;khkH ≤R be a collection of training examples. We randomly s- κ κ elect n examples from D for labeling. Without loss Our basic assumption (A0) is that the regression error of generality, we assume that the first n examples are "2 ≪ R2 is small, and the maximum regression error > 2 Rn labeled by yl = (y1; : : : ; yn) . We denote by of g(x) for any x 2 X is also small, i.e., > N y = (y1; : : : ; yN ) 2 R the true labels for all the D − 2 , 2 2 examples in . In this study, we assume y = f(x) is sup(f(x) g(x)) "max = O(n" = ln N): decided by an unknown deterministic function f(x). x2X Our goal is to learn an accurate prediction function by To present our algorithm, we define an integral oper- A Simple Algorithm for Semi-supervised Learning ator over the examples in D: Table 1. Statistics of datasets XN Name #Objects #Features b 1 L (f)(·) = κ(x ; ·)f(x ); (4) insurance 9; 822 85 N N i i i=1 wine 4; 898 11 b b temperature 9; 504 2 where f 2 Hκ. Let (ϕi(x); λi); i = 1; 2;:::;N be the b eigenfunctions and eigenvalues of LN ranked in the de- hb · b · i scending order of eigenvalues, where ϕi( ); ϕj( ) Hκ = δ(i; j) for any 1 ≤ i; j ≤ N. According Learning Repository (Frank & Asuncion, 2010), while to (Guo & Zhou, 2011), the prediction function g(x) the task of the last dataset is to predict the tempera- can be well approximated by a function in the subspace ture based on the coordinates (latitude, longitude) on b spanned by the top eigenfunctions of LN . Hence, we the earth surface. All three datasets are designed for propose to learn a target prediction function gb(x) as a regression tasks with real-valued outputs. We choose linear combination of the first s eigenfunctions, i.e., these three datasets because they fit in with our as- sumptions that will be elaborated in section 3.2. Xs b ∗ b g(x) = γj ϕj(x); (5) We randomly choose 90% of the data for training, j=1 and use the rest 10% for testing.