An RKHS for Multi-View Learning and Manifold Co-Regularization

An RKHS for Multi-View Learning and Manifold Co-Regularization Vikas Sindhwani [email protected] Mathematical Sciences, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598 USA David S. Rosenberg [email protected] Department of Statistics, University of California Berkeley, CA 94720 USA Abstract 1. Introduction In semi-supervised learning, we are given a few la- Inspired by co-training, many multi-view beled examples together with a large collection of un- semi-supervised kernel methods implement labeled data from which to estimate an unknown tar- the following idea: find a function in each of get function. Suppose we have two hypothesis spaces, multiple Reproducing Kernel Hilbert Spaces 1 and 2, each of which contains a predictor that (RKHSs) such that (a) the chosen functions well-approximatesH H the target function. We know that make similar predictions on unlabeled exam- predictors that agree with the target function also agree ples, and (b) the average prediction given with each other on unlabeled examples. Thus, any pre- by the chosen functions performs well on dictor in one hypothesis space that does not have an labeled examples. In this paper, we con- “agreeing predictor” in the other can be safely elimi- struct a single RKHS with a data-dependent nated from consideration. Due to the resulting reduc- “co-regularization” norm that reduces these tion in the complexity of the joint learning problem, approaches to standard supervised learn- one can expect improved generalization performance. ing. The reproducing kernel for this RKHS These conceptual intuitions and their algorithmic in- can be explicitly derived and plugged into stantiations together constitute a major line of work any kernel method, greatly extending the in semi-supervised learning. One of the earliest ap- theoretical and algorithmic scope of co- proaches in this area was “co-training” (Blum & regularization. In particular, with this devel- Mitchell, 1998), in which 1 and 2 are defined opment, the Rademacher complexity bound over different representations,H or “views”,H of the data, for co-regularization given in (Rosenberg and trained alternately to maximize mutual agree- & Bartlett, 2007) follows easily from well- ment on unlabeled examples. More recently, sev- known results. Furthermore, more refined eral papers have formulated these intuitions as joint bounds given by localized Rademacher com- complexity regularization, or co-regularization, be- plexity can also be easily applied. We pro- tween 1 and 2 which are taken to be Reproducing pose a co-regularization based algorithmic al- KernelH HilbertH Spaces (RKHSs) of functions defined ternative to manifold regularization (Belkin on the input space . Given a few labeled exam- et al., 2006; Sindhwani et al., 2005a) that X ples (xi,yi) i L and a collection of unlabeled data leads to major empirical improvements on { } ∈ xi i U , co-regularization learns a prediction func- semi-supervised tasks. Unlike the recently ∈ tion,{ } proposed transductive approach of (Yu et al., 1 1 2 2008), our RKHS formulation is truly semi- f⋆(x) = f⋆ (x) + f⋆ (x) (1) supervised and naturally extends to unseen 2 1 1 2 2 test data. where f⋆ and f⋆ are obtained by solving the following∈ H optimization∈ H problem, 1 2 1 2 2 2 (f⋆ , f⋆ ) = argmin γ1 f 1 + γ2 f 2 th f 1 1,f 2 2 || ||H || ||H Appearing in Proceedings of the 25 International Confer- ∈H ∈H 1 2 2 ence on Machine Learning, Helsinki, Finland, 2008. Copy- +µ [f (xi) f (xi)] + V (yi, f(xi)) (2) − right 2008 by the author(s)/owner(s). i U i L X∈ X∈ An RKHS for Multi-View Learning and Manifold Co-Regularization In this objective function, the first two terms measure the final prediction function f⋆: 2 2 complexity by the RKHS norms 1 and 2 k · kH k · kH γ1 1 2 γ2 2 2 in 1 and 2 respectively, the third term enforces f⋆ = argmin min f 1 + f 2 + H H f f=f1+f2 2 || ||H 2 || ||H agreement among predictors on unlabeled examples, f 1 1,f 2 2 and the final term evaluates the empirical loss of the ∈H ∈H 1 2 µ 1 2 2 1 1 mean function f = (f + f )/2 on the labeled data [f (xi) f (xi)] + V yi, f(xi) (3) 2 − 2 2 with respect to a loss function V ( , ). The real-valued i U i L · · X∈ X∈ parameters γ1, γ2, and µ allow different tradeoffs be- tween the regularization terms. L and U are index sets Consider the sum space of functions, ˜, given by, H over labeled and unlabeled examples respectively. ˜ = 1 2 (4) Several variants of this formulation have been pro- H H ⊕H = f f(x) = f 1(x) + f 2(x), f 1 1, f 2 2 posed independently and explored in different con- { | ∈H ∈H } texts: linear logistic regression (Krishnapuram et al., and impose on it a data-dependent norm, 2005), regularized least squares classification (Sind- hwani et al., 2005b), regression (Brefeld et al., 2006), 2 1 2 2 2 f ˜ = min γ1 f 1 + γ2 f 2 support vector classification (Farquhar et al., 2005), k kH f=f1+f2 k kH k kH f 1 1,f 2 2 Bayesian co-training (Yu et al., 2008), and generaliza- ∈H ∈H 1 2 2 tion theory (Rosenberg & Bartlett, 2007). +µ f (xi) f (xi) (5) − i U The main theoretical contribution of this paper is the X∈ construction of a new “co-regularization RKHS,” in The minimization problem in Eqn. 3 can then be posed which standard supervised learning recovers the so- as standard supervised learning in ˜ as follows, lution to the co-regularization problem of Eqn. 2. H Theorem 2.2 presents the RKHS and gives an ex- 2 1 1 f⋆ = argmin γ f + V yi, f(xi) (6) plicit formula for its reproducing kernel. This “co- k k ˜ 2 2 f ˜ H i L regularization kernel” can be plugged into any stan- ∈H X∈ dard kernel method giving convenient and immediate 1 where γ = 2 . Of course, this reformulation is not access to two-view semi-supervised techniques for a really useful unless ˜ itself is a valid new RKHS. Let wide variety of learning problems. Utilizing this ker- us recall the definitionH of an RKHS. nel, in Section 3 we give much simpler proofs of the results of (Rosenberg & Bartlett, 2007) concerning Definition 2.1 (RKHS). A reproducing kernel Hilbert space (RKHS) is a Hilbert Space that possesses a bounds on the Rademacher complexity and general- F ization performance of co-regularization. As a more reproducing kernel, i.e., a function k : for which the following hold: (a) k(x, .)X×Xfor →R all algorithmic application, in Section 4 we consider the ∈ F semi-supervised learning setting where examples live x , and (b) f, k(x, .) = f(x) for all x and ∈X h iF ∈X near a low-dimensional manifold embedded in a high f , where , denotes inner product in . ∈F h· ·iF F dimensional ambient euclidean space. Our approach, In Theorem 2.2, we show that ˜ is indeed an RKHS, manifold co-regularization (CoMR), gives major em- H pirical improvements over the manifold regularization and moreover we give an explicit expression for its re- (MR) framework of (Belkin et al., 2006; Sindhwani producing kernel. Thus, it follows that although the et al., 2005a). domain of optimization in Eqn. 6 is nominally a function space, by the Representer Theorem we can express The recent work of (Yu et al., 2008) considers a similar it as a finite-dimensional optimization problem. reduction. However, this reduction is strictly transductive and does not allow prediction on unseen test 2.1. Co-Regularization Kernels examples. By contrast, our formulation is truly semi- supervised and provides a principled out-of-sample ex- Let 1, 2 be RKHSs with kernels given by k1, k2 re- H H tension. spectively, and let ˜ = 1 2 as defined in Eqn. 4. We have the followingH result.H ⊕H 2. An RKHS for Co-Regularization Theorem 2.2. There exists an inner product on ˜ for which ˜ is a RKHS with norm defined by Eqn.H 5 We start by reformulating the co-regularization opti- and reproducingH kernel k˜ : given by, mization problem, given in Eqn. 1 and Eqn. 2, in the X×X→R following equivalent form where we directly solve for T k˜(x, z) = s(x, z) µd Hdz (7) − x An RKHS for Multi-View Learning and Manifold Co-Regularization where s(x, z) is the (scaled) sum of kernels given by, bounds are given on the Rademacher complexity of the co-regularized hypothesis space. This leads to general- 1 1 1 2 s(x, z) = γ1− k (x, z) + γ2− k (x, z), ization bounds in terms of the Rademacher complexity. In this section, we derive these complexity bounds in and dx is a vector-valued function that depends on the a few lines using Theorem 2.2 and a well-known result difference in views measured as, on RKHS balls. Furthermore, we present improved 1 1 1 2 generalization bounds based on the theory of localized dx = γ− k γ− k , 1 Ux − 2 Ux Rademacher complexity. i i T where kUx = k (x, xj ),j U , and H is a positive- ∈ 1 3.1. Rademacher Complexity Bounds definite matrix given by H = (I+µS)− . Here, S is the 1 1 1 2 gram matrix of s( , ), i.e., S = γ1− KUU + γ2− KUU Definition 3.1. The empirical Rademacher complex- i i · · where KUU = k (U, U) denotes the Gram matrices of ity of a function class = f : on a sample i A { X → R} k over unlabeled examples.

An RKHS for Multi-View Learning and Manifold Co-Regularization

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support