An RKHS for Multi-View Learning and Manifold Co-Regularization

An RKHS for Multi-View Learning and Manifold Co-Regularization Vikas Sindhwani [email protected] Mathematical Sciences, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598 USA David S. Rosenberg [email protected] Department of Statistics, University of California Berkeley, CA 94720 USA Abstract 1. Introduction In semi-supervised learning, we are given a few la- Inspired by co-training, many multi-view beled examples together with a large collection of un- semi-supervised kernel methods implement labeled data from which to estimate an unknown tar- the following idea: find a function in each of get function. Suppose we have two hypothesis spaces, multiple Reproducing Kernel Hilbert Spaces 1 and 2, each of which contains a predictor that (RKHSs) such that (a) the chosen functions well-approximatesH H the target function. We know that make similar predictions on unlabeled exam- predictors that agree with the target function also agree ples, and (b) the average prediction given with each other on unlabeled examples. Thus, any pre- by the chosen functions performs well on dictor in one hypothesis space that does not have an labeled examples. In this paper, we con- “agreeing predictor” in the other can be safely elimi- struct a single RKHS with a data-dependent nated from consideration. Due to the resulting reduc- “co-regularization” norm that reduces these tion in the complexity of the joint learning problem, approaches to standard supervised learn- one can expect improved generalization performance. ing. The reproducing kernel for this RKHS These conceptual intuitions and their algorithmic in- can be explicitly derived and plugged into stantiations together constitute a major line of work any kernel method, greatly extending the in semi-supervised learning. One of the earliest ap- theoretical and algorithmic scope of co- proaches in this area was “co-training” (Blum & regularization. In particular, with this devel- Mitchell, 1998), in which 1 and 2 are defined opment, the Rademacher complexity bound over different representations,H or “views”,H of the data, for co-regularization given in (Rosenberg and trained alternately to maximize mutual agree- & Bartlett, 2007) follows easily from well- ment on unlabeled examples. More recently, sev- known results. Furthermore, more refined eral papers have formulated these intuitions as joint bounds given by localized Rademacher com- complexity regularization, or co-regularization, be- plexity can also be easily applied. We pro- tween 1 and 2 which are taken to be Reproducing pose a co-regularization based algorithmic al- KernelH HilbertH Spaces (RKHSs) of functions defined ternative to manifold regularization (Belkin on the input space . Given a few labeled exam- et al., 2006; Sindhwani et al., 2005a) that X ples (xi,yi) i L and a collection of unlabeled data leads to major empirical improvements on { } ∈ xi i U , co-regularization learns a prediction func- semi-supervised tasks. Unlike the recently ∈ tion,{ } proposed transductive approach of (Yu et al., 1 1 2 2008), our RKHS formulation is truly semi- f⋆(x) = f⋆ (x) + f⋆ (x) (1) supervised and naturally extends to unseen 2 1 1 2 2 test data. where f⋆ and f⋆ are obtained by solving the following∈ H optimization∈ H problem, 1 2 1 2 2 2 (f⋆ , f⋆ ) = argmin γ1 f 1 + γ2 f 2 th f 1 1,f 2 2 || ||H || ||H Appearing in Proceedings of the 25 International Confer- ∈H ∈H 1 2 2 ence on Machine Learning, Helsinki, Finland, 2008. Copy- +µ [f (xi) f (xi)] + V (yi, f(xi)) (2) − right 2008 by the author(s)/owner(s). i U i L X∈ X∈ An RKHS for Multi-View Learning and Manifold Co-Regularization In this objective function, the first two terms measure the final prediction function f⋆: 2 2 complexity by the RKHS norms 1 and 2 k · kH k · kH γ1 1 2 γ2 2 2 in 1 and 2 respectively, the third term enforces f⋆ = argmin min f 1 + f 2 + H H f f=f1+f2 2 || ||H 2 || ||H agreement among predictors on unlabeled examples, f 1 1,f 2 2 and the final term evaluates the empirical loss of the ∈H ∈H 1 2 µ 1 2 2 1 1 mean function f = (f + f )/2 on the labeled data [f (xi) f (xi)] + V yi, f(xi) (3) 2 − 2 2 with respect to a loss function V ( , ). The real-valued i U i L · · X∈ X∈ parameters γ1, γ2, and µ allow different tradeoffs be- tween the regularization terms. L and U are index sets Consider the sum space of functions, ˜, given by, H over labeled and unlabeled examples respectively. ˜ = 1 2 (4) Several variants of this formulation have been pro- H H ⊕H = f f(x) = f 1(x) + f 2(x), f 1 1, f 2 2 posed independently and explored in different con- { | ∈H ∈H } texts: linear logistic regression (Krishnapuram et al., and impose on it a data-dependent norm, 2005), regularized least squares classification (Sind- hwani et al., 2005b), regression (Brefeld et al., 2006), 2 1 2 2 2 f ˜ = min γ1 f 1 + γ2 f 2 support vector classification (Farquhar et al., 2005), k kH f=f1+f2 k kH k kH f 1 1,f 2 2 Bayesian co-training (Yu et al., 2008), and generaliza- ∈H ∈H 1 2 2 tion theory (Rosenberg & Bartlett, 2007). +µ f (xi) f (xi) (5) − i U The main theoretical contribution of this paper is the X∈ construction of a new “co-regularization RKHS,” in The minimization problem in Eqn. 3 can then be posed which standard supervised learning recovers the so- as standard supervised learning in ˜ as follows, lution to the co-regularization problem of Eqn. 2. H Theorem 2.2 presents the RKHS and gives an ex- 2 1 1 f⋆ = argmin γ f + V yi, f(xi) (6) plicit formula for its reproducing kernel. This “co- k k ˜ 2 2 f ˜ H i L regularization kernel” can be plugged into any stan- ∈H X∈ dard kernel method giving convenient and immediate 1 where γ = 2 . Of course, this reformulation is not access to two-view semi-supervised techniques for a really useful unless ˜ itself is a valid new RKHS. Let wide variety of learning problems. Utilizing this ker- us recall the definitionH of an RKHS. nel, in Section 3 we give much simpler proofs of the results of (Rosenberg & Bartlett, 2007) concerning Definition 2.1 (RKHS). A reproducing kernel Hilbert space (RKHS) is a Hilbert Space that possesses a bounds on the Rademacher complexity and general- F ization performance of co-regularization. As a more reproducing kernel, i.e., a function k : for which the following hold: (a) k(x, .)X×Xfor →R all algorithmic application, in Section 4 we consider the ∈ F semi-supervised learning setting where examples live x , and (b) f, k(x, .) = f(x) for all x and ∈X h iF ∈X near a low-dimensional manifold embedded in a high f , where , denotes inner product in . ∈F h· ·iF F dimensional ambient euclidean space. Our approach, In Theorem 2.2, we show that ˜ is indeed an RKHS, manifold co-regularization (CoMR), gives major em- H pirical improvements over the manifold regularization and moreover we give an explicit expression for its re- (MR) framework of (Belkin et al., 2006; Sindhwani producing kernel. Thus, it follows that although the et al., 2005a). domain of optimization in Eqn. 6 is nominally a function space, by the Representer Theorem we can express The recent work of (Yu et al., 2008) considers a similar it as a finite-dimensional optimization problem. reduction. However, this reduction is strictly transductive and does not allow prediction on unseen test 2.1. Co-Regularization Kernels examples. By contrast, our formulation is truly semi- supervised and provides a principled out-of-sample ex- Let 1, 2 be RKHSs with kernels given by k1, k2 re- H H tension. spectively, and let ˜ = 1 2 as defined in Eqn. 4. We have the followingH result.H ⊕H 2. An RKHS for Co-Regularization Theorem 2.2. There exists an inner product on ˜ for which ˜ is a RKHS with norm defined by Eqn.H 5 We start by reformulating the co-regularization opti- and reproducingH kernel k˜ : given by, mization problem, given in Eqn. 1 and Eqn. 2, in the X×X→R following equivalent form where we directly solve for T k˜(x, z) = s(x, z) µd Hdz (7) − x An RKHS for Multi-View Learning and Manifold Co-Regularization where s(x, z) is the (scaled) sum of kernels given by, bounds are given on the Rademacher complexity of the co-regularized hypothesis space. This leads to general- 1 1 1 2 s(x, z) = γ1− k (x, z) + γ2− k (x, z), ization bounds in terms of the Rademacher complexity. In this section, we derive these complexity bounds in and dx is a vector-valued function that depends on the a few lines using Theorem 2.2 and a well-known result difference in views measured as, on RKHS balls. Furthermore, we present improved 1 1 1 2 generalization bounds based on the theory of localized dx = γ− k γ− k , 1 Ux − 2 Ux Rademacher complexity. i i T where kUx = k (x, xj ),j U , and H is a positive- ∈ 1 3.1. Rademacher Complexity Bounds definite matrix given by H = (I+µS)− . Here, S is the 1 1 1 2 gram matrix of s( , ), i.e., S = γ1− KUU + γ2− KUU Definition 3.1. The empirical Rademacher complex- i i · · where KUU = k (U, U) denotes the Gram matrices of ity of a function class = f : on a sample i A { X → R} k over unlabeled examples.

Load more