An RKHS for Multi-View Learning and Manifold Co-Regularization

Vikas Sindhwani [email protected] Mathematical Sciences, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598 USA David S. Rosenberg [email protected] Department of Statistics, University of California Berkeley, CA 94720 USA

Abstract 1. Introduction In semi-, we are given a few la- Inspired by co-training, many multi-view beled examples together with a large collection of un- semi-supervised kernel methods implement labeled data from which to estimate an unknown tar- the following idea: find a function in each of get function. Suppose we have two hypothesis spaces, multiple Reproducing Kernel Hilbert Spaces 1 and 2, each of which contains a predictor that (RKHSs) such that (a) the chosen functions well-approximatesH H the target function. We know that make similar predictions on unlabeled exam- predictors that agree with the target function also agree ples, and (b) the average prediction given with each other on unlabeled examples. Thus, any pre- by the chosen functions performs well on dictor in one hypothesis space that does not have an labeled examples. In this paper, we con- “agreeing predictor” in the other can be safely elimi- struct a single RKHS with a data-dependent nated from consideration. Due to the resulting reduc- “co-regularization” norm that reduces these tion in the complexity of the joint learning problem, approaches to standard supervised learn- one can expect improved generalization performance. ing. The reproducing kernel for this RKHS These conceptual intuitions and their algorithmic in- can be explicitly derived and plugged into stantiations together constitute a major line of work any , greatly extending the in semi-supervised learning. One of the earliest ap- theoretical and algorithmic scope of co- proaches in this area was “co-training” (Blum & regularization. In particular, with this devel- Mitchell, 1998), in which 1 and 2 are defined opment, the Rademacher complexity bound over different representations,H or “views”,H of the data, for co-regularization given in (Rosenberg and trained alternately to maximize mutual agree- & Bartlett, 2007) follows easily from well- ment on unlabeled examples. More recently, sev- known results. Furthermore, more refined eral papers have formulated these intuitions as joint bounds given by localized Rademacher com- complexity regularization, or co-regularization, be- plexity can also be easily applied. We pro- tween 1 and 2 which are taken to be Reproducing pose a co-regularization based algorithmic al- KernelH HilbertH Spaces (RKHSs) of functions defined ternative to manifold regularization (Belkin on the input space . Given a few labeled exam- et al., 2006; Sindhwani et al., 2005a) that X ples (xi,yi) i L and a collection of unlabeled data leads to major empirical improvements on { } ∈ xi i U , co-regularization learns a prediction func- semi-supervised tasks. Unlike the recently ∈ tion,{ } proposed transductive approach of (Yu et al., 1 1 2 2008), our RKHS formulation is truly semi- f⋆(x) = f⋆ (x) + f⋆ (x) (1) supervised and naturally extends to unseen 2 1 1 2 2  test data. where f⋆ and f⋆ are obtained by solving the following∈ H optimization∈ H problem,

1 2 1 2 2 2 (f⋆ , f⋆ ) = argmin γ1 f 1 + γ2 f 2 th f 1 1,f 2 2 || ||H || ||H Appearing in Proceedings of the 25 International Confer- ∈H ∈H 1 2 2 ence on , Helsinki, Finland, 2008. Copy- +µ [f (xi) f (xi)] + V (yi, f(xi)) (2) − right 2008 by the author(s)/owner(s). i U i L X∈ X∈ An RKHS for Multi-View Learning and Manifold Co-Regularization

In this objective function, the first two terms measure the final prediction function f⋆: 2 2 complexity by the RKHS norms 1 and 2 k · kH k · kH γ1 1 2 γ2 2 2 in 1 and 2 respectively, the third term enforces f⋆ = argmin min f 1 + f 2 + H H f f=f1+f2 2 || ||H 2 || ||H agreement among predictors on unlabeled examples, f 1 1,f 2 2 and the final term evaluates the empirical loss of the ∈H ∈H 1 2 µ 1 2 2 1 1 mean function f = (f + f )/2 on the labeled data [f (xi) f (xi)] + V yi, f(xi) (3) 2 − 2 2 with respect to a V ( , ). The real-valued i U i L   · · X∈ X∈ parameters γ1, γ2, and µ allow different tradeoffs be- tween the regularization terms. L and U are index sets Consider the sum space of functions, ˜, given by, H over labeled and unlabeled examples respectively. ˜ = 1 2 (4) Several variants of this formulation have been pro- H H ⊕H = f f(x) = f 1(x) + f 2(x), f 1 1, f 2 2 posed independently and explored in different con- { | ∈H ∈H } texts: linear logistic regression (Krishnapuram et al., and impose on it a data-dependent norm, 2005), regularized classification (Sind- hwani et al., 2005b), regression (Brefeld et al., 2006), 2 1 2 2 2 f ˜ = min γ1 f 1 + γ2 f 2 support vector classification (Farquhar et al., 2005), k kH f=f1+f2 k kH k kH f 1 1,f 2 2 Bayesian co-training (Yu et al., 2008), and generaliza- ∈H ∈H 1 2 2 tion theory (Rosenberg & Bartlett, 2007). +µ f (xi) f (xi) (5) − i U The main theoretical contribution of this paper is the X∈   construction of a new “co-regularization RKHS,” in The minimization problem in Eqn. 3 can then be posed which standard supervised learning recovers the so- as standard supervised learning in ˜ as follows, lution to the co-regularization problem of Eqn. 2. H Theorem 2.2 presents the RKHS and gives an ex- 2 1 1 f⋆ = argmin γ f + V yi, f(xi) (6) plicit formula for its reproducing kernel. This “co- k k ˜ 2 2 f ˜ H i L   regularization kernel” can be plugged into any stan- ∈H X∈ dard kernel method giving convenient and immediate 1 where γ = 2 . Of course, this reformulation is not access to two-view semi-supervised techniques for a really useful unless ˜ itself is a valid new RKHS. Let wide variety of learning problems. Utilizing this ker- us recall the definitionH of an RKHS. nel, in Section 3 we give much simpler proofs of the results of (Rosenberg & Bartlett, 2007) concerning Definition 2.1 (RKHS). A reproducing kernel Hilbert space (RKHS) is a Hilbert Space that possesses a bounds on the Rademacher complexity and general- F ization performance of co-regularization. As a more reproducing kernel, i.e., a function k : for which the following hold: (a) k(x, .)X×Xfor →R all algorithmic application, in Section 4 we consider the ∈ F semi-supervised learning setting where examples live x , and (b) f, k(x, .) = f(x) for all x and ∈X h iF ∈X near a low-dimensional manifold embedded in a high f , where , denotes inner product in . ∈F h· ·iF F dimensional ambient euclidean space. Our approach, In Theorem 2.2, we show that ˜ is indeed an RKHS, manifold co-regularization (CoMR), gives major em- H pirical improvements over the manifold regularization and moreover we give an explicit expression for its re- (MR) framework of (Belkin et al., 2006; Sindhwani producing kernel. Thus, it follows that although the et al., 2005a). domain of optimization in Eqn. 6 is nominally a func- tion space, by the Representer Theorem we can express The recent work of (Yu et al., 2008) considers a similar it as a finite-dimensional optimization problem. reduction. However, this reduction is strictly trans- ductive and does not allow prediction on unseen test 2.1. Co-Regularization Kernels examples. By contrast, our formulation is truly semi- supervised and provides a principled out-of-sample ex- Let 1, 2 be RKHSs with kernels given by k1, k2 re- H H tension. spectively, and let ˜ = 1 2 as defined in Eqn. 4. We have the followingH result.H ⊕H 2. An RKHS for Co-Regularization Theorem 2.2. There exists an inner product on ˜ for which ˜ is a RKHS with norm defined by Eqn.H 5 We start by reformulating the co-regularization opti- and reproducingH kernel k˜ : given by, mization problem, given in Eqn. 1 and Eqn. 2, in the X×X→R following equivalent form where we directly solve for T k˜(x, z) = s(x, z) µd Hdz (7) − x An RKHS for Multi-View Learning and Manifold Co-Regularization where s(x, z) is the (scaled) sum of kernels given by, bounds are given on the Rademacher complexity of the co-regularized hypothesis space. This leads to general- 1 1 1 2 s(x, z) = γ1− k (x, z) + γ2− k (x, z), ization bounds in terms of the Rademacher complexity. In this section, we derive these complexity bounds in and dx is a vector-valued function that depends on the a few lines using Theorem 2.2 and a well-known result difference in views measured as, on RKHS balls. Furthermore, we present improved 1 1 1 2 generalization bounds based on the theory of localized dx = γ− k γ− k , 1 Ux − 2 Ux Rademacher complexity. i i T where kUx = k (x, xj ),j U , and H is a positive- ∈ 1 3.1. Rademacher Complexity Bounds definite matrix given by H = (I+µS)− . Here, S is the   1 1 1 2 gram matrix of s( , ), i.e., S = γ1− KUU + γ2− KUU Definition 3.1. The empirical Rademacher complex- i i · · where KUU = k (U, U) denotes the Gram matrices of ity of a function class = f : on a sample i  A { X → R} k over unlabeled examples. x ,..., xℓ is defined as 1 ∈X We give a rigorous proof in Appendix A. ℓ σ 2 Rˆℓ( ) = E sup σif(xi) , A "f ℓ # 2.2. Representer Theorem ∈A i=1 X

Theorem 2.2 says that ˜ is a valid RKHS with kernel where the expectation is with respect to σ = ˜ H σ1,...,σℓ , and the σi are i.i.d. Rademacher ran- k. By the Representer Theorem, the solution to Eqn.6 { } 1 dom variables, that is, P (σi = 1) = P (σi = 1) = . is given by − 2 ˜ f⋆(x) = αik(xi, x) (8) Let be an arbitrary RKHS with kernel k( , ), and i L H · · X∈ denote the standard RKHS supervised learning objec- 1 2 2 The corresponding components in , can also be tive function by Q(f) = i L V (yi, f(xi)) + λ f . H H ∈ || ||H retrieved as, Let f⋆ = argminf Q(f). Then Q(f⋆) Q(0) = ∈H P 2 ≤ i L V (yi, 0). It follows that f⋆ Q(0)/λ. Thus 1 1 1 T 1 ∈ k kH ≤ f (x) = αiγ− k (xi, x) µd Hk (9) if we have some control a priori on Q(0), then we ⋆ 1 − xi Ux P i L can restrict the search for f⋆ to a ball in of radius X∈  H 2 1 2 T 2 r = Q(0)/λ. f⋆ (x) = αiγ2− k (xi, x) + µdxi HkUx (10) i L We now cite a well-known result about the X∈  p Rademacher complexity of a ball in an RKHS (see 1 2 Note that and are defined on the same do- e.g. (Boucheron et al., 2005)). Let r := f : H { ∈ H main so thatH takingH the mean prediction is meaning- f r denote the ball of radius r in . Then we || ||H ≤ } H ful. InX a two-view problem one may begin by defining have the following: 1, 2 on different view spaces 1, 2 respectively. H H X X Lemma 3.2. The empirical Rademacher complexity Such a problem can be mapped to our framework by on the sample x1,..., xℓ for the RKHS ball r is extending 1, 2 to = 1 2 by re-defining 1 2r∈X ˆ 2r H bounded as follows: 4 √trK Rℓ( r) √trK 1 1 2 H 1H 1 X1 1X × X 2 √2 ℓ ℓ f (x , x ) = f (x ), f ; similarly for . While ℓ ≤ H ≤ ∈H H where K = k(x , x ) is the kernel matrix. we omit these technical details, it is important to note i j i,j=1 that in such cases, Eqns. 9 and 10 can be reinterpreted  1 2 For the co-regularization problem described in as predictors defined on , respectively. 2 X X Eqns. 3 and 6, we have f⋆ ˜r where r = ∈ H ℓ supy V (0,y), where ℓ is number of labeled examples. 3. Bounds on Complexity and We now state and prove bounds on the Rademacher Generalization complexity of ˜r. The bounds here are exactly the same as thoseH given in (Rosenberg & Bartlett, 2007). By eliminating all predictors that do not collectively However, while they have a lengthy “bare-hands” ap- agree on unlabeled examples, co-regularization intu- proach, here we get the result as a simple corollary of itively reduces the complexity of the learning prob- Theorem 2.2 and Lemma 3.2. lem. It is reasonable then to expect better test perfor- mance for the same amount of labeled training data. Theorem 3.3. The empirical Rademacher com- plexity on the labeled sample x ,..., xℓ In (Rosenberg & Bartlett, 2007), the size of the co- 1 ∈ X for the RKHS ball ˜r is bounded as follows: regularized function class is measured by its empiri- H 1 2r ˜ ˆ ˜ 2r ˜ ˜ cal Rademacher complexity, and tight upper and lower 4 trK Rℓ( r) trK, where K = √2 ℓ ≤ H ≤ ℓ p p An RKHS for Multi-View Learning and Manifold Co-Regularization

1 1 1 2 T 1 γ1− KLL + γ2− KLL µDUL (I + µS)− DUL and We need two more conditions for the localized bound: 1 1 −1 2 DUL = γ1− KUL γ2− KUL Condition 2. For any probability distribution P , − ˜  there exists f⋆ 1 satisfying P [V (f⋆(x),y)] = Proof. Note that K˜ is just the kernel matrix for the co- x ∈ H inff ˜1 P [V (f( ),y)] ˜ ∈H regularization kernel k( , ) on the labeled data. Then Condition 3. There is a constant B 1 such that · · ≥ the bound follows immediately from Lemma 3.2. for every probability distribution P and every f ˜1 2 ∈ H we have, P (f f ) BP (V [f(x),y] V [f⋆(x),y)]) 3.2. Co-Regularization Reduces Complexity − ∗ ≤ − In the following theorem, let Pℓ denote the empirical The co-regularization parameter µ controls the extent probability measure for the labeled sample of size ℓ. to which we enforce agreement between f 1 and f 2. Let Theorem 3.5. [Cor. 6.7 from (Bartlett et al., 2002)] ˜(µ) denote the co-regularization RKHS for a partic- H Assume that supx k(x, x) 1 and that V satisfies ular value of µ. From Theorem 3.3, we see that the ∈X ≤ˆ ˜ ˜ the 3 conditions above. Let f be any element of 1 Rademacher complexity for a ball of radius r in (µ) H H satisfying PℓV [fˆ(x),y] = inf ˜ PℓV [f(x),y]. There decreases with µ by an amount determined by f 1 exist a constant c depending only∈H on A and B s.t. with T 1 ν ∆(µ) = tr µD (I + µS)− D (11) probability at least 1 6e− , UL UL − ℓh i ν P V [fˆ(x),y] V [f⋆(x),y] c rˆ∗ + , 2 1 1 1 2 = ρ γ− k , γ− k (12) − ≤ ℓ 1 Uxi 2 Uxi     i=1 h 1 where rˆ∗ min0 h ℓ + i>h λi and where X  ≤ ≤ ≤ ℓ ℓ U λ1,...,λℓ are the eigenvalues of the labeled-data kernel where ρ( , ) is a metric on | | defined by, pP  · · R matrix K˜LL in decreasing order. 2 1 ρ (s, t) = µ(s t)′ (I + µS)− (s t) − − Note that while Theorem 3.4 bounds the gap between We see that the complexity reduction, ∆(µ), grows expected and empirical performance of an arbitrary with the ρ-distance between the two different repre- f ˜1, Theorem 3.5 bounds the gap between the sentations of the labeled points. Note that the metric empirical∈ H loss minimizer over ˜ and true risk min- H1 ρ is determined by S, the weighted sum of gram ma- imizer in ˜1. Since the localized bound only needs trices of the two kernels on unlabeled data. to accountH for the capacity of the function class in the neighborhood of f , the bounds are potentially tighter. 3.3. Generalization Bounds Indeed, while the∗ bound in Theorem 3.4 is in terms of the trace of the kernel matrix, the bound in Theo- With Theorem 2.2 allowing us to express multi-view rem 3.5 involves the tail sum of kernel eigenvalues. If co-regularization problems as supervised learning in a the eigenvalues decay very quickly, the latter is poten- data-dependent RKHS, we can now bring a large body tially much smaller. of theory to bear on the generalization performance of co-regularization methods. We start by quoting the theorem proved in (Rosenberg & Bartlett, 2007). 4. Manifold Co-Regularization Next, we state an improved bound based on localized Consider the picture shown in Figure 1(a) where there Rademacher complexity. Below, we denote the unit 2 ˜ ˜ are two classes of data points in the plane ( ) lying ball in by 1. R H H on one of two concentric circles. The large, colored Condition 1. The loss V ( , ) is Lipschitz in its first · · points are labeled while the smaller, black points are argument, i.e., there exists a constant A such that unlabeled. The picture immediately suggests two no- y, yˆ , yˆ : V (ˆy ,y) V (ˆy ,y) A yˆ yˆ ∀ 1 2 | 1 − 2 | ≤ | 1 − 2| tions of distance that are very natural but radically Theorem 3.4. Suppose V : 2 [0, 1] satisfies Con- different. For example, the two labeled points are close Y → dition 1. Then conditioned on the unlabeled data, for in the ambient euclidean distance on 2, but infinitely any δ (0, 1), with probability at least 1 δ over apart in terms of intrinsic geodesic distanceR measured ∈ − the sample of labeled points (x1,y1),..., (xℓ,yℓ) drawn along the circles. i.i.d. from P , we have for any predictor f ˜1 that ∈ H Suppose for this picture one had access to two kernel ℓ 1 functions, k1, k2 that assign high similarity to nearby P [V (ϕ(x),y)] V (ϕ(xi),yi) + 2BRˆℓ( ˜ ) ≤ ℓ H1 points according to euclidean and geodesic distance i=1 X respectively. Because of the difference in ambient and 1 + 2 + 3 ln(2/δ)/2 intrinsic representations, by co-regularizing the asso- √ℓ ciated RKHSs one can hope to get good reductions in  p  An RKHS for Multi-View Learning and Manifold Co-Regularization complexity, as suggested in section 3.2. In Figure 1, respect to the graph. Here, W denotes the adjacency we report the value of complexity reduction (Eqn. 12) matrix of the graph. When is a euclidean space, a 2 X xi xj for four point clouds generated at increasing levels of k − k typical W is given by Wij = exp( 2σ2 ) if i and noise off the two concentric circles. When noise be- j are nearest neighbors and 0 otherwise.− In practice, comes large, the ambient and intrinsic notions of dis- one may use a problem dependent similarity matrix to tance converge and the amount of complexity reduc- set these edge weights. This norm can be conveniently tion decreases. written as a quadratic form f T Mf, where M is the graph Laplacian matrix defined as M = D W , and D − Figure 1. Complexity Reduction ∆(µ = 1) (Eqn. 12) with is a diagonal degree matrix with entrees Dii = j Wij . respect to noise level ρ. The choice of k1, k2 is discussed in It turns out that with the norm P is an the following subsections. I I RKHS whose reproducingH kernel k :k · k I V × V → is given by k (xi, xj ) = Mij† , where M † de- Rnotes the pseudo-inverseI of the Laplacian. Given with its reproducing kernel, graph transduction HsolvesI the standard RKHS regularization problem, 2 f⋆ = argminf I γ f + i L V (yi, f(xi)), where ∈H k kI ∈ yi is the label associated with the node xi. Note that P the solution f⋆ is only defined over , the set of la- V (a) ∆ = 7957, ρ = 0 (b) ∆ = 7569, ρ = 0.1 beled and unlabeled examples. Since graph transduc- tion does not provide a function whose domain is the ambient space , it is not clear how to make predic- tions on unseenX test points x . Possessing a prin- cipled “out-of-sample extension”∈X distinguishes semi- supervised methods from transductive procedures.

4.1. Ambient and Intrinsic Co-Regularization We propose a co-regularization solution for out-of- (c) ∆ = 3720, ρ = 0.2 (d) ∆ = 3502, ρ = 0.5 sample prediction. Conceptually, one may interpret the manifold setting as a multi-view problem where each labeled or unlabeled example appears in two The setting where data lies on a low-dimensional “views”: (a) an ambient view in in terms of eu- submanifold embedded in a higher dimensional clidean co-ordinates x and (b) an intrinsicX view in ambient spaceM , as in the concentric circles case as a vertex index i. Let : be anG above, has attractedX considerable research interest re- RKHS defined over the ambientHA domainX×X with →R an asso- cently, almost orthogonal to multi-view efforts. The ciated kernel k : . We can now combine main assumption underlying manifold-motivated semi- ambient and intrinsicA X×X views →R by co-regularizing , . supervised learning algorithms is the following: two This can be done by setting k1 = k and k2 H=AkHinI points that are close with respect to geodesic distances Eqn. 7 and solving Eqn. 6. The combinedA predictionI on should have similar labels. Such an assump- M function f⋆ given by Eqn. 8 is the mean of an ambient tion may be enforced by an intrinsic regularizer that 1 component f⋆ given by Eqn. 9 and an intrinsic compo- emphasizes complexity along the manifold. 2 nent f⋆ given by Eqn. 10. Even though f⋆ is transduc- Since is truly unknown, the intrinsic regularizer is tive and only defined on labeled and unlabeled exam- M 1 empirically estimated from the point cloud of labeled ples, the ambient component f⋆ can be used for out- and unlabeled data. In the graph transduction ap- of-sample prediction. Due to co-regularization, this proach, an nn-nearest neighbor graph is constructed ambient component is (a) smooth in and (b) in G X which serves as an empirical substitute for . The agreement with a smooth function onH the data mani- M vertices of this graph are the set of labeled and un- fold. We call this approach manifold co-regularization, V labeled examples. Let denote the space of all func- and abbreviate it as CoMR. tions mapping to H, whereI the subscript implies V R I “intrinsic.” Any function f can be identified 4.2. Manifold Regularization ∈ HIT with the vector f = [f(xi), xi ] . One can impose 2 ∈ V 2 In the manifold regularization (MR) approach a norm f = ij Wij [f(xi) f(xj)] on that providesk ak naturalI measure of smoothness− forHfI with of (Belkin et al., 2006; Sindhwani et al., 2005a), the P An RKHS for Multi-View Learning and Manifold Co-Regularization following optimization problem is solved: 4.3. Experiments ⋆ 2 T In this section, we compare MR and CoMR. Similar to f = argmin γ1 f A + γ2f Mf f A k kH ∈H our construction of the co-regularization kernel, (Sind- + V (yi, f(xi)) (13) hwani et al., 2005a) provide a data-dependent kernel i L that reduces manifold regularization to standard su- X∈ pervised learning in an associated RKHS. We write the T ⋆ where f = [f(xi), i L, U] . The solution, f is manifold regularization kernel in the following form, defined on , and can∈ therefore be used for out-of- X mr T sample prediction. k˜ (x, z) =s ¯(x, z) d¯ H¯ d¯z (14) − x In Figure 2, we show a simple two-dimensional dataset 1 1 ¯ 1 1 where we have,s ¯ = γ1− k (x, z), dx = γ1− kUx and where MR provably fails when is the space of lin- 1 HA ¯ 1 ¯1 1 ¯ 2 − ¯1 ear functions on 2. The LINES dataset consists of H = γ1− K + γ2− K , where K is the Gram Ma- 1 two classes spreadR along perpendicular lines. In MR trix of k = k over labeled and unlabeled examples, ¯2 A  ¯ ¯ the intrinsic regularizer is enforced directly on . and K = M †. We use the notations, ¯ d, H so that It can be easily shown that the intrinsic normH ofA a the kernel can be easily compared with corresponding linear function f(x) = wT x along the perpendicular quantities in the co-regularization kernel Eqn. 7. In lines is exactly the same as the ambient norm, i.e., this section we empirically compare this kernel with f 2 = f 2 = wT w. Due to this, MR simply ig- the co-regularization kernel of Eqn. 7 for exactly the I A 1 2 noresk kH unlabeledk kH data and reduces to supervised train- same choice of k , k . Semi-supervised classification experiments were performed on 5 datasets described ing with the regularization parameter γ1 + γ2. in table 1. The linear function that gives maximally smooth pre- dictions on one line also gives the maximally non- Table 1. Datasets with d features and c classes. 10 random smooth predictions on the other line. One way to data splits were created with l labeled, u unlabeled, t test, remedy this restrictive situation is to introduce slack and v validation examples. variables ξ = (ξi)i L U in Eqn. 13 with an ℓ2 penalty, Dataset dcl u t v ∈ ⋆∪ 2 and instead solve: f = argminf A,ξ γ1 f A + LINES 2 2 2 500 250 250 T 2 ∈H k kH G50C 50 2 50 338 112 50 γ2(f + ξ) M(f + ξ) + µ ξ + i L V (yi, f(xi)). Re-parameterizing g = f k+kξ, we can∈ re-write the USPST 256 10 50 1430 477 50 COIL20 241 20 40 1320 40 40 above problem as, f ⋆ = argmin P γ f 2 + f A,g I 1 A PCMAC 7511 2 50 1385 461 50 2 2 ∈H ∈H k kH γ2 g I + µ f g + i L V (yi, f(xi)), which may bek viewedkH ask a− variantk of the∈ co-regularization prob- The LINES dataset is a variant of the two-dimensional lem in Eqn. 2 where empiricalP loss is measured for f problem shown in Figure 2 where we added random alone. Thus, this motivates the view that CoMR adds noise around the two perpendicular lines. The G50C, extra slack variables in the MR objective function to USPST, COIL20, and PCMAC datasets are well better fit the intrinsic regularizer. Figure 2 shows that known and have frequently been used for empirical CoMR achieves better separation between classes on studies in semi-supervised learning literature. They the LINES dataset. were used for benchmarking manifold regularization in (Sindhwani et al., 2005a) against a number of com- Figure 2. Decision boundaries of MR and CoMR (using the peting methods. g50c is an artificial dataset gener- quadratic hinge loss) on the LINES dataset ated from two unit covariance normal distributions with equal probabilities. The class means are adjusted so that the Bayes error is 5%. COIL20 consists of 32 32 gray scale images of 20 objects viewed from varying× angles. USPST is taken from the test subset of the USPS dataset of images containing 10 classes of handwritten digits. PCMAC is used to setup bi- nary text categorization problems drawn from the 20- newsgroups dataset. For each of the 5 datasets, we constructed random splits into labeled, unlabeled, test and validation sets. (a) MR (b) CoMR The sizes of these sets are given in table 1. For all datasets except LINES, we used Gaussian am- An RKHS for Multi-View Learning and Manifold Co-Regularization

Table 2. Error Rates (in percentage) on Test Data Table 4. Parameters Used. Note µ = 1 for CoMR. Linear kernel was used for LINES dataset. Dataset MR CoMR LINES 7.7 (1.2) 1.0 (1.5) Dataset nn σ p MR CoMR G50C . . . . γ1,γ2 γ1,γ2 5 8 (2 8) 5 5 (2 3) −6 −4 USPST 18.2 (1.5) 14.1 (1.6) LINES 10 − 1 0.01, 10 10 , 100 COIL20 23.8 (11.1) 14.8 (8.8) G50C 50 17.5 5 1, 100 10, 10 −6 −4 PCMAC 11.9 (3.4) 8.9 (2.6) USPST 10 9.4 2 0.01, 0.01 10 , 10 − − − − COIL20 2 0.6 1 10 4,10 6 10 6, 10 6 PCMAC 50 2.7 5 10, 100 1, 10 Table 3. Error Rates (in percentage) on Unlabeled Data Dataset MR CoMR LINES 7.5 (1.0) 1.3 (2.0) References G50C 6.6 (0.8) 6.9 (1.0) Bartlett, P. L., Bousquet, O., & Mendelson, S. (2002). USPST 18.6 (1.4) 13.3 (1.0) COIL20 37.5 (6.0) 14.8 (3.3) Localized rademacher complexities. COLT 02 (pp. PCMAC 11.0 (2.4) 9.4 (1.9) 44–58). Springer-Verlag. Belkin, M., Niyogi, P., & Sindhwani, V. (2006). 2 1 x z Manifold regularization: A geometric framework bient kernels k (x, z) = exp( k 2−σ2k ), and intrin- sic graph kernel whose gram− matrix is of the form for learning from labeled and unlabeled examples. p 6 1 JMLR, 7, 2399–2434. K2 = (M + 10− I)− . Here, M is the normalized Graph Laplacian constructed using nn nearest neigh- Bertinet, A., & Thomas-Agnan, C. (2004). Reproduc- bors and p is an integer. These parameters are tabu- ing kernel hilbert spaces in probability and statistics. lated in Table 4 for reproducibility. For more details Kluwer Academic Publishers. on these parameters see (Sindhwani et al., 2005a). Blum, A., & Mitchell, T. (1998). Combining labeled We chose squared loss for V ( , ). Manifold regulariza- and unlabeled data with co-training. COLT. tion with this choice is also· referred· to as Laplacian RLS and empirically performs as well as Laplacian Boucheron, S., Bousquet, O., & Lugosi, G. (2005). SVMs. For multi-class problems, we used the one- Theory of classification: a survey of some recent ad- versus-rest strategy. γ1, γ2 were varied on a grid of vances. ESAIM: P&S, 9, 323–375. 6 4 2 values: 10− , 10− , 10− , 1, 10, 100 and chosen with re- spect to validation error. The chosen parameters are Brefeld, U., G¨artner, T., Scheffer, T., & Wrobel, S. also reported in Table 4. Finally, we evaluated the MR (2006). Efficient co-regularised least squares regres- solution and the ambient component of CoMR on an sion. ICML (pp. 137–144). unseen test set. In Tables 2 and 3 we report the mean Farquhar, J. D. R., Hardoon, D. R., Meng, H., Taylor, and standard deviation of error rates on test and unla- J. S., & Szedm´ak, S. (2005). Two view learning: beled examples with respect to 10 random splits. We SVM-2K, theory and practice. NIPS. performed a paired t-test at 5% significance level to as- sess the statistical significance of the results. Results Krishnapuram, B., Williams, D., Xue, Y., & shown in bold are statistically significant. A. Hartemink, L. C. (2005). On semi-supervised classification. NIPS. Our experimental protocol makes MR and CoMR ex- actly comparable. We find that CoMR gives major Rosenberg, D., & Bartlett, P. L. (2007). The empirical improvements over MR on all datasets ex- Rademacher complexity of co-regularized kernel cept G50C where both methods approach the Bayes classes. AISTATS. error rate. Sindhwani, V., Niyogi, P., & Belkin, M. (2005a). Be- yond the point cloud: From transductive to semi- 5. Conclusion supervised learning. ICML. In this paper, we have constructed a single, new RKHS Sindhwani, V., Niyogi, P., & Belkin, M. (2005b). A co- in which standard supervised algorithms are immedi- regularization approach to semi-supervised learning ately turned into multi-view semi-supervised learners. with multiple views. ICML Workshop on Learning This construction brings about significant theoretical in Multiple Views (pp. 824–831). simplifications and algorithmic benefits, which we have demonstrated in the context of generalization analysis Yu, S., Krishnapuram, B., Rosales, R., Steck, H., & and manifold regularization respectively. Rao, R. (2008). Bayesian co-training. NIPS. An RKHS for Multi-View Learning and Manifold Co-Regularization

A. Proof of Theorem 2.2 (a) k˜(z, ·) ∈ H˜ ∀z ∈ X Recall from Eqn. 7 that the co-regularization kernel is defined as This theorem generalizes Theorem 5 in (Bertinet & ˜ 1 1 1 2 Thomas-Agnan, 2004). k(x, z) = γ1− k (x, z) + γ2− k (x, z) 1 1 1 2 T µ γ− k γ− k βz The Product Hilbert Space We begin by intro- − 1 Ux − 2 Ux ducing the product space, 1 1 1  1 2 where βz = Hdz = (I + µS)− γ1− kUz γ2− kUz . 1 1 1 1 1 − = 1 2 = (f 1, f 2) : f 1 1, f 2 2 , Define h (x) = γ1− k (x, z) µγ1− kUxβz and 2 1 2 −1 2  F H ×H { ∈H ∈H } h (x) = γ2− k (x, z) + µγ2− k (x,U)βz. Note that, 1 1 1 1 1 and an inner product on defined by, h span k (z, ), k (x , ),...,k (xu, ) , F ∈ · 1 · · ⊂ H 1 2 1 2 1 1 2 2 and similarly, h2 2. It’s clear that k˜(z, ) = (f , f ), (g , g ) = γ f , g 1 + γ f , g 2 1 2 1 2  ∈ H · F h iH h iH h ( ) + h ( ) , and thus k˜(z, ) ˜. 1 2 1 2 · · · ∈ H +µ f (xi) f (xi) g (xi) g (xi) (15) − − i U   X∈ (b) Reproducing Property For convenience, we   1 2 It’s straightforward to check that , is a valid inner collect some basic properties of h and h in the fol- product. Moreover, we have the following:h· ·iF lowing lemma: 1 2 Lemma A.1. is a Hilbert space. Lemma A.2 (Properties of h and h ). Writing 1 1 F h (U) for the column vector with entries h (xi) i U, and similarly for other functions on , we have∀ the∈ Proof. (Sketch.) We need to show that is complete. X 1 2 F 1 following: Let (fn, fn) be a Cauchy sequence in . Then fn 1 2 F 2 1 1 1 1 1 1 T is Cauchy in and fn is Cauchy in . By the z H H f ,h 1 = γ1− f ( ) µγ1− f (U) βz (17) completeness of 1 and 2, we have f 1 f 1 in 1 H − H H n → H 2 2 1 2 z 1 2 T and f 2 f 2 in 2, for some (f 1, f 2) . Since f ,h 2 = γ2− f ( ) + µγ2− f (U) βz (18) n H 1 →2 H ∈ F 1 2 and are RKHSs, convergence in norm implies h (U ) h (U) = βz (19) pointwiseH H convergence, and thus the co-regularization − Proof. The first two equations follow from the defini- part of the distance also goes to zero. tions of h1 and h2 and the reproducing kernel property. The last equation is derived as follows: ˜ is a Hilbert Space Recall the definition of ˜ in H ˜ 1 2H 1 2 1 1 1 1 Eqn. 4. Define the map u : by u(f , f ) = h (U) h (U) =γ1− k (U, z) µγ1− k (U, U)βz 1 2 F → H 1 − − f + f . The map’s kernel N := u− (0) is a closed 1 2 1 2 γ− k (U, z) µγ− k (U, U)βz − 2 − 2 subspace of , and thus its orthogonal complement 1  F =dz µS(I + µS)− dz N ⊥ is also a closed subspace. We can consider N ⊥ − 1 as a Hilbert space with the inner product that is the = I µS(I + µS)− dz − natural restriction of , to N ⊥. Define v : N ⊥ 1 h· ·iF → = (I + µS)− dz = βz ˜ as the restriction of u to N ⊥. Then v is a bijection, andH we define an inner product on ˜ by where the last line follows from the Sherman-Morrison- H 1 1 Woodbury inversion formula. f, g ˜ = v− (f),v− (g) . (16) h iH h iF Since k˜(z, ) = h ( ) + h ( ), it is clear that (h ,h ) = · 1 · 2 · 1 2 We conclude that ˜ is a Hilbert space isomorphic to 1 ˜ 1 H v− k(z, ) +n, for some n N. Since v− (f) N ⊥, N ⊥. · ∈ ∈ we have  1 1 The Co-Regularization Norm Fix any f ˜, f, k˜(z, ) = v− (f),v− (k˜(z, )) 1 1 ∈ H · ˜ · and note that u (f) = v (f) + n n N . Since H F − − D E D 1 E 1 | ∈ = v− (f), (h1,h2) n v− (f) and N are orthogonal, it’s clear by the − F  1 1 Pythagorean theorem that v− (f) is the element of = v− (f), (h1,h2) 1 F u− (f) with minimum norm. Thus 1 1 =γ1 f ,h + γ2 f2,h2 1 2 2 1 2 1 2 2 H h iH f ˜ = v− (f) = min (f , f ) 1 2 T 1 2 || || || ||F (f 1,f 2) u−1(f) || ||F + µ h (U ) h (U) f (U) f (U) H ∈ − − 1 2 1 2 T We see that the inner product on ˜ induces the norm =f (z) + f (z) µ f (U ) f (U) βz H − − claimed in the theorem statement. 1 2 T + µ f (U) f (U) βz (from A.2) − We next check the two conditions for validity of an = f(z)  RKHS (see Definition 2.1).