Multi-view Metric Learning in Vector-valued Kernel Spaces

Riikka Huusari Hachem Kadri Cécile Capponi

Aix Marseille Univ, Université de Toulon, CNRS, LIS, Marseille, France

Abstract unlabeled examples [26]; the latter tries to efficiently combine multiple kernels defined on each view to exploit We consider the problem of metric learning for information coming from different representations [11]. multi-view data and present a novel method More recently, vector-valued reproducing kernel Hilbert for learning within-view as well as between- spaces (RKHSs) have been introduced to the field of view metrics in vector-valued kernel spaces, multi-view learning for going further than MKL by in- as a way to capture multi-modal structure corporating in the learning model both within-view and of the data. We formulate two convex opti- between-view dependencies [20, 14]. It turns out that mization problems to jointly learn the metric these kernels and their associated vector-valued repro- and the classifier or regressor in kernel feature ducing Hilbert spaces provide a unifying framework for spaces. An iterative three-step multi-view a number of previous multi-view kernel methods, such metric learning algorithm is derived from the as co-regularized multi-view learning and manifold reg- optimization problems. In order to scale the ularization, and naturally allow to encode within-view computation to large training sets, a block- as well as between-view similarities [21]. wise Nyström approximation of the multi-view Kernels of vector-valued RKHSs are positive semidefi- kernel is introduced. We justify our ap- nite matrix-valued functions. They have been applied proach theoretically and experimentally, and with success in various machine learning problems, such show its performance on real-world datasets as multi-task learning [10], functional regression [15] against relevant state-of-the-art methods. and structured output prediction [5]. The main advan- tage of matrix-valued kernels is that they offer a higher degree of flexibility in encoding similarities between 1 Introduction data points. However finding the optimal matrix-valued kernel of choice for a given application is difficult, as is In this paper we tackle the problem of supervised multi- the question of how to build them. In order to overcome view learning, where each labeled example is observed the need for choosing a kernel before the learning pro- under several views. These views might be not only cess, we propose a supervised metric learning approach correlated, but also complementary, redundant or con- that learns a matrix-valued multi-view kernel jointly tradictory. Thus, learning over all the views is expected with the decision function. We refer the reader to [3] for arXiv:1803.07821v1 [cs.LG] 21 Mar 2018 to produce a final classifier (or regressor) that is better a review of metric learning. It is worth mentioning that than each individual one. Multi-view learning is well- algorithms for learning matrix-valued kernels have been known in the semi-supervised setting, where the agree- proposed in the literature, see for example [9, 8, 17]. ment among views is usually optimized [4, 28]. Yet, the However, these methods mainly consider separable ker- supervised setting has proven to be interesting as well, nels which are not suited for multi-view setting, as will independently from any agreement condition on views. be illustrated later in this paper. Co-regularization and multiple kernel learning (MKL) The main contributions of this paper are: 1) we intro- are two well known kernel-based frameworks for learn- duce and learn a new class of matrix-valued kernels de- ing in the presence of multiple views of data [31]. The signed to handle multi-view data 2) we give an iterative former attempts to optimize measures of agreement algorithm that learns simultaneously a vector-valued and smoothness between the views over labeled and multi-view function and a block-structured metric be- tween views, 3) we provide generalization analysis of Proceedings of the 21st International Conference on Artifi- our algorithm with a Rademacher bound; and 4) we cial Intelligence and Statistics (AISTATS) 2018, Lanzarote, show how matrix-valued kernels can be efficiently com- Spain. PMLR: Volume 84. Copyright 2018 by the author(s). puted via a block-wise Nyström approximation in order Multi-view Metric Learning in Vector-valued Kernel Spaces to reduce significantly their high computational cost. kernels are defined by

2 Preliminaries K(x, z) = k(x, z)T, where T is a matrix in v×v. This class of kernels is We start here by briefly reviewing the basics of vector- R very attractive in terms of computational time, as it valued RKHSs and their associated matrix-valued ker- is easily decomposable. However the matrix T acts nels. We then describe how they can be used for learn- only on the outputs independently of the input data, ing from multi-view data. which makes it difficult for these kernels to encode nec- essary similarities in multi-view setting. Transformable 2.1 Vector-valued RKHSs kernels are defined by Vector-valued RKHSs were introduced to the machine [K(x, z)] = k(S x,S z). learning community by Micchelli and Pontil [19] as a lm m l way to extend kernel machines from scalar to vector Here m and l are indices of the output matrix (views outputs. In this setting, given a random training sample in multi-view setting) and operators {S }v , are used n t t=1 {xi, yi}i=1 on X × Y, optimization problem to transform the data. In contrast to separable kernels, n here the St operate on input data; however choosing X 2 arg min V (f, xi, yi) + λkfkH, (1) them is a difficult task. For further reading on matrix- f∈H i=1 valued reproducing kernels, see, e.g., [1, 6, 7, 15]. where f is a vector-valued function and V is a loss function, can be solved in a vector-valued RKHS H by 2.2 Vector-valued multi-view learning the means of a vector-valued extension of the represen- This section reviews the setup for supervised multi-view ter theorem. To see this more clearly, we recall some learning in vector-valued RKHSs [14, 21]. The main fundamentals of vector-valued RKHSs. idea is to consider a kernel that measures not only the Definition 1. (vector-valued RKHS) similarities between examples of the same view but also A Hilbert space H of functions from X to Rv is called those coming from different views. Reproducing ker- a reproducing kernel Hilbert space if there is a positive nels of vector-valued Hilbert spaces allow encoding in a definite Rv×v-valued kernel K on X × X such that: natural way these similarities and taking into account i. the function z 7→ K(x, z)y belongs to H, ∀z, x ∈ both within-view and between-view dependencies. In- deed, a kernel function K in this setting outputs a X , y ∈ Rv, matrix in Rv×v, with v the number of views, so, that v ii. ∀f ∈ H, x ∈ X , y ∈ R , hf, K(x, ·)yiH = K(xi, xj)lm, l, m = 1, . . . , v, is the similarity measure v hf(x), yiR (reproducing property). between examples xi and xj from the views l and m. Definition 2. (matrix-valued kernel) More formally, consider a set of n labeled data v×v {(x , y ) ∈ X × Y, i = 1, . . . , n}, where X ⊂ d and An R -valued kernel K on X × X is a function i i R v×v Y = {−1, 1} for classification or Y ⊂ for regression. K(·, ·): X × X → R ; it is positive semidefinite if: R Also assume that each input instance x = (x1,..., xv) > > i i i i. K(x, z) = K(z, x) , where denotes the l dl Pv is seen in v views, where xi ∈ R and l=1 dl = d. of a matrix, The supervised multi-view learning problem can be thought of as trying to find the vector-valued function ii. and, for every r ∈ N and all {(xi, yi)i=1,...,r} ∈ v ˆ ˆ1 ˆv ˆl P v f(·) = (f (·),... f (·)), with f (x) ∈ Y, solution of X × R , i,jhyi,K(xi, xj)yjiR ≥ 0. n Important results for matrix-valued kernels include the X 2 arg min V (yi, W(f(xi))) + λkfk . (2) positive semidefiniteness of the kernel K and that we ob- f∈H,W i=1 tain a solution for regularized optimization problem (1) via a representer theorem. It states that solution fˆ ∈ H Here f is a vector-valued function that groups v learn- for a learning problem can be written as ing functions, each corresponding to one view, and W : v → is combination operator for combining n R R ˆ X v the results of the learning functions. f(x) = K(x, xi)ci, with ci ∈ R . i=1 While the vector-valued extension of the representer theorem provides an algorithmc way for computing the Some well-known classes of matrix-valued kernels in- solution of the multi-view learning problem (2), the clude separable and transformable kernels. Separable question of choosing the multi-view kernel K remains Riikka Huusari, Hachem Kadri, Cécile Capponi

n crucial to take full advantage of the vector-valued learn- where we have written kl(xi) = (kl(xt, xi))t=1. We ing framework. In [14], a matrix-valued kernel based on note that this class is not in general separable or trans- cross-covariance operators on RKHS that allow mod- formable. However in the special case when it is possi- eling variables of multiple types and modalities was ble to write Aml = AmAl the kernel is transformable. proposed. However, it has two major drawbacks: i) the It is easy to see that the lm-th block of the block kernel is fixed in advance and does not depend on the kernel matrix K built from the matrix-valued kernel (4) learning problem, and ii) it is computationally expen- can be written as Klm = KlAlmKm, where Ks = sive and becomes infeasible when the problem size is n k (x , x ) for view s. The block kernel matrix very large. We avoid both of these issues by learning a s i j i,j=1 n block low-rank metric in kernel feature spaces. K = K(xi, xj) i,j=1 in this case has the form K = HAH, (5) 3 Multi-View Metric Learning 1 where H = blockdiag(K1, ··· , Kv), and the matrix v nv×nv Here we introduce an optimization problem for learning A = (Alm)l,m=1 ∈ R encodes pairwise similari- simultaneously a vector-valued multi-view function and ties between all the views. Multi-view metric learning a positive semidefinite metric between kernel feature then corresponds to simultaneously learning the met- maps, as well as an operator for combining the answers ric A and the classifier or regressor. from the views to yield the final decision. We then From this framework, with suitable choices of A, we can derive a three-step metric learning algorithm for multi- recover the cross-covariance multi-view kernel of [14], view data and give Rademacher bound for it. Finally or for example a MKL-like multi-view kernel containing we demonstrate how it can be implemented efficiently only one-view kernels. via block-wise Nyström approximation and give a block- sparse version of our formulation. 3.2 Algorithm

3.1 Matrix-valued multi-view kernel Using the vector-valued representer theorem, the multi- view learning problem (2) becomes We consider the following class of matrix-valued kernels that can operate over multiple views n   n  X X arg min V yi, W  K(xi, xj)cj l m v K(xi, xj)lm = Φl(xi),CXlXm Φm(xj ) , (3) c1,...,cn∈R i=1 j=1 n where Φl (resp. Φm) is the feature map associated to X + λ hci,K(xi, xj)cji. the scalar-valued kernel kl (resp. km) defined on the view l (resp. m). In the following we will leave out the i,j=1 view label from data instance when the feature map We set V to be the square loss function and assume or kernel function already has that information, e.g. the operator W to be known. We choose it to be l T instead of Φl(xi) we write Φl(xi). CXlXm : Hm → Hl weighted sum of the outputs: W = w ⊗ In giving us is a linear operator between the scalar-valued RKHSs Pv m n Wf(x) = m=1 wmf (x). Let y ∈ R be the output H and H of kernels k and k , respectively. The n l m l m vector (yi)i=1. The previous optimization problem can operator CXlXm allows one to encode both within-view now be written as and between-view similarities. T 2 arg min ky − (w ⊗ In)Kck + λ hKc, ci , c∈ nv The choice of the operator CXlXm is crucial and depends R on the multi-view problem at hand. In the following we where K ∈ Rnv×nv is the block kernel matrix associated only consider operators CXlXm that can be written as v T to the matrix-valued kernel (3) and w ∈ R is a vector CXlXm = ΦlAlmΦm, where Φs = (Φs(x1), ..., Φs(xn)) n×n containing weights for combining the final result. with s = l, m and Alm ∈ R is a positive definite matrix which plays the role of a metric between the Using (5), and considering an additional regularizer two features maps associated with kernels kl and km we formulate the multi-view metric learning (MVML) defined over the views l and m. This is a large set of optimization problem: possible operators, but depends on a finite number of T 2 min ky − (w ⊗ In)HAHck + λ hHAHc, ci (6) parameters. It gives us the following class of kernels A,c 2 T + ηkAkF , s.t. A 0. K(xi, xj)lm = Φl(xi), ΦlAlmΦmΦm(xj) T T 1 = Φl Φl(xi), AlmΦmΦm(xj) Given a set of n × n matrices K1, ··· , Kv, H = blockdiag(K , ··· , K ) is the block diagonal matrix sat- = hk (x ), A k (x )i , (4) 1 v l i lm m i isfying Hl,l = Kl, ∀l = 1, . . . , v. Multi-view Metric Learning in Vector-valued Kernel Spaces

Here we have restricted the block metric matrix A to Algorithm 1 Multi-View Metric Learning: with a) be positive definite and we penalize its complexity via full kernel matrices; b) Nyström approximation Frobenius norm. Initialize A 0 and w Inspired by [8] we make a change of variable g = AHc while not converged do in order to obtain a solution. Using a mapping (c, A) → Update g via Equation a) 8 or b) 15 (g, A) we obtain the equivalent learning problem: if w is to be calculated then Update w via Equation a) 10 or b) 17 T 2 † min ky − (w ⊗ In)Hgk + λ g, A g (7) A,g if sparsity promoting method then Iterate A with Equation a) 12 or b) 18 + ηkAk2 , s.t. A 0. F else It is good to note that despite the misleading similari- Iterate A via Equation a) 9 or b) 16 ties between our work and that of [8], we use different return A, g, w mappings for solving our problems, which are also for- mulated differently. We also consider different classes of kernels as [8] considers only separable kernels. within-view dependencies, and to output kernel learn- Remark. The optimization problem (7) is convex. The ing (OKL) [8, 9] where separable kernels are learnt. We main idea is to note that g, A†g is jointly convex (see generated an extremely simple dataset of two classes e.g. [32]). and two views in R2, allowing for visualization and understanding of the way the methods perform classifi- We use an alternating scheme to solve our problem. cation with multi-view data. The second view in the We arrive to the following solution for g with fixed A: dataset is completely generated from the first, through T T T † −1 T T a linear transformation (a shear mapping followed by a g = (H(w ⊗In) (w ⊗In)H+λA ) H(w ⊗In) y. (8) ). The generated data and transformation aris- The solution of (7) for A for fixed g is obtained by ing from applying the algorithms are shown in Figure 1. 2 gradient descent, where the update rule is given by The space for transformed data is R since we used linear kernels for simplicity. Our MVML is the only † † Ak+1 = (1 − 2µη) Ak + µλ Ak ggT Ak , (9) method giving linear separation of the two classes. This means that it groups the data points into groups based where µ is the step size. Technical details of the deriva- on their class, not view, and thus is able to construct a tions can be found in the supplementary material (Ap- good approximation of the initial data transformations pendix A.1). It is important to note here that Equa- by which we generated the second view. tion (9) is obtained by solving the optimization prob- lem (7) without considering the positivity constraint 1 3.4 Rademacher complexity bound on A. Despite this, (when µη < 2 ) the obtained A is symmetric and positive (compare to [13]), and hence We now provide a generalization analysis of MVML al- the learned matrix-valued multi-view kernel is valid. gorithm using Rademacher complexities [2]. The notion If so desired, it is also possible to learn the weights w. of Rademacher complexity has been generalizable to For fixed g and A the solution for w is vector-valued hypothesis spaces [18, 27, 24]. Previous work has analyzed the case where the matrix-valued w = (ZT Z)−1ZT y, (10) kernel is fixed prior to learning, while our analysis where Z ∈ Rn×v is filled columnwise from Hg. considers the kernel learning problem. It provides a Rademacher bound for our algorithm when both the Our MVML algorithm thus iterates over solving A, g vector-valued function f and the metric between views and w if weights are to be learned (see Algorithm 1, A are learnt. We start by recalling that the feature version a). The complexity of the algorithm is O(v3n3) map associated to the matrix-valued kernel K is the for it computes the inverse of the nv × nv matrix A, mapping Γ: X → L(Y, H), where X is the input space, required for calculating g. We will show later how Y = v, and L(Y, H) is the set of bounded linear oper- to reduce the computational complexity of our algo- R ators from Y to H (see, e.g., [19, 7] for more details). rithm via Nyström approximation, while conserving the It is known that K(x, z) = Γ(x)∗Γ(z). We denote desirable information about the multi-view problem. by ΓA the feature map associated to our multi-view kernel (Equation 4). The hypothesis class of MVML is 3.3 Illustration ∗ Hλ = {x 7→ fu,A(x) = ΓA(x) u : A ∈ ∆, kukH ≤ β}, We illustrate with simple toy data the effects of learn- ing both within- and between-view metrics. We com- with ∆ = {A : A 0, kAkF ≤ α} and β is a regu- pare our method, MVML, to MKL that considers only larization parameter. Let σ1,..., σv be an iid family Riikka Huusari, Hachem Kadri, Cécile Capponi

Figure 1: Simple two-view dataset and its transformations - left: original data where one of the views is completely generated from the other by a linear transformation (a shear mapping followed by a rotation), left middle: MKL transformation, right middle: MVML transformation and right: OKL transformation. MVML shows a linear separation of classes (blue/pale red) of the views (/triangles), while MKL and OKL do not. of vectors of independent Rademacher variables where trix A, and further show how to reduce the complexity v σi ∈ R , ∀ i = 1, . . . , n. The empirical Rademacher of the required computations for our algorithms. complexity of the vector-valued class Hλ is the function Rˆn(Hλ) defined as Block-sparsity We formulate a second optimization " n # problem to study the effect of sparsity over A. In- 1 X Rˆ (H ) = sup sup σ>f (x ) . stead of having for example l1-norm regularizer over n λ n E i u,A i f∈H A∈∆ i=1 the whole matrix, we consider sparsity on a group level so that whole blocks corresponding to pairs of views Theorem 1. The empirical Rademacher complexity are put to zero. Intuitively, the block-sparse result will of H can be upper bounded as follows: λ give insight as to which views are interesting and worth βpαkqk taking into account in learning. For example, by tun- Rˆ (H ) ≤ 1 , n λ n ing the parameter controlling sparsity level one could v derive, in some sense, an order of importance to the where q = tr(K2) , and K is the Gram matrix l l=1 l views and their combinations. The convex optimization computed from the training set {x1, . . . , xn} with the problem is as follows kernel kl defined on the view l. For kernels kl such that 2 T 2 tr(Kl ) ≤ τn, we have min ky − (w ⊗ In)HAHck + λ hHAHc, ci A,c rατv X Rˆ (H ) ≤ β . + η kAγ kF , (11) n λ n γ∈G

The proof for the theorem can be found in the supple- where we have a l1/l2-regularizer over set of groups G mentary material (Appendix A.2). Using well-known we consider for sparsity. In our multi-view setting these results [22, chapter 10], this bound on Rademacher com- groups correspond to combinations of views; e.g. with plexity can be used to obtain a generalization bound three views the matrix A would consist of six groups: for our algorithm. It is worth mentioning that in our multi-view setting the matrix-valued kernel is com- puted from the product of the kernel matrices defined over the views. This is why our assumption is on the We note that when we speak of combinations of views trace of the square of the kernel matrices Kl. It is we include both blocks of the matrix that this combi- more restrictive than the usual one in the one-view nation corresponds to. Using this group regularization, setting (tr(Kl) ≤ τn), but is satisfied in some cases, in essence, allows us to have view-sparsity in our multi- like, for example, for diagonally dominant kernel ma- view kernel matrix. trices [25]. It is interesting to investigate whether our To solve this optimization problem we introduce the Rademacher bound could be obtained under a much same mapping as before, and obtain the same solution less restrictive assumption on the kernels over the views, for g and w. However (11) does not have an obvious and this will be investigated in future work. closed-form solution for A so it is solved with proximal gradient method, the update rule being 3.5 Block-sparsity and efficient implemen- k+1 tation via block-Nyström approximation [A ]γ = (12)  η  In this section we consider variant of our formulation (6) k k k 1 − k k k [A − µ ∇h(A )]γ , which allows block-sparse solutions for the metric ma- k[A − µ ∇h(A )]γ kF + Multi-view Metric Learning in Vector-valued Kernel Spaces where µk is the step size, h(Ak) = λ g, (Ak)†g , and and k k −1 T k −1 ∇h(A ) = −λ(A ) gg (A ) . T 2 † min ky − (w ⊗ In)Ug˜k + λhg˜, A˜ g˜i (14) We note that even if we begin iteration with positive A˜ ,g˜,w definite (pd) matrix the next iterate is not guaranteed X + η kA˜ γ kF . to be always pd, and this is the reason for omitting γ∈G the positivity constraint in the formulation of sparse problem (Equation 11). Nevertheless all block-diagonal We note that the optimization problems are not strictly results are pd, and so are other results if certain condi- equivalent to the ones before; namely we impose the ˜ tions hold. In experiments we have observed that the Frobenius norm regularization over A rather than over solution is positive semidefinite. The full derivation of A. The obtained solution for (13) will again satisfy 1 the proximal algorithm and notes about positiveness the positivity condition when µη < 2 . For the sparse of A are in supplementary material (Appendix A.1). solution the positivity is unfortunately not always guar- anteed, but is achieved if certain conditions hold. Nyström approximation As a way to reduce the We solve the problems as before, and obtain: complexity of the required computations we propose T T T T † −1 T T T using Nyström approximation on each one-view ker- g˜ = (U (w ⊗In) (w ⊗In)U+λA˜ ) U (w ⊗In) y, nel matrix. In Nyström approximation method [30], (15) a (scalar-valued) kernel matrix M is divided in four blocks, ˜ k+1 ˜ k ˜ k † T ˜ k † M M  A = (1 − 2µη) A + µλ(A ) g˜g˜ (A ) , (16) M = 11 12 , M21 M22 and and is approximated by M ≈ QW†QT , where Q = w = (Z˜ T Z˜)−1Z˜ T y. (17)  T M11 M12 and W = M11. Denote p as the number Here Z˜ ∈ Rn×v is filled columnwise from Ug˜. For our of rows of M chosen to build W. This scheme gives a block-sparse method with we get update rule low-rank approximation of M by sampling p examples, ˜ k+1 and only the last block, M22, will be approximated. [A ]γ = (18)   We could approximate the block kernel matrix K di- η h ˜ k k ˜ k i rectly by applying the Nyström approximation, but this 1 −  A − µ ∇f(A ) , k k k γ [A˜ − µ ∇f(A˜ )]γ would have the effect of removing the block structure F + in the kernel matrix and consequently the useful multi- view information might be lost. Instead, we proceed in where ∇f(A˜ k) = −λ(A˜ k)−1ggT (A˜ k)−1 and µk is the a way that is consistent with the multi-view problem step size. and approximate each kernel matrix defined over one We follow the same algorithm than before for calcu- † T † 1/2 † 1/2 T view as Kl ≈ QlWl Ql = Ql(Wl ) (Wl ) Ql = lating the solution; now over A˜ and g˜ (Algorithm 1, T UlUl , ∀ l = 1, . . . , v. The goodness of approximation version b). The complexity is now of order O(v3p3) is based on the p chosen. Before performing the approx- rather than O(v3n3), where p  n is the number of imation a random ordering of the samples is calculated. samples chosen for the Nyström approximation in each We note that in our multi-view setting we have to block. From the obtained solution it is possible to impose the same ordering over all the views. We intro- calculate the original g and A if needed. duce the Nyström approximation to all our single-view kernels and define U = blockdiag(U1, ··· , Uv). We To also reduce the complexity of predicting with our can now approximate our multi-view kernel (5) as multi-view kernel framework, our block-wise Nyström approximation is used again on the test kernel matrices T T T K = HAH ≈ UU AUU = UAU˜ , test Ks computed with the test examples. Let us recall where we have written A˜ = UT AU. Using this scheme, that for each of our single-view kernels, we have an T † T we obtain a block-wise Nyström approximation of K approximation K ≈ UsUs = QsWsQs . We choose test test that preserves the multi-view structure of the kernel Qs to be p first columns of the matrix Ks , and matrix while allowing substantial computational gains. define the approximation for the test kernel to be We introduce Nyström approximation into (6) and (11) test test † T test †1/2 T Ks ≈ Qs WsQs = Qs Ws Us . and write g˜ = AU˜ T c, resulting in In such an approximation, the error is in the last n − p T 2 ˜ † min ky − (w ⊗ In)Ug˜k + λhg˜, A g˜i (13) columns of Ktest. We gain in complexity, as if forced A˜ ,g˜,w s to use the test kernel as is, we would need to calculate ˜ 2 ˜ + ηkAkF , s.t. A 0 A from A˜ in O(vn3) operations. Riikka Huusari, Hachem Kadri, Cécile Capponi

Figure 2: Regression on Sarcos1-dataset. Left: normalized mean squared errors (the lower the better), middle: R2-score (the higher the better), right: running times; as functions of Nyström approximation level (note the logartithmic scales). The results for KRR are calculated without approximation and are shown as horizontal dashed lines. Results for view 2 and early fusion are worse than others and outside of the scope of the two plots.

4 Experiments 4.1 Effect of Nyström approximation

4 Here we evaluate the proposed multi-view metric learn- For our first experiment we consider SARCOS-dataset , ing (MVML) method on real-world datasets and com- where the task is to map a 21-dimensional input space (7 pare it to relevant methods. The chosen datasets are joint positions, 7 joint velocities, 7 joint accelerations) "pure" multi-view datasets, that is to say, the view to the corresponding 7 joint torques. Here we present division arises naturally from the data. results to the first task. We perform two sets of experiments with two goals. The results with various levels of Nyström approxi- First, we evaluate our method in regression setting mation - averaged over four approximations - from with a large range of Nyström approximation levels in 1% to 100% of data are shown in Figure 2. Regu- order to understand the effect it has on our algorithm. larization parameters were cross-validated over values Secondly, we compare MVML to relevant state-of-the- λ ∈ [1e-08, 10] and η ∈ [1e-04, 100]. Kernel parameter 2 art methods in classification. In both cases, we use non γ = 1/2σ was fixed to be 1/number of features as multi-view methods to justify the multi-view approach. a trade-off between overfitting and underfitting. We The methods we use in addition to our own are: used only 1000 data samples of the available 44484 in training (all 4449 in testing) to be feasibly able to show • MVML_Cov and MVML_I: we use pre-set the effect of approximating the matrices on all levels, kernels in our framework: MVML_Cov uses the and wish to note that using more data samples with kernel from [14] and MVML_I refers to the case moderate approximation level we can yield a lower error when have only one-view kernel matrices in the than presented here: for example with 2000 training diagonal of the multi-view kernel.2 samples and Nyström approximation level of 8% we • lpMKL is an algorithm for learning weights for obtain error of 0.3915. However the main goal of our MKL kernel [16]. We apply it to kernel regression. experiment was to see how our algorithm behaves with • OKL [8, 9] is a kernel learning method for sepa- various Nyström approximation levels and because of rable kernels. the high complexity of our algorithm trained on the • MLKR [29] is an algorithm for metric learnig in full dataset without approximation we performed this kernel setting. experiment with low amount of samples. • KRR and SVM: We use kernel ridge regression and support vector machines with one-view as well The lowest error was obtained with our MVMLsparse as in early fusion (ef) and late fusion (lf) in order to algorithm at 8% Nyström approximation level. All validate the benefits of using multi-view methods. the multi-view results seem to benefit from using the We perform our experiments with Python, but for OKL approximation. Indeed, approximating the kernel ma- and MLKR we use the MATLAB codes provided by trices can be seen as a form of regularization and our 3 1 results reflect on that [23]. Overall our MVML learning authors . In MVML we set weights uniformly to v . For all the datasets we use Gaussian kernels, k(x, z) = methods have much higher computational cost with 1 2 large Nyström parameters, as can be seen from Figure 2, exp(− 2 kx − zk ). 2σ rightmost plot. However with smaller approximation 2Code for MVML is available at https://lives.lif.univ- levels with which the methods are intended to be used, mrs.fr/?page_id=12 the computing time is competitive. 3 https://www.cs.cornell.edu/∼kilian/code/code.html and https://github.com/cciliber/matMTL. 4http://www.gaussianprocess.org/gpml/data. Multi-view Metric Learning in Vector-valued Kernel Spaces

Table 1: Classification accuracies ± standard deviation. The number after the dataset indicates the level of approximation of the kernel. The results for efSVM classification for Flower17-dataset are missing as only similarity matrices for each view were provided. Last column reports the best result obtained when using only one view.

METHOD MVML MVMLsp. MVML_Cov MVML_I lpMKL OKL MLKR

Flower17 (6%) 75.98 ± 2.62 75.71 ± 2.48 75.71 ± 2.19 76.03 ± 2.36 75.54 ± 2.61 68.73 ± 1.95 63.82 ± 2.51 Flower17 (12%) 77.89 ± 2.41 77.43 ± 2.44 77.30 ± 2.36 78.36 ± 2.52 77.87 ± 2.52 75.19 ± 1.97 64.41 ± 2.41 Flower17 (24%) 78.60 ± 1.41 78.60 ± 1.36 79.00 ± 1.75 79.19 ± 1.51 78.75 ± 1.58 76.76 ± 1.62 65.44 ± 1.36 uWaveG. (6%) 92.67 ± 0.21 92.68 ± 0.17 92.34 ± 0.20 92.34 ± 0.19 92.34 ± 0.18 70.09 ± 1.07 71.09 ± 0.94 uWaveG. (12%) 93.03 ± 0.11 92.86 ± 0.26 92.53 ± 0.18 92.59 ± 0.13 92.48 ± 0.21 74.07 ± 0.26 80.22 ± 0.38 uWaveG. (24%) 92.59 ± 0.99 93.26 ± 0.15 92.66 ± 0.05 93.10 ± 0.11 92.85 ± 0.13 76.65 ± 0.33 86.38 ± 0.31

METHOD efSVM lfSVM 1 view SVM

Flower17 (6%) - 15.32 ± 1.94 11.59 ± 1.54 Flower17 (12%) - 23.82 ± 2.38 15.74 ± 1.54 Flower17 (24%) - 38.24 ± 2.31 22.79 ± 0.79 uWaveG. (6%) 80.00 ± 0.74 71.24 ± 0.41 56.54 ± 0.38 uWaveG. (12%) 82.29 ± 0.63 72.53 ± 0.16 57.50 ± 0.17 uWaveG. (24%) 84.07 ± 0.23 72.99 ± 0.06 58.01 ± 0.05

4.2 Classification results

In our classification experiments we use two real-world multi-view datasets: Flower175 (7 views, 17 classes, 80 samples per class) and uWaveGesture6 (3 views, 8 classes, 896 data samples for training and 3582 samples for testing). We set the kernel parameter to be mean 1 Pn of distances, σ = n2 i,j=1 kxi − xjk. The regulariza- tion parameters were obtained by cross-validation over Figure 3: An example of learned A˜ with MVMLsparse values λ∈[1e-08, 10] and η ∈ [1e-03, 100]. The results from Flower17 (12%) experiments. are averaged over four approximations. We adopted one-vs-all classification approach for multi- class classification. The results are displayed in Table 1. methods for simultaneously learning a multi-view func- The MVML results are always notably better than the tion and a metric in vector-valued kernel spaces. We SVM results, or the results obtained with OKL or provided iterative algorithms for the two resulting opti- MLKR. Compared to MVML, OKL and MLKR ac- mization problems, and have been able to significantly curacies decrease more with low approximation levels. lower the high computational cost associated with ker- We can see that all MVML methods perform very sim- nel methods by introducing block-wise Nyström ap- ilarly, sometimes the best result is obtained with fixed proximation. We have explained the feasibility of our multi-view kernel, sometimes when A is learned. approach onto a trivial dataset which reflects the ob- jective of learning the within-view and between-view As an example of our sparse output with MVML we correlation metrics. The performance of our approach note that running the algorithm with Flower17 dataset was illustrated with experiments with real multi-view with 12% approximation often resulted in a spd matrix datasets by comparing our method to standard multi- as in Figure 3. Indeed the resulting sparsity is very view approaches, as well as methods for metric learning interesting and tells us about importance of the views and kernel learning. Our sparse method is especially and their interactions. promising in the sense that it could give us information about importance of the views. It would be interesting 5 Conclusion to investigate the applicability of our framework in problems involving missing data in views, as well as We have introduced a general class of matrix-valued the generalization properties with the Nyström approx- multi-view kernels for which we have presented two imation. We would also like to continue investigating the theoretical properties of our sparse algorithm in 5 http://www.robots.ox.ac.uk/∼vgg/data/flowers/17. order to prove the positiveness of the learned metric 6 http://www.cs.ucr.edu/∼eamonn/time_series_data. matrix that we observed experimentally. Riikka Huusari, Hachem Kadri, Cécile Capponi

Acknowledgements [11] Mehmet Gönen and Ethem Alpaydın. Multiple kernel learning algorithms. Journal of machine We thank the anonymous reviewers for their relevant learning research, 12(Jul):2211–2268, 2011. and helpful comments. This work is granted by Lives Project (ANR-15-CE23-0026). [12] Matthias Günther and Lutz Klotz. Schur’s theo- rem for a block hadamard product. Linear Algebra References and its Applications, 437(3):948–956, 2012. [13] Pratik Jawanpuria, Maksim Lapin, Matthias Hein, [1] Mauricio A. Alvarez, Lorenzo Rosasco, Neil D and Bernt Schiele. Efficient output kernel learning Lawrence, et al. Kernels for vector-valued func- for multiple tasks. In Advances in Neural Informa- tions: A review. Foundations and Trends R in tion Processing Systems (NIPS), pages 1189–1197, Machine Learning, 4(3):195–266, 2012. 2015. [2] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk [14] Hachem Kadri, Stéphane Ayache, Cécile Capponi, bounds and structural results. Journal of Ma- Sokol Koço, François-Xavier Dupé, and Emilie chine Learning Research, 3:463–482, 2002. Morvant. The multi-task learning view of mul- timodal data. In Asian Conference in Machine [3] Aurélien Bellet, Amaury Habrard, and Marc Seb- Learning (ACML), pages 261–276, 2013. ban. Metric learning. Synthesis Lectures on Artifi- cial Intelligence and Machine Learning, 9(1):1–151, [15] Hachem Kadri, Emmanuel Duflos, Philippe Preux, 2015. Stéphane Canu, Alain Rakotomamonjy, and Julien Audiffren. Operator-valued kernels for learning [4] Avrim B. Blum and Tom M. Mitchell. Combining from functional response data. Journal of Machine labeled and unlabeled data with co-training. In Learning Research, 16:1–54, 2016. 11th Annual Conference on Computational Learn- ing Theory (COLT), pages 92–100, 1998. [16] Marius Kloft, Ulf Brefeld, Sören Sonnenburg, [5] Céline Brouard, Marie Szafranski, and Florence and Alexander Zien. Lp-norm multiple kernel d’Alché Buc. Input output kernel regression: Su- learning. Journal of Machine Learning Research, pervised and semi-supervised structured output 12(Mar):953–997, 2011. prediction with operator-valued kernels. Journal [17] Néhémy Lim, Florence d’Alché Buc, Cédric Auliac, of Machine Learning Research, 17(176):1–48, 2016. and George Michailidis. Operator-valued kernel- [6] Andrea Caponnetto, Charles A. Micchelli, Massi- based vector autoregressive models for network miliano Pontil, and Yiming Ying. Universal multi- inference. Machine learning, 99(3):489–513, 2015. task kernels. Journal of Machine Learning Re- search, 9(Jul):1615–1646, 2008. [18] Andreas Maurer. The Rademacher complex- ity of linear transformation classes. In Interna- [7] Claudio Carmeli, Ernesto De Vito, Alessandro tional Conference on Computational Learning The- Toigo, and Veronica Umanita. Vector valued re- ory (COLT), pages 65–78, 2006. producing kernel hilbert spaces and universality. Analysis and Applications, 08(01):19–61, 2010. [19] Charles A. Micchelli and Massimiliano Pontil. On learning vector-valued functions. Neural Compu- [8] Carlo Ciliberto, Youssef Mroueh, Tomaso Poggio, tation, 17:177–204, 2005. and Lorenzo Rosasco. Convex learning of mul- tiple tasks and their structure. In International [20] Ha Quang Minh, Loris Bazzani, and Vittorio Conference in Machine Learning (ICML), 2015. Murino. A unifying framework for vector-valued [9] Francesco Dinuzzo, Cheng S. Ong, Gianluigi Pil- manifold regularization and multi-view learning. lonetto, and Peter V Gehler. Learning output In International Conference in Machine Learn- kernels with block coordinate descent. In Interna- ing (ICML), 2013. tional Conference on Machine Learning (ICML), pages 49–56, 2011. [21] Ha Quang Minh, Loris Bazzani, and Vittorio Murino. A unifying framework in vector-valued [10] Theodoros Evgeniou, Charles A. Micchelli, and reproducing kernel hilbert spaces for manifold reg- Massimiliano Pontil. Learning multiple tasks with ularization and co-regularized multi-view learning. kernel methods. Journal of Machine Learning Journal of Machine Learning Research, 17(25):1– Research, 6:615–637, 2005. 72, 2016. Multi-view Metric Learning in Vector-valued Kernel Spaces

[22] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2012. [23] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström com- putational regularization. In Advances in Neural Information Processing Systems (NIPS), pages 1657–1665, 2015. [24] Maxime Sangnier, Olivier Fercoq, and Florence d’Alché Buc. Joint quantile regression in vector- valued RKHSs. In Advances in Neural Informa- tion Processing Systems (NIPS), pages 3693–3701, 2016. [25] Bernhard Schölkopf, Jason Weston, Eleazar Eskin, Christina Leslie, and William Stafford Noble. A kernel approach for learning from almost orthogo- nal patterns. In European Conference on Machine Learning, pages 511–528. Springer, 2002. [26] Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin. A co-regularization approach to semi- supervised learning with multiple views. In Pro- ceedings of ICML workshop on learning with mul- tiple views, pages 74–79, 2005. [27] Vikas Sindhwani, Minh Ha Quang, and Aurélie C. Lozano. Scalable matrix-valued kernel learning for high-dimensional nonlinear multivariate regression and granger causality. In Uncertainty in Artificial Intelligence (UAI), 2012. [28] Vikas Sindhwani and David S. Rosenberg. An RKHS for multi-view learning and manifold co- regularization. In International Conference in Ma- chine Learning (ICML), pages 976–983, 2008. [29] Kilian Q. Weinberger and Gerald Tesauro. Metric learning for kernel regression. In Artificial Intelli- gence and Statistics, pages 612–619, 2007. [30] Christopher K. I. Williams and Matthias Seeger. Using the nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems (NIPS), pages 682–688, 2001. [31] Chang Xu, Dacheng Tao, and Chao Xu. A survey on multi-view learning. arXiv preprint arXiv:1304.5634, 2013. [32] Yu Zhang and Dit-Yan Yeung. A convex for- mulation for learning task relationships in multi- task learning. In Uncertainty in Artificial Intelli- gence (UAI), 2012. Multi-view Metric Learning in Vector-valued Kernel Spaces Supplementary Material

Riikka Huusari Hachem Kadri Cécile Capponi Aix Marseille Univ, Université de Toulon, CNRS, LIS, Marseille, France

A Appendix Solving for A in (6)

A.1 MVML optimization When we consider g (and w) to be fixed in the MVML framwork (6), for A we have the following minimization Here we go through the derivations of the solutions problem: † 2 A, D and w for our optimization problem. The pre- min λ g, A g + ηkAkF sented derivations are for the case without Nyström A approximation; however the derivations with Nyström Derivating this with respect to A gives us approximation are done exactly the same way. d † 2 λ g, A g + ηkAkF Solving for g and w dA d = λ g, A†g + η tr(AA) Let us first focus on the case where A and w are fixed dA and we solve for g. We calculate the derivative of the = −λA†ggT A† 7 + 2ηA expression in Equation (7): Thus the gradient descent step will be d T 2 † ky − (w ⊗ In)Hgk + λ g, A g dg † † Ak+1 = (1 − 2µη) Ak + µλ Ak ggT Ak d T = hy, yi − 2hy, (w ⊗ In)Hgi dg when moving to the direction of negative gradient with T T † + h(w ⊗ In)Hg, (w ⊗ In)HDi + λhg, A gi step size µ. T T = −2H(w ⊗ In) y T T T † Solving for A in (11) + 2H(w ⊗ In) (w ⊗ In)Hg + 2λA g To solve A from equation (11) we use proximal mini- By setting this to zero we obtain the solution mization. Let us recall the optimization problem after T T T † −1 T T the change of the variable: g = (H(w ⊗In) (w ⊗In)H+λA ) H(w ⊗In) y. T 2 † min ky − (w ⊗ In)Hgk + λhg, A gi As for w when A and g are fixed, we need only to A,g,w X consider optimizing + η kAγ kF , γ∈G T 2 min ky − (w ⊗ In)Hgk . (19) w and denote † If we denote that Z ∈ Rn×v is equal to reshaping Hg h(A) = λ g, A g by taking the elements of the vector and arranging and them onto the columns of Z, we obtain a following X form: Ω(A) = η kAγ kF min ky − Zwk2. (20) γ∈G w for the two terms in our optimization problem that One can easily see by taking the derivative and setting contain the matrix A. it to zero that the solution for this is 7 −1 Matrix cookbook (Equation 61): https://www.math. w = ZT Z ZT y. (21) uwaterloo.ca/~hwolkowi/matrixcookbook.pdf. Multi-view Metric Learning in Vector-valued Kernel Spaces

Without going into detailed theory of proximal opera- Proof. We start by recalling that the feature map asso- tors and proximal minimization, we remark that the ciated to the operator-valued kernel K is the mapping proximal minimization algorithm update takes the form Γ: X → L(Y, H), where X is the input space, Y = Rv, k+1 k k k and L(Y, H) is the set of bounded linear operators A = prox k (A − µ ∇h(A )). µ Ω from Y to H (see, e.g., [19, 7] for more details). It is ∗ It is well-known that in traditional group-lasso situation known that K(x, z) = Γ(x) Γ(z). We denote by ΓA the the proximal operator is feature map associated to our multi-view kernel (Equa- tion 4). We also define the matrix Σ = (σ)n ∈ nv  η  i=1 R [proxµkΩ(z)]γ = 1 − zγ , kzγ k2 + " n # 1 X where z is a vector and + denotes the maximum of Rˆ (H ) = sup sup σ>f (x ) zero and the value inside the brackets. In our case we n λ n E i u,A i f∈H A∈∆ i=1 are solving for a matrix, but due to the equivalence " n # 1 X ∗ of Frobenious norm to vector 2-norm we can use this v = E sup sup hσi, ΓA(xi) uiR n u exact same operator. Thus we get as the proximal A i=1 update " n # 1 X k+1 = E sup sup hΓA(xi)σi, uiH (1) [A ]γ = n u A   i=1 η k k k " n # 1 − k k k [A − µ ∇h(A )]γ , β X k[A − µ ∇h(A )]γ kF ≤ sup k Γ (x )σ k (2) + n E A i i H A i=1 where  1    k k −1 T k −1 n 2 ∇h(A ) = −λ(A ) gg (A ) . β  X  v = E sup  hσi,KA(xi, xj)σjiR   (3) n  A  We can see from the update fromula and the derivative i,j=1 that if Ak is a positive matrix, the update without   k k k β 1/2 block-multiplication, A − µ ∇h(A ), will be positive, = sup (hΣ, K Σi nv ) n E A R too. This is unfortunately not enough to guarantee the A   general positivity of Ak+1. However we note that it is, β 1/2 = E suphΣ, HAHΣi indeed, positive if it is block-diagonal, and in general n A   whenever a matrix of the multipliers α β > 1/2 = E sup tr(HΣΣ HA)  η  n A αst = 1 −   k[Ak − µk∇h(Ak)] k β > st 2 + ≤ sup tr([HΣΣ H]2)1/4tr(A2)1/4 (4) n E k+1 A is positive, then A is, too (see [12] for reference -   β 2 > 1/2 2 1/4 this is a blockwise Hadamard product where the blocks ≤ E sup tr(H ΣΣ ) tr(A ) commute). n A √   β α 2 > 1/2 ≤ E sup tr(H ΣΣ ) A.2 Proof of Theorem 1 n √ A β α h 2 > 1/2i Theorem 1. Let H be a vector-valued RKHS asso- = E tr(H ΣΣ ) √n ciated with the the multi-view kernel K defined by 1/2 β α  h 2 > i Equation 4. Consider the hypothesis class Hλ = ≤ E tr(H ΣΣ ) (5) ∗ n {x 7→ fu,A(x) = ΓA(x) u : A ∈ ∆, kukH ≤ β}, √ β α  h i1/2 with ∆ = {A : A 0, kAkF ≤ α}. The empirical = tr H2 (ΣΣ>) n E Rademacher complexity of Hλ can be upper bounded as √ q follows: β α 2 2 = k(tr(K1 ), . . . , tr(Kv ))k1. βpαkqk n Rˆ (H ) ≤ 1 , n λ n Here (1) and (3) are obtained with reproducing prop- 2 v erty, (2) and (4) with Cauchy-Schwarz inequality, and where q = tr(Kl ) , and Kl is the Gram matrix l=1 (5) with Jensen’s inequality. The last equality follows computed from the training set {x1, . . . , xn} with the from the fact that tr(H2) = Pv tr(K 2). For kernels kernel kl defined on the view l. For kernels kl such that l=1 l 2 2 kl that satisfy tr(K ) ≤ τn, l = 1, . . . , v, we obtain tr(Kl ) ≤ τn, we have l that rατv rατv Rˆ (H ) ≤ β . Rˆ (H ) ≤ β . n λ n n λ n