Noname manuscript No. (will be inserted by the editor)

MultiView Diffusion Maps

Ofir Lindenbaum1 Arie Yeredor1 Moshe Salhov2 Amir Averbuch2

Received: date / Accepted: date

Abstract In this paper, a reduced dimensionality representation is learned from multiple views of the processed data. These multiple views can be obtained, for example, when the same underlying process is observed using several different modalities, or measured with different instrumentation. The goal is to effectively utilize the availability of such multiple views for various purposes such as non- linear embedding, manifold learning, , anomaly detection and non-linear system identification. The proposed method, which is called multiview, exploits the intrinsic relation within each view as well as the mutual relations between views. This is achieved by defining a cross-view model in which an implied a process between objects is restrained to hop between the different views. This multiview method is robust to scaling and it is insensitive to small structural changes in the data. Within this framework, new diffusion distances are defined to analyze the spectra of the implied kernels. The applicability of the multiview approach is demonstrated on for clustering classification and manifold learning using both artificial and real data.

1 Introduction

High dimension big data exist in various fields and it is difficult to analyze them as is. Extracted features are useful in analyzing these datasets. Some prior knowledge or modeling is required in order to identify the essential features. On the other hand, methods are purely unsupervised aiming to find a low dimensional representation that is based on the intrinsic geometry of the ana- lyzed dataset that includes the connectivities among multidimensional data points within the dataset. A “good” dimensionality reduction methodology reduces the complexity of a data processing while preserving the coherency of the original data such that clustering, classification, manifold learning and many other data

1 School of Electrical Engineering, Tel Aviv University, Israel 2School of Computer Science, Tel Aviv University, Israel 2 Ofir Lindenbaum1 Arie Yeredor1 Moshe Salhov2 Amir Averbuch2 analysis tasks can be applied effectively in the reduced space. Many methods such as Principal Component Analysis (PCA) [1], Multidimensional Scaling (MDS) [2], Local Linear Embedding [3], Laplacian Eigenmaps [4], Diffusion Maps (DM) [5] and more have been proposed to achieve dimensionality reduction that preserve its data coherency. Exploiting the low dimensional representation yields various applications such as face recognition that is based on Laplacian Eigenmaps [4], Non-linear independent component analysis with DM [6], Musical Key extraction using DM [7], and many more. The DM framework extends and enhances ideas from other methods by utilizing a stochastic Markov matrix that is based on local affinities between multidimensional data points to identify a lower dimension rep- resentation for the data. All the mentioned methods do not consider the possibility of having more than one view to represent the same process. An additional view can provide meaningful insight regarding the dynamical process that has generated and governed the data. In this paper, we consider learning from data that is analyzed by multiple views. The goal is to effectively utilize multiple views such as non-linear embed- ding, multi-view manifold learning, spectral clustering, anomaly detection and non-linear system identification to achieve better analysis of high dimension big data. Most dimensionality reduction methods suggest to concatenate the datasets into a single vector space. However, this methodology is sensitive to scalings of each data component. It does not utilize for example the fact that noise in both datasets could be uncorrelated. It assumes that both datasets lie in one high dimensional space which is not always true. The problem of learning from two views has been studied in the field of spectral clustering. Most of these studies have been focused on classification and clustering that are based on spectral characteristics of the data while using two or more sam- pled views. Some approaches, which address this problem, are Bilinear Model [8], Partial Least Squares [9] and Canonical Correlation Analysis [10]. These methods are powerful for learning the relation between different views but do not provide separate insights or combined into the low dimensional geometry or structure of each view. Recently, a few kernel based methods (e.g [11]) propose a model of co- regularizing kernels in both views in a way that resembles joint diagonalization. It is done by searching for an orthogonal transformation that maximizes the diago- nal terms of the kernel matrices obtained from all views. A penalty term, which incorporates the disagreement between clusters from the views, was added. Their algorithm is based on alternating maximization procedure. A mixture of Markov chains is proposed in [12] to model multiple views in order to apply spectral cluster- ing. It deals with two cases in graph theory: directed and undirected graph where the second case is related to our work. This approach converges the undirected graph problem to a Markov chains averaging where each is constructed separately within the views. A way to incorporate a given multiple metrics for the same data using a cross diffusion process is described in [13]. They define a new diffusion dis- tance which is useful for classification, clustering or retrieval tasks. However, the proposed process is not symmetrical thus does not allow to compute an embed- ding. An iterative algorithm for spectral clustering is proposed in [14]. The idea is to iteratively modify each view using the representation of the other view. The problem of two manifolds, which were derived from the same data (i.e two views), is described in [15]. This approach is similar to Canonical Correlation Analysis MultiView Diffusion Maps 3

[16] that seeks a linear transformation that maximizes the correlation among the views. It demonstrates the power of this method in canceling uncorrelated noise present in both views. Furthermore, [15] applies its method to a non-linear system identification task. A similar approach is proposed in [17]. It suggests data mod- eling that uses a bipartite graph and then, based on the ‘minimum-disagreement’ algorithm, partitions the dataset. This approach attempts to minimize the cluster’s disagreement between two views. In this work, we present a framework based on the construction in [17] and show that this approach is a special case of a more general diffusion based process. We build and analyze a new framework that generalizes the random walk model while utilizing multiple views. Our proposed method utilizes the intrinsic relation within each view as well as the mutual relations between views. The multiview is achieved by defining a cross diffusion process in which a special structured random walk is imposed between the various views. Within this framework, new diffusion distances are defined to analyze the spectra of the new kernels and compute the infinitesimal generator to where the multi view-based our kernel converges. The constructed multi view kernel matrix is similar to a symmetric matrix thus guarantees real eigenvalues and eigenvectors, this property enables us to define an multi view embedding. The advantages of the proposed method for manifold learning and spectral clustering are explored using vast experiments. The paper has the following structure: Background is given in section 2. Section 3 presents and analyzes the multiview framework. Section 5 studies the asymp- totics properties of the proposed kernel Kc (Eq. 5). Section 7 presents the experi- mental results.

2 Background

2.1 General dimensionality reduction framework

M×N Consider a high dimensional dataset X = {x1, x2, x3, ..., xM } ∈ R , N xi ∈ R , i = 1,...,M. The goal is to find a low dimensional representation M×S S Z = {z1, z2, z3, ..., zM } ∈ R , zi ∈ R , i = 1,...,M, such that S  N and the local connectivities among the multidimensional data points are preserved. This problem setup is based on the assumption that the data is represented (viewed) by a single vector space (single view).

2.2 Diffusion Maps (DM)

DM [5] is a dimensionality reduction method that finds the intrinsic geometry in the data. This framework is highly effective when the data is densely sampled from some low dimensional manifold that is not linear. Given a high dimensional dataset X, the DM framework contains the following steps: 1. A kernel function K : X × X −→ R is chosen. It is represented by a matrix K ∈ M×M R which satisfies for all (xi, xj ) ∈ X the following properties: Symme- T try: Ki,j = K(xi, xj ) = K(xj , xi), positive semi-definiteness: vi Kvi ≥ 0 for M all vi ∈ R and K(xi, xj ) ≥ 0. These properties guarantee that the matrix K 4 Ofir Lindenbaum1 Arie Yeredor1 Moshe Salhov2 Amir Averbuch2

has real eignenvectors and non-negative real eigenvalues. Gaussian kernel is 2 ||xi−xj || a common example where Ki,j = exp{− 2 } with an L2 norm as the 2σx affinity measure between two data vectors; P 2. By normalizing the kernel using D where Di,i = Ki,j , we compute the j following matrix elements:

x −1 Pi,j = P(xi, xj ) = [D K]i,j . (1)

x M×M The resulting matrix P ∈ R can be viewed as the transition kernel of a x t (fictitious) on X such that the expression [(P ) ]i,j = pt(xi,xj ) describes the transition probability from point xi to point xj in t steps. 3. Spectral decomposition is applied to matrix P x or to one of its powers (P x)t

to obtain a sequence of eigenvalues {λm} and normalized eigenvectors {ψm} x that satisfies P ψm = λmψm, m = 0, ..., M − 1; 4. Define a new representation for the dataset X

 t t t t T M−1 Ψ t(xi): xi 7−→ λ1ψ1(i), λ2ψ2(i), λ3ψ3(i), ..., λM−1ψM−1(i) ∈ R , (2)

th where t is the selected number of steps and ψm(i) denotes the i element of ψm. The main idea behind this representation is that the Euclidian distance between two data points in the new representation is equal to the weighted L2 distance between the conditional probabilities pt(xi, :), and pt(xj , :), i, j = 1, ..., M (the i-th and j-th rows of P t). The following is referred as the Diffusion Distance

2 2 X 2t 2 2 Dt (xi, xj ) = ||Ψ t(xi) − Ψ t(xj )|| = λm(ψm(i) − ψm(j)) = ||pt(xi, :) − pt(xj , :)||W −1 , m≥1 (3) Di,i where W is a diagonal matrix with elements Wi,i = φ0(i) = PM . This i=1 Di,i equality is proven in [5]. 5. The desired accuracy δ ≥ 0 is chosen for the diffusion distance defined by Eq. t t 3 such that s(δ, t) = max{` ∈ N such that |λ`| > δ|λ1| }. By using δ, a new mapping of s(δ, t) dimensions is defined as (δ)  t t t t T s(δ,t) Ψt : X → λ1ψ1(i), λ2ψ2(i), λ3ψ3(i), ..., λsψs(i) ∈ R . This approach has been found useful in various fields. As previously noted, it is limited to a single view representation. A common extension of this approach to multiple views is to use a data concatenation from all views and then apply the diffusion framework. This approach assumes orthogonality of the sampled dimen- sions which is an unrealistic assumption in many cases. Furthermore, this approach can create redundancy in some dimensions and requires scaling of each dimension separately such that none is preferable over the others. Previous studies such as [21,23] apply the DM framework to each view separately and then incorporated the learned mapping from various views. However they do not exploit the mutual relations which might exist between the various views in order to create and utilize the correct mapping. MultiView Diffusion Maps 5

3 Multiview Dimensionality Reduction

Problem Formulation: Given multiple sets of observations Xl , l = 1, ..., L. l l l l l M×Nl Each view is a high dimensional dataset X = {x1, x2, x3, ..., xM } ∈ R , Nl is the dimension of each feature space. Note that a bijective correspondence between views is assumed. We seek for a lower dimensional representations that preserves the interactions between multidimensional data points within a given view Xl and among the views {X1, X2, ..., XL}.

3.1 Multiview Diffusion Maps

We begin by generalizing the DM framework for handling a multiview scenario. Our goal is to impose a random walk model using the local connectivities be- tween data points within all views. Our way to generalize the DM framework is by restraining the random walker to “hop” between views in each step. The construc- tion requires to choose symmetrical positive semi-definite kernels for each view Kl : Xl × Xl −→ R , l = 1, ..., L, we use the Gaussian function

l l 2 l ||xi − xj || Ki,j = exp{− 2 }, (4) 2σl then the multiview kernel is formed by the following matrix

 1 2 1 3 1 L  0M×M K K K K ... K K 2 1 2 3 2 L K K 0M×M K K ... K K   3 1 3 2 3 L  Kc = K K K K 0M×M ... K K  . (5)    ::: ... :  L 1 L 2 L 3 K K K K K K ... 0M×M .

P Finlay, by using the diagonal matrix Db where Dbi,i = Kbi,j , the normalized j row-stochastic matrix is defined as

−1 Kbi,j Pb = Db Kc, Pbi,j = , (6) Dbi,i

where the m, l block is a square M × M matrix located at [1 + (m − 1)M, 1 + (l − 1)M], l = 1, ..., L. This block describes the probability of transition between view Xm and Xl.

3.2 Alternative Multiview Approaches

In this section, we describe two additional methods for incorporating different views. We do not analyze these approaches but use them as references for com- parisons in the experimental evaluations. 1. Kernel Product DM (KP): Multiplying the probability matrix element wise 6 Ofir Lindenbaum1 Arie Yeredor1 Moshe Salhov2 Amir Averbuch2

◦ 4 1 2 L ◦ 4 1 2 L K = K ◦ K ◦ ... ◦ K , Ki,j = Ki,j · Ki,j · ... · Kij , then normalizing by the sum of rows. The resulting row stochastic matrix is

◦ ◦−1 ◦ Pi,j = [D K ]i,j , (7)

◦ P ◦ where Di,i = Ki,j . j

Lema 1 In the special case of a Gaussian kernel with σ1 = σ2 = ... = σL in Eq. (4), the resulting matrix K◦ is equal to the matrix Kw constructed using the 2 1 T L T T w ||wi−wj || concatenated vector wi = [(xi ) , ..., (xi ) ] such that Ki,j = exp{− 2 }. 2σw The scale should be set as

v u L q uX 2 2 σw = t σi = L · σ1, (8) i=1

where the last equality holds only for this special case.

This approach, which corresponds to [5], will be referred as the Kernel Product DM in the following sections. + 4 PL l 2. Kernel Sum DM (KS): Defining the sum kernel K = l=1 K . Normalizing the sum kernel by the sum of rows D+, to compute

+ +−1 + Pi,j = [D K ]i,j . (9)

This random walk sums the step probabilities from each view. This approach is proposed in [12].

t 3.3 Probabilistic interpretation of Pb

t 1 1 In our proposed construction (Eqs. (4), (5) and (6)), the entries [Pb ]i,j = pbt(xi , xj ) 1 1 denote for each i, j = 1,...,M, the transition probability from node xi to node xj in t time steps by “hopping” between the views X1 and Xl, l = 2, ..., L in each time step. Note that due to the block-anti-diagonal structure of Kc (and Pb (Eq. (6))), this probability is zero for t = 1. However, for higher values of t, this probability is nonzero describing a time transition from view X1 through any view Xl, l = 1 t l l 2, ..., L and back to X . In the same way, [Pb ]i+(l−1)·M,j+(l−1)M = pbt(xi, xj ) l l denotes the transition probability from node xi to node xj (i, j = 1,...,M) in t t time steps. Likewise, [Pb ]i+(l−1)·M,j+(m−1)M = pbt(xi, xj ) denotes the transition l m probability from node xi to node xj (i, j = 1,...,M) l 6= m in t time steps. MultiView Diffusion Maps 7

3.4 Multiview Diffusion Distance

In a variety of real data types, the Euclidian distance does not provide a sufficient indication on the intrinsic relations between data points. The Euclidian distance is highly sensitive to scaling and rotations of multidimensional data points. Tasks such as classification, clustering or system identification require a measure for the intrinsic connectivity between data points. This type of measure is only satisfied locally by the Euclidian distance in the high dimensional ambient space. The multiview diffusion kernel (defined in section (3.1)) indicates about all the small t local connections between data points. The row stochastic matrix Pb (Eq. (6)) incorporates all the possibilities for having a transition in t time steps between data points that are hopping between both views. For a fixed value t > 0, two data t points are intrinsically similar if the conditional distributions pbt(xi, :) = [Pb]i,: and t pbt(xj , :) = [Pb]j,: are similar. This type of similarity measure indicates that the points xi and xj are similarly connected to several mutual points. Thus, they are connected by a geometrical path. In many cases, a small Euclidean distance could be misleading due to the fact that two data points could be “close” without having any Geodesic path that connects them. The similarity between probabilities is more robust in these cases. Based on this observation, by expanding the single view construction given in [5], we define the weighted inner view diffusion distances for the first view as

2M t t 2 2 1 1 X ([Pb ]i,k − [Pb ]j,k) T t 2 D (x , x ) = = ||(e − e ) P || −1 , (10) t i j i j b Dc φo(k) k=1 where 1 ≤ i, j ≤ M, ei is the i-th column of an L·M ×L·M identity matrix, φ0 is ˆ the first left eigenvector of P and its k-th element is φ0(k) = Dbk,k. The weighted 2 T norm of x is defined by ||x||W = x W x. Similarly, defined for the l-th view

2M t t 2 2 l l X ([Pb ]˜l+i,k − [Pb ]˜l+j,k) T t 2 D (x , x ) = = ||(e − e ) P || −1 , (11) t i j ˜l+i ˜l+j b Dc φo(k) k=1

where ˜l = (l − 1) · M. The main advantage of these distances (Eqs. (10) and (11)) is that they can be expressed in terms of the eigenfunctions and the eigenvectors of the matrix Pb. This insight allows us to use a representation (defined in section 3.5) where the induced Euclidean distance is proportional to the diffusion distances defined in Eqs. (10) and (11).

Theorem 1 The inner view diffusion distance defined by Eqs. (10) and (11) is equal to

L·M−1 2 l l X 2t ˜ ˜ 2 Dt (xi, xj ) = λ` (ψ`(i + l) − ψ`(j + l)) , i, j = 1, ..., M, (12) `=1

again ˜l = (l − 1) · M. 8 Ofir Lindenbaum1 Arie Yeredor1 Moshe Salhov2 Amir Averbuch2

t −1 t t −1 t −1 Proof We can express Pb Db (Pb )T by Pb Db (Pb )T = ΨΛtΦT Db ΦΛtΨ T = 2t t T −1 T 2 l l ΨΛ Ψ since Φ Db Φ = Π Π = I. Therefore, Dt (xi, xj ) = T t 2 T t −1 t T ||(e˜ − e˜ ) P || −1 = (e˜ − e˜ ) P D (P ) (e˜ − e˜ ) = l+i l+j b Dc l+i l+j b b b l+i l+j T 2t T PL·M−1 2t ˜ ˜ 2 (e˜l+i − e˜l+j ) ΨΛ Ψ (e˜l+i − e˜l+j ) = `=0 λ` (ψ`(l + i) − ψ`(l + j)) = PL·M−1 2t ˜ ˜ 2 `=1 λ` (ψ`(l + i) − ψ`(l + j)) . Where ` = 0 is excluded due to Ψ0 = 1 (an all-ones vector) that holds for all stochastic matrices.

3.5 Multiview data parametrization

Tasks such as classification, clustering or regression in a high-dimension sampled feature space are considered to be computationally expensive. In addition, the per- formance of these tasks is highly dependent on the distance measure. As explained in section 3.4, distance measures in the original ambient space are meaningless in many real life situations. Interpreting Theorem 1 in terms of Euclidean distance enables us to define mappings for every view Xl, l = 1, ..., L, using the right eigen- t l vectors of Pb (Eq. (6)) weighted by λi. The representation for instances in X is given by

l l  t ¯ t ¯ T M−1 Ψbt(xi): xi 7−→ λ1ψ1(i + l), ..., λM−1ψM−1(i + l) ∈ R , (13) where ¯l = (l − 1) · M. These L mappings capture the intrinsic geometry of the views as well as the mutual relation between them. As shown in [25] the set of eigenvalues λm has a decaying property such that 1 = |λ0| ≥ |λ1| ≥ ... ≥ |λM−1|. Exploiting the decaying property enables us to represent data up to a dimension r where r  N1, ..., NL. The dimension r ≡ r(δ) is determined by approximating the diffusion distance (Eq. (12)) up to a desired accuracy δ. This argument is expanded in section 4.4. The reduced dimension version of Ψbt(X) is denoted as r Ψbt (X) Using the inner view diffusion distances defined in Eqs. (10) and (11), we define a multiview diffusion distance as a linear combination of the inner views distances such that L (MV )2 X r r 2 Dt (i, j) = ||Ψbt (xi) − Ψbt (xj )|| . (14) l=1 This distance is the induced Euclidean distance in a space constructed from the concatenation of all low dimensional multiview mappings

r 1 r 2 r L Ψb t(X) = [Ψbt (X ), Ψbt (X ), ..., Ψbt (X )]. (15) This mapping will be used in the experimental results for clustering tasks.

3.6 Multiview Kernel Bandwidth

When constructing the Gaussian kernels Kl, l = 1, ..., L in Equation (4), one has 2 to set the values of the scale (width) parameter σl . Setting the values to be too MultiView Diffusion Maps 9 small may result in very small local neighborhoods that are not able to capture the local structure around the point. On the contrary, setting the values to be too large may result in a fully connected graph that may generate a coarse description of the data. In [22] a max-min measure is suggested, such that the scale

2 l l 2 σl = C · max[min(||xi − xj || )], (16) j i,i6=j where C is set within the range [1, 1.5]. This approach attempts to set a small scale to maintain local connectivities. This single view approach could be relaxed in the multi view scenario. The multi view kernel Kc (5) contains multiplication of single view kernel matrices Kl, l = 1, ..., L (Eq. 4). The diagonal values of each kernel matrix Kl, l = 1, ..., L are all 1’s, therefore, a connectivity in only one view is sufficient. This insight suggests that smaller value for the parameter C could be used. Another scheme by [33] aims to find a range of values for σl. The idea is to compute the kernel Kl (Eq. 4) for various values of σ and search for the range of values where the Gaussian bell shape exists. The range is identified by applying a logarithmic function to the sum of the kernel. We expand the idea for a multi view scenario based on the following algorithm

Algorithm 1 Multi view kernel bandwidth selection Input: Multiple sets of observations Xl, l = 1, ..., L. Output: Scale parameters for all views {σ1, ..., σL}. l 1: Compute Gaussian kernels K (σl), l = 1, ..., L for several values of σl. PP lm lm 2: Compute for all pairs l 6= m: Slm(σl, σm) = Ki,j (σl, σm), where K (σl, σm) = i j l m K (σl) · K (σm). 3: for l = 1 : L do lm 4: Find minimal value for σl such that S (σl, σm) for all m 6= l. 5: end for

Note that the two dimensional function S(σl, σm) consists two asymptotes, σl,σm→0 σl,σm→∞ 3 S(σl, σm) −→ log(N), and S(σl, σm) −→ log(N ) = 3log(N), since for l m σl, σm → 0, both K and K approach the Identity matrix, and for  → ∞, both l m K and K approach all-ones matrices. An example of the plot S(σl, σm) for two views (L = 2) is presented in Fig. 1. 10 Ofir Lindenbaum1 Arie Yeredor1 Moshe Salhov2 Amir Averbuch2

Fig. 1: An example of the two dimensional function S(σl, σm), and a slice at the first row −5 (σ2 = 10 ). The asymptotes are clearly visible in both figures. Algorithm 1 exploits the multi view to set a small scale parameter for both views.

4 Coupled Views L = 2

In this section we provide analytical results for the simple case of a coupled data set (i.e L = 2). Some of the results could be expanded to a larger number of views but not in a straight forward manner. To simplify the notation in the rest of this 4 4 section we denote X = X1 and Y = X2.

4.1 Coupled mapping

The mappings provided by our approach (Eq. (13)) are justified by the relations given by Eq. (12). In this section, we provide another analytic justification for the proposed mapping. We begin with an analysis of a 1-dimensional mapping for each view. Let ρ(x) = (ρ(x1), ρ(x2), ..., ρ(xM )) and ρ(y) = (ρ(y1), ρ(y2), ..., ρ(yM )) 4 4 denote such mappings (one for each view) and let ρ = (ρ(x), ρ(y)) and ρi = z x y x yb b (ρ(xi), ρ(yi)). Define K = K · K , where K , K are computed based on Eq. 4. Our mapping should preserve local connectivities, therefore, we want to ensure that if the data points i and j are close in both views, then ρbi and ρbj will be close. Minimization of the objective function

X h 2 z 2 z T i argmin (ρ(xi) − ρ(xj )) Ki,j + (ρ(yi) − ρ(yj )) (Ki,j ) , (17) ρb i,j

z with additional constrains provides such a connectivity preserving mapping. If Ki,j is small indicating a low connectivity between data point i and j, the distance z between ρbi and ρbj can be large. On the other hand, if Ki,j is large indicating a high connectivity between point i and j, the distance between ρbi and ρbj will be small to minimize the objective function.

Theorem 2 Setting ρb = ψ1 minimizes the objective function in Eq. 17, where ψ1 is the second eigenvector of the eigenvalue problem λiPb = ψiPb.

Proof Ph 2 z 2 z i P 2 z P 2 z T (ρ(xi) − ρ(xj )) Ki,j + (ρ(yi) − ρ(yj )) Ki,j = ρ(xi) Ki,j + ρ(xj ) (Ki,j ) i,j i,j i,j MultiView Diffusion Maps 11

P z P 2 z P 2 z T P z − 2ρ(xi)ρ(xj )Ki,j + ρ(yi) Ki,j + ρ(yj ) (Ki,j ) − 2ρ(yi)ρ(yj )Ki,j = i,j i,j i,j i,j P 2 rows P 2 cols P z ρ(xi) Di,i + ρ(xj ) Dj,j − 2ρ(yi)ρ(yj )Ki,j + i j i,j P 2 cols P 2 rows P z ρ(yi) Di,i + ρ(yj ) Dj,j − 2ρ(yi)ρ(yj )Ki,j = i j i,j "  rows   z  #  T    D 0M×M 0M×M K ρ(x) ρ(x) ρ(y) cols − z T T . 0M×M D (K ) 0M×M ρ(y) By adding a scaling constrain, the minimization problem is rewritten as T argmin ρb(Db − Kc)ρb . (18) T ρbDcρb =1

This minimization problem can be solved by finding the minimal eigenvalue of (Db − T ¯ T ¯ T ¯ Kc)ρb = λDb ρb since the minimization term is ρλDb ρb = λ. This eigenproblem has a trivial solution which is an eigenvector of all ones (denoted as 1) with λ¯ = 0. The following constraint ρDb 1 = 0 was added to remove the trivial solution. The b −1 solution is given by the smallest non-zero eigenvalue. Multiplying Eq. (18) by Db reduces the problem to ρbPb = λPb . Thus, we are looking for the eigenvector which corresponds to the second largest eigenvalue. Theorem 2 provides yet another justification to use our proposed mapping Eq. (13).

4.2 Spectral decomposition

In this section, we show how to efficiently compute the spectral decomposition of Pb (Eq. (6)) when only two view exist (L = 2). The matrix Pb is algebraically similar 1/2 −1/2 −1/2 −1/2 to the symmetric matrix Pbs where Pbs = Db PbDb = Db KcDb . There- fore, both Pb and Pbs share the same set of eigenvalues {λm}. Due to symmetry 2M−1 of the matrix Pbs, it has a set of 2M real eigenvalues {λi}i=0 ∈ R and a cor- 2M−1 2M T responding real orthogonal eigenvectors {πm}m=0 ∈ R , thus, Pbs = ΠΛΠ . By denoting Ψ = Db −1/2Π and Φ = Db 1/2Π, we conclude that the set 2M−1 2M {ψm, φm}m=0 ∈ R denotes the right and the left eigenvectors of T T Pb = ΨΛΦ , respectively, satisfying ψi φj = δi,j . In the sequel, we use the sym- metric matrix Pbs to simplify the analysis. To avoid the spectral decomposition of a 2M × 2M matrix Pb s, the spectral decomposition of Pbs can be computed using the Singular Value Decomposition ¯ z rows−1/2 z cols−1/2 rows (SVD) of the matrix K = D K D of size M ×M where Di,i = PM z cols PM z j=1 Ki,j and Dj,j = i=1 Ki,j are diagonal matrices. Theorem 3 enables us to form the eigenvectors of Pb as a concatenation of the singular vectors of Kz = Kx · Ky. Theorem 3 By using the left and right singular vectors of Kz = VΣU T , the eigenvectors and the eigenvalues of Kc are computed explicitly by 1 VV   Σ 0  Π = √ , Λ = M×M . (19) 2 U −U 0M×M −Σ 12 Ofir Lindenbaum1 Arie Yeredor1 Moshe Salhov2 Amir Averbuch2

T T Proof Both V and U are orthonormal sets, therefore, ui uj = δi,j , and vi vj = T δi,j , thus, the set {πm} is orthonormal. Therefore, ΠΠ = I. By using the con- struction defined in Eq. (19), ΠΛΠT is computed explicitly by VV  Σ 0  V T U T  VΣ −VΣ V T U T  ΠΛΠT = 1 = 1 = 2 U −U 0 −Σ V T −U T 2 UΣUΣ V T −U T  0 2Kz 1 = K. 0 denotes the M × M matrix of zeros. 2 (2Kz)T 0 c

Thus the proposed mapping in Eq. (13). could be computed for L = 2 using the SVD of Kz, Eq. 19 and Ψ = Db −1/2Π.

4.3 Cross View Diffusion Distance

In some physical systems, the observed dataset denoted by X changes over some underlying parameter denoted by α. Under this model, we can obtain multiple snapshots for various values of α. Each snapshot is denoted by Xα. If these datasets are high dimensional, quantifying the amount of change in the datasets is a dif- ficult task. This scenario was recently studied in [23].It generalizes the diffusion framework for cases in which the data changes over the parameter α. An example of such a scenario occurs in hyper-spectral images that change over time. The DM framework is applied in [23] to a fixed value of α. Then, by using the extracted low dimensional mapping, the Euclidean distance enables to quantify the amount of changes over α. This approach is sensitive since every small change in the data can result in different mappings and the mappings are extracted independently. Thus, there is no mutual influence on the extracted mapping. Our approach incorporates the mutual relations of data within the view and the relations among views. This observation enables us to measure in a more robust way the number of variations between two datasets that correspond to a small variation in α. We now define a new diffusion distance. This distance measures the relation between two views, i.e. between all the data points. We measure the distance between all the coupled data points among both mappings by using the expression

M (CV )2 X 2 Dt (X,Y ) = ||Ψbt(xi) − Ψbt(yi)|| . (20) i=1 Our kernel matrix is a product of the Gaussian kernel matrices in each view. If these values of the kernel matrices (Kx, Ky) are similar, this corresponds to similarity between the views inner geometry. The right and left singular vectors of x y (CV ) the matrix K K will be similar, thus, Dt will be small. Theorem 4 The cross manifold distance (Eq. 20) is invariant to orthonormal transformations between the ambient spaces X and Y .

Proof Denote an orthonormal transformation matrix R : X → Y w.l.o.g. by yi = Rxi. 2 2 2 y ||yi−yj || ||Rxi−Rxj || ||xi−xj || x Ki,j = exp{− 2 } = exp{− 2 } = exp{− 2 } = Ki,j . 2σy 2σy 2σx The last equality is due to the orthonomality of R and to the choice σy = σx. MultiView Diffusion Maps 13

Therefore, the matrix Kz = (Kx)2 (Eq. 4) is symmetric and its right and left singular vectors are equal, i.e. U = V , Eq.(19). This creates a repetitive form in −1/2 Ψ = Db Π → ψl(i) = ψl(M + i), 1 ≤ i, l ≤ M − 1 → Ψt(yi) = Ψt(xi), thus, (CM)2 Dt (X,Y ) = 0.

4.4 Spectral Decay of Kc

The power of kernel based methods for dimensionality reduction stems from the spectral decay of the kernel’s eigenvalues. In this section, we study the relation between the spectral decay of the Kernel Product (Eq. (7)) and our multiview kernel (Eq. (6)). In section 7.1.1, we evaluate the spectral decay empirically using two experiments. The rest of this section is devoted to the theoretical justification for the spectral decay of our proposed framework. We start with some background.

Theorem 5 The eigenvalues of Pb (Eq. 6) are real and bounded where |λi| ≤ 1, i = 1, ..., 2M. A similar proof is given in [24].

Proof As shown in section 4.2, Pb is algebraically similar to a symmetric matrix, thus, its eigenvalues are guaranteed to be real. Denote by λ and ψ the eigenvalue and the eigenvector, respectively, such that λψ = Pb ψ. Define i0 = argmax|ψ(i)| to 1≤i≤2M be the index of the largest entry in ψ. The maximal value ψ(i0) can be computed 2M 2M P P ψ(j) using P from Eq. 6 such that λψ(i0) = Pi j ψ(j) −→ |λ| = | Pi j | ≤ b b 0 b 0 ψ(i0) j=1 j=1 2M 2M P |ψ(j)| P Pi j ≤ Pi j = 1. The first inequality is due to the triangle inequality b 0 ψ(i0) b 0 j=1 j=1 and the second equality is due to the kernel normalization by Db −1. Theorem 5 shows that the eigenvalues are bounded. However, bounded eigenvalues are insufficient for dimensionality reduction. Dimensionality reduction is meaning- ful when there is a significant spectral decay. Defenition 1 Let M be a manifold. The intrinsic dimension d of the manifold is a positive integer determined by how many independent “coordinates” are needed to describe M. Using a parametrization to describe a manifold, the dimension of M is the smallest integer d such that a smooth map f(ξ) = M, ξ ∈ Rd, describes the manifold, where ξ ∈ Rd.

Our framework is based on a Gaussian kernel. The spectral decay of Gaussian kernels was studied in [5]. We use Lemma 2 to evaluate the spectral decay of our kernel. Lema 2 Assume that the data is sampled from a manifold with intrinsic dimen- sion d  M. Let K◦ (section 3.2) denotes the kernel with an exponential decay as a function of the Euclidean distance. For δ > 0, the number of eigenvalues of ◦ 1 d K above δ is proportional to (log( δ )) . 14 Ofir Lindenbaum1 Arie Yeredor1 Moshe Salhov2 Amir Averbuch2

Lemma 2 is based on Weyl’s asymptotic law [25]. Let rδ = r(δ) = max{` ∈ N such that |λ`| ≥ δ} denotes the number of eigenvalues ◦ ◦ x y of K above δ. Ki,j = Ki,j Ki,j corresponds to a single DM view given in [5]. Theorem 6 relates the spectral decay of the kernel Pb (Eq. 6) to the decay of the Kernel Product-based DM (P ◦ Eq. (7) and in [5]).

M×M Lema 3 Let A, B ∈ R be such that A, B ≥ 0. Then for any 1 ≤ k ≤ M − 1 M−1 M−1 Y Y λ`(A · B) ≤ λ`(A ◦ B) (21) `=k `=k where ◦ is the Kronecker matrix product.

This inequality is proved in [26] and [27].

z Theorem 6 Multiplying the last M − 1 − rδ eigenvalues of K is smaller than M−1 M−1−rδ Q x y M−1−rδ δ . Formally, λ`(K · K ) ≤ δ . `=rδ

M−1 Proof Denote by {λi(A)}i=0 the eigenvalues of the matrix A. They are enumer- ated in descending order such that λ0(A) ≥ λ2(A) ≥ ... ≥ λM−1(A). We use Lemma 3 to prove Theorem 6 by choosing A = Kx and B = Ky, which are pos- itive semi-definite. Kx ◦ Ky = K◦ corresponds to to the approach in [5]. By using Lemma 3 and choosing ` = rδ in Eq. 21, we get M−1 M−1 Q x y Q ◦ M−1−rδ λ`(K · K ) ≤ λ`(K ) ≤ δ . `=rδ `=rδ

Using the kernel matrix spectral decay, we can approximate Eq. (12) by ne- glecting all the eigenvalues that are smaller than δ. Thus, we can compute a low dimensional mapping such that

r  t t t t T r−1 Ψb t (xi): xi 7−→ λ1ψ1(i), λ2ψ2(i), λ3ψ3(i), ..., λr−1ψr−1(i) ∈ R . (22)

This mapping of dimension r provides a low dimensional space which improves the performance and the efficiency of various machine learning tasks.

5 Infinitesimal Generator

A family of diffusion operators were introduced in [5]. Each operator differs by the normalization. If appropriate limits are taken such that M → ∞,  → 0 and  = 2σ2, then from [5], the DM kernel operator will converge to one of the fol- lowing differential operators: 1. Normalized graph Laplacian. 2. Laplace Beltrami diffusion. 3. Heat kernel equation. These are proved in [5]. These are all special cases of the diffusion equation. This provides not only a physical justification for the DM framework, but allows in some cases to distinguish between the geome- try and the density of the data points. In this section, we study the asymptotics properties of the proposed kernel Kc (Eq. 5), using only two views, i.e. L = 2. MultiView Diffusion Maps 15

Theorem 7 The infinitesimal generator induced by the proposed kernel Kˆ (Eq. 5) converges when M → ∞,  → 0,  = 2σ2 to a “cross domain Laplacian operator”. The convergence is to functions f(x) and g(y), which are the eigenfunction of Kˆ . These functions operate on the two manifolds Mx and My and are the solutions to the following diffusion like equations:

00 g(y) = [m1f(y) + m2f (y)],

00 f(x) = [m3g(x) + m4g (x)].

Proof Assume x1, x2, ..., xM are i.i.d uniformly distributed on some manifold Mx and y1, y2, ..., yM are i.i.d uniformly distributed on another manifold My. The functions f(x) and g(y) are used to construct the vector

2M h = [f(x1), f(x2), ..., f(xM ), g(y1), g(y2), ..., g(yM )]) ∈ R . The limit of the characteristic equation is

2 ||y −y ||2 ||xi−xl|| l j M M M − 2 − 2 P z P P 2σx 2σx Ki,j f(xj ) e e f(xj ) j=1 j=1 l=1 lim g(yi) − M = lim g(yi) − M . M→∞ P Kz M→∞ P Kz 2 2 i,j →0 i,j =2σx=2σy →0 j=1 j=1 The sum is approximated by using the following Riemann-based integral

2 ||y −y ||2 2 2 M M ||xi−xl|| l j − (y−t) (x−t) 1 − 2 − 2 2σ2 − 2 P P 2σx 2σx RR y 2σx M e e f(xj ) −→ e e f(x)dxdt. j=1 l=1 M→∞

x−t By applying a change of variables z = , x = zσx + t, dx = dzσx, we get σx

(y−t)2 2 2 2 − − (x−t) − z − (t−y) RR 2σ2 2σ2 RR 2σ2 2σ e y e x f(x)dxdt = σx e x e y f(t + zσx)dzdy. By using Taylor’s expansion of f we get

z2 (t−y)2 − 2 − 0 1 00 2 RR 2σx 2σy = σx e e [f(t) + f (t)zσx + 2 f (t)z σx + ...]dzdy

(t−y)2 − z2 z2 z2 2σ2 − − 0 − 00 2 2 R y R 2 R 2 1 R 2 ≈ σx e [ e f(t) + e f (t)zσx + 2 e f (t)z σx]dzdy

(t−y)2 − 2 R 2σy 1 00 2 = σx e [a0f(t) + 2 a1f (t)σx]dzdy,

2 √ 2 2 √ R − z R − z R − z 2 where a0 = e 2 dz = 2π, e 2 zdz = 0 and a1 = e 2 z dz = 2π. t−y By another change of variables, r = , t = rσy + y, dt = drσy, we get σy

2 − r 00 2 R 2 1 = σx e [a0f(y + rσy) + 2 a1f (y + rσy)σx]dr

2 − r 0 00 2 2 2 00 000 0000 2 2 R 2 1 1 1 ≈ σx e [a0[f(y) + f (y)rσy + 2 f (y)r σy] + 2 a1σx[f (y) + f (y)rσy + 2 f (y)r σy]dr 16 Ofir Lindenbaum1 Arie Yeredor1 Moshe Salhov2 Amir Averbuch2

2 2 − r − r 00 2 2 2 R 2 1 R 2 ≈ σx[a0 e f(y)dr + 2 e f (y)(r σy + a1σx)dr]

2 00 = σx[a0f(y) + a3f (y)],

2 √ R − r 2 2 2 2 2 4 where a3 = e 2 (r σy + a1σx)dr = 2π(σy + σx). We neglected the order σ terms. In the same way, we compute the eigenfunctions g(x)

2 00 f(x) = σy[a0g(x) + a4g (x)].

This infinitesimal generator is a coupled Laplacian operator acting on the mani- folds. The characteristic functions induced by this operator are sines and cosines, just as in the standard diffusion equation.

6 Relation To Recent Studies

In this section, we describe other frameworks for incorporating multiple view. The first two are related to the diffusion process, however the proposed kernels are not symmetric thus does not insure real eigenvalues and eigenvectors. 1. Unsuper- vised metric fusion by cross diffusion The framework in [12] suggests to construct two matrices for each view P 1, P1, P 2, P2. The first is defined as in Eq. 1, the later is a stochastic matrix computed using only KNN nearest neighbors for each instance. The kernels are fused such that

1 1 2 1 T P t+1 = P · P t · (P ) (23)

2 2 1 2 T P t+1 = P · P t · (P ) . (24) The diffusion process incorporates two steps (between the view) in each time unit. A convergence of the induce diffusion distance is proved for t → ∞. 2. Common Manifold Learning Using Alternating-Diffusion An alternating diffusion process is proposed in [18]. The construction is based on fusion the stochastic matrices P 1, P 2 by multiplying the matrices such that 1 2 P AD = P · P . Assuming that a common random variable exists in both views, the authors prove new results regarding the extraction of underlying hidden ran- dom parameters. The study was inspired by our work, as it quotes the proposed construction. 3 Kernel Canonical Correlation Analysis (KCCA) These frameworks [19], [20] extend the well know Canonical Correlation Analysis (CCA) by applying a kernel function prior to the application of CCA. Kernels 1 2 K , K are constructed for each view as in Eq. 4, and the canonical vectors v1, v2 are computed by solving the following generelized eigenvalue problem

 1 2    1 2    0M×M K · K v1 (K ) 0M×M v1 2 1 = ρ · 2 2 . (25) K · K 0M×M v2 0M×M (K ) v2

For clustering tasks, K- means is applied to the set of generalized eigenvectors. 4. Spectral clustering with two views The approach in [17] generalizes the traditional normalized graph Laplacian for MultiView Diffusion Maps 17 two views. Kernels K1, K2 are computed in each view as in Eq. 4. The kernel are multiplied such that W = K1 · K2, then fused   0M×M W A = T . (26) W 0M×M P Finally, normalized using D¯ where D¯ i,i = Wi,j , such that the normalized fused j kernel is defined as A¯ = D¯ −0.5 · A · D¯ −0.5. (27)

Denoting the number of clusters as NC and the eigenvectors of A as φi, i = 1, ..., 2M. K-Means algorithm is applied to the mapping

4 2 2 Φ(i) = [φ1(i), ..., φ1(NC )]/s(i), (28)

PNC 2 where s(i) = j (φj (i)). The paper is focused on spectral clustering, however a similar version of the proposed kernel is suited for manifold learning, as we demonstrate in this paper. We use a stochastic version of the kernel and extend the construction to multiple views. The stochastic version Pb (Eq. 6) is useful as it provides various theoretical justifications for the multi view diffusion process. This approach is termed De Sa, we use it as comparison for clustering experiments on various datasets.

7 Experimental Results

In this section, we present the experimental results which evaluate our frame- work. We focus the experiments on three tasks in machine learning: clustering, classification and manifold learning.

7.1 Empirical Evaluations of Theoretical Aspects

In the first group of experiments we provide empirical evidence which support the theoretical analysis from Section 4.

7.1.1 Spectral Decay

In Section 4.4, an upper bound on the eigenvalues’ decay rate for our multiview- based approach (matrix Pb Eq. (6)) is presented. In order to empirically evaluate the decay rate, synthetic datasets are generated accompanied by comparison to other approaches. To evaluate the spectral decay of Pb (Eq. (6)), P ◦ (Eqs. (7) and [5]) and P + (Eq. (3.2)) we compare the spectral decay rate between vari- ous frameworks on synthetic clustered data drawn from Gaussian distributions. The following steps describe the generation of both views denoted by (X,Y ) and referred as View-I (X) and View-II (Y ), respectively: 9 1. 6 vectors µj ∈ R , j = 1,..., 6) were drawn from a Gaussian distribution N(0, 8 · I9×9). These vectors are the center of masses of the generated classes. 18 Ofir Lindenbaum1 Arie Yeredor1 Moshe Salhov2 Amir Averbuch2

2. 100 data points were drawn for each cluster j by using µj , 1 ≤ j ≤ 6, from a Gaussian distribution N(µj , 2 · I9×9). Denote these 600 data points by X. 3. 100 data points were drawn for each cluster j by using µj , 1 ≤ j ≤ 6, from a Gaussian distribution N(µj , 2 · I9×9). Denote these 600 data points by Y . The first 3 dimensions of both views are presented in Fig. 2. We compute the probability matrix for each view P x and P y (Eq. (1)), the Kernel Sum approach probability matrix P + (Eq. (3.2)), the Kernel Product approach P ◦ (Eq. (7)) and the proposed approach Pb. The eigendecomposition is computed for all matrices. The resulting eigenvalues’ decay rate are compared with the eigenvalues product from both views. To get a fair comparison between all the methods, we set the Gaussian scale parameter σx and σy in each view and then use these scales in all the methods. The vectors’ variance in the concatenation approach is the sum of variances since we assume statistical independence. Therefore, the following scale 2 2 2 parameters σ◦ = σx + σy are used. The experiment is repeated but this time X contains 6 clusters whereas Y contains only 3. For Y , we use only the first 3 center of masses and generate 200 points in each cluster. Figure 3 presents a logarithmic scale of the spectral decay for eigenvalues extracted from all methods. It is evident that our proposed kernel has the strongest spectral decay.

7.1.2 Cross View Diffusion Distance

In this section, we examine the proposed Cross View Diffusion Distance (Section 4.3). A swiss roll is generated by using the function  1   xi 6θi cos(θi) 2 1 View I: X = xi  =  hi  + N i , (29) 3 xi 6θi sin(θi)

θi = (1.5π)si, i = 1, 2, 3, ..., 1000, where si are 1000 data points that spread linearly within the line si → [1, 3]. The second view is generated by the application of an orthonormal transformation to the swiss roll and adding Gaussian noise. The function in Eq. (30) describes the representation of the second view  1   yi 6θi cos(θi) 2 2 View II: Y = yi  = R  hi  + N i , (30) 3 yi 6θi sin(θi) 3×3 where R ∈ R is a random orthonormal transformation matrix. It is generated by drawing values from i.i.d Gaussian variables and applying the Graham-Schmidt process. hi, i = 1, ..., 1000 are drawn from a uniform distribution within the interval 1 2 3×1 [0, 100]. Each component of N i , N i ∈ R is drawn from a Gaussian distribution 2 with zero mean and a variance of σN . An example for both Swiss rolls is presented in Fig. 4. A standard DM is applied to each view and a 2-dimensional embedding of the Swiss roll is extracted. The sum of distances between all the data points in the embedding spaces is denoted as a single view diffusion distance (SVDD). The distance is computed using the following measure M (SV )2 X 2 Dt (X,Y ) = ||Ψt(xi) − Ψt(yi)|| , (31) i=1 MultiView Diffusion Maps 19

Gaussian mixture- X

10

5

0 3 X -5

-10

-15 20 10 20 X 0 10 2 0 X -10 -10 1 -20 -20

Gaussian mixture- Y

20

10

0 3 Y -10

-20 20 10 20 Y 10 2 0 0 -10 -10 Y -20 1 -20 -30

Fig. 2: The first 3 dimensions of the Gaussian mixture. Both views share the center of masses of the Gaussian spread. Top: first view denoted as X. Bottom: second view denoted as Y . The variance of the Gaussian in each dimension is 8.

where Ψt(xi),Ψt(yi), i = 1, ..., M a the single view diffusion mappings. Then, the proposed framework is applied to extract the coupled embedding. A Cross View Diffusion Distance (CVDD) is computed using Eq. (20). This experiment was 2 executed 100 times for various values of the Gaussian noise variance σN . The median of the results are presented in Fig. 5. In about 10% of the single view simulations the embeddings’ axis are permuted. This generates a large SVDD although the embeddings share similar structures. In order to remove these type of errors we use the Median of 100 simulations.

7.2 Multiview Clustering

The task of clustering has been in the core of machine learning for many years. The goal is to divide a given data set into a subsets based on the inherited structure of 20 Ofir Lindenbaum1 Arie Yeredor1 Moshe Salhov2 Amir Averbuch2

100

10-1

10-2 Kernel product (element wise) Kernel sum

Einavalues magnitude MultiView Diffusion View 1 View 2 Eigenvalues multiplication

10-3 1 2 3 4 5 6 7 8 9 Number of eigenvalue

100

10-1

10-2 Kernel product (element wise) Kernel sum

Eigenvalues magnitude MultiView diffusion View 1 View 2 Eigenvalues multiplication

10-3 1 2 3 4 5 6 7 8 9 Number of eigenvalue

Fig. 3: Eigenvalues decay rate. Comparison between different mapping methods. Top: 6 clusters in each view. Bottom: 6 clusters in X and 3 clusters in Y . the data. We use the multiview construction to extract low dimensional mappings from multiple sets of high dimensional data points. In the following experiments we demonstrate the advantage of the proposed approach on both artificial and real data sets.

7.2.1 Two Circles Clustering

Spectral properties of data sets are useful for clustering since they reveal informa- tion about the unknown number of clusters. The characteristic of the eigenvalues of Pb (Eq. 6) can provide insight into the number of clusters within the data set. The study in [28] relates the number of clusters to the multiplicity of the eigenvalue 1. A different approach in [34] provides an analysis about the relation between the eigenvalue drop to the number of clusters. In this section, we evaluate how our proposed method captures the clusters’ structure when two views are available. MultiView Diffusion Maps 21

View I- X

100

50

0 3 X

-50

150 -100 100 -60 -40 50 -20 0 0 20 40 60 80 -50 X X 1 2 View II- Y

150

100

50 3 Y 0

-50

50 -100 0 80 -50 60 40 -100 20 0 -20 Y Y -40 -60 -80 -150 1 2

Fig. 4: The two Swiss Rolls, Top- generated by Eq. (29), Bottom- generated by Eq. (30).

We generate two circles that represent the original clusters using the function

z [i] r · cos(θ[i])) Z = 1 = , (32) z2[i] r · sin(θ[i]) where 1600 points θ[i], 1 ≤ i ≤ 1600, are spread linearly within the line [0, 4π]. The clusters are created by changing the radius as follows: r = 2.2, 1 ≤ i ≤ 800 (first cluster) , r = 4, 801 ≤ i ≤ 1600 (second cluster). The views X and Y are generated by the application of the following non-linear functions to generate the distorted views   z1[i] + 0.8 + n2[i]|z2[i] ≥ 0 x1[i] = , x2[i] = z2[i] + n1[i] (33) z1[i] − 0.8 + n3[i]|z2[i] < 0 and   z2[i] + 0.8 + n5[i]|z1[i] ≥ 0 y1[i] = z1[i] + n4[i], y2[i] = , (34) z2[i] − 0.8 + n6[i]|z1[i] < 0 22 Ofir Lindenbaum1 Arie Yeredor1 Moshe Salhov2 Amir Averbuch2

105

100

10-5

10-10

10-15 Multiview Diffusion Distance

10-20 Single View Diffusion Distance Distances in embbeding space

10-25 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 10 10 Gaussian10 10 noise10 variance10 10 σ2 10 10 10 10 N

Fig. 5: Comparison between two cross view diffusion based distances. Simulated on two Swiss rolls with additive Gaussian noise. The results are the median of 100 simulations.

X− First view Y− Second view 5 5

4 4

3 3

2 2

1 1 2 0 2 0 X Y

−1 −1

−2 −2

−3 −3

−4 −4

−5 −5 −5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5 X Y 1 1

Fig. 6: Left: first view X. Right: second view Y . The ground truth clusters are represented by the marker’s shape and color.

where nl[i], 1 ≤ l ≤ 6, are i.i.d random variables drawn from a Gaussian distribu- 2 tion with µ = 0 and σn ∈ [0.03, 0.6]. This data is referred as the Coupled Circles dataset. In Fig. 6, the views X and Y , which were generated by Eqs. (33) and (34), are presented. Color and shape indicate the ground truth clusters. Initially, DM is applied in each view and we perform clustering using K-means (k = 1) within the first diffusion coordinate. The following parameters t = 1, σx = σy = 0.05 are used. They are defined in the DM framework Section 2.2. The results are presented in Fig. 7. We further extract a 1-dimensional representation using the proposed MultiView Diffusion Maps 23

Single view clustering (X) Single view clustering (Y) 5 5

4 4

3 3

2 2

1 1

2 0 2 0 X Y

−1 −1

−2 −2

−3 −3

−4 −4

−5 −5 −5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5 X Y 1 1

Fig. 7: Left: first view X. Right: second view Y . The ground truth clusters are represented by the marker’s shape. The clustering results, which are based on the single DM-based view, are represented by the marker’s color.

Matrix product clustering Multiview clustering 5 5

4 4

3 3

2 2

1 1 2 X

0 2 0 X

−1 −1

−2 −2

−3 −3

−4 −4

−5 −5 −5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5 X X 1 1

Fig. 8: Left: first view X. The clustering result, which originated from a DM-based Kernel Product, are represented by the marker’s color. Right: first view X. The clustering results, which are based on the proposed approach, are represented by the marker’s color. The ground truth clusters are represented by the marker’s shape.

multiview framework (Eq. 13), the Kernel Sum DM, Kernel Product DM (Eq. 7), De Sa’s approach and Kernel CCA Section 6. Clustering is performed in the representation space by the application of K-means where k = 1. In the multiview mapping, the following parameters t = 1, σx = σy = 0.05 (defined in Eq. 4 and 13) and σ◦ = 0.071 (Eq. 8) are used. The clustering results are presented in Fig. 8. It is evident that the multiview-based approach outperforms the DM-based single view and the Kernel Product approaches. To evaluate the performance of our proposed mapping, 100 simulations with various values of the Gaussian’s noise variance (all with zero mean) were performed. The average clustering success rate is presented in Fig. 9. 24 Ofir Lindenbaum1 Arie Yeredor1 Moshe Salhov2 Amir Averbuch2

100 Multiview DM 95 Single view DM (X) Single view DM (Y) Kernel product DM 90 Kernel sum DM De Sa 85 Kernel CCA CCA

80

75

70

65

60 Clustering accuracy [%] 55

50 0.1 0.2 0.3 0.4 0.5 0.6 Variance of Gaussian noise σ2 n

Fig. 9: Clustering results from averaging 100 trials vs. the variance of the Gaussian noise. The simulation performed on the Coupled Circles data (Eqs. (32), (33) and (34)).

7.2.2 Handwritten Digits

For the following clustering experiment we use the Multiple Features database from the UCI repository. The data set consists of 2000 handwritten digits from 0 to 9, equally spread. The features extracted from the images are the profile correlations (FAC), Karhunen-love coefficients (KAR), Zerkine moment (ZER), morphological (MOR), pixel average in 2 × 3 windows and the Fourier coefficients (Fou) as our feature spaces X1, X2,X3, X4, X5, X6 correspondingly. We perform dimensionality reduction using a single view DM, Kernel Product DM, Kernel Sum DM and the proposed Multiview. We apply K-means to the reduces mapping using 6 to 20 coordinates. The clustering performance is measured using the Normalized Mutual Information [32] (NMI). Figure 10 presents the average clustering results using K-Means (K=1).

7.2.3 Isolet Data Set

The data set was constructed by recording 150 people pronouncing all 26 letters, each letter twice. The feature vector available is a concatenation of the following features: spectral coefficients, contour, sonorant, pre-sonorant and post-sonorant. The authors do not provide the separation into features, therefore, to apply the multiview approach we compute 3 different kernels and fuse them together. K1 is the standard Gaussian kernel defined in Eq. (4). K2 is a Laplacian kernel defined by the following equation

2 4 −|xi − xj | Ki,j = exp{ }. (35) σ2 The third kernel, K3 is an exponent with a cosine distance as the affinity measure, the kernel is given by

4 T − 1 K3 = exp i,j , i, j = 1, ..., M. (36) i,j 2σ2 MultiView Diffusion Maps 25

0.8

0.75

0.7

0.65

Multiview DM 0.6 X1 Single View DM X2 Single View DM X3 Single View DM 0.55 X4 Single View DM X5 Single View DM 0.5 X6 Single View DM Kernel Sum DM Kernel Product DM 0.45 NMI- Clustering accuracy

0.4

0.35 6 8 10 12 14 16 18 20 d- Emmbeding dimension

Fig. 10: Average clustering accuracy running 100 simulations on the Handwritten data set. Accuracy is measures using the Normalized Mutual Information (NMI).

Where Ti,j is the correlation coefficient between the i-th and j-th feature vectors, computed using:

T 4 x˜i x˜j Ti,j = , i, j = 1, ..., M, (37) q T T (x˜i x˜i)(x˜j x˜j )

4 the average-subtracted features are x˜i = xi − ηi · 1. We fuse all kernels using the multiview, kernel product and kernel sum approach and perform K-Means in the extracted space. The average NMI for 26 classes is presented in Table ****. Note that the proposed approach does not provide the highest score, however it improves all single view scores and is comparable to the alternatives.

7.3 Manifold Learning

7.3.1 Artificial Manifold Learning

In the general DM approach, there is an assumption that the sampled space de- scribes a low dimensional manifold. However, this assumption can be incorrect since the sampled space can describe the existence of redundancy in the manifold, or more generally, the sampled space can describe two or more manifolds generated by a common physical process. In this section, we examine the extracted embed- ding computed using our method and compare it to the Kernel Product approach (section 3.2). Helix A Two coupled manifolds with a common underlying open circular structure are generated. The helix shaped manifolds were generated by the application of a 26 Ofir Lindenbaum1 Arie Yeredor1 Moshe Salhov2 Amir Averbuch2

X− First View

3

2.5

2

1.5 3

X 1

0.5

0 5 4 3 5 2 4 1 3 0 2 −1 1 −2 0 −1 −3 −2 X −4 −3 2 −5 −4 X 1 Y− Second View

1

0.8

0.6 3

Y 0.4

0.2

0 5 4 3 5 2 4 1 3 2 0 1 −1 0 Y −2 −1 −3 −2 −3 2 −4 −4 Y −5 −5 1

Fig. 11: Top: first Helix X (Eq. (38)). Bottom: second Helix Y (Eq. (39)). Both manifolds have some circular structure governed by the angle parameter a[i] and b[i], i = 1, 2, 3, ..., 1000 colored by the points index i.

3 dimensional function to 1000 data points that spread linearly within the lines a[i] → [0, 2π] and b[i] = a[i] + 0.5π mod 2π, i = 1, 2, 3, ..., 1000. The functions in Eqs. (38) and (39) are used to generate the datasets for View-I and View-II denoted as X and Y , respectively:

    x1[i] 4 cos(0.9a[i]) + 0.3 cos(20a[i]) View I: X = x2[i] = 4 sin(0.9a[i]) + 0.3 sin(20a[i]) , i = 1, 2, 3, ..., 1000, 2 3 x3[i] 0.1(6.3a[i] − a[i] ) (38)

    y1[i] 4 cos(0.9b[i]) + 0.3 cos(20b[i]) View II: Y = y2[i] = 4 sin(0.9b[i]) + 0.3 sin(20b[i]) , i = 1, 2, 3, ..., 1000. 2 y3[i] 0.1(6.3b[i] − b[i] ) (39) The 3-dimensional Helix shaped manifolds X and Y are presented in Fig. 11. The Kernel Product mapping (Eq. (7)) separates the manifold to a bow and a point as shown in Fig. 13. This structure neither represents any of the original structures nor reveals the underlying parameters a[i], b[i]. On the other hand, our embedding (Eq. (13)) captures the two structures one for each view. As shown in Fig. 12, one structure represents the angle of a[i] while the other represents the an- gle of b[i]. The Euclidean distance in the new spaces preserves the mutual relations between data points based on the geometrical relation in both views. Moreover, MultiView Diffusion Maps 27

MultiView Diffusion- Ψˆ (X) 1.5

1

0.5

0 2 ˆ Ψ

-0.5

-1

-1.5 -1.5 -1 -0.5 0 0.5 1 1.5 Ψˆ 1

ff Ψˆ 1.5 MultiView Di usion- (Y)

1

0.5 2 ˆ Ψ 0

-0.5

-1

-1.5 -1.5 -1 -0.5 0 0.5 1 1.5 Ψˆ 1

Fig. 12: Top: MultiView based embbeding of the first view Ψb(X). Bottom: MultiView based embedding of the second view Ψb(Y ). They were computed by using Eq. (13)), respectively.

both manifolds are in the same coordinate system and this is a strong advantage as it enables us to compare between the manifolds in the lower dimensional space. The Euclidean distance in the new spaces preserves the mutual relations between data points that are based on the geometrical structure of both views. Helix B The previous experiment was repeated with the functions in Eqs. (40) and (41) to 28 Ofir Lindenbaum1 Arie Yeredor1 Moshe Salhov2 Amir Averbuch2

Kernel Product Diffusion Map- Ψ◦ 300

200

100

◦ 2 0 Ψ

-100

-200

-300 -1500 -1000 -500 0 500 1000 1500 2000 2500 3000 Ψ◦ 1

Fig. 13: 2-dimensional DM-based mapping of the Helix computed using the concatenation vector from both views that correspond to the kernel P ◦ (Eq.(7)). generate datasets for View-I and View-II denoted by X and Y , respectively.     x1[i] 4 cos(5a[i]) View I: X = x2[i] = 4 sin(5a[i]) , (40) x3[i] 4a[i]     y1[i] 4 cos(5b[i]) View II: Y = y2[i] = 4 sin(5b[i]) . (41) y3[i] 4b[i] Again, 1000 points were generated using a[i] → [0, 2π], b[i] = a[i] + 0.5π mod 2π, i = 1, 2, 3, ..., 1000. The generated manifolds are presented in Fig. 14. As can be viewed in Fig. 15, the proposed embeddings (Eq. (13) has success- fully captured the governing parameters ai and bi. The Kernel Product based embedding (Eq. (7)) is presented in Fig. 16. The Kernel Product based embed- ding again separated the data points into two unconnected structures that do not represent well the parameters.

7.3.2 MultiView Video Sequence

Various examples such as images, audio, MRI [21], [7] and [35] have demonstrated the power of DM for extracting from real datasets the underlying changing physical parameters. In this experiment, the multiview approach is tested on a real data. Two web cameras and a toy train with a pre-designed path are used. The train’s path has an “eight” shape structure. Extracting the underlying manifold from the set of images enables us to organize the images according to the location along the train’s path and thus reveals the true underlying parameters of the processes. MultiView Diffusion Maps 29

X- First View

30

25

20

3 15 X

10

5

0 4 3 2 4 1 3 2 0 1 -1 0 -2 -1 -2 -3 -3 X -4 -4 X 2 1

Y- Second View

30

25

20

3 15 X

10

5

0 4 3 2 4 1 3 2 0 1 -1 0 -2 -1 -2 -3 -3 X -4 -4 X 2 1

Fig. 14: Top: first Helix X (Eq. (40)). Bottom: second Helix Y (Eq. (41)). Both manifolds have some circular structure governed by the angle parameter a[i] and bi, i = 1, 2, 3, ..., 1000, as colored by the point’s index i.

The setting of the experiment is as follows: each camera records a set of im- ages from a different angle. A sample frame from each view is presented in Fig. 17. The video is sampled at 30 frames per second with a resolution of 640×480 pixels per frame. M = 220 images were collected from each view (camera). Then, the R,G,B values were averaged and downsampled to 160X120 pixels resolution. The matrices were reshaped into column vectors. The resulted set of vectors are 19200 denoted by X and Y where xi, yi ∈ R , 1 ≤ i ≤ 220. The sequential order of the images is not important for the algorithm. In a normal setting, one view is suf- ficient to extract the parameters that govern the movement of the train and thus extract the natural order of the images. However, we use two types of interferences to create a scenario in which each view by itself is insufficient for the extraction of the underlying parameters. The first interference is a gap in the recording of each camera. We remove 20 consecutive frames from each view at different time 30 Ofir Lindenbaum1 Arie Yeredor1 Moshe Salhov2 Amir Averbuch2

ff Ψˆ 1.5 MultiView Di usion- (X)

1

0.5

2 0 ˆ Ψ

-0.5

-1

-1.5 -1.5 -1 -0.5 0 0.5 1 1.5 Ψˆ 1 MultiView Diffusion- Ψˆ (Y) 1.5

1

0.5

2 0 Ψ

-0.5

-1

-1.5 -1.5 -1 -0.5 0 0.5 1 1.5 Ψ1

Fig. 15: The coupled mappings computed using our proposed parametrization in Eq. 13

◦ ×1013 1 Kernel Product Diffusion Map- Ψ

0.5 ◦ 2 0 Ψ

-0.5

-1 -2 0 2 4Ψ◦ 6 8 10 12 1 ×1013

Fig. 16: A 2-dimensional mapping, extracted based on P ◦ (Eq. (7)). MultiView Diffusion Maps 31 locations. By doing it, the bijective correspondence of some of the images in the sequence is broken. However, even an approximated correspondence is sufficient for our proposed manifold extraction. A standard 2-dimensional DM base mapping of each view was extracted. The results are bow shaped manifolds as presented in Fig. 18. Applying DM separately to each view extracts the correct order of the data points (images) along the path. However, the “missing” data points broke the circular structure of the expected manifold and resulted in a bow shaped em- bedding. We use the multiview-based methodology to overcome this interference by application of the multiview-based framework to extract two coupled mapping (Eq. (13)). The results are presented in Fig. 19. The proposed approach overcomes the interferences by smoothing the gap inherited in each view through the use of connectivities from the “undistracted” view. Finally, we concatenate the vectors from both views and compute the Kernel Product embbeding. The results are pre- sented in Fig. 20. Again, the structure of the manifold is distorted and incomplete due to the missing images. This experiment was repeated while replacing 10 frames from each view with a Gaussian noise that has the parameters µ = 0 and σ2 = 10 that are average and variance, respectively. A single view DM-based mapping was computed. The Ker- nel Product-based DM and the multiview-based DM mappings were computed as well. As presented in Fig. 21, the Gaussian noise distorted the manifolds extracted in each view. The multiview approach extracted two circular structures presented in Fig. 22. Again, the data points are ordered according to the position along the path. This time, the circular structure is unfolded and the gaps are visible in both embeddings. Applying the Kernel Product approach (Eq. 7) has yielded a distorted manifold as presented in Fig. 23.

7.4 Classification

Dimensionality reduction could not only improve classification results but also reduce the required run time. Studies such as [36], [37] and [38] focus on the role of dimensionality reduction for classification. Applications have been studied in diverse fields [39] [7] [40].

7.4.1 Classification of Seismic Events

Various studies have used machine learning for the problem of seismic events clas- sification, some of the methods which were applied are: artificial neural networks [42], [43], [44], self-organizing maps [45], [46] hidden Markov models [47], [48] and support vector machines [49]. We use a data set collected from a seismic catalog by the Geophysical Institute of Israel. The data set includes 46 earthquakes, 62 explosions and 62 noise waveforms. Each waveform was sampled at a frequency of 40 Hz. The length of each waveform is one minute, thus each consists of 2400 sam- ples. Data was collected from two stations, each station uses a three component seismometer, concluding the total number of views as 6. The features extracted from the waveform are termed Sonograms [41]. First, a Spectrogram is computed using the short time Fourier transform (STFT). The frequency scale is rearrange to be equally tempered on a logarithmic scale, such 32 Ofir Lindenbaum1 Arie Yeredor1 Moshe Salhov2 Amir Averbuch2

Fig. 17: Top: a sample image from the first camera (X). Bottom: a sample image from the second camera (Y ).

that the final spectrogram contains 11 frequency bands. The bins are normalized such that the sum of energy in every frequency band is equal to 1. The resulting set of sonograms denoted as X1, X2, X3, X4, X5, X6, are the input views for our framework. We apply the proposed framework, as well as single view DM, kernel product DM and kernel sum DM. Classification is performed using K-NN (K=1), based on 3 or 4 coordinates from the reduced mapping. The results are presented in table 1. MultiView Diffusion Maps 33

ff Ψ 1.5 Di usion Map- (X)

1

0.5

0

2 -0.5 Ψ

-1

-1.5

-2

-2.5 -1.5 -1 -0.5 0 0.5 1 1.5 2 Ψ1 ff Ψ 1.5 Di usion Map- (Y)

1

0.5

0 2 Ψ -0.5

-1

-1.5

-2 -2 -1.5 -1 -0.5Ψ1 0 0.5 1 1.5 2

Fig. 18: Top: DM-based single view mapping Ψ(X). Bottom: DM-based single view mapping Ψ(Y )). The removed images caused a bow shaped structure.

Table 1: Classification accuracy using 1-folds cross validation. Method Accuracy [%] Single View DM (X1) 89.9 Single View DM (X2) 88.6 Single View DM (X3) 89.3 Single View DM (X4) 89.3 Single View DM (X5) 89.3 Single View DM (X6) 88.6 Kernel Sum DM 93.7 Kernel Product DM 94.3 Multiview DM 97.5 34 Ofir Lindenbaum1 Arie Yeredor1 Moshe Salhov2 Amir Averbuch2

ff Ψˆ 5 MultiView Di usion- (X)

4

3

2

1 1 ˆ Ψ 0

-1

-2

-3

-4 -6 -5 -4 -3 -2 -1 0 1 Ψˆ 1 ff Ψˆ 5 MultiView Di usion- (Y)

4

3

2

1 2 ˆ Ψ 0

-1

-2

-3

-4 -6 -5 -4 -3 -2 -1 0 1 Ψˆ 1

Fig. 19: Top: Mapping Ψˆ(X). Bottom: Mapping Ψˆ(Y ) as extracted by the multiview-based framework. Two small gaps, which correspond to the removed images, are visible MultiView Diffusion Maps 35

Kernel Product Diffusion Map- Ψ◦ 3

2

1

0 ◦ 2 Ψ -1

-2

-3 ◦ -1.5 -1 -0.5Ψ 0 0.5 1 1

Fig. 20: A standard diffusion mapping (Kernel Product-based) that was computed by using the concatenation vector from both views that correspond to kernel K◦. 36 Ofir Lindenbaum1 Arie Yeredor1 Moshe Salhov2 Amir Averbuch2

Diffusion Map- Ψ(X) ×1010 10

8

6

4

2 2

Ψ 0

-2

-4

-6

-8 -1 0 1 2 3 4 5 6 7 8 9 Ψ1 ×1010 Diffusion Map- Ψ(Y) ×1012 1.5

1

0.5 2 Ψ 0

-0.5

-1 -12 -10 -8 -6 -4 -2 0 2 × 11 Ψ1 10

Fig. 21: Top: DM-based single view mapping Ψ(X). Bottom: DM-based single view mapping Ψ(Y )). The Gaussian noise deformed the circular structure MultiView Diffusion Maps 37

ff Ψˆ 2 MultiView Di usion- (X)

1

0

-1 2 ˆ Ψ -2

-3

-4 -1 -0.5 0 0.5Ψˆ 1 1 1.5 2 2.5 3 3.5 ff Ψˆ 2 MultiView Di usion- (Y)

1

0

-1 2 ˆ Ψ -2

-3

-4 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 Ψˆ 1

Fig. 22: Top: Mapping Ψˆ(X). Bottom: Mapping Ψˆ(Y ) as extracted by the multiview frame- work. Two gaps are visible that correspond to Gaussian noise. 38 Ofir Lindenbaum1 Arie Yeredor1 Moshe Salhov2 Amir Averbuch2

Kernel Product Diffusion Map- Ψ◦ ×1011 0.5

0 200

-0.5 180 160 -1 140 -1.5

◦ 2 120 -2 Ψ 100 -2.5 80 -3 60 -3.5 40

-4 20

-4.5 ◦ -150 -100 -50Ψ 0 50 100 150 1

Fig. 23: Computation of a standard diffusion mapping (Kernel Product) by using the concate- nation vector from both views (corresponding to kernel P ◦ Eq.(7)). MultiView Diffusion Maps 39

8 Conclusions

In this paper, we presented a framework for dimensionality reduction that is multi- view based. The method enables us to extract simultaneous embeddings from cou- pled embeddings. We enforce a cross domain probabilistic model at a single time step. The transition probabilities depend on the connectivities in both views. We derived various theoretical aspects of the proposed method and demonstrated their applicabilities to both artificial and real data. The experimental results demon- strate the strength of the proposed framework in cases where data is missing in each view or each of the manifolds is deformed by an unknown function. The framework is applicable to various real life machine learning tasks that consist of multiple views or multiple modalities. We consider the influence of various nor- malizations methods on the extracted embedding. Future work will include both application and theoretical analysis of such normalizations.

References

1. I. Jolliffe, Principal component analysis, 2005, vol. 21. 2. J. B. Kruskal and W. M, “Multidimensional scaling,” Sage Publications. Beverly Hills, 1977. 3. S. T. Roweis and L. K. Sau, “Nonlinear dimensionality reduction by local linear embed- ding,” Science, vol. 290.5500, pp. 2323–2326, 200. 4. W. Luo, “Face recognition based on laplacian eigenmaps,” 2011, pp. 416 – 419. 5. R. R. Coifman and S. Lafon, “Diffusion maps,” Applied and computational harmonic analysis, vol. 21, no. 1, pp. 5–30, 2006. 6. A. Singer and R. R. Coifman, “Non linear independent component analysis with diffusion maps,” Applied and Computational Harmonic Analysis, vol. 25(2), pp. 226–239, 2008. 7. O. Lindenbaum, A. Yeredor, and I. Cohen, “Musical key extraction using diffusion maps,” Signal Processing, 2015. 8. W. T. Freeman and J. B. Tenenbaum, “Learning bilinear models for two-factor problems in vision,” in Computer Vision and Pattern Recognition,Proceedings., IEEE Computer Society Conference on. IEEE, 1997, pp. 554–560. 9. I. S. Helland, “Partial least squares regression and statistical models,” Scandinavian Jour- nal of Statistics, pp. 97–114, 1990. 10. K. Chaudhuri, S. M. Kakade, K. Livescu, and K. Sridharan, “Multi-view clustering via canonical correlation analysis,” in Proceedings of the 26th annual international conference on machine learning. ACM, 2009, pp. 129–136. 11. A. Kumar, P. Rai, and H. Daume, “Co-regularized multi-view spectral clustering,” in Advances in Neural Information Processing Systems, 2011, pp. 1413–1421. 12. D. Zhou and C. Burges, “Spectral clustering and transductive learning with multiple views,” Proceedings of the 24th international conference on Machine learning, pp. 1159– 1166, 2007. 13. B. Wang, J. Jiang, W. Wang, Z.-H. Zhou, and Z. Tu, “Unsupervised metric fusion by cross diffusion,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 2997–3004. 14. A. Kumar and H. Daum´e,“A co-training approach for multi-view spectral clustering,” in Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 393–400. 15. B. Boots and G. Gordon, “Two-manifold problems with applications to nonlinear system identification,” 2012. 16. H. Hotelling, “Relations between two sets of variates,” Biometrika, pp. 321–377, 1936. 17. V. R. de Sa, “Spectral clustering with two views,” in ICML workshop on learning with multiple views, 2005. 18. R. R. Lederman and R. Talmon, “Common manifold learning using alternating-diffusion,” submitted, Tech. Report YALEU/DCS/TR1497, Tech. Rep., 2014. 40 Ofir Lindenbaum1 Arie Yeredor1 Moshe Salhov2 Amir Averbuch2

19. P. L. Lai and C. Fyfe, “Kernel and nonlinear canonical correlation analysis,” International Journal of Neural Systems, vol. 10, no. 05, pp. 365–377, 2000. 20. S. Akaho, “A kernel method for canonical correlation analysis,” arXiv preprint cs/0609071, 2006. 21. S. Lafon, Y. Keller, and R. Coifman, “Data fusion and multicue data matching by diffusion maps,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28 no. 11, p. 17841797, 2006. 22. Y. Keller, R. R. Coifman, S. Lafon, and S. W. Zucker, “Audio-visual group recognition using diffusion maps,” IEEE Transactions on Signal Processing, vol. 58, no. 1, pp. 403– 413, 2010. 23. M. Hirn and R. Coifman, “Diffusion maps for changing data,” Applied and computational harmonic analysis, 2013. 24. Y. Shkolinsky. Lecture notes on spectral methods in data anlisys, https://sites.google.com/site/yoelshkolnisky/teaching. 25. S. Lafon, “Diffusion maps and geometric harmonics,” Ph.D dissertation Yale, 2004. 26. T. Ando, “Majorization relations for hadamard products.” Linear Algebra and its Appli- cations, pp. 57–64, 1995. 27. G. Visick, “A weak majorization involving the matrices a b and ab.” Linear Algebra and its Applications, vol. 224/224, pp. 731–744, 1995. 28. A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering1 analysis and an algorithm,” Proceedings of Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, vol. 14, pp. 849–856, 2001. 29. M. A. Mouchet, S. Vill´eger,N. W. Mason, and D. Mouillot, “Functional diversity measures: an overview of their redundancy and their ability to discriminate community assembly rules,” Functional Ecology, vol. 24, no. 4, pp. 867–876, 2010. 30. P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,” Journal of computational and applied mathematics, vol. 20, pp. 53–65, 1987. 31. S. Petrovic, “A comparison between the silhouette index and the davies-bouldin index in labelling ids clusters,” in Proceedings of the 11th Nordic Workshop of Secure IT Systems, 2006, pp. 53–64. 32. P. A. Est´evez, M. Tesmer, C. A. Perez, and J. M. Zurada, “Normalized mutual information feature selection,” IEEE Transactions on Neural Networks, vol. 20, no. 2, pp. 189–201, 2009. 33. A. Singer, R. Erban, I. Kevrekidis, and R. R. Coifman, “Detecting intrinsic slow variables in stochastic dynamical systems by anisotropic diffusion maps,” vol. 106, no. 38, 2009, pp. 16 090–16 095. 34. M. Polito and P. Perona, “Grouping and dimensionality reduction by locally linear em- bedding,” in NIPS, 2001, pp. 1255–1262. 35. G. Piella, “Diffusion maps for multimodal registration,” Sensors, vol. 14, no. 6, pp. 10 562– 10 577, 2014. 36. W. Wang and M. A. Carreira-Perpin´an,“The role of dimensionality reduction in classifi- cation.” in AAAI, 2014, pp. 2128–2134. 37. T. S. Tian, “Dimensionality reduction for classification with high-dimensional data,” Ph.D. dissertation, Citeseer, 2009. 38. F. Plastria, S. De Bruyne, and E. Carrizosa, “Dimensionality reduction for classification,” in International Conference on Advanced Data Mining and Applications. Springer, 2008, pp. 411–418. 39. B. Li, C.-H. Zheng, D.-S. Huang, L. Zhang, and K. Han, “Gene expression data classifi- cation using locally linear discriminant embedding,” Computers in Biology and Medicine, vol. 40, no. 10, pp. 802–810, 2010. 40. X. He, S. Yan, Y. Hu, P. Niyogi, and H.-J. Zhang, “Face recognition using laplacianfaces,” IEEE transactions on pattern analysis and machine intelligence, vol. 27, no. 3, pp. 328– 340, 2005. 41. J. H. Kurz, C. U. Grosse, and H.-W. Reinhardt, “Strategies for reliable automatic onset time picking of acoustic emissions and of ultrasound signals in concrete,” Ultrasonics, vol. 43, no. 7, pp. 538–546, 2005. 42. T. Tiira, “Discrimination of nuclear explosions and earthquakes from teleseismic distances with a local network of short period seismic stations using artificial neural networks,” Physics of the earth and planetary interiors, vol. 97, no. 1, pp. 247–268, 1996. 43. E. Del Pezzo, A. Esposito, F. Giudicepietro, M. Marinaro, M. Martini, and S. Scarpetta, “Discrimination of earthquakes and underwater explosions using neural networks,” Bul- letin of the Seismological Society of America, vol. 93, no. 1, pp. 215–223, 2003. MultiView Diffusion Maps 41

44. A. Esposito, F. Giudicepietro, S. Scarpetta, L. DAuria, M. Marinaro, and M. Martini, “Automatic discrimination among landslide, explosion-quake, and microtremor seismic signals at stromboli volcano using neural networks,” Bulletin of the Seismological Society of America, vol. 96, no. 4A, pp. 1230–1240, 2006. 45. B. Sick, M. Guggenmos, and M. Joswig, “Chances and limits of single-station seismic event clustering by unsupervised pattern recognition,” Geophysical Journal International, vol. 201, pp. 1801–1813, 2015. 46.A.K ´’ohler, M. Ohrnberger, and F. Scherbaum, “Unsupervised pattern recognition in con- tinuous seismic wavefields records using self-organizing maps,” Geophysical Journal Inter- national, vol. 185, pp. 1619–1630, 2010. 47. M. Beyreuther, C. Hammer, M. Wassermann, M. Ohrnberger, and M. Megies, “Construct- ing a hidden markov model based earthquake detector: Application to induced seismicity,” Geophysical Journal International, vol. 189, pp. 602–610, 2012. 48. C. Hammer, M. Ohrnberger, and D. F´’ah, “Classifying seismic waveforms from scratch: A case study in the alpine environment,” Geophysical Journal International, vol. 192, pp. 425–439, 2013. 49. J. Kortstr´’om, U. Marja, and T. Tiira, “Automatic classification of seismic events within a regional seismograph network,” Comuters and Geosciences, vol. 87, pp. 22–30, 2016.