Arxiv:1508.05550V7 [Cs.LG] 5 Jun 2019 Tions Based on the Intrinsic Geometry of the Analyzed Data

Multi-View Diﬀusion Maps

Oﬁr Lindenbauma, Arie Yeredora, Moshe Salhovb, Amir Averbuchb

aSchool of Electrical Engineering, Tel Aviv University, Israel bSchool of Computer Science, Tel Aviv University, Israel

Abstract In this paper, we address the challenging task of achieving multi-view dimensionality reduction. The goal is to effectively use the availability of multiple views for extracting a coherent low-dimensional representation of the data. The proposed method exploits the intrinsic relation within each view, as well as the mutual relations between views. The multi-view dimensionality reduction is achieved by defining a cross-view model in which an implied random walk process is restrained to hop between objects in the different views. The method is robust to scaling and insensitive to small structural changes in the data. We define new diffusion distances and analyze the spectra of the proposed kernel. We show that the proposed framework is useful for various machine learning applications such as clustering, classification, and manifold learning. Finally, by fusing multi-sensor seismic data we present a method for automatic identification of seismic events. Keywords: Dimensionality reduction, Manifold learning, Diffusion Maps, Multi-view.

1. Introduction

high-dimensional big data are becoming ubiquitous in a growing variety of fields, and pose new challenges in their analysis. Extracted features are useful in analyzing these datasets. However, some prior knowledge or modeling is required in order to identify the essential features. Unsupervised dimensionality reduction methods, on the other hand, aim to find low-dimensional representa- arXiv:1508.05550v7 [cs.LG] 5 Jun 2019 tions based on the intrinsic geometry of the analyzed data. A “good” dimensionality reduction methodology reduces the complexity of the data, while pre- serving its coherency, thereby facilitating data analysis tasks (such as clustering, classification, manifold learning and more) in the reduced space. Many methods such as Principal Component Analysis (PCA) [1], Multidimensional Scaling (MDS) [2], Local Linear Embedding [3], Laplacian Eigenmaps (LE) [4], Diffusion

Email address: [email protected] (Oﬁr Lindenbaum)

Preprint submitted to Elsevier June 6, 2019 Maps (DM) [5], p-Laplacian Regularization [6], T-distributed Stochastic Neigh- bor Embedding (t-SNE) [7], Uniform manifold approximation [8] and more have been proposed to achieve robust dimensionality reduction. low-dimensional representations have been shown to be useful in various applications such as face recognition based on LE [9], Non-linear independent component analysis using DM [10], Musical Key extraction employing DM [11], and many more. Frameworks such as [1, 2, 3, 4, 5, 6, 7, 8] do not consider the possibility to exploit more than one view representing the same process. Such multiple views can be obtained, for example, when the same underlying process is observed using several different modalities, or measured with different instrumentations. An additional view can provide meaningful insights regarding the dynamical process behind the data, whenever the data is indeed generated or governed by such a latent process. This study is devoted to the development of a framework for obtaining a low- dimensional parametrization from multiple measurements organized as “views”. Our approach essentially relies on quantifying of the speed of a random walk restricted to “hop” between views. The analysis of such a random walk provides a natural representation of the data using joint l organization of the multiple measurements. The problem of learning from multiple views has been studied in several domains. Some prominent statistical approaches, for addressing this problem are Bilinear Models [12], Partial Least Squares [13] and Canonical Correlation Analysis (CCA) [14]. These methods are powerful for learning the relations among the different views, but are restricted to linear transformation for each view. Some methods such as [15, 16, 17, 18, 19, 20] extend the CCA to nonlinear transformations by introducing a kernel, and demonstrate the advantages of fusing multiple views for clustering and classification. A Hessian multiset canonical correlations is presented in [21], the authors propose to overcome lim- itations of the Laplacian by defining local Hessians on the samples to capture structures of multiple views. Markov based methods such as [22, 23, 24] use diffusion distances for classification, clustering or retrieval tasks. The “agree- ment” (also called “consensus”) between different views is used in [25] to extract the geometric information from all views. A sparsity-based learning algorithm for cross-view dimensionality reduction is proposed in [26]. A neural network which extracts maximally correlated representations is studied in [27]. A re- view comprehensive article [28] presents recent progress and challenges in this domain. In this work we extend the concepts established in [19, 29] to devise a diffusion-based learning framework for fusing multiple views, by seeking the implied underlying low-dimensional structure in the ambient space. Our contri- butions can be summarized as follows. 1. We present a natural generalization of the DM method [5] for handling multiple views. The generalization is attained by combining the intrinsic relations within each view with the mutual relations between views, so as

2 to construct a multi-view kernel matrix. The proposed kernel defines a cross-view diffusion process, and related diffusion distances, which impose a structured random walk between the various views. In addition, we show that the spectral decomposition of the proposed kernel can be used to find an embedding of the data that offers improved exploitation of the information encapsulated in the multiple views. 2. For the coupled views setting, we analyze theoretical properties of the proposed method. The associated parametrization is justified by being recast as a minimizer of a kernel-based objective function. Then, we relate the spectrum of the proposed kernel to the spectrum of a single-view kernel which is based on na¨ıve concatenation of the views. Finally, under some simplification, we find the infinitesimal generator of the proposed kernel. 3. For practical use, we present an automated method for setting the kernel bandwidth parameter and a proposed procedure for recursively augment- ing the representation with each new data point. 4. We demonstrate the applicability of our multi-view method to classification, clustering and manifold learning. Furthermore, by fusing data from multiple seismic sensors we demonstrate an automatic extraction of latent seismic parameters using real-world data. The paper is structured as follows. Some essential background is provided in section 2. In section 3 we formulate the multi-view dimensionality reduction problem and discuss prior work. Then, in Section 4 we present our proposed multi-view DM method and discuss some of its basic properties. Section 5 studies additional theoretical properties of the proposed kernel for the particular (more simple) case of a coupled setting (where only two views are available). Section 6 presents the experimental results. Potential applications are described in Section 7 and concluding remarks appear in Sections 8 and 9 (respectively).

2. Background

2.1. General dimensionality reduction D×M Consider a high-dimensional dataset X = {x1, x2,..., xM } ∈ R , xi ∈ RD, i = 1, 2,...,M. The goal is to ﬁnd a low-dimensional representation Z = r×M r {z1, z2,..., zM } ∈ R , zi ∈ R , i = 1, 2,...,M, such that r D, while latent “inner relations” (if any) among the multidimensional data points are preserved as closely as possible, in some sense. This problem setup is based on the assumption that the data is represented (viewed) in a single vector space (single view).

2.2. Diﬀusion Maps (DM) DM [5] is a dimensionality reduction method which aims at extracting the intrinsic geometry of the data. The DM framework is highly eﬀective when the data is densely sampled from some low-dimensional manifold, so that its “inner

3 relations” are the local connectivities (or proximities) on the manifold. Given a high-dimensional dataset X, the DM framework consists of the following steps: 1. A kernel function K : X × X → R is chosen, so as to construct a matrix M×M 4 K ∈ R with elements Kij = K(xi, xj), satisfying the following properties: (i) Symmetry: K = KT; (ii) Positive semi-defeniteness: K 0, namely ∀v ∈ RM : vTKv ≥ 0; and (iii) Non-negativity: K ≥ 0, namely Ki,j ≥ 0 ∀i, j ∈ [1,M]. These properties guarantee that the matrix K has real-valued eigenvectors and non-negative eigenvalues. A Gaussian kernel 2 4 n kxi−xj k o is a common example, in which K(xi, xj) = exp − 2 , where k·k 2σx 2 denotes the L2 norm and σx is a user-selected width (scale) parameter. 2. By normalizing the rows of K, the matrix

x 4 −1 M×M P = D K ∈ R (1) M×M P is obtained, where D ∈ R is a diagonal matrix with Di,i = j Ki,j. P x can be interpreted as the transition probabilities of a (ﬁctitious) Markov x t 4 chain on X, such that [(P ) ]i,j = pt(xi, xj) (where t is an integer power) describes the implied probability of transition from point xi to point xj in t steps. 3. Spectral decomposition is applied to P x, obtaining a set of M eigenvalues {λm} (in descending order) and associated normalized eigenvectors {ψm} x x satisfying P ψm = λmψm, m = 0,...,M − 1, or P Ψ = ΨΛ, where Ψ and Λ are (resp.) the eigenvectors and diagonal eigenvalues matrices. 4. A new representation is deﬁned for the dataset X, representing each xi by the i-th row of (P x)tΨ = ΨΛt, namely:

t t T M−1 Ψt(xi): xi 7→ λ1ψ1[i], . . . , λM−1ψM−1[i] ∈ R , i ∈ [1,M] (2)

where t is a selected number of steps and ψm[i] denotes the i-th element of ψm. Note that the trivial eigenvector ψ0 = 1 (with corresponding eigenvalue λ0 = 1) was omitted from the representation as it does not carry information about the data. The main idea behind this representation is that the Euclidean distance between two data points in the new representation equals a weighted L2 distance between the conditional probability x t vectors pt(xi, :) and pt(xj, :), i, j ∈ [1,M] (the i-th and j-th rows of (P ) ). This weighted Euclidean distance is referred to as the diﬀusion distance, denoted

2 4 2 Dt (xi, xj) = kΨt(xi) − Ψt(xj)k M−1 X 2t 2 2 = λm(ψm[i] − ψm[j]) = kpt(xi, :) − pt(xj, :)kW −1 , (3) m=1

T 4 T where W = D/Trace{D} and where kz kQ = z Qz denotes a Q- weighted Euclidean norm (the last equality is shown in [5]).

4 5. A desired accuracy level δ ≥ 0 is chosen for the diffusion distance de- t t fined by Eq. (3), such that r(δ, t) = max{` : |λ`| > δ|λ1| }. The new (truncated) r(δ, t)-dimensional mapping, which leads to the desired representation Z, is then defined as

δ t t t T 4 r(δ,t) Ψt (xi): xi 7→ λ1ψ1[i], λ2ψ2[i]. . . . , λrψr[i] = zi ∈ R . (4)

This dimensionality reduction approach was found to be useful in various applications in diverse ﬁelds. However, as previously noted, it is limited to a single-view representation.

3. Multi-view dimensionality reduction - Problem formulation and prior work

Assume now that we are given multiple sets (views) of observations X` , ` = ` ` ` ` D`×M 1, ..., L. Each view is a high-dimensional dataset X = {x1, x2, ..., xM } ∈ R , where D` is the dimension of each feature space. Note that a bijective correspondence between views is assumed. These views can be a result of different measurement devices or different types of features extracted from raw data. For each view ` = 1, ..., L, we seek a lower-dimensional representation that preserves the “inner relations” between multidimensional data points within each view X`, as well as among all views {X1, ..., XL}. Our goal is to find L repre- 1 L D r sentations Φ`(X , ..., X ): R ` → R , such that r D`, ` = 1, ..., L. Before turning to present our proposed approach, we describe existing alternative approaches for incorporating multiple views. In particular, we detail here the methods which we would later use as benchmarks for comparison to our proposed framework. All of these methods begin by computing the kernel matrix K` for each (`-th) view (` = 1,...L), and differ in the way of combining these kernels for generating the DM.

3.1. Kernel Product DM (KP) A na¨ıve generalization of DM [5] may be computed using an element-wise ◦ 4 1 2 L kernel product, namely by constructing K = K ◦ K ◦ ... ◦ K ∈ RM×M , ◦ 4 where ◦ denotes Hadamard’s (element-wise) matrix product, such that Ki,j = 1 2 L ◦ −1 ◦ M×M Ki,j · Ki,j · ... · Kij, followed by row-normalization: P = (D ) K ∈ R , ◦ ◦ 4 P ◦ where D is diagonal, with Di,i = Ki,j. j Note that in the special case of using Gaussian kernels with equal width parameters σ, the resulting matrix K◦ is equal to the matrix Kw constructed from the concatenated observations vector

T h 1T LTi wi = xi , ..., xi (5)

5 such that ( ) kw − w k2 Kw = exp − i j . (6) i,j 2σ2

3.2. Kernel Sum DM (KS) An average diﬀusion process was used in [23], where the sum kernel is deﬁned as L 4 X K+ = K`, (7) `=1 + +−1 + + P + followed by row-normalization: P = D K , where Di,i = Ki,j. The j implied random walk sums up (and normalizes) the step probabilities from each view.

3.3. Kernel Canonical Correlation Analysis (KCCA) The frameworks [15, 16] extend the well-known Canonical Correlation Anal- ysis (CCA) by applying a kernel function prior to the application of CCA. Kernels K1 and K2 are constructed for each view, and the canonical vectors v1 and v2 are obtained by solving the following generalized eigenvalue problem

1 2 1 2 0M×M K · K v1 (K + γI) 0M×M v1 2 1 = ρ · 2 2 , (8) K · K 0M×M v2 0M×M (K + γI) v2 where γI are regularization terms added to prevent overﬁtting and thus improve generalization of the method. Usually the Incomplete Cholesky Decomposition (ICD) [15, 16, 30] is used to reduce the run time required for solving (8). For clustering tasks, K-means clustering is applied to the set of generalized eigenvectors.

3.4. Spectral clustering with two views The approach in [19] generalizes the traditional normalized graph Laplacian for two views. Kernels K1 and K2 are computed in each view and multiplied, yielding W = K1 · K2, from which 4 0M×M W 2M×2M A = T ∈ R (9) W 0M×M is obtained. Symmetric normalization is applied by using the diagonal matrix P D¯ with diagonal elements D¯ i,i = Wi,j, such that the normalized fused kernel j is deﬁned as A¯ = D¯ −0.5 · A · D¯ −0.5. (10)

6 Assuming that the data can be clustered into NC clusters, and denoting the 2M ¯ eigenvectors of A as φi, i = 1, ..., 2M (in descending order of the corresponding eigenvalues), the mapping

4 1 T Φ[i] = [φ [i], ..., φ [i]] ∈ NC (11) s[i] 1 NC R

PNC 2 (where s[i] = j=1(φj [i])) is useful for applying K-means clustering. The work in [19] is focused on spectral clustering, however, a similar version of the kernel from Eq. (10) is suitable for manifold learning, as we demonstrate in the sequel. In section 6 this approach is referred to as de Sa’s. As We show in the next section, our proposed approach essentially uses a stochastic matrix version of this kernel, and extends the construction to multiple views. We shall show that our stochastic matrix version is useful, as it provides various theoretical justiﬁcations for the implied multi-view diﬀusion process.

4. Our Proposed Approach for Multi-view Diﬀusion Maps

Here we propose our generalization of the DM framework for handling a multi-view scenario. This is done by imposing an implied (fictitious) random walk model using the local connectivities between data points within all views. Our way to generalize the DM framework is by restraining the random walker to “hop” between different views in each step. The construction requires to choose a set of L symmetrical positive semi-definite, non-negative kernels, one for each l l view Kl : X × X → R , l = 1, ..., L. We use the Gaussian kernel function, so l that the (i, j)-th element of each K ∈ RM×M is given by

( l l 2 ) l ||xi − xj|| Ki,j = exp − 2 , i, j = 1, . . . , M, l = 1, . . . , L, (12) 2σl

2 L where σl l=1 are a set of selected parameters. The decaying property of the Gaussian is useful, as it removes the inﬂuence of large Euclidean distances. The multi-view kernel is formed by constructing the following matrix

 1 2 1 3 1 L 0M×M K K K K ... K K 2 1 2 3 2 L K K 0M×M K K ... K K   3 1 3 2 3 L LM×LM Kc = K K K K 0M×M ... K K  ∈ R . (13)    ::: ... :  L 1 L 2 L 3 K K K K K K ... 0M×M .

LM×LM Finally, using the diagonal row-normalization matrix Db ∈ R with Dbi,i = P Kbi,j, the normalized row-stochastic matrix is deﬁned as j

−1 LM×LM Pb = Db Kc ∈ R . (14)

7 We refer to the (l, m)-th block of Pb as the square M × M matrix starting at [1 + (l − 1)M, 1 + (m − 1)M], l, m = 1, ..., L. Thus, the (i, j)-th element of the l m (l, m)-th block describes a (ﬁctitious) probability of transition from xi to xj . This construction takes into account all possibilities to “hop” between views, under the constraint that staying in the same view (namely, a transition from l l xi to xj) is forbidden.

t 4.1. Probabilistic interpretation of Pb Subsequent to our proposed construction (Eqs. (12), (13) and (14)), each t element of the power-t matrix Pb ,

h ti 4 l m Pb = pt(xi, xj ) (15) i+(l−1)M,j+(m−1)M b

l m denotes the probability of transition from xi to xj in t time-steps. Note that, as mentioned above, due to the block-oﬀ-diagonal structure of Pb , which forbids l l a transition into the same view within a single time-step, we have pb1(xi, xj) = l l 0, l = 1,...,L, although for t > 1 pbt(xi, xj) may be non-zero. 4.1.1. Smoothing eﬀect t = 1 l m For simplicity let us examine the term pbt(xi, xj ), l 6= m for t = 1. The transition probability for t = 1 is

P l m l m s Ki,sKs,j pb1(xi, xj ) = . Dbi,i This probability takes into consideration all the various connectivities of node l l m xi to node xs and the connectivities of the corresponding node xs to the m destination node xj . The proposed multi-view approach has a “smoothing effect” in terms of the transition probabilities, meaning that the probability of l m l m transitioning from xi to xj could be positive even if Ki,j = 0 and Ki,j = 0: Assume that a non-empty subset S = {s1, ..., sF } ⊆ [1,M] exists, such that Kl > 0 and Km > 0, f = 1, .., F . Then by definition of the multi-view i,sf sf ,j l m probability we have pb1(xi, xj ) > 0. Note that although with the Gaussian kernel all elements of Kl are positive, the smoothing effect can still be significant l m l m when despite negligibly low values in Ki,j and in Ki,j, pb1(xi, xj ) can become relatively high. Figure 1 illustrates the multi-view transition probabilities compared to a single-view approach using two (L = 2) deformed “Swiss Roll” manifolds. In each view, there is no probability of transition from one side of its observed “gap” to the other side. And yet, the multi-view (one-step) transition probability is non-zero for points at both sides of the gaps. This smoothing effect occurs because the gaps are located at a different position (near different points) on each view, thus allowing the multi-view kernel to smooth out the nonlinear gaps.

8 Figure 1: Top-left: A non-smooth Swiss Roll sampled from View-I (X1), colored by the single-view probability of transition (t = 1) from x1 to x:. Top-right: A second Swiss Roll, sampled from View-II (X2), colored by the single-view probability of transition (t = 1) from y1 to y:. Bottom-left: the ﬁrst Swiss Roll, colored by the multi-view probabilities of transition (t = 1) from xi to y:. In the top-left ﬁgure, we highlight x1 using an arrow. Bottom-right: a low-dimensional representation extracted based on the multi-view transition matrix Pb (Eq. 14).

4.1.2. Increasing the diffusion step t Under the stochastic Markov model assumption, raising Pb to a higher power (by increasing the diffusion step t) spreads the probability mass function along its rows based on the connectivities in all views. As described in [5], this probability spread reduces the influence of eigenvectors associated with high-indexed (smaller) eigenvalues on the diffusion distance (Eq. (16)). This implies that the eigenvectors corresponding to low-indexed (large) eigenvalues have a low- frequency content, whereas the eigenvectors corresponding to the high-indexed (small) eigenvalues describe the oscillatory behavior of the data [5]. In Fig. 2, t we present the eigenvalues of the matrix Pb with different values of t. For this experiment we generated L = 3 Swiss Rolls with M = 1,200 data points each. t It is evident that the numerical rank of Pb decreases for higher values of t.

Figure 2: The decay of the eigenvalues for increasing powers of the matrix Pb .

9 4.2. Multi-view diffusion distance In a variety of real-world data types, the Euclidean distance between observed data points (vectors) does not provide sufficient information about the intrinsic relations between these vectors, and is highly sensitive to non-unitary transformations. Common tasks such as classification, clustering or system identification often require a measure for the intrinsic connectivity between data points, which is only locally expressed by the Euclidean distance in the high-dimensional ambient space. The multi-view diffusion kernel (defined in section 4) describes all the small local connections between data points. The t row stochastic matrix Pb (Eq. (14)) accounts for all possible transitions between data points in t time steps while hopping between the views. For a fixed value t > 0, two data points are intrinsically similar if the conditional distributions t t pbt(xi, :) = [Pb ]i,: and pbt(xj, :) = [Pb ]j,: are similar. This type of similarity measure indicates that the points xi and xj are similarly connected (in the sense of similar probabilities of transitions) to several mutual points. Thus, they are connected by some geometrical path. In many cases, a small Euclidean distance can be misleading due to the fact that two data points can be “close” without having any geodesic path that connects them. Comparing the transition probabilities is more robust, as it takes into consideration all of the local connectivities between the compared points. Therefore, even if two points seem far from one another in the Euclidean sense, they may still share many common neighbors and thus be “close” in the sense of having a small diffusion distance. Based on this observation, by expanding the single-view construction given in [5], we define the weighted inner view diffusion distances for the first view as

L·M 2 1 h ti h ti t 2 2 1 1 4 X T Dt (xi , xj ) = Pb − Pb = (ei − ej) Pb −1 , (16) ˜ i,k j,k Db k=1 φ0(k) where 1 ≤ i, j ≤ M, ei is the i-th column of an L · M × L · M identity matrix, ˜ φ0 is the (non-normalized) ﬁrst left eigenvector of Pb, whose k-th element is ˜ φ0(k) = Dbk,k. A similarly weighted norm is deﬁned for the l-th view

L·M 2 1 h ti h ti t 2 2 l l 4 X T Dt (xi, xj) = Pb − Pb = (e˜l+i − e˜l+j) Pb −1 , ˜ i+˜l,k j+˜l,k Db k=1 φ0(k) (17) where ˜l = (l − 1) · M. The main advantage of these distances (Eqs. (16) and (17)) is that they can be expressed in terms of the eigenvalues and eigenvectors of the matrix Pb. This insight allows us to use a representation (defined in section 4.3 below) where the induced Euclidean distance is proportional to the diffusion distances defined in Eqs. (16) and (17). Indeed, let Pb = ΨΛΦT denote the eigenvalues decomposition of Pb , where Ψ = [ψ0, . . . , ψL·M−1] and Φ = [φ0, . . . , φL·M−1] denote the matrices of (normalized) right- and left-eigenvectors 4 (resp.), and Λ = Diag(λ0, . . . , λL·M−1) denotes the diagonal matrix of respective eigenvalues. Then

10 Theorem 1. The inner view diﬀusion distance deﬁned by Eqs. (16) and (17) can also be expresses as

L·M−1 2 2 l l X 2t h ˜i h ˜i Dt (xi, xj) = λk ψk i + l − ψk j + l , i, j = 1, ..., M, (18) k=1 where ˜l = (l − 1) · M.

Proof. Using the eigenvalues decomposition of Pb we have

t −1 tT −1 T Pb Db Pb = ΨΛtΦTDb ΦΛtΨb . (19)

4 −1/2 −1/2 1/2 −1/2 Deﬁne the symmetric matrix Pb s = Db KcDb = Db Pb Db , and note that this matrix is algebraically similar to Pb . Therefore, both matrices share the same set of eigenvalues Λ. Additionally, let Π denote the (left- and right-) T orthonormal eigenvectors matrix of Pb s, namely Pb s = ΠΛΠ . The left- and −1/2 1/2 right-eigenvectors matrices of Pb = Db Pb sDb are then easily identiﬁed 1/2 −1/2 −1 as Φ = Db Π and Ψ = Db Ψ (resp.), so that ΦTDb Φ = ΠTΠ = I. Therefore,

t 2 t −1 tT 2 l l T T Dt (xi, xj) = (e˜l+i − e˜l+j) Pb −1 = (e˜l+i−e˜l+j) Pb Db Pb (e˜l+i−e˜l+j) Db L·M−1 2 T 2t T X 2t h ˜i h ˜i = (e˜l+i−e˜l+j) ΨΛ Ψ (e˜l+i−e˜l+j) = λk ψk i + l − ψk j + l . k=1 (20)

The term with k = 0 can be excluded from the sum since ψ0 = 1 (an all-ones vector) is always the ﬁrst right-eigenvector (with eigenvalue 1) for any stochastic matrix, so the term corresponding to k = 0 in the sum would vanish anyway.

4.3. Multi-view Data Parametrization Tasks such as classiﬁcation, clustering or regression in a high-dimensional feature space are considered to be computationally expensive. In addition, the performance in such tasks is highly dependent on the distance measure used. As explained in section 4.2, distance measures in the original ambient space are often meaningless in many real life situations. Interpreting Theorem 1 in terms of Euclidean distance enables us to deﬁne mappings for every view Xl, l = 1, ..., L, t using the right eigenvectors of Pb (Eq. (14)) weighted by λi. The representation for instances in Xl is given by

l l t ˜ t ˜T M−1 Ψb t(xi): xi 7−→ λ1ψ1[i + l], ..., λM−1ψM−1[i + l] ∈ R , (21) where ˜l = (l − 1) · M. These L mappings capture the intrinsic geometry of the views as well as the mutual relations between them. As shown in [31], the set of

11 eigenvalues λm has a decaying property such that 1 = |λ0| ≥ |λ1| ≥ ... ≥ |λM−1|. Exploiting the decaying property enables us to represent data up to a dimension r where r D1, ..., Dl. The dimension r ≡ r(δ) is determined by approximating the diffusion distance (Eq. (18)) up to a desired accuracy δ (we elaborate on this issue in section 5.4). The reduced dimension version of Ψb t(X) is denoted r by Ψb t (X). Using the inner view diffusion distances defined in Eqs. (16) and (17), we define a multi-view diffusion distance as the sum of inner views distances,

L 2 r r 2 (MV ) 4 X l l Dt (i, j) = Ψb t (xi) − Ψb t (xj) . (22) l=1 This distance is the induced Euclidean distance in a space constructed from the concatenation of all low-dimensional multi-view mappings

h r 1 r 2 r L i L·r×M Ψb t(X) = Ψb t (X ); Ψb t (X ); ...; Ψb t (X ) ∈ R . (23)

This mapping is used in section 7.1 for the experimental evaluation of clustering.

4.4. Multi-view kernel bandwidth When constructing the Gaussian kernels Kl, l = 1, ..., L, in Eq. (12), the 2 values of the scale (width) parameter σl have to be set. Setting these values to be too small may result in very small local neighborhoods that are unable to capture the local structures around the data points. Conversely, setting the values to be too large may result in a fully connected graph that may generate a too coarse description of the data. Several approaches have been proposed in the literature for determining the kernel width (scale): In [32], a max-min measure is suggested such that the scale becomes 2 l l 2 σl = C · max min(||xi − xj|| ) . (24) j i,i6=j

Based on empirical results, the authors in [32] suggest to set C within the range [1, 1.5]. The specific choice of C should be sufficient to capture all local connectivities. This single-view approach could be relaxed in the context of our multi-view scenario. The multi-view kernel Kc (Eq. 13) consists of products of single-view kernel matrices Kl, l = 1, ..., L (Eq. 12). The diagonal values of each kernel matrix Kl are all 1’s, therefore, in order to have connectivity in all rows of Kc a connectivity in only one of the views is sufficient. By connectivity, we mean that there is at least one nonzero value in the affinity kernel. This insight suggests that a smaller value for the parameter C could be used in the multi view setting. In [33], the authors provide a probabilistic interpretation of the choice of C, and present a grid-search algorithm for optimizing the choice of C such that each point in the dataset is connected to at least one other point.

12 Figure 3: Left: an example of the two dimensional function S(σl, σm). Right: a slice at the −5 ﬁrst row (σ2 = 10 ). The asymptotes are clearly visible in both ﬁgures. Algorithm 1 exploits the multi-view to set a small scale parameter for both views.

Another scheme, proposed in [34], aims to find a range of values for σl. The idea is to compute the kernel Kl (Eq. (12)) for various values of σ and search for the range of values where the Gaussian bell shape is more pronounced. To find such range of valid values for σ, first a logarithmic function is applied to the sum of all elements of the kernel matrix, next the range is identified as the maximal range where the logarithmic plot is linear. We expand this idea for a multi-view scenario based on the following algorithm:

Algorithm 1 Multi-view kernel bandwidth selection Input: Multiple sets of observations (views) Xl, l = 1, ..., L. Output: Scale parameters {σ1, ..., σL}. l 1: Compute Gaussian kernels K (σl), l = 1, ..., L for several values of σl ∈ [10−5, 105]. lm PP lm 2: Compute for all pairs l 6= m: S (σl, σm) = Ki,j (σl, σm), where i j lm l m K (σl, σm) = K (σl) · K (σm). 3: for l = 1 : L do lm 4: Find the minimal value for σl such that S (σl, σm) is linear for all m 6= l. 5: end for

lm Note that the two dimensional function S (σl, σm) consists of two asymp- lm σl,σm→0 lm σl,σm→∞ 3 totes, S (σl, σm) −→ log(N), and S (σl, σm) −→ log(N ) = 3log(N), l m since for σl, σm → 0, both K and K approach the Identity matrix, and for l m σl, σm → ∞, both K and K approach all-ones matrices. An example of the lm plot S (σl, σm) for two views (L = 2) is presented in Fig. 3. The range of each σl should reﬂect the asymptotic behaviour, and may be determined empirically l l l l by choosing min(σl) median{||xi −xj||} and max(σl) median{||xi −xj||}. In practice the range [10−5, 105] is suﬃcient.

13 4.5. Computational Complexity Let us now consider the computational complexity required for extracting 1 L r×M l l l l l Dl×M Φl(X , ..., X ) ∈ R from X = {x1, x2, x3, ..., xM } ∈ R . The com- P 2 putational complexity of computing L kernels is O( l M Dl). Computing the L(L−1) 3 proposed normalized kernel Pb (Eq. 14) adds a complexity of O( 4 M ). The ﬁnal step requires the spectral decomposition of Pb , thus it is the most computationally expensive step and adds a complexity of O(L3M 3). However, due to the low rank, sparse nature of Pb , this spectral decomposition could be approximated using random projection methods such as in [35, 36, 37]. Using random projection the complexity of the spectral decomposition is improved to 2 2 O(L M log r). For the optional step proposed in Algorithm 1, if n0 values are chosen for σ, the additional complexity cost is of O(MLn0).

5. Coupled views L = 2

In this section we provide analytical results for the special case of a coupled data set (i.e L = 2). Some of the results could be expanded to a larger number of views but not in a straightforward manner. To simplify the notation in the 4 4 rest of this section we denote X = X1 and Y = X2.

5.1. Coupled mapping The mappings provided by our approach (Eq. (21)) are justiﬁed by the relations given by Eq. (18). In this subsection we provide some intuition on the relation between the mappings of X and Y . We focus on the analysis of a 1- x 4 T dimensional mapping for each view. Let ρ = ρ(X) = [ρ(x1), ρ(x2), ..., ρ(xM )] y 4 T and ρ = ρ(Y ) = [ρ(y1), ρ(y2), ..., ρ(yM )] denote such 1-dimensional map- 4 pings (one for each view) and deﬁne Kz = Kx·Ky, where Kx, Ky are computed from X and Y (resp.) based on Eq. (12). The following theorem characterizes the desirable 1-dimensional mappings ρx and ρy. Theorem 2. A 1-dimensional zero-mean representation which minimizes the following objective function

X 2 z min (ρ(xi) − ρ(yj)) Ki,j ρx,ρy i,j x 2 y 2 X s.t. kρ k + kρ k = 1, (ρ(xi) + ρ(yi)) = 0, (25) i

4 xT yT T is obtained by setting ρb = [ρ ρ ] = ψ1, where ψ1 is the ﬁrst non-trivial normalized eigenvector of the graph Laplacian Db − Kc. The scaling constraint avoids arbitrary scaling of the representation, while the zero-mean constraint (implying orthogonality of ρb to the constant all-ones vector 1) eliminates any presence of a trivial (constant) component in the representations.

14 Note that Kz is an aﬃnity measure of points based on Kx ·Ky. This means z that a small value of Ki,j indicates weak connectivity between data points i and j (possibly through an intermediate point ` ∈ [1,M]), implying that the distance z between ρ(xi) and ρ(yj) can be large. Conversely, if Ki,j is large, indicating strong connectivity (possibly through an intermediate point) between points i and j, the distance between ρ(xi) and ρ(yj) should be small when aiming to minimize the objective function. Proof. By expanding the objective function in Eq. (25) we get

X 2 z X 2 z X 2 z X z (ρ(xi)−ρ(yj)) Ki,j = ρ (xi)Ki,j+ ρ (yj)Ki,j−2 ρ(xi)ρ(yj)Ki,j i,j i,j i,j i,j X 2 X z X 2 X z X z X z = ρ (xi) Ki,j+ ρ (yj) Ki,j− ρ(xi)ρ(yj)Ki,j− ρ(xj)ρ(yi)Kj,i i j j i i,j i,j X 2 rows X 2 cols X z X z = ρ (xi)Di,i + ρ (yj)Dj,j − ρ(xi)ρ(yj)Ki,j − ρ(xj)ρ(yi)Kj,i i j i,j j,i " rows z # x xT yT D 0M×M 0M×M K ρ = ρ ρ cols − z T y , (26) 0M×M D (K ) 0M×M ρ

rows PM z cols PM z where Di,i = j=1 Ki,j and Dj,j = i=1 Ki,j are diagonal matrices. The same minimization problem (Eq. (25)) can therefore be rewritten as

T T min ρ (Db − Kc)ρ s.t. kρk = 1, ρ 1 = 0. (27) ρˆ b b b b Without the orthogonality constraint, this minimization problem could be solved T ¯ T by ﬁnding the minimal eigenvalue of (Db − Kc)ρb = λρb . This eigenproblem ¯ has a trivial solution which is the all-ones eigenvector ψ0 = 1 with λ = 0. How- ever, to satisfy the orthogonality constraint, we must use a diﬀerent eigenvector (which would naturally be orthogonal to ψ0 due to the symmetry of the matrix), leading to the second-smallest (smallest non-zero) eigenvalue λ1 with its corresponding eigenvector ψ1.

5.2. Spectral decomposition In this section, we show how to eﬃciently compute the spectral decomposition of Pb (Eq. (14)) when only two view exist (L = 2). As already mentioned above, the matrix Pb is algebraically similar (conjugation) to the symmetric ma- 4 1/2 −1/2 −1/2 −1/2 trix Pbs = Db PbDb = Db KcDb . Therefore, both Pb and Pbs share the same set of eigenvalues. Due to symmetry of the matrix Pbs, it has a set 2M−1 of 2M real eigenvalues {λi}i=0 ∈ R and corresponding real orthogonal eigen- −1/2 2M−1 2M T vectors {πm}m=0 ∈ R , thus, Pbs = ΠΛΠ . By denoting Ψ = Db Π 1/2 2M−1 2M and Φ = Db Π, we conclude that the set {ψm, φm}m=0 ∈ R are the right T T and the left eigenvectors of Pb = ΨΛΦ , respectively, satisfying ψi φj = δi,j

15 (Kronecker’s delta function). In the sequel, we use the symmetric matrix Pbs to simplify the analysis. To avoid the spectral decomposition of a 2M × 2M matrix Pb s, the spectral decomposition of Pbs can be computed using the Singular Value Decomposition (SVD) of the matrix K¯ z = (Drows)−1/2Kz(Dcols)−1/2 of size M × M where rows PM z cols PM z Di,i = j=1 Ki,j and Dj,j = i=1 Ki,j are diagonal matrices. Theorem 3 enables us to form the eigenvectors of Pb as a concatenation of the singular vectors of Kz = Kx · Ky. Theorem 3. By using the left and right singular vectors of Kz = V ΣU T , the eigenvectors and the eigenvalues of Kc are given explicitly by 1 VV Σ 0 Π = √ , Λ = M×M . (28) 2 U −U 0M×M −Σ

T Proof. Both V and U are orthonormal sets, therefore, ui uj = ∆i,j, and T T vi vj = ∆i,j, thus, the set {πm} is orthonormal. Therefore, ΠΠ = I. By direct substitution of Eq. (28), ΠΛΠT can be computed explicitly as

T T T 1 VV Σ 0M×M V U ΠΛΠ = T T 2 U −U 0M×M −Σ V −U T T z 1 V Σ −V Σ V U 1 0M×M 2K = T T = z T = Kc, (29) 2 UΣ UΣ V −U 2 (2K ) 0M×M

. Thus the proposed mapping in Eq. (21) could be computed for L = 2 using the SVD of Kz, Eq. (28) and Ψ = Db −1/2Π.

5.3. Cross-view (CV) diffusion distance In some physical systems the observed dataset X may depend on some underlying parameter, say α. Under this model, multiple snapshots (views) can typically be obtained for various values of α, each denoted Xα. An example of such a scenario occurs in hyper-spectral images that change over time. Quantifying the amount of change in the datasets due to a change in α in such models is often desired, but if the datasets are high-dimensional, this can be quite a challenging task. This scenario was recently studied in [38], where the DM framework was applied to each fixed value of α. Then, by using the extracted low-dimensional mappings, the Euclidean distance enables to quantify the extent of changes due to α. However, this approach may be rather unstable, since every small change in the data can result in different mappings and the mappings are extracted independently of any mutual influence. Our multiview approach, on the other hand, incorporates the mutual relations of data within each view, as well as the relations between views. This point of view facilitates a more robust measure of the extent of variation between two

16 datasets that correspond to a small variation in α. To this end, we define a new diffusion distance, which measures the relation between two views, i.e. between all the data points obtained for different values of α. We measure the distance between all the coupled data points among the mappings of the snapshots Xαl and Xαm by using the expression

M 2 4 X 2 (CV) αl αm αl αm Dt (X , X ) = Ψb t(xi ) − Ψb t(xi ) . (30) i=1 Our kernel matrix is a product of the Gaussian kernel matrices in each view. If α α these values of the kernel matrices (Kx l , Kx m ) are similar, this corresponds to similarity between the views’ inner geometry. The right- and left-singular xαl xαm (CV) vectors of the matrix K K will be similar, thus, Dt will be small. Theorem 4. Using Gaussian kernels with identical σ values, the cross manifold distance (deﬁned in Eq. (30)) is invariant to orthonormal transformations between the ambient spaces Xαl and Xαm .

Proof. Denote an orthonormal transformation matrix R : Xαl → Xαm , w.l.o.g. αm αl by xi = Rxi . Then

( αm αm 2 ) ( αl αl 2 ) xαm kxi − xj k kRxi − Rxj k Ki,j = exp − 2 = exp − 2 2σm 2σm

( αl αl 2 ) kx − x k α i j x l = exp − 2 = Ki,j . (31) 2σl

The penultimate transition is due to the orthonomality of R and to the iden- z xαm 2 tical σl = σm values. Therefore, the matrix K = (K ) from Eq. (12) is symmetric and its right- and left-singular vectors are equal, i.e. U = V in Eq. −1/2 (28). This induces a repetitive form in Ψ = Db Π → ψl[i] = ψl[M + i], 1 ≤ 2 αl αm (CM) αl αm i, l ≤ M − 1 → Ψt(xi ) = Ψt(xi ), thus, Dt (X , X ) = 0.

5.4. Spectral decay of Kc The power of kernel based methods for dimensionality reduction stems from the spectral decay of the kernel matrix’ eigenvalues. In this subsection we study the relation between the spectral decay of the Kernel Product (Eq. (6)) and our multi-view kernel (Eq. (14)).

Theorem 5. All 2M eigenvalues of Pb (Eq. (14)) are real-valued and bounded, |λi| ≤ 1, i = 0, ..., 2M − 1.

17 Proof. 1 As shown in section 5.2, Pb is algebraically similar to a symmetric matrix, thus its eigenvalues are guaranteed to be real-valued. Denote by λ and ψ an eigenvalue and an eigenvector, resp., such that λψ = Pb ψ. 4 Deﬁne i0 = arg maxi |ψ[i]| (the index of the largest element of ψ). The maximal value ψ[i0] can be computed using Pb from Eq. (14),

2M−1 2M−1 2M−1 2M X X ψ[j] X |ψ[j]| X λψ[i0] = Pbi0jψ[j] ⇒ |λ| = Pbi0j ≤ Pbi0j ≤ Pbi0j = 1. ψ[i ] |ψ[i ]| j=0 j=0 0 j=0 0 j=1 (32) The ﬁrst inequality is due to the triangle inequality and the second equality is due to the deﬁnition of i0.

Although, according to this Theorem, the eigenvalues are all real-valued and bounded, their mere boundedness is generally insufficient for dimensionality reduction. Dimensionality reduction is meaningful only in the presence of a significant spectral decay. Definition 1. Let M be a manifold. The intrinsic dimension d of the manifold is a positive integer determined by how many independent “coordinates” are needed to describe M. Using a parametrization to describe a manifold, the dimension of M is the smallest integer d such that a smooth map f(ξ) = M describes the manifold, where ξ ∈ Rd. Our framework employs a Gaussian kernel, the spectral decay of which was studied in [5]. We use Lemma 1 (which is based on Weyl’s asymptotic law and appears in [31]) to evaluate the spectral decay of our kernel.

Lemma 1. Assume that the data is sampled from a manifold with intrinsic dimension d M. Let K◦ ∈ RM×M (a kernel constructed based on the concatenation of views) denote the kernel with an exponential decay as a function of the Euclidean distance. For δ > 0, the number of eigenvalues of K◦ above δ 1 d is proportional to (log( δ )) . Assume that the eigenvalues of K◦ are arranged in descending order of their absolute values, |λ0| ≥ |λ1| ≥ · · · ≥ |λM−1|, set δ ∈ (0, 1) and deﬁne 4 ◦ rδ = max{` ∈ [1,M]: |λ`−1| > δ} (denoting the number of eigenvalues of K above δ). Recall now that K◦ = Kx ◦ Ky corresponds to a single DM view (formed of the concatenation of two views) as addressed in [5]. Theorem 6 relates the spectral decay of our proposed kernel Pb (Eq. (14)) to the decay of the Kernel Product-based DM (P ◦).

1A similar proof is given in https://sites.google.com/site/yoelshkolnisky/teaching

18 Lemma 2. Let A, B ∈ RM×M be any two positive semi-deﬁnite (PSD) matrices. Then for any 1 ≤ k ≤ M − 1

M−1 M−1 Y Y λ`(A · B) ≤ λ`(A ◦ B) (33) `=k `=k where λ`(·) denotes the `-th eigenvalue (in descending order) of the enclosed matrix. This inequality is proved in [39] and [40].

z Theorem 6. The product of the last M − 1 − rδ eigenvalues of K is smaller M−1 M−1−rδ Q x y M−1−rδ or equal to δ . Formally, λ`(K · K ) ≤ δ . `=rδ Proof. Substitute the PSD matrices A = Kx and B = Ky in Lemma 2 and choose ` = rδ in Eq. 33 to obtain

M−1 M−1 Y x y Y ◦ M−1−rδ λ`(K · K ) ≤ λ`(K ) ≤ δ . (34)

`=rδ `=rδ

Relying on the spectral decay of the kernel matrix, we can approximate Eq. (18) by neglecting all eigenvalues smaller than δ. Thus, we can compute a low dimensional mapping such that

r t t t t T r−1 Ψb t (xi): xi 7−→ λ1ψ1[i], λ2ψ2[i], λ3ψ3[i], ..., λr−1ψr−1[i] ∈ R . (35) The following Lemma introduces an error bound for using this low-dimensional mapping:

Lemma 3. The truncated diﬀusion distance up to coordinate r deﬁned as

r r r 2 r 2 4 X 2t 2 [Dt (xi, xj)] = Ψb t (xi) − Ψb t (xj) = λs (ψs[i] − ψs[j]) , (36) s=1 is bounded by the inner view diﬀusion distance (deﬁned in Eq. (16))

"M−1 !# M−1 X 2t 2 2t 1 − ∆i,j r 2 X 2t 2 2· λ (ψs[i] − ψs[j]) − δ · ≤ [D (xi, xj)] ≤ 2· λ (ψs[i]−ψs[j]) , s min t s s=1 Dbi,j s=1

min where Dbi,j is the minimal value of Dbi,i and Dbj,j, ∆i,j is the Kronecker delta function and δ is the accuracy threshold.

19 Proof. For the last inequality, clearly

M−1 r 2 2M 2 X 2t 2 [Dt (xi, xj)] ≤ [Dt (xi, xj)] = 2 · λs (ψs[i] − ψs[j]) , s=1 the equality is a result of the repetitive form of Ψ and Λ, which were defined in Theorem 3 for L = 2. Note that s = 0 was excluded from the sum as −1/2 ψ0 = 1 is constant. For the first inequality, using Ψ = Db Π where Π is an orthonormal basis defined in Eq. (28), we get

−1/2 −1/2 −1 ΨΨT = Db ΠΠT Db = Db , which means that

2M−1 X 2 1 1 2∆i,j (ψs[i] − ψs[j]) = + − . (37) s=0 Dbi,i Dbj,j Dbi,i Using the deﬁnition of the truncated diﬀusion distance we have

2M−1 2M−1 r 2 X 2t 2 X 2t 2 [Dt (xi, xj)] = λs (ψs[i] − ψs[j]) − λs (ψs[i] − ψs[j]) s=0 s=r+1 M−1 2M−1 X 2t 2 2t X 2 ≥ 2 λs (ψs[i] − ψs[j]) − δ (ψs[i] − ψs[j]) s=1 s=0 "2M−1 !# X 2t 2 2t 1 − ∆i,j ≥ 2 λ (ψs[i] − ψs[j]) − δ · . (38) s min s=1 Dbi,j In the same way, a similar bound for the truncated diﬀusion distance between yi and yj can be derived.

The dimension r is typically determined by looking at the decay rate of the eigenvalues and on choosing a threshold level δ. In this subsection, we demonstrated how the dimension r and the threshold δ are related to the approximation of the diffusion distance. By removing all coordinates with eigenvalues smaller than δ, the error of the diffusion distance is bounded according to Lemma 3. The decaying property of the eigenvalues could be quantified by the numerical rank of the Gaussian kernel matrix. Next, we present some existing results that shed some light onto the practicality of the truncation of the diffusion coordinates up to dimension r. In practice, there usually is no single optimal value for r, and the number of necessary coordinates depends on the application.

20 5.4.1. Intrinsic Dimension and Numerical Rank In [41] Bermanis et al. prove that the numerical rank of a Gaussian kernel matrix is typically independent of its size. Their results state that the numerical rank is proportional to the volume of the data. By using boxes with side-length of D/2 = σD (D is the dimension of the data), it is shown that the numerical rank is bounded from above by the minimal number of such cubes required to cover the data. As the number of eigenvalues |λ`| > δ is proportional to the numerical rank, this means that for a prescribed δ, the number of coordinates rδ required is independent of the number of samples N. Practically speaking, this result is useful if the numerical rank is small. Furthermore, as there are several methods for estimating the numerical rank, using the numerical rank to choose rδ is not very expensive. Another practical method, which we found useful for setting r without directly choosing δ, is based on estimating the intrinsic dimension of the data d. Various methods have been proposed for estimating the intrinsic dimension of high-dimensional data. Here, we provide a brief description of a few. Popular methods, such as [42, 43] apply local PCA to the data, and estimate the intrinsic dimension as the median of the number of principal components corresponding to singular values larger than some threshold. Based on our experience, we found that a method called Dimensionality from Angle and Norm Concentration (DANCo) [44] provides a robust estimation of the intrinsic dimension d. DANCo generates artiﬁcial data of dimension d and attempts to minimize the KullbackLeibler divergence between the estimated probability distribution functions (pdf-s) of the observed data and the artiﬁcially-generated samples.

5.5. Out-of-sample extension To extend the diffusion coordinates to new data points without re-applying a large-scale eigendecomposition [32], the Nyströmextension [45] is widely used. Here we formulate the extension method for a multi-view scenario. Given the data sets X and Y and new points x˜ ∈/ X and y˜ ∈/ Y , we want to extend the multi-view diffusion mapping to x˜ and y˜ without re-applying the batch procedure. First, we describe the explicit form for the eigenvalue problem for two views X and Y . The eigenvector ψk and its associated eigenvalue λk satisfiy λkψk = Pb ψk. Substituting the definition of Pb from Eqs. (12) and (14) we get that

x y −1 0M×M K · K 2M λkψk = Db y x · ψk ∈ R , K · K 0M×M due to the block form of the matrix Pb

x X y M λkψk[i] = pb(xi, yj)ψk[j] ∈ R , j

y X x M λkψk[i] = pb(yi, xj)ψk[j] ∈ R , j

21 x 4 y 4 where ψk[i] = ψk[i], i = 1, ..., M and ψk[i] = ψk[i + M], i = 1, ..., M. The transition matrices are P x y P y x s Ki,sKs,j s Ki,sKs,j p(xi, y ) = , and p(y , xj) = . b j rows b i cols Dbi,i Dbi,i The Nystr¨omextension is is similarly obtained by computing a weighted sum of the original eigenvectors. The weights are computed by applying the kernel K to the extended data points, followed by row-normalization. For the proposed mapping, the extension is deﬁned by

1 X ψˆ (x˜) = pˆ(x˜, y )ψy[j] (39) k λ j k k j

1 X ψˆ (y˜) = Pˆ(y˜, x )ψx[j] (40) k λ j k k j where the “missing” probabilitiesp ˆ(x˜, yj) andp ˆ(y˜, xj) are approximated using the original kernel function as

2 2 X ||x˜ − xs|| ||ys − yj|| 1 pˆ(x˜, y ) = exp − exp − (41) j 2σ2 2σ2 s x y Dbj+M,j+M

2 2 X ||y˜ − y || ||xs − xj|| 1 pˆ(y˜, x ) = exp − s exp − (42) j 2σ2 2σ2 s y x Dbj+M,j+M x y and where ψk [j] = ψk[j] and ψk [j] = ψk[m + j]. The new mapping vector for the new data point is then given by

ˆ ˆ ˆ ˆ ˆ M−1 Ψ(x˜) = λ1ψ1(x˜), λ2ψ2(x˜), λ3ψ3(x˜), ..., λM−1ψM−1(x˜) ∈ R . (43) The new coordinates in the diﬀusion space are approximated and the new data points x˜, y˜ have no eﬀect on the original map’s structure.

5.6. Infinitesimal generator In the subsequent analysis we only consider kernel functions operating on the scaled difference between their two arguments, namely kernel functions which can be written as x − x K(x , x ) = K i√ j , (44) i j where is a positive constant, and K(z): RD 7→ R is a non-negative symmetric function, namely R(z) ≥ 0 and R(z) = R(−z) ∀z ∈ RD. For example, the 2 n kxi−xj k o Gaussian kernel (see subsection 2.2) K(xi, xj) = exp − 2 satisfies 2σx this property with = σ2.

22 A family of differently normalized diffusion operators was introduced in [5]. If appropriate limits are taken such that M → ∞, → 0, then from [5] it follows that the DM kernel operator will converge to one of the following differential operators: 1. Normalized graph Laplacian; 2. Laplace-Beltrami diffusion; or 3. Heat kernel equation. These are proved in [5]. The operators are all special cases of the diffusion equation. This convergence provides not only a physical justification for the DM framework, but allows in some cases to distinguish between the geometry and the density of the data points. In this subsection we study the asymptotic properties of the proposed kernel Kc (Eq. (13)), limiting the discussion to only two views, i.e. L = 2. We are interested in understanding the properties of the eigenfunctions of the proposed multi-view kernel Pb (Eqs. (13), (14)) for two views. We assume that there is some unknown mapping β : Rd → Rd from view X to view Y that satisfies yi = β(xi), i = 1, ..., M. Each view-specific kernel applies the same function, namely Kx(z) = Ky(z) = K(z), and K(z) is normalized such that R d K(z)dz = 1. Note that the Gaussian kernel can be propely normalized to R D satisfy this requirement. The analysis relates to data points {x1, ..., xM } ∈ R sampled from a uniform distribution over a bounded domain in Rd. The image of the function β is a bounded domain in Rd with distribution α(z). Theorem 7. The infinitesimal generator induced by the proposed kernel matrix Kc (Eq. (13)) after row-normalization, denoted in here as Pb , converges when 2 2 M → ∞, → 0 (with = σx = σy) to a “cross domain Laplacian operator”. The functions f(x) and g(y) converge to eigenfunctions of Pb. These functions are the solutions of the following diffusion-like equations:

3/2 (Pb f)(xi) = g(β(xi)) + 4γ(β(xi))/α(β(xi)) + O( ), (45) −1 −1 3/2 (Pb g)(yi) = f(β (yi)) + 4η(β (yi))/α(β(yi) + O( ), (46)

4 4 where the functions γ, η are deﬁned as γ(z) = g(z)α(z), η(z) = f(z)α(z). The proof of theorem 7 is deferred to the Appendix. Although the interpretation of the result is hardly intuitive, it provides evidence that the mapping f and g are indeed coupled and are related to the second derivative on the manifold. However, the distribution of the points α has an impact on the mapping, which we hope to reduce in future research.

5.7. The convergence rate In Theorem 7, we assume the number of data points M → ∞ while the scale parameter → 0. In practice we cannot expect to have an inﬁnite number of data points. It was shown in [5, 46] (and elsewehere) that a single-view graph Laplacian converges to the laplacian operator on a manifold. It is demonstrated in [47, 48] that the variance of the error for such an operator decreases as M → ∞, but increases as → 0. The study in [47] proves that for a uniform distribution of data points, the variance of the error is bounded by

23 O( 1 , 1/2). This bound was improved in [48] by an asymptotic factor √M 1/21+d/4 of based on the correlation between D−1 and K. We now turn our attention to the variance of the multi-view kernel for a ﬁnite number of points. Given x1, ...., xM independent uniformly distributed data points sampled from a bounded domain in Rd, deﬁne the multi-view Parzen Window density estimator by

M M 4 1 X X 1 x − x` y` − yj K˙ (x) = K √ K √ . (47) M, M 2 d `=1 j=1

We are interested in ﬁnding a bound for the variance of K˙ M,(x) for a ﬁnite number of data points:

M ! 1 X x − x` y` − yj var(K˙ (x)) = · M · var K √ K √ M, M 42d ` 1 3 x y 1 x y y x ≤ · M · var (K K ) ≤ var K · ||K ∞ + var K · ||K M 42d M2d ∞ 1 m + m · α(x) ≤ · [d/2 · m · 1 + d/2 · m · α(x) ≤ 1 2 , (48) M2d 1 2 M · 1.5d where the constants m1 and m2 are functions of the chosen kernels and of the pdf α(·) of the points yi, i = 1, ..., M. This bound helps to choose an optimal value for the scaling factor given the number of data points M and the intrinsic dimension d.

5.8. Generalized multi-view kernel One can consider a more general multi-view kernel which does not preclude a transition within views X and Y in each time step. Such a kernel will take the form η · (Kx)2 (1 − η) · KxKy Kc = , (49) c (1 − η) · KyKx η · (Ky)2 where the parameter η ∈ [0, 1] prescribes the (implied) probability of within- view transitions. This kernel is normalized using the sum of rows diagonal −1 matrix Db , such that Pb = Db Kc. For large values of η, the kernel favors the within-view transitions, thereby sharing characteristics with the single-view diﬀusion process. For small values of η, the kernel tends to behave like our multi-view kernel Kc (Eq. (13)).

5.9. Additive noise In this sub section we explore the eﬀect of noise by following the analysis presented in [18]: Assuming that the noise is additive, we evaluate how the proposed method preserves the statistics of the latent model parameters Θ ∈ Rd×M . We begin by considering the case of a single-view linear model, then

24 consider a multi-view (linear model) and finally use the kernel trick to relax the linearity assumption. The single-view linear model is defined by xi = Aθi +ξi, i = 1, ..., M, where d the vectors θi ∈ R , i = 1, ..., M (columns of Θ) describe latent parameters, modeled as i.i.d., zero-mean random vectors with a finite covariance matrix d Σθ, and ξi ∈ R are i.i.d., zero-mean random “noise” vectors with a finite D D×d covariance matrix Σξ. The observations are xi ∈ R and the matrix A ∈ R has full column rank. A standard approach for reducing the dimension is to apply PCA. However, in the presence of noise, even if the covariance of Θ is full rank PCA will provide a biased representation of Θ. In applying PCA to the set X, one first 1 T computes the sample covariance M XX , which at the limit M → ∞ converges T to AΣθA + Σξ. This means that using the top principal components of X embeds the data into a biased representation of Θ. Nonetheless, by introducing D×d a second set of measurements yi = Bθi + ηi, i = 1, ..., M, where B ∈ R is another full-rank matrix and ηi are i.i.d. zero-mean noise vectors which are all T uncorrelated with the respective ξi (namely, E ξi · ηi = 0, i = 1, ..., M), one can remove the bias term. Essentially, the corresponding additional measurements, along with the noise decorrelation assumption, provide the information required for retrieving an unbiased representation of Θ. Removing the bias term is possible by applying CCA [49] to X and Y . CCA not only removes the bias, but enables to uncover a reduced representation that preserves the statistics of θ. This is because the sample cross covariance of X and Y is an unbiased estimate T 1 T of AΣθB . The top d right- and left-singular vectors of M XY are used to embed the data into a d dimensional space, such that the statistics of Θ are pre- ˆ ˆ T 1 T served. By denoting the top d singular vectors as (U, S, V ) = SVDd( M XY ), T T the reduced representations are defined as Uˆ X and Vˆ Y , respectively. Although CCA is an effective, powerful tool in this context, it is based on a linear model, and thus limited to linear transformations of the data. A widely used solution to capture non-linear relations in the data are Kernel matrices [50, 51, 52]. Using a kernel matrix in the ambient space is a natural way to capture the sample-covariance in some unknown high-dimensional feature space [53]. Therefore, using a kernel generalizes the results provided by CCA to the nonlinear case. We now demonstrate how under mild assumptions the proposed kernel Kz = KxKy mitigates the bias effect of the additive uncorrelated noise. Assume that xi, yi, i = 1, ..., M are noisy observations of the same low- dimensional latent variable θi, such that xi = r(θi) + ξi and yi = h(θi) + d D x y ηi. The functions r, h : R → R . The affinity values of K , K are the sample covariance in two high-dimensional feature spaces Γ, ∆. The inaccesible feature maps for both views are defined by γ(xi) and δ(yi). This means that by computing the SVD of Kz = KxKy, we would essentially be applying CCA in the inaccessible features spaces Γ, ∆. This implies that if the following conditions hold 1. The functions r, h ∈ C∞ are invertible.

25 2. The feature representations Γ(X) and ∆(Y ) are centered. T 3. The noise terms are uncorrelated: E ξi · ηi = 0, i = 1, ..., M. 4. E[γ(xi)|θi] = a0θi + a1, with some real valued constants a0 and a1.

5. E[δ(yi)|θi] = b0θi + b1, with some real valued constants b0 and b1. then, applying SVD to Kz enables to cancel out the view-specific noise terms. Condition (1) implies that the latent parameters Θ lie on a noisy manifold in both observed spaces. Based on condition (2) and on the choice of kernels, x 1 T the empirical covariance matrices in the feature space are K = M ΓΓ and y 1 T K = M ∆∆ . We remind that Γ and ∆ are inaccessible and the kernels are computed based on X and Y . As in the linear case, using Kx or Ky alone is insufficient for obtaining an unbiased estimate of the representation of Θ. We now turn our attention to demonstrate that the matrix Kz = KxKy = 1 T T M 2 ΓΓ ∆∆ enables to remove the uncorrelated additive noise. A right eigen- z vector of K , denoted by τ i, satisfies 1 Kzτ = ΓΓT ∆∆T τ = λ τ , (50) i M 2 i i i by multiplying Eq. 50 on the left by ∆T we get 1 ∆T ΓΓT ∆∆T τ = λ ∆T τ . (51) M 2 i i i

T ¯ 1 T ¯ 1 T Substituting ui = ∆ τ i, Σγδ = M Γ ∆ and Σδγ = M ∆ Γ yields ¯ ¯ ΣδγΣγδui = λiui, (52) ¯ so up to scaling the left- and right-singular vectors of Σγδ provide a low- dimensional representation that captures the statistics of Θ. Thus, based on conditions (3-5), by taking the number of points to inﬁnity, one can extract an unbiased estimate of the representation of Θ.

6. Experimental results

In this section we present experimental results to evaluate our proposed framework. First we empirically evaluate the theoretical properties derived in section 5. Then, we demonstrate how the proposed framework can be used for learning coupled manifolds even in the presence of noise.

6.1. Empirical evaluations of theoretical aspects In the ﬁrst group of experiments we provide empirical evidence corroborating the theoretical analysis from Section 5.

26 6.1.1. Spectral decay In Section 5.4, an upper bound on the eigenvalues’ rate of decay for our multi-view-based approach (matrix Pb Eq. (14)) was presented. In order to empirically evaluate the spectral decay for Pb , P ◦ (Eq. (6) and [5]) and P +, we generated synthetically-clustered data drawn from Gaussian-Mixtures distributions. The following steps describe the generation of both views, denoted (X,Y ) and referred to as View-I (X) and View-II (Y ), resp.: 9 1. Six vectors µj ∈ R , j = 1,..., 6 were drawn from a Gaussian distribution N(0, 8 · I9×9). These vectors would serve as the centers of masses of the generated classes. 2. One hundred data points were drawn for each cluster j = 1, ..., 6 from a Gaussian distribution N(µj, 2 · I9×9). Denote these 600 data points X ∈ R9×600. 3. One hundred additional data points were similarly drawn from each of the six Gaussian distributions N(µj, 2 · I9×9). Denote these 600 data pointsY ∈ R9×600. The ﬁrst 3 dimensions of both views are depicted in Fig. 4. We compute the probability matrix for each view P x and P y, the Kernel Sum approach probability matrix P +, the Kernel Product approach P ◦ (Eq. (6)) and the proposed approach Pb. The eigendecomposition is computed for all matrices. The resulting eigenvalues’ decay rate are compared with the eigenvalues product from both views. To get a fair comparison between all the methods, we set the Gaus- sian scale parameters σx and σy in each view and then use these scales in all the methods. The vectors’ variance in the concatenation approach is the sum of variances since we assume statistical independence. Therefore, the following 2 2 2 scale parameters σ◦ = σx + σy are used. The experiment is repeated but this time X contains 6 clusters whereas Y contains only 3. For Y , we use only the ﬁrst 3 centers of masses and generate 200 points in each cluster. Figure 5 presents a logarithmic scale of the spectral decay for eigenvalues extracted from all methods. It is evident that our proposed kernel has the strongest spectral decay.

6.1.2. Cross view diﬀusion distance In this section, we examine the proposed Cross View Diﬀusion Distance (Section 5.3). A Swiss Roll is generated by using the function     xi[1] 6θi cos(θi) (1) View I: xi = xi[2] =  hi  + ni , (53) xi[3] 6θi sin(θi) with θi = (1.5π)si, i = 1, 2, ..., 1,000, where si are 1000 data points evenly spread (1) along the segment [1, 3], and where ni are i.i.d. zero-mean Gaussian noise 2 vectors with covariance σN · I3×3. The second view is generated by applying an

27 Figure 4: The ﬁrst 3 dimensions of the Gaussian mixture. Both views share the center of masses of the Gaussian spread. Left: ﬁrst view denoted as X. Right: second view denoted as Y . The variance of the Gaussian in each dimension is 8.

0 10

−1 10 Kernel product (element wise) Kernel sum MultiView Diffusion

−2 10 View 1 View 2

Einavalues magnitude Eigenvalues multiplication

−3 10 1 2 3 4 5 6 7 8 9 Number of eigenvalue

0 10

−1 10

Kernel product (element wise) −2 10 Kernel sum Cross diffusion Eigenvalues magnitude View 1 View 2 Eigenvalues multiplication −3 10 1 2 3 4 5 6 7 8 9 Number of eigenvalue

Figure 5: Eigenvalues decay rate. Comparison between diﬀerent mapping methods. Top: 6 clusters in each view. Bottom: 6 clusters in X and 3 clusters in Y .

28 Figure 6: The two Swiss Rolls, Left - a Swiss Roll generated by Eq. (53), Right - a Swiss Roll generated by Eq. (54). orthonormal transformation to the (noiseless) Swiss Roll and similarly adding Gaussian noise:     yi[1] 6θi cos(θi) (2) View II: yi = yi[2] = R  hi  + ni , (54) yi[3] 6θi sin(θi) where R ∈ R3×3 is a random orthonormal transformation matrix, and where (2) 2 ni are i.i.d. N(0, σN ·I3×3. The matrix R is generated by independently draw- ing its elements from a standard Gaussian distribution, followed by applying the Gram-Schmidt orthogonalization procedure. The variables hi, i = 1, ..., 1000 are drawn from a uniform distribution in the interval [0, 100]. An example for both Swiss Rolls is shown in Fig. 6. A standard DM is applied to each view and a 2-dimensional embedding of the Swiss Roll is extracted. The sum of distances between all the data points in the embedding spaces is denoted as a single-view diﬀusion distance (SVDD). The distance is computed using the measure

M (SV)2 X 2 Dt (X,Y ) = ||Ψt(xi) − Ψt(yi)|| , (55) i=1 where Ψt(xi), Ψt(yi), i = 1, ..., M are the single-view diffusion mappings. Then, the proposed framework is applied to extract the coupled embedding. A Cross View Diffusion Distance (CVDD) is computed using Eq. (30). This experiment 2 was executed 100 times for various values of the Gaussian noise variance σN . In about 10% of the single-view trials the embeddings’ axis are flipped. This generates a large SVDD although the embeddings share similar structures. In order to mitigate the effect of this type of errors we used the Median of the measures taken from the 100 trialss, presented in Fig. 7.

6.2. Manifold learning In this subsection we demonstrate how our proposed Multi-view DM framework allows to simultaneously extract L low-dimensional representations for L

29 Cross View Diffusion Distance 105

100

10-5

10-10

10-15 Multiview Diffusion Distance 10-20 Single View Diffusion Distance

10-25 -10 -8 -6 -4 -2 0

Distances in embbeding space 10 10 10 10 10 10 Gaussian noise power

Figure 7: Comparison between two cross view diﬀusion based distances. Simulated on two Swiss Rolls with additive Gaussian noise. The results are the median of 100 simulations. datasets.

6.2.1. Artiﬁcial manifold learning The general DM approach is based on an underlying assumption, that the sampled space describes a single low-dimensional manifold. However, this assumption may be incorrect if the sampled space describes the existence of redundancy in the manifold, or more generally, if the sampled space can describe two or more manifolds generated by a common physical process. In this subsection we consider such cases. We examine the extracted embedding computed using our method and compare it to the Kernel Product approach (Subsection 3.1).

Helix A Two coupled manifolds with a common underlying open circular structure are generated. The helix shaped manifolds were generated by the application of a M 3-dimensional function to M = 1,000 data points {ai, bi}i=1, such that the {ai} are evenly spread in [0, 2π] and bi = (ai + 0.5π) mod 2π, i = 1, ..., M. The following functions are used to generate the datasets for View-I and View-II denoted as X and Y , resp.:

    xi[1] 4 cos(0.9ai) + 0.3 cos(20ai) View I: xi = xi[2] = 4 sin(0.9ai) + 0.3 sin(20ai) , i = 1, 2, ..., 1,000, (56) 2 3 xi[3] 0.1(6.3ai − ai )

    yi[1] 4 cos(0.9bi) + 0.3 cos(20bi) View II: yi = yi[2] = 4 sin(0.9bi) + 0.3 sin(20bi) , i = 1, 2, ..., 1,000. (57) 2 yi[3] 0.1(6.3bi − bi )

30 X− First View Y− Second View

3 1 2.5 0.8 2

0.6

1.5 3 3

Y 0.4

X 1

0.5 0.2

0 0 5 5 4 4 3 3 2 5 5 4 2 4 1 3 1 3 0 2 0 2 1 1 −1 −1 0 −2 0 Y −2 −1 −3 −1 −3 −2 −2 −3 X −4 −3 2 −4 −4 Y 2 −5 −4 −5 −5 X 1 1 Figure 8: Left: ﬁrst Helix X (Eq. (56)). Right: second Helix Y (Eq. (57)). Both manifolds have some circular structure governed by the angle parameter a[i] and b[i], i = 1, 2, ..., 1,000 colored by the points index i.

MultiView Diﬀusion- Ψˆ (X) MultiView Diﬀusion- Ψˆ (Y) 1.5 1.5

1 1

0.5 0.5 [2]

ˆ 0

0 Ψ [2] ˆ Ψ

-0.5 -0.5

-1 -1

-1.5 -1.5 -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5Ψˆ 0 0.5 1 1.5 Ψˆ [1] [1]

Figure 9: Left: Multi-View based embedding of the ﬁrst view Ψb (X). Right: Multi-View based embedding of the second view Ψb (Y ). They were computed by using Eq. (21)), respectively.

Kernel Product Diﬀusion Map- Ψ◦ 300

200

100

◦ 2 0 Ψ

-100

-200

-300 -1500 -1000 -500 0 500 1000 1500 2000 2500 3000 Ψ◦ 1

Figure 10: 2-dimensional DM-based mapping of the Helix computed using the concatenated vector from both views that correspond to the kernel P ◦ (Eq. (6)).

31 Figure 11: Left: ﬁrst Helix X (Eq. (58)). Right: second Helix Y (Eq. (59)). Both manifolds have some circular structure governed by the angle parameter a[i] and bi, i = 1, 2, ..., 1,000, as colored by the point’s index i.

The resulting 3-dimensional Helix-shaped manifolds X and Y are shown in Fig. 8. The Kernel Product mapping (Eq. (6)) separates the manifold to a bow and a point as shown in Fig. 10. This structure neither represents any of the original structures nor reveals the underlying parameters ai, bi. On the other hand, our embedding (Eq. (21)) captures the two structures. one for each view. As shown in Fig. 9, one structure represents the angle of ai while the other represents the angle of bi. The Euclidean distance in the new spaces preserves the mutual relations between data points based on the geometrical relation in both views. Moreover, both manifolds are in the same coordinate system and this is a strong advantage as it enables to compare the manifolds in the lower-dimensional space. The Euclidean distance in the new spaces preserves the mutual relations between data points that are based on the geometrical structure of both views. Helix B The previous experiment was repeated using the following alternative functions:     xi[1] 4 cos(5ai) View I: xi = xi[2] = 4 sin(5ai) , (58) xi[3] 4ai     yi[1] 4 cos(5bi) View II: yi = yi[2] = 4 sin(5bi) . (59) yi[3] 4bi

Again, M = 1,000 points were generated using ai ∈ [0, 2π], bi = (ai + 0.5π) mod 2π, i = 1, ..., M. The generated manifolds are presented in Fig. 11. As can be viewed in Fig. 12, our proposed embeddings (Eq. (21) has success- fully captured the governing parameters ai and bi. The Kernel Product based embedding (Eq. (6)), as evident in Fig. 13, again separated the data points into two unconnected structures that do not represent well the parameters.

32 Figure 12: The coupled mappings computed using our proposed parametrization in Eq. 21

Figure 13: A 2-dimensional mapping, extracted based on P ◦ (Eq. (6)).

6.2.2. MultiView video sequence Various examples involving datasets in diverse fields, such as images, audio, MRI ([54], [11] and [55], resp.) have demonstrated the power of DM for the extraction of underlying changing physical parameters from real datasets. In this experiment, the multi-view approach is tested on a real video data, in what can literally be termed a “toy example”: Two web cameras and a toy train with preset tracks are used. The train’s tracks have an “eight” shape structure. Extracting the underlying manifold from the set of images enables to organize the images according to the location along the train’s path and thus reveals the true underlying parameters of the processes. The setting of the experiment is as follows: each camera records a set of images from a different angle. A sample frame from each view is shown in Fig. 14. The video is sampled at 30 frames per second with a resolution of 640 × 480 pixels per frame. M = 220 images were collected from each view (camera). Then, the R,G,B values were averaged and downsampled to 160 × 120 pixels resolution. The matrices were reshaped into column vectors. The resulted set 19,200 of vectors are denoted by X and Y where xi, yi ∈ R , 1 ≤ i ≤ 220. The sequential order of the images is not important for the algorithm. In a normal setting, one view is sufficient to extract the parameters that govern the move- ment of the train and thus extract the natural order of the images. However,

33 Figure 14: Left: a sample image from the first camera (X). Right: a sample image from the second camera (Y ). we use two types of interferences to create a scenario in which each view by itself is insufficient for the extraction of the underlying parameters. The first interference is a gap in the recording of each camera. We remove 20 consecutive frames from each view at different time intervals. By removing frames, the bijective correspondence of some of the images in the sequence is broken. However, even an approximated correspondence is sufficient for our proposed manifold extraction. A standard 2-dimensional DM based mapping of each view was extracted. The results are bow-shaped manifolds as presented in Fig. 15. Applying DM separately to each view extracts the correct order of the data points (images) along the path. However, the “missing” data points broke the circular structure of the expected manifold and resulted in a bow-shaped embedding. We use the multi-view based methodology to overcome this interference by application of the multi-view framework to extract two coupled mappings (Eq. (21)). The results are shown in Fig. 16. The proposed approach overcomes the interferences by smoothing out the gap inherited in each view through the use of connectivities from the “unonbstructed” view. Finally, we concatenate the vectors from both views and compute the Kernel Product embedding The results are presented in Fig. 17. Again, the structure of the manifold is distorted and incomplete due to the missing images. This experiment was then repeated, replacing 10 frames from each view with “noise frames” consisting of pixel-wise i.i.d. zero-mean Gaussian noise with variance 10. A single-view DM-based mapping was computed. The Kernel Product- based DM and the multi-view based DM mappings were computed as well. As presented in Fig. 18, the Gaussian noise distorted the manifolds extracted in each view. The multi-view approach extracted two circular structures presented in Fig. 19. Again, the data points are ordered according to the position along the path. This time, the circular structure is unfolded and the gaps are visible in both embeddings. Applying the Kernel Product approach (Eq. 6) has yielded a distorted manifold as presented in Fig. 20.

34 Figure 15: Left: DM-based single-view mapping Ψ(X). Right: DM-based single-view mapping Ψ(Y )). The removed images caused a bow shaped structure.

Figure 16: Left: Mapping Ψ(ˆ X). Right: Mapping Ψ(ˆ Y ) as extracted by the multi-view based framework. Two small gaps, which correspond to the removed images, are visible.

Kernel Product Diﬀusion Map- Ψ◦ 3

0 ◦ 2 Ψ -1

-2

-3 ◦ -1.5 -1 -0.5Ψ 0 0.5 1 1

Figure 17: A standard diﬀusion mapping (Kernel Product-based) that was computed by using the concatenated vector from both views that correspond to kernel K◦.

35 Diﬀusion Map- Ψ(X) Diﬀusion Map- Ψ(X) ×1010 ×1012 10 1.5

1 6

4 0.5 2 2 2 Ψ Ψ 0 0 -2

-4 -0.5

-6

-8 -1 -1 0 1 2 3 4 5 6 7 8 9 -12 -10 -8 -6 -4 -2 0 2 Ψ1 Ψ1 ×1010 ×1011

Figure 18: Left: DM-based single-view mapping Ψ(X). Right: DM-based single-view mapping Ψ(Y )). The Gaussian noise deformed the circular structure

ﬀ Ψˆ ﬀ Ψˆ 2 MultiView Di usion- (X) 2 MultiView Di usion- (Y)

1 1

0 0

-1 -1 2 2 ˆ ˆ Ψ Ψ -2 -2

-3 -3

-4 -4 -1 -0.5 0 0.5Ψˆ 1 1.5 2 2.5 3 3.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 1 Ψˆ 1

Figure 19: Left: Mapping Ψ(ˆ X). Right: Mapping Ψ(ˆ Y ) as extracted by the multi-view framework. Two gaps are visible that correspond to Gaussian noise.

Kernel Product Diﬀusion Map- Ψ◦ ×1011 0.5

0 200

-0.5 180 160 -1 140 -1.5

◦ 2 120 -2 Ψ 100 -2.5 80 -3 60 -3.5 40

-4 20

-4.5 ◦ -150 -100 -50Ψ 0 50 100 150 1

Figure 20: Computation of a standard diﬀusion mapping (Kernel Product) by using the concatenation vector from both views (corresponding to kernel P ◦ Eq. (6)).

36 7. Applications

7.1. Multi-view clustering The task of clustering has been in the core of machine learning for many years. The goal is to divide a given dataset into subsets based on the inherited structure of the data. We use the multi-view construction to extract low- dimensional mappings from multiple sets of high-dimensional data points. In the following experiments we expand the examples presented in [56] to cluster artiﬁcial and real data sets. For the real data sets applying the multi-view approach requires an eigen decomposition of large matrices. To reduce the runtime of experiments we use an approximate matrix decomposition based on sparse random projections [35].

7.1.1. Two circles clustering Spectral properties of data sets are useful for clustering since they reveal information about the unknown number of clusters. The characteristic of the eigenvalues of Pb (Eq. (14)) can provide insight into the number of clusters within the data set. The study in [57] relates the number of clusters to the multiplicity of the eigenvalue 1. A different approach in [58] provides an analysis about the relation between the eigenvalue drop to the number of clusters. In this section, we evaluate how our proposed method captures the clusters’ structure when two views are available. We generate two circles that represent the original clusters using the function zi[1] r · cos(θi)) zi = = , (60) zi[2] r · sin(θi) where M = 1,600 points θi, 1 ≤ i ≤ M, are evenly spread in [0, 4π]. The clusters are created by changing the radius as follows: r = 2, 1 ≤ i ≤ 800 (first cluster) , r = 4,801 ≤ i ≤ 1,600 (second cluster). The views X (Eq. (61)) and Y (Eq. (62)) are generated by the application of the following non-linear functions that produce the distorted views z1[i] + 1 + ni[2]|zi[2] ≥ 0 xi[1] = , xi[2] = zi[2] + ni[1] (61) z1[i] + ni[3]|zi[2] < 0 and zi[2] + 1 + ni[6]|zi[1] ≥ 0 yi[1] = zi[1] + ni[4], yi[2] = , (62) zi[2] + ni[6]|zi[1] < 0 where ni[`], 1 ≤ ` ≤ 6, are i.i.d. random variables drawn from a Gaussian 2 distribution with µ = 0 and σn ∈ [0.03, 0.6]. This data is referred to as the Coupled Circles dataset. In Fig. 21, the views X and Y , which were generated by Eqs. (61) and (62), are shown. Color and shape indicate the ground truth clusters. Initially, DM is applied to each view and clustering is applied using K-means (K = 2) within the first diffusion coordinate. The kernel bandwidths σx and σy for all

37 Figure 21: Left: ﬁrst view X. Right: second view Y . The ground truth clusters are represented by the marker’s shape and color.

100 Multi-view DM Single view DM (X) 95 Single view DM (Y)

90 Kernel Product DM Kernel Sum DM 85 De Sa Kernel CCA 80 CCA

65 Clustering accuracy [%]

50 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Variance of Gaussian noise- 2 n

Figure 22: Clustering results from averaging 200 trials vs. the variance of the Gaussian noise. The simulation performed on the Coupled Circles data (Eqs. (60), (61) and (62)). methods are set using the min-max method described in Eq. (24). We use t = 1 since it is optimal for clustering tasks. For the kernel product method we q 2 2 use σ◦ = σx + σy. We further extract a 1-dimensional representation using the proposed multi-view framework (Eq. (21)), the Kernel Sum DM (Eq. (7)), Kernel Product DM (Eq. (6)), de Sa’s approach (Eq. (10)) and Kernel CCA (Eq. (8)) described in Section 3. The regularization parameter is γ = 0.01 for KCCA and we use 100 components for the Incomplete Cholesky Decomposition [15, 16]. Clustering is performed in the representation space by the application of K-means where K = 2. To evaluate the performance of our proposed map 100 simulations with various values of the Gaussian’s noise variance (all with zero mean) were performed. The average clustering success rate is presented in Fig. 22. It is evident that the multi-view based approach outperforms the DM-based single-view and the Kernel Product approaches. The performance of kernel methods is highly dependent on setting an appropriate kernel bandwidth σx, σy, in Algorithm 1 we have presented a method for setting such parameters. To evaluate the inﬂuence of these parameters on the clustering quality we set σn = 0.16 and extract the multi-view, Kernel Sum and Kernel Product diﬀusion mapping for various values of σx, σy. The average clustering performance using K-means (K = 2) are presented in Fig. 23.

38 Figure 23: Clustering results from averaging 20 trails using various values of σx, σy based on diﬀerent mappings. The standard deviation of the noise is σn = 0.16. Top left- Kernel Sum DM, top right- Kernel Product DM, bottom- Multi View DM.

7.1.2. Handwritten digits For the following clustering experiment, we use the Multiple Features database 2 from the UCI repository. The data set consists of 2, 000 handwritten digits from 0 to 9 that are equally spread. The extracted features from these images are the profile correlations (FAC), Karhunen-Loéve coefficients (KAR), Zerkine moment (ZER), morphological (MOR), pixel averages in 2 × 3 windows and the Fourier coefficients (Fou) as our feature spaces X1, X2,X3, X4, X5, X6 respectively. We apply dimensionality reduction using a single-view DM, Kernel Product DM, Kernel Sum DM and the proposed Multi-view. We apply K-means to the reduced mapping using 6 to 20 coordinates. The clustering performance is measured using the Normalized Mutual Information [59] (NMI). Figure 24 presents the average clustering results using K-Means. Next, we attempt to cluster 40K grey scale images of handwritten 2’s and 3’s. The images with dimension 28 × 28 were collected from the infinity MNIST dataset [60]. We generate two independent noisy versions of the 40K samples. By adding pixel Gaussian noise N(0, 0.5) to each image we create the first noisy view which is denoted as X1. The second view X2 is created by randomly and independently zeroing each pixel with probability 0.5. In Fig. 25. we present 25 examples from each view. To evaluate the clustering performance, we apply K-means 100 times to the reduced representations with dimensions 5, 10, 15 and 20 and report the top results for each method. In table 1 we present the top Normalized Mutual Information (NMI) and clustering accuracy’s the proposed multi-view and various alternative methods.

2http://archive.ics.uci.edu/ml

39 0.8

0.75

0.7

0.65 Multiview DM 0.6 X1 Single View DM 2 0.55 X Single View DM X3 Single View DM 0.5 X4 Single View DM X5 Single View DM 0.45 X6 Single View DM 0.4 Kernel Sum DM Kernel Product DM 0.35 6 8 10 12 14 16 18 20 Normalized mutual information Emmbeding dimension- r

Figure 24: Average clustering accuracy running 100 simulations on the Handwritten data set. Accuracy is measured using the Normalized Mutual Information (NMI).

Figure 25: Random samples from both views. Left- X1 generated by adding Gaussian noise. Right- X2 generated by randomly dropping out 50% of the pixels.

40 Method NMI Accuracy DM X1 0.38 84.4% DM X2 0.38 84.1% Kernel Prod 0.41 85.2% Kernel Sum 0.39 84.6% Kernel CCA 0.54 90.5% de Sa 0.59 91.2% Multiview 0.70 94.7%

Table 1: Accuracy of clustering and the normalized mutual information (NMI) on the inﬁnity MNIST dataset. Each view consists a noisy version based on 40K images of handwritten 2’s and 3’s. The ﬁrst view X1 is generated by adding Gaussian noise, while X2 is generated by zeroing out random pixels with probability 0.5.

7.1.3. Isolet data set The Isolet data set was constructed by recording 150 people pronouncing each letter twice for all 26 letters. The feature vector available is a concatenation of the following features: spectral coefficients, contour, sonorant, pre-sonorant and post-sonorant. The authors do not provide the feature’s separation, therefore, the dimension of the feature vector is 617. We use a subset of the data with 1,599 instances, thus the features space is X ∈ R1,559×617. To apply the multi-view approach we compute 3 different kernels and fuse them together. The first kernel K1 is the standard Gaussian kernel defined in Eq. (12). K2 is a Laplacian kernel defined by 2 4 −|xi − xj| Ki,j = exp . (63) σ2

The third kernel K3 is an exponent with a correlation distance as the aﬃnity measure, given by

4 T − 1 K3 = exp i,j , i, j = 1, ..., M, (64) i,j 2σ2 where Ti,j is the correlation coeﬃcient between the i-th and j-th feature vectors, computed by T 4 x˜i · x˜j Ti,j = , i, j = 1, ..., M. (65) q T T (x˜i · x˜i)(x˜j · x˜j)

4 The average subtracted features are x˜i = xi − ηi · 1, where ηi is the average of the features for instance i. We fuse the kernels using multi-view, kernel product and kernel sum approach, we then apply K-Means to the extracted space. The average NMI for 26 classes is presented in Fig. 26.

41 Figure 26: Clustering accuracy measured with Normalized Mutual Information (NMI) on the Isolet data set by using 3 diﬀerent kernel matrices. Clustering was performed in the r dimensional embedding space.

7.1.4. Caltech 101 For this experiment we use an image dataset which consists of 101 categories collected in [61] for an object recognition problem. We use two subsets of the dataset, with 10 and 15 instances in each category. The views X1, X2 are represented by the following features: Bag-of-words SIFT descriptors [62] and Pyramid Histogram of Oriented Gradients (PHOG) [63]. We use Kernel matrices computed by [64], therefore, we do no set the scale parameters. The quality of the clustering is again measured using Calinski-Harabasz Criterion [65] and the average silhouette width [66], for both metrics higher values indicates better separation. The silhouette is deﬁned within the range of [−1, 1]. The average clustering results based on 10 to 15 coordinates are presented in Table 2.

Method Silhouette(10) Calinski(10) Silhouette(15) Calinski(15) DM X1 (SIFT) −0.27 183 −0.27 293 DM X2 (PHOG) −0.23 169 −0.19 329 Kernel Prod −0.08 348 −0.02 683 Kernel Sum −0.11 322 −0.05 650 de Sa −0.35 134 −0.34 227 Multiview 0.32 727 0.32 1275

Table 2: Clustering results on two subsets of Caltech-101 data set. Columns represent measures for the quality of clustering. First two rows are scores based on a single-view DM using diﬀerent features. The last four rows present scores based on the proposed and alternative schemes for a Multiview DM based mapping.

7.2. Learning from multi sensor seismic data Automatic detection and identiﬁcation of seismic events is an important task, it is carried out constantly for seismic and nuclear monitoring. The monitoring

42 process results in a seismic event bulletin that contains information about the detected events, their locations, magnitudes and type (natural or man made event). Seismic stations usually consist of multiple sensors recording contin- uously at a low frequency. The amount of available data is huge and only a fraction of the recordings contains the signal of interest. Thus, automatic tools for monitoring are of great interest. A suspect event is usually identified based on the energy of the signal, then, typical discrimination algorithms extract seismic parameters. A simple seismic parameter is the focal depth. Its drawback is that its estimation is usually inaccurate without the depth phases. Other widely used seismic discrimination methods are Ms:mb (surface wave magnitude versus body wave magnitude) and spectral amplitude ratios of different seismic phases [67] [68]. Current automatic seismic bulletins comprise a large number of false alarms, which have to be manually corrected by an analyst. In this subsection we apply the proposed method to extract essential latent seismic parameters. A suspected event is identified based on a short and long time average ratio (STA/LTA) [69]. Then, a time-frequency representation is computed to which multi-view is applied to fuse the data from multiple seismic sensors. Using the multi-view low-dimensional embedding and simple classi- fiers, we demonstrate capabilities classification of event type (earthquakes vs. explosions) and quarry source for explosions.

7.2.1. Description of the data set The dataset consists of recordings from two diﬀerent broad band seismic stations MMLI (Malkishua) and HRFI (Harif). Both stations are operated by the Geophysical Institute of Israel (GII) and they are part of the Israel National Seismic Network [70]. MMLI and HRFI stations are located to the north and to the south of the analyzed region, respectively. Each station is equipped with a three component STS-2 seismometer, thus the total number of views is L = 6. All recording are sampled at Fs = 40Hz. The HRFI data includes 1654 explosions and 105 earthquakes, while MMLI includes a subset of 46 earthquakes, 62 explosions. The explosions occurred in the south of Israel between the years 2,005-2,015.

7.2.2. Feature Extraction by Normalized Sonograms A seismic event typically generates two underground traveling waves. A primary wave (P) and secondary wave (S). The two waves (P-S) arrive at the recording station with some time delay. A time frequency representation of the recording captures the spectral properties of the event while maintaining the P-S time gap. Here we use a time frequency representation termed sonogram [71] with some modiﬁcations. Each single-trace seismic waveform, denoted by N y[n] ∈ R is a time series signal sampled at the rate of Fs = 40Hz. The waveform y[n] is decomposed into a set of overlapping windows of length N0 = 256 using an overlap of s = 0.9. Thus, the sift between consecutive windows is NS = b0.1 · 256c = 25. A short-time Fourier transform (STFT) is applied to y[n] in each time window. Then, power spectral densities are computed. The

43 resulting spectrogram is denoted by R(f, t), where f is a frequency bin and t is a number of time window (bin). Thus, it contains T = 192 time bins and N0/2 = 128 frequency bins. The sonogram is obtained by summing the spectrogram in equally tempered logarithmically scaled frequency bands, this is done for every time bin. Finally the sonogram is normalized such that the sum of energy in every frequency band is equal to 1. The result is a normalized sonogram and it is denoted by S(k, t), where k is the frequency band number and t is the time window number. The resulting set of sonograms are denoted X1, X2, X3, X4, X5, X6. These are the input views for our framework.

7.2.3. Event classification Each sensor records information from the seismic event as well as nuisance noise. We can assume that the noise at each station is independent. Thus, by fusing the measurements from different sensors we may be able to improve detection level. To evaluate how well the proposed approach fuses the information, we use a set which includes 46 earthquakes, 62 explosions recorded at MMLI (Malkishua) and HRFI (Harif). After we extract the sonogram, the proposed MVDM framework is applied, as well as the single-view DM, kernel product DM and kernel sum DM. Classification is performed by using K-NN (K=1), based on 3 or 4 coordinates from the reduced mapping. The results are presented in table 3. This experiment demonstrates that applying multi-view DM to seismic recordings extracts a meaningful representation. In the following test we check how each view affects the detection rate. We do this by applying the MV to subsets of the L = 6 views. We per- form classification using representation computed based on all pairs of view Xl, Xm, l, m = 1, ..., 6, l 6= m. The accuracy of classification based on the multi-view representation Ψb (Xl), l = 1, ..., 6 given that Xm, m = 1, ..., 6, m 6= l is presented in Fig. 27.

Table 3: Classiﬁcation accuracy using 1-fold cross validation. r is the number of coordinates used in the embedding space.

Method Accuracy [%] (r = 3) Accuracy [%] (r = 4) single-view DM (X1) 89.9 93.0 single-view DM (X2) 88.6 92.4 single-view DM (X3) 89.3 91.1 single-view DM (X4) 89.3 89.2 single-view DM (X5) 89.3 90.5 single-view DM (X6) 88.6 91.1 Kernel Sum DM 93.7 94.9 Kernel Product DM 94.3 93.0 Multi-view DM 97.5 98.1

44 Figure 27: Classification accuracy using K-nn (K=1) for all pairs of views Xl, Xm, l 6= m. The y-axis is the number of the first view used, while the x-axis is the number of the second view. Classification is performed in the multi-view low-dimensional embedding (r = 4). The diagonal terms are presented as zero since we did not simulate for l = m.

Table 4: Classiﬁcation accuracy using 1-fold cross validation. r is the number of coordinates used in the embedding space.

Method Accuracy [%] (r = 3) Accuracy [%] (r = 4) single-view DM (X1) 80.3 80.6 single-view DM (X2) 79.1 79.2 single-view DM (X3) 76.4 77.8 Kernel Sum DM 82.2 82.8 Kernel Product DM 80.8 81.2 Multi-view DM 86.2 86.4

Identification and separation of quarries by attributing the explosions to the known sources is a challenging task [72, 73]. Quarry blast have a similar spectral properties, they are usually classified by a triangulation process. Such a process requires to compare the arrival time between distinct seismic stations. Here we attempt to classify the source of the explosions, using 3-channels from the same station. For this experiment 602 seismograms of explosions are used. The explosions occurred in 4 quarry clusters in Israel and 1 quarry in Jordan. All events were recorded in HRFI station, the distances from the event to the station vary between 50-130Km. The association of each blast to quarry (labeling) is performed manually by an analyst from the GII. After extracting the sonograms, we apply multi-view DM and present the first 2 coordinates in Fig. 28. In table 4 we summarize the classification results by applying K-NN (k = 1) in a leave one out procedure. Our method is compared to a single-view DM, kernel product and kernel sum.

45 Figure 28: Left- 2 dimensional multi-view DM of the 602 quarry blasts. Points are colored by quarry label. Right- a map with the approximated source location.

8. Discussion

We presented a multi-view based framework for dimensionality reduction. The framework enables to extract simultaneous embeddings from coupled embeddings. Our approach is based on imposing an implied cross-domain transition in each single time-step. The transition probabilities depend on the connectivities in both views. We reviewed various theoretical aspects of the proposed method and demonstrated their applicability to both synthetic and real data. The experimental results demonstrate the strength of the proposed framework in cases where data is missing in each view or each of the manifolds is deformed by an unknown function. The framework is applicable to various real life machine learning tasks that consist of multiple views or multiple modalities.

9. Appendix

We prove Theorem 7 (subsection 5.6), repeated here for convenience:

Theorem 7. The inﬁnitesimal generator induced by the proposed kernel matrix Kc (Eq. (13)) after row-normalization, denoted in here as Pb , converges when 2 2 M → ∞, → 0 (with = σx = σy) to a “cross domain Laplacian operator”. The functions f(x) and g(y) converge to eigenfunctions of Pb. These functions are the solutions of the following diﬀusion-like equations:

3/2 (Pb f)(xi) = g(β(xi)) + 4γ(β(xi))/α(β(xi)) + O( ), (66) −1 −1 3/2 (Pb g)(yi) = f(β (yi)) + 4η(β (yi))/α(β(yi) + O( ), (67)

4 4 where the functions γ, η are deﬁned as γ(z) = g(z)α(z), η(z) = f(z)α(z).

Proof. By extending the single-view construction presented in [5], the eigen- function of the limit-operator Pb is deﬁned using the functions f(x) and g(y)

46 by concatenating the vectors such that

2M h = [f(x1), f(x2), ..., f(xM ), g(y1), g(y2), ..., g(yM )] ∈ R . The limit of the top-half (Eq. (66)) of the characteristic equation is given by

2M M M P P P x y Kbi,jhj Ki,`K`,jg(yj) j=1 j=1 `=1 lim (P h ) = lim h − = lim f(x )− , i = 1, ..., M. b i i 2M i M M M→∞ M→∞ M→∞ y →0 →0 P →0 P P x Kbi,j Ki,`K`,j j=1 j=1 `=1 (68) We approximate the summations using a Riemann integral. Beginning with the denominator, we have

M M 1 X X 4 1 Z Z s − x y − β(s) x y √ √ 2 d Ki,`K`,j −→ D(x) = d K K α(y)dsdy. M M→∞ d d j=1 `=1 →0 R R

y − β(s) √ √ d/2 Using a change of variables z = , y = β(s) + z, dz = dy we get

1 Z Z s − x √ D(x) = K √ K(z)α(β(s) + z)dsdz. d/2 d d R R Using a ﬁrst order Taylor expansion of α(·) we get √ 1 Z Z s − x D(x) ≈ K √ K(z) α(β(s)) + zT∇α(β(s)) + O() dsdz, d/2 d d 2 R R using the symmetry of the kernel K(z) we have Z K(z)zTdz = 0T, d R √ s√−x therefore, applying another change of variables t = , s = t + x, dt = dsd/2 we get Z √ D(x) ≈ K(t)[α(β(x + t)) + O()]dt ≈ α(β(x)) + O() d R (independent of x), where the last transition is again based on a Taylor expansion (of β(·)) and on zeroing out the odd (1st) moment of K(t). Turning to the numerator (of Eq. (68)),

M M 1 X X 4 1 Z Z s − x y − β(s) x y √ √ 2 d Ki,`K`,jg(yj) −→ N(x) = d K K g(y)α(y)dsdy. M M→∞ d d j=1 `=1 →0 R R

47 y − β(s) √ √ d/2 By applying a change of variables z = , y = β(s) + z, dz = dy we get 1 Z Z s − x √ N(x) = K √ K(z)γ(β(s) + z)dsdz. d/2 d d R R Using Taylor’s expansion of γ(·) we get √ 1 Z Z s − x N(x) ≈ K √ K(z) γ(β(s)) + zT ∇γ(β(s)) + zT Hz + O(3/2) dsdz d/2 d d 2 2 R R

4 ∂2γ(β(s)) where Hi,j = is the Hessian. The ﬁrst term yields the integral over ∂si∂sj K(z), while the second term vanishes due to integration over an odd (1st) moment of the symmetric kernel K(z). The last term yields

Z 2 2 Z T ∂ γ(β(s)) X ∂ γ(β(s)) K(z)z zdz = zizjK(z)dz d ∂s ∂s ∂s ∂s d R i j i,j i j R 2 Z X ∂ γ(β(s)) 2 = 2 zi K(z)dz = 4γ(β(s)), ∂s d i i R R where 4 denotes the Laplacian operator, and where we assumed that d zizjK(z)dz R vanishes for i 6= j and equals 1 for i = j. Note that this is naturally satisﬁed, e.g., by the Gaussian kernel function K(z) = c · exp(−0.5kzk2) (with c = (2π)−d/2, as per the scaling requirement). Substituting into N(x) we get 1 Z s − x h i N(x) ≈ K √ γ(β(s)) + 4γ(β(s)) + O(3/2) ds. d/2 d 2 R √ s√−x d/2 Using a change of variables t = , s = t + x, dt = ds and γ(y) = g(y)α(y) we get Z h √ √ i N(x) ≈ K(t) γ(β(x + t)) + 4γ(β(x + t)) + O(3/2) dt. d 2 R Using Taylor’s expansion (of β(·)) once again we get Z h i N(x) ≈ K(t) γ(β(x)) + 4γ(β(x)) + tT Ht + O(3/2) dt, d 2 2 R here we neglected terms involving to a power higher than 3/2 and terms with odd order of t due to the symmetry of the kernel K. This leads to

N(x) ≈ γ(β(x)) + 4γ(β(x)) + O(3/2), dividing by the denominator we get

3/2 (Pb f)(xi) ≈ g(β(xi)) + 4γ(β(xi))/α(β(xi)) + O( )

48 In the same way, we compute the convergence on g(yi)

−1 −1 3/2 (Pb g)(yi) = f(β (yi)) + 4η(β (yi))/α(β(yi) + O( ).

The following issues were ignored in the proof: • Errors due to approximating the sum by an integral; An upper bound on the associated errors in the single-view DM is derived in [34]. • Deformation due to the fact that the data is sampled from a non uniform density. This changes the result by some constant.

• The data lies on some manifold. This could be dealt by changing the coordinate system and integrating on the manifold. • When assuming that the data lies on some manifold, the Euclidean distance should be replaced by the geodesic distance along the manifold. As in the analysis of [5], this introduces a factor to the integral.

10. References

References

[1] Hans-Peter Deutsch. Principle component analysis. In Derivatives and Internal Models, pages 539–547. Springer, 2002. [2] Joseph B Kruskal. Nonmetric multidimensional scaling: a numerical method. Psychometrika, 29(2):115–129, 1964. [3] Sam T. Roweis and Lawrence K. Sau. Nonlinear dimensionality reduction by local linear embedding. Science, 290.5500:2323–2326, 200. [4] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373–1396, 2003.

[5] Ronald R Coifman and St´ephaneLafon. Diﬀusion maps. Applied and computational harmonic analysis, 21(1):5–30, 2006. [6] Weifeng Liu, Xueqi Ma, Yicong Zhou, Dapeng Tao, and Jun Cheng. p- laplacian regularization for scene recognition. IEEE Transactions on Cy- bernetics, (8):5120–5129, 2018.

[7] Laurens van der Maaten and Geoﬀrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.

49 [8] Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. Umap: uniform manifold approximation and projection. The Journal of Open Source Software, 3(29):861, 2018. [9] Xiaofei He, Shuicheng Yan, Yuxiao Hu, Partha Niyogi, and Hong-Jiang Zhang. Face recognition using laplacianfaces. IEEE transactions on pattern analysis and machine intelligence, 27(3):328–340, 2005. [10] A. Singer and R. R. Coifman. Non linear independent component analysis with diffusion maps. Applied and Computational Harmonic Analysis, 25(2):226–239, 2008. [11] Ofir Lindenbaum, Arie Yeredor, and Israel Cohen. Musical key extraction using diffusion maps. Signal Processing, 117:198–207, 2015. [12] William T Freeman and Joshua B Tenenbaum. Learning bilinear models for two-factor problems in vision. In Computer Vision and Pattern Recogni- tion,Proceedings., IEEE Computer Society Conference on, pages 554–560. IEEE, 1997. [13] Inge S Helland. Partial least squares regression and statistical models. Scandinavian Journal of Statistics, pages 97–114, 1990. [14] Kamalika Chaudhuri, Sham M Kakade, Karen Livescu, and Karthik Srid- haran. Multi-view clustering via canonical correlation analysis. In Pro- ceedings of the 26th annual international conference on machine learning, pages 129–136. ACM, 2009. [15] Pei Ling Lai and Colin Fyfe. Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems, 10(05):365–377, 2000. [16] Francis R Bach and Michael I Jordan. Kernel independent component analysis. Journal of machine learning research, 3(Jul):1–48, 2002. [17] Abhishek Kumar and Hal Daumé. A co-training approach for multi-view spectral clustering. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 393–400, 2011. [18] Byron Boots and Geoff Gordon. Two-manifold problems with applications to nonlinear system identification. arXiv preprint, page 1206.4648, 2012. [19] Virginia R de Sa. Spectral clustering with two views. In ICML workshop on learning with multiple views, pages 20–27, 2005. [20] Shiliang Sun, Xijiong Xie, and Mo Yang. Multiview uncorrelated discrim- inant analysis. IEEE transactions on cybernetics, 46(12):3272–3284, 2016. [21] Weifeng Liu, Xinghao Yang, Dapeng Tao, Jun Cheng, and Yuanyan Tang. Multiview dimension reduction via hessian multiset canonical correlations. Information Fusion, 41:119–128, 2018.

50 [22] Bo Wang, Jiayan Jiang, Wei Wang, Zhi-Hua Zhou, and Zhuowen Tu. Unsu- pervised metric fusion by cross diﬀusion. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2997–3004. IEEE, 2012. [23] D. Zhou and C.J.C Burges. Spectral clustering and transductive learning with multiple views. Proceedings of the 24th international conference on Machine learning, pages 1159–1166, 2007. [24] Roy R Lederman and Ronen Talmon. Learning the geometry of common latent variables using alternating-diﬀusion. Applied and Computational Har- monic Analysis, 44(3):509–536, 2018.

[25] Weiran Wang and Miguel A Carreira-Perpin´an.The role of dimensionality reduction in classiﬁcation. In AAAI, pages 2128–2134, 2014. [26] Huawen Liu, Lin Liu, Thuc Duy Le, Ivan Lee, Shiliang Sun, Jiuyong Li, et al. Non-parametric sparse matrix decomposition for cross-view dimensionality reduction. IEEE Trans. Multimedia, 15:138–145, 2017.

[27] Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep canonical correlation analysis. In International Conference on Machine Learning, pages 1247–1255, 2013. [28] Jing Zhao, Xijiong Xie, Xin Xu, and Shiliang Sun. Multi-view learning overview: Recent progress and new challenges. Information Fusion, 38:43– 54, 2017. [29] Ofir Lindenbaum, Arie Yeredor, and Moshe Salhov. Learning coupled embedding using multiview diffusion maps. In International Conference on Latent Variable Analysis and Signal Separation, pages 127–134. Springer, 2015. [30] David S Kershaw. The incomplete choleskyconjugate gradient method for the iterative solution of systems of linear equations. Journal of Computa- tional Physics, 26(1):43–65, 1978. [31] S. Lafon. Diffusion maps and geometric harmonics. Ph.D dissertation Yale, pages 33–34, 2004. [32] Yosi Keller, Ronald R Coifman, StéphaneLafon, and Steven W Zucker. Audio-visual group recognition using diffusion maps. IEEE Transactions on Signal Processing, 58(1):403–413, 2010.

[33] David Dov, Ronen Talmon, and Israel Cohen. Kernel-based sensor fusion with application to audio-visual voice activity detection. IEEE Transac- tions on Signal Processing, 64(24):6406–6416, 2016.

51 [34] Amit Singer, Radek Erban, Ioannis G Kevrekidis, and Ronald R Coif- man. Detecting intrinsic slow variables in stochastic dynamical systems by anisotropic diﬀusion maps. Proceedings of the National Academy of Sciences, 106(38):16090–16095, 2009. [35] Yariv Aizenbud and Amir Averbuch. Matrix decompositions using sub- gaussian random matrices. arXiv preprint, page 1602.03360, 2016. [36] Michael Holmes, Alexander Gray, and Charles Isbell. Fast svd for large- scale matrices. In Workshop on Eﬃcient Machine Learning at NIPS, volume 58, pages 249–252, 2007.

[37] Michael P Holmes, Jr Isbell, Charles Lee, and Alexander G Gray. Quic-svd: Fast svd using cosine trees. In Advances in Neural Information Processing Systems, pages 673–680, 2009. [38] Ronald R Coifman and Matthew J Hirn. Diﬀusion maps for changing data. Applied and computational harmonic analysis, 36(1):79–107, 2014.

[39] T. Ando. Majorization relations for hadamard products. Linear Algebra and its Applications, pages 57–64, 1995. [40] G. Visick. A weak majorization involving the matrices a b and ab. Linear Algebra and its Applications, 224/224:731–744, 1995.

[41] Amit Bermanis, Amir Averbuch, and Ronald R Coifman. Multiscale data sampling and function extension. Applied and Computational Harmonic Analysis, 34(1):15–29, 2013. [42] Keinosuke Fukunaga and David R Olsen. An algorithm for ﬁnding intrinsic dimensionality of data. IEEE Transactions on Computers, 100(2):176–183, 1971. [43] Peter J. Verveer and Robert P. W. Duin. An evaluation of intrinsic dimensionality estimators. IEEE Transactions on pattern analysis and machine intelligence, 17(1):81–86, 1995. [44] Claudio Ceruti, Simone Bassis, Alessandro Rozza, Gabriele Lombardi, Elena Casiraghi, and Paola Campadelli. Danco: An intrinsic dimensionality estimator exploiting angle and norm concentration. Pattern recognition, 47(8):2569–2581, 2014. [45] Amit Bermanis, Amir Averbuch, and Ronald R Coifman. Multiscale data sampling and function extension. Applied and Computational Harmonic Analysis, 34(1):15–29, 2013. [46] Mikhail Belkin and Partha Niyogi. Convergence of laplacian eigenmaps. In NIPS, pages 129–136, 2006.

52 [47] Matthias Hein, Jean-Yves Audibert, and Ulrike Von Luxburg. From graphs to manifolds–weak and strong pointwise consistency of graph laplacians. In International Conference on Computational Learning Theory, pages 470– 485. Springer, 2005. [48] Amit Singer. From graph to manifold laplacian: The convergence rate. Applied and Computational Harmonic Analysis, 21(1):128–134, 2006. [49] Harold Hotelling. Relations between two sets of variates. Biometrika, pages 321–377, 1936. [50] Kilian Q Weinberger, Fei Sha, and Lawrence K Saul. Learning a kernel matrix for nonlinear dimensionality reduction. In Proceedings of the twenty- first international conference on Machine learning, page 106. ACM, 2004. [51] Yuka Shiokawa, Yasuhiro Date, and Jun Kikuchi. Application of kernel principal component analysis and computational machine learning to ex- ploration of metabolites strongly associated with diet. Scientific reports, 8(1):3426, 2018. [52] Qinyi Zhang, Sarah Filippi, Arthur Gretton, and Dino Sejdinovic. Large- scale kernel methods for independence testing. Statistics and Computing, 28(1):113–130, 2018. [53] Bernhard Schölkopf. The kernel trick for distances. In Advances in neural information processing systems, pages 301–307, 2001. [54] S. Lafon, Y. Keller, and R. Coifman. Data fusion and multicue data match- ing by diffusion maps. IEEE Trans. Pattern Anal. Mach. Intell., 28 no. 11:17841797, 2006.

[55] Gemma Piella. Diffusion maps for multimodal registration. Sensors, 14(6):10562–10577, 2014. [56] Ofir Lindenbaum, Arie Yeredor, and Amir Averbuch. Clustering based on multiview diffusion maps. In Data Mining Workshops (ICDMW), 2016 IEEE 16th International Conference on, pages 740–747. IEEE, 2016.

[57] Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering1 analysis and an algorithm. Proceedings of Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 14:849–856, 2001. [58] Marzia Polito and Pietro Perona. Grouping and dimensionality reduction by locally linear embedding. In NIPS, pages 1255–1262, 2001.

[59] Pablo A Est´evez, Michel Tesmer, Claudio A Perez, and Jacek M Zurada. Normalized mutual information feature selection. IEEE Transactions on Neural Networks, 20(2):189–201, 2009.

53 [60] GaëlleLoosli, StéphaneCanu, and LéonBottou. Training invariant support vector machines using selective sampling. Large scale kernel machines, 2, 2007. [61] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer Vision and Image Understand- ing, 106(1):59–70, 2007. [62] Yu-Gang Jiang, Chong-Wah Ngo, and Jun Yang. Towards optimal bag-of- features for object categorization and semantic video retrieval. In Proceed- ings of the 6th ACM international conference on Image and video retrieval, pages 494–501. ACM, 2007. [63] Yang Bai, Lihua Guo, Lianwen Jin, and Qinghua Huang. A novel feature extraction method using pyramid histogram of orientation gradients for smile recognition. In Image Processing (ICIP), 2009 16th IEEE Interna- tional Conference on, pages 3305–3308. IEEE, 2009.

[64] P Gehler and S Nowozin. On feature combination for multiclass object detection. In International Conference on Computer Vision, volume 2, page 6, 2009. [65] Maud A Mouchet, S´ebastienVill´eger, Norman WH Mason, and David Mouillot. Functional diversity measures: an overview of their redundancy and their ability to discriminate community assembly rules. Functional Ecology, 24(4):867–876, 2010. [66] Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied math- ematics, 20:53–65, 1987.

[67] R. Blandford. Seismic event discrimination. Bulletin of the Seismological Society of America, 72:569–587, 1982. [68] Arthur J Rodgers, Thorne Lay, William R Walter, and Kevin M Mayeda. A comparison of regional-phase amplitude ratio measurement techniques. Bulletin of the Seismological Society of America, 87(6):1613–1621, 1997. [69] Ludger K¨uperkoch, Thomas Meier, and Tobias Diehl. Automated event and phase identiﬁcation. New manual of seismological observatory practice, 2:1–52, 2012. [70] R. Hofstetter, Y. Gitterman, V. Pinsky, N. Kraeva, and L. Feldman. Seis- mological observations of the northern Dead Sea basin earthquake on 11 February 2004 and its associated activity. Israel Journal of Earth Sciences, 57:101–124, 2008.

54 [71] Manfred Joswig. Automated classiﬁcation of local earthquake data in the bug small array. Geophysical Journal International, 120(2):262–286, 1995. [72] David B Harris. A waveform correlation method for identifying quarry explosions. Bulletin of the Seismological Society of America, 81(6):2395– 2418, 1991.

[73] David B Harris and Tormod Kvaerna. Superresolution with seismic arrays using empirical matched ﬁeld processing. Geophysical Journal Interna- tional, 182(3):1455–1477, 2010.