<<

Multi-View Clustering via Analysis

Keywords: multi-view learning, clustering, canonical correlation analysis

Abstract The basic problem is as follows: we are given indepen- Clustering in high-dimensions is be- dent samples from a mixture of k distributions, and lieved to be a hard problem in general. A our task is to either: 1) infer properties of the under- number of efficient clustering algorithms de- lying mixture model (e.g. the mixing weights, , veloped in recent years address this prob- etc) or 2) classify a random sample according to which lem by projecting the data into a lower- distribution it was generated from. dimensional subspace, e.g. via Principal Under no restrictions on the underlying distribution, Components Analysis (PCA) or random pro- this problem is considered to be hard. However, in jections, before clustering. Here, we consider many applications, we are only interested in clustering constructing such projections using multiple the data when the component distributions are “well views of the data, via Canonical Correlation separated”. In fact, the focus of recent clustering algo- Analysis (CCA). rithms [Das99, VW02, AM05, BV08] is on efficiently Under the assumption that conditioned on learning with as little separation as possible. Typi- the cluster label the views are uncorrelated, cally, these separation conditions are such that when we show that the separation conditions re- given a random sample form the mixture model, the quired for the algorithm to be successful are Bayes optimal classifier is able to reliably, with high rather mild (significantly weaker than , recover which cluster generated that point. results in the literature). We provide re- This work assumes a rather natural multi-view as- sults for mixture of Gaussians and mixtures sumption: the assumption is that the views are (con- of log concave distributions. We also provide ditionally) uncorrelated, if we condition on which mix- empirical support from audio-visual speaker ture distribution generated the views. There are many clustering (where we desire the clusters to natural applications for which this assumption applies. correspond to speaker ID) and from hierar- For example, we can consider multi-modal views, with chical Wikipedia document clustering (where one view being a video stream and the other an audio one view is the words in the document and stream of a speaker — here conditioned on the speaker other is the link structure). identity and maybe the phoneme (both of which could label the generating cluster), the views may be uncor- related. A second example is the words and link struc- 1. Introduction ture in a document from a corpus such as Wikipedia The multi-view approach to learning is one in which – here, conditioned on the category of each document, we have ‘views’ of the data (sometimes in a rather the words in it and its link structure may be uncorre- abstract sense) and the goal is to use the relation- lated. In this paper, we provide for both ship between these views to alleviate the difficulty of a settings. learning problem of interest [BM98, KF07, AZ07]. In Under this multi-view assumption, we provide a sim- this work, we explore how having ‘two views’ of the ple and efficient subspace learning method, based data makes the clustering problem significantly more on Canonical Correlation Analysis (CCA). This algo- tractable. rithm is affine invariant and is able to learn with some Much recent work has been done on understanding of the weakest separation conditions to date. The in- under what conditions we can learn a mixture model. tuitive reason for this is that under our multi-view assumption, we are able to (approximately) find the low-dimensional subspace spanned by the means of the Preliminary work. Under review by the International Con- component distributions. This subspace is important, ference on (ICML). Do not distribute. Multi-View Clustering via Canonical Correlation Analysis because, when projected onto this subspace, the means work of [VW02] and [CR08]. [VW02] shows that it is of the distributions are well-separated, yet the typical sufficient to find the subspace spanned by the means between points from the same distribution is of the distributions in the mixture for effective cluster- smaller than in the original space. The number of ing. Like our algorithm, [CR08] use a projection onto samples we require to cluster correctly scales as O(d), the top k −1 singular value decomposition subspace of where d is the ambient dimension. Finally, we show the canonical correlations . They also require through experiments that CCA-based algorithms con- a spreading condition, which is related to our require- sistently provide better performance than PCA-based ment on the rank. We borrow techniques from both clustering methods when applied to datasets in two dif- these papers. ferent domains – audio-visual speaker clustering, and [?] propose a similar algorithm for multi-view cluster- hierarchical Wikipedia document clustering by cate- ing, in which, data is projected onto the top few direc- gory. tions obtained by a kernel-CCA of the multi-view data. Our work shows how the multi-view framework can They show empirically, that for clustering images us- provide substantial improvements to the clustering ing associated text, (where the two views are an image, problem, adding to the growing body of results which and text associated with it, and the target clustering is show how the multi-view framework can alleviate the a human-defined category), CCA-based methods out- difficulty of learning problems. perform PCA-based clustering algorithms.

Related Work. Most provably efficient clustering In this paper, we study the problem of multi-view clus- algorithms first project the data down to some low tering. We have data on a fixed set of objects from two dimensional space and then cluster the data in this sources, which we call the two views, and our goal is lower dimensional space (typically, an algorithm such to use this fact to cluster more effectively than with as single linkage suffices here). Typically, these al- data from a single source. Following prior theoretical gorithms also work under a separation requirement, work, our goal is to show that our algorithm recovers which is measured by the minimum distance between the correct clustering when the input obeys certain the means of any two mixture components. conditons. One of the first provably efficient algorithms for learn- This Work. Our input is data on a fixed set of ob- ing mixture models is due to [Das99], who learns jects from two views, where View j is generated by a j j a mixture of spherical Gaussians by randomly pro- mixture of k Gaussians (D1,...,Dk), for j = 1, 2. To jecting the mixture onto a low-dimensional subspace. generate a sample, a source i is picked with probability (1) (2) [VW02] provide an algorithm with an improved sepa- wi, and x and x in Views 1 and 2 are drawn from 1 2 ration requirement that learns a mixture of k spheri- distributions Di and Di . cal Gaussians, by projecting the mixture down to the We impose two requirements on these mixtures. First, k-dimensional subspace of highest . [KSV05, we require that conditioned on the source distribution AM05] extend this result to mixtures of general Gaus- in the mixture, the two views are uncorrelated. No- sians; however, they require a separation proportional tice that this is a weaker restriction than the condition to the maximum directional of any that given source i, the samples from D1 and D2 are mixture component. i i drawn independently. Moreover, this condition allows [CR08] use a canonical-correlations based algorithm to the distributions in the mixture within each view to be learn mixtures of axis-aligned Gaussians with a sep- completely general, so long as they are uncorrelated aration proportional to σ∗, the maximum directional across views. Although we do not show it theoreti- standard deviation in the subspace containing the cen- cally, we suspect that our algorithm is robust to small ters of the distributions. Their algorithm requires the deviations from this assumption. coordinate-independence property, and an additional Second, we require the rank of the CCA matrix across “spreading” condition. None of these algorithms are the views to be at least k − 1, when each view is in affine invariant. isotropic position, and the k − 1-th singular value of Finally, [BV08] provides an affine-invariant algorithm this matrix to be at least λmin. This condition ensures for learning mixtures of general Gaussians, so long as that there is sufficient correlation between the views. the mixture has a suitably low Fisher coefficient when If the first two conditions hold, then, we can recover in isotropic position. However, their separation in- the subspace containing the means of the distributions volves a rather large polynomial dependence on 1 . in both views. wmin The two results most closely related to ours are the In addition, for mixtures of Gaussians, if in at least Multi-View Clustering via Canonical Correlation Analysis one view, say View 1, we have that for every pair of 2. The Setting distributions i and j in the mixture, We assume that our data is generated by a mixture of k distributions. In particular, we assume we obtain (1) (2) (1) (2) ||µ1 − µ1|| > Cσ∗k1/4plog(n/δ) samples x = (x , x ), where x and x are the i j two views of the data, which live in the vector spaces V1 of dimension d1 and V2 of dimension d2, respec- j 1 tively. We let d = d1 + d2. Let µ , for i = 1, . . . , k and for some constant C, where µi is the of the i-th i component distribution in view one and σ∗ is the max- j = 1, 2 be the center of distribution i in view j, and imum directional standard deviation in the subspace let wi be the mixing weight for distribution i. containing the means of the distributions in view 1, For simplicity, we assume that data have mean 0. We then our algorithm can also determine which distribu- denote the matrix of the data as: tion each sample came from. Moreover, the number > (1) (1) > of samples required by us to learn this mixture grows Σ = E[xx ], Σ11 = E[x (x ) ] (almost) linearly with d. (2) (2) > (1) (2) > Σ22 = E[x (x ) ], Σ12 = E[x (x ) ] This separation condition is considerably weaker than previous results in that σ∗ only depends on the direc- Hence, we have: tional variance in the subspace spanned by the means,  Σ Σ  which can be considerably lower than the maximum di- Σ = 11 21 . (1) rectional variance over all directions. The only other Σ12 Σ22 algorithm which provides affine-invariant guarantess The multi-view assumption we work with is as follows: is due to [BV08] — while this result does not explic- itly state results in terms of separation between the Assumption 1 (Multi-View Condition) We assume means (it uses a Fisher coefficient concept), the im- that conditioned on the source distribution l in the mix- plied separation is rather large and grows with decreas- ture (where l = i is picked with probability w ), the two ing w , the minimum mixing weight. To get our im- i min views are uncorrelated. More precisely, we assume that proved bounds, we use a result due for all i ∈ [k], to [RV07], which may be of independent interest. (1) (2) > (1) (2) > We stress that our improved results are really due the E[x (x ) |l = i] = E[x |l = i]E[(x ) |l = i] multi-view condition. Had we simply combined the data from both views, and applied previous algorithms This assumption implies that: on the combined data, we could not have obtained our X 1 2 T guarantees. We also emphasize that for our algorithm Σ12 = wiµi · (µi ) . to cluster successfully, it is sufficient for the distribu- i tions in the mixture to obey the separation condition To see this, observe that in one view, so long as the multiview condition and rank conditions are obeyed. (1) (2) > X (1) (2) > E[x (x ) ] = EDi [x (x ) ] Pr[Di] Finally, we study through experiments, the perfor- i mance of CCA-based algorithms on data-sets from two X (1) (2) > = wiEDi [x ] · EDi [(x ) ] different domains. First, we with audio- i visual speaker clustering, in which the two views are X = w µ1 · (µ2)T (2) audio and face images of a speaker, and the target i i i i cluster variable is the speaker. Our experiments show that CCA-based algorithms perform better than PCA- As the distributions are in isotropic position, we ob- P 1 P 2 based algorithms on audio data, just as well on image serve that i wiµi = i wiµi = 0. Therefore, the data, and are more robust to occlusions and transla- above equation shows that the rank of Σ12 is at most tions of the images. For our second experiment, we k − 1. We now assume that it has rank precisely k − 1. cluster documents in Wikipedia. The two views are the words, and the link structure of a document, and Assumption 2 (Non-Degeneracy Condition) We as- the target cluster is the category. Our experiments sume that Σ12 has rank k − 1 and that the minimal show that a CCA-based algo- non-zero singular value of Σ12 is λmin > 0 (where we rithm provides much higher performance than PCA- are working in a coordinate system where Σ11 and Σ22 based hierarchical clustering for this data. are identity matrices). Multi-View Clustering via Canonical Correlation Analysis

For clarity of exposition, we also work in a isotropic samples into k clusters. For this algorithm, we assume coordinate system, in each view. Specifically, the ex- that the data obeys the separation condition in View pected of the data, in each view, is 1; an analogous algorithm can be applied when the the identity matrix, i.e. Σ11 = Id1 , Σ22 = Id2 . data obeys the separation condition in View 2 as well. As our analysis shows, our algorithm is robust to er- Algorithm 1. rors, so we assume that data is whitened as a pre- processing step. 1. Randomly partition S into two subsets of equal size A and B. One way to view the Non-Degeneracy Assumption is in terms of correlation coefficients. Recall that for two 2. Let Σb12(A)(Σb12(B) respectively) denote the em- directions u ∈ V1 and v ∈ V2, the correlation coeffi- pirical covariance matrix between views 1 and cient is defined as: 2, computed from the sample set A (B respec- E[(u · x(1))(v · x(2))] tively). Compute the top k − 1 left singular vec- ρ(u, v) = p . E[(u · x(1))2]E[(v · x(2))2] tors of Σb12(A)(Σb12(B) respectively), and project the samples in B (A respectively) on the subspace An alternative definition of λmin is just the minimal spanned by these vectors. non-zero, correlation coefficient i.e.

λmin = min ρ(u, v) . 3. Apply single linkage clustering [?] (for mixtures u,v:ρ(u,v)6=0 of log-concave distributions), or the algorithm in Section 3.5 of [AK05] (for mixtures of Gaussians) Note 1 ≥ λmin > 0.We use Σb11 and Σb22 to denote the on the projected examples in View 1. sample covariance matrix in views 1 and 2 respectively. We use Σ to denote the sample covariance matrix b12 We note that in Step 3, we apply either single-linkage combined across views 1 and 2. We assumed these are or the algorithm of [AK05]; this allows us to show the- obtained through empirical averages from i.i.d. sam- oretically that given the distributions in the mixture ples from the underlying distribution. are of a certain type, and given the right separation conditions, the clusters can be recovered correctly. In 3. The Clustering Algorithm practice, however, these algorithms do not perform as well due to lack of robustness, and one would use an The following lemma provides the intuition for our al- algorithm such as k-means or EM to cluster in this gorithm. low-dimensional subspace. In particular, a variant of Lemma 1 Under Assumption 2, if U, D, V is the the EM algorithm has been shown [DS00] to cluster correctly mixtures of Gaussians, under certain condi- ‘thin’ SVD of Σ12 (where the thin SVD removes all zero entries from the diagonal), then the subspace tions. spanned by the means in view 1 is precisely the col- Moreover, in Step 1, we divide the data-set into two umn span of U (and we have the analogous statement halves to ensure independence between Steps 2 and 3 for view 2). for our analysis; in practice however, these steps can be executed on the same sample set. The lemma is a consequence of Equation 2 and the rank assumption. Since samples from a mixture are 3.1. The Main Result well-separated in the space containing the means of the distributions, the lemma suggests the following strat- Our main theorem can be stated as follows. egy : use CCA to (approximately) project the data down to the subspace spanned by the means to get Theorem 1 (Gaussians) Suppose the source distri- an easier clustering problem, and then apply standard bution is a mixture of Gaussians, and suppose As- clustering algorithms in this space. sumptions 1 and 2 hold. Let σ∗ be the maximum di- Our clustering algorithm, based on the above idea, is rectional standard deviation of any distribution in the 1 k stated below. We can show that this algorithm clusters subspace spanned by {µi }i=1. If, for each pair i and j correctly with high probability, when the data in at and for a fixed constant C, least one of the views obeys a separation condition, in kn addition to our assumptions. ||µ1 − µ1|| ≥ Cσ∗k1/4 log( ) i j δ The input to the algorithm is a set of samples S, and a number k, and the output is a clustering of these then, with probability 1−δ, Algorithm 1 correctly clas- Multi-View Clustering via Canonical Correlation Analysis sifies the examples if the number of examples used is Next, we show that given sufficiently many samples, d d the subspace spanned by the top k − 1 singular vec- c · log2( ) log2(1/δ) tors of Σ still approximates the subspace containing (σ∗)2λ2 w2 σ∗λ w b12 min min min min the means of the distributions comprising the mixture. for some constant c. Finally, we use this fact, along with some results in [AK05] to prove Theorem 1. Our main lemma of this Here we assume that a separation condition holds in section is the following. View 1, but a similar theorem also applies to View 2. An analogous theorem can also be shown for mixtures Lemma 4 (Projection Subspace Lemma) Let v1 of log-concave distributions. (resp. v2) be any vector in S1 (resp. S2). If the num- d 2 d 2 1 ber of samples n > c τ 2λ2 w log ( τλ w ) log ( δ ) Theorem 2 (Log-concave Distributions) min min min min for some constant c, then, with probability 1 − δ, the Suppose the source distribution is a mixture of length of the projection of v1 (resp. v2 ) in the sub- log-concave distributions, and suppose Assumptions space spanned by the top k − 1 left (resp. right) sin- ∗ √ 1 and 2 hold. Let σ be the maximum directional 2 1 √gular vectors of Σb12 is at least 1 − τ ||v || (resp. standard deviation of any distribution in the subspace 2 2 1 k 1 − τ ||v ||). spanned by {µi }i=1. If, for each pair i and j and for a fixed constant C, The main tool in the proof of Lemma 4, is the following √ kn lemma, which uses a result due to [RV07]. ||µ1 − µ1|| ≥ Cσ∗ k log( ) i j δ Lemma 5 (Sample Complexity Lemma) If the then, with probability 1−δ, Algorithm 1 correctly clas- number of samples sifies the examples if the number of examples used is

d 2 d 2 1 d 3 d 2 n > c · log ( ) log ( ) c · log ( ) log (1/δ) 2 ∗ 2 2 2 ∗  wmin wmin δ (σ ) λminwmin σ λminwmin for some constant c. for some constant c, then, with probability at least 1−δ, ||Σb12 − Σ12|| ≤ , where || · || denotes the L2-norm of a matrix. 4. Analyzing Our Algorithm A consequence of Lemma 5 and Lemmas 2 and 3 is the In this section, we prove our main theorems. First, we following lemma. define some notation. d 2 d 2 1 Lemma 6 Let n > C 2 log ( ) log ( ), for Notation. In the sequel, we assume that we are given  wmin wmin δ samples from a mixture which obeys Assumptions 2 some constant C. Then, with probability 1 − δ, the 1 2 and 1. We use the notation S (resp. S ) to denote top k − 1 singular values of Σb12 have value at least the subspace containing the centers of the distributions λmin − . The remaining min(d1, d2) − k + 1 singular in the mixture in View 1 (resp. View 2), and notation values of Σb12 have value at most . S01 (resp. S02) to denote the orthogonal complement to the subspace containing the centers of the distributions The proof follows by a combination of Lemmas 2,3, 5 in the mixture in View 1 (resp. View 2). and a triangle inequality.

For any matrix A, we use ||A|| to denote the L2 norm Proof:(Of Lemma 5) To prove this lemma, we apply or maximum singular value of A. Lemma 7. Observe the block representation of Σ in Equation 1. Moreover, with Σ11 and Σ22 in isotropic Proofs. Now, we are ready to prove our main the- position, we have that the L2 norm of Σ12 is at most orem. First, we show the following two lemmas, 1. Using the triangle inequality, we can write: which demonstrate properties of the expected cross- correlational matrix across the views. Their proofs 1 ||Σb12 −Σ12|| ≤ (||Σb −Σ||+||Σb11 −Σ11||+||Σb22 −Σ22||) are immediate from Assumptions 2 and 1. 2 (where we have applied the triangle inequality to the Lemma 2 Let v1 and v2 be any vectors in S1 and S2 1 T 2 2 × 2 block matrix with off-diagonal entries Σ12 − Σ12 respectively. Then, |(v ) Σ12v | > λmin. b and with 0 diagonal entries). We now apply Lemma 7 1 2 01 Lemma 3 Let v (resp. v ) be any vector in S three times, on Σb11 − Σ11, on Σb22 − Σ22 and a scaled 02 1 (resp. S ). Then, for any u ∈ V1 and u2 ∈ V2, version of Σb − Σ. The first two applications follow 1 T 2 1 T 2 (v ) Σ12u = (u ) Σ12v = 0. directly. Multi-View Clustering via Canonical Correlation Analysis

1 For the third application, we observe that Lemma 7 projection of v√on the top k−1 left singular vectors of 2 1 is rotation invariant, and that scaling each covariance Σb12 is equal to 1 − τ˜ ||v ||, whereτ ˜ > τ. Then, there 1 value by some factor s scales the norm of the matrix exists some unit vector u in V1 in the orthogonal com- by at most s. We claim that we can apply Lemma plement of the space spanned by the top k − 1 left sin- 1 7 on Σb − Σ with s = 4. Since the covariance of any gular vectors of Σb12 such that the projection of v on two random variables is at most the product of their u1 is equal toτ ˜||v1||. This vector u1 can be written as: 1 1 2 1/2 1 1 standard deviations, and since Σ11 and Σ22 are Id1 u =τv ˜ +(1−τ˜ ) y , where y is in the orthogonal 1 and Id2 respectively, the maximum singular value of complement of S . From Lemma 2, there exists some 2 2 1 > 2 Σ12 is at most 1; the maximum singular value of Σ is vector u in S , such that (v ) Σ12u ≥ λmin; from 2 1 > 2 therefore at most 4. Our claim follows. Lemma 3, for this vector u ,(u ) Σ12u ≥ τλ˜ min. d 2 d 2 1 If n > c 2 2 log ( τλ˜ w ) log ( δ ), then, from The lemma now follows by plugging in n as a function τ˜ λminwmin min min 1 T 2 τ˜ of , d and wmin  Lemma 6, (u ) Σb12u ≥ 2 λmin.

Now, since u1 is in the orthogonal complement of Lemma 7 Let X be a set of n points generated by the subspace spanned by the top k − 1 left singu- d 2 a mixture of k Gaussians over R , scaled such that lar vectors of Σb12, for any vector y in the subspace T E[x · x ] = Id. If M is the sample covariance matrix spanned by the top k − 1 right singular vectors of Σb12, > 2 of X, then, for n large enough, with probability at least (u1) Σb12y = 0. This, in turn, means that there ex- 2 1 − δ, ists a vector z in V2 in the orthogonal complement of the subspace spanned by the top k − 1 right singular q 2n 1 T 2 τ˜ d log n log( δ ) log(1/δ) vectors of Σb12 such that (u ) Σb12z ≥ 2 λmin. This ||M − E[M]|| ≤ C · √ wminn implies that the k-th singular value of Σb12 is at least τ˜ 2 λmin. However, from Lemma 6, all except the top where C is a fixed constant, and w is the minimum τ min k − 1 singular values of Σb12 are at most λmin, which mixing weight of any Gaussian in the mixture. 3 is a contradiction.  Proof: To prove this lemma, we use a concentra- tion result on the L2-norms of matrices due to [RV07]. Finally, we are ready to prove our main theorem. We observe that each vector xi in the scaled space is generated by a Gaussian with some mean µ and Proof:(Of Theorem 1) From Lemma 4, if n > C · 2 d 2 d 2 maximum directional variance σ . As the total vari- 2 2 log ( ) log (1/δ), then, with prob- τ λminwmin τλminwmin ance of the mixture along any direction is at most 1, ability at least 1 − δ, the projection of any vector v 2 2 1 2 wmin(µ + σ ) ≤ 1. Therefore, for all samples xi, with in S or S onto the subspace returned by Step 2 of q √ 2n Algorithm 1 has length at least 1 − τ 2||v||. There- probability at least 1−δ/2, ||xi|| ≤ ||µ||+σ d log( δ ). fore, the maximum directional variance of any Di in 2 ∗ 2 2 2 We condition on the fact that the event ||xi|| ≤ any this subspace is at most (1 − τ )(σ ) + τ σ , q 2 ||µ|| + σ d log( 2n ) happens for all i = 1, . . . , n. The where σ is the maximum directional variance of any δ D . When τ ≤ σ∗/σ, this is at most 2(σ∗)2. From probability of this event is at least 1 − δ/2. i the isotropic condition, σ ≤ √ 1 . Therefore, when wmin Conditioned on this event, the distributions of the vec- d 2 d 2 n > C · (σ∗)2λ2 w2 log ( σ∗λ w ) log (1/δ), the tors x are independent. Therefore, we can apply The- min min min min i maximum directional variance of any D in the mix- orem 3.1 in [RV07] on these conditional distributions, i ture in the space output by Step 2 of the Algorithm is to conclude that: at most 2(σ∗)2. −cnt2/Λ2 log n Pr[||M − E[M]|| > t] ≤ 2e Since A and B are random partitions of the sample set S, the subspace produced by the action of Step where c is a constant, and Λ is an upper bound on the 2 of Algorithm 1 on the set A is independent of B, norm of any vector ||xi||. The lemma follows√ by plug- and vice versa. Therefore, when projected onto the q Λ2 log(4/δ) log n 2 d log(2n/δ) ging in t = , and Λ ≤ √ . top k − 1 SVD subspace of Σ12(A), the samples from cn wmin  b B are distributed as a mixture of (k − 1)-dimensional Gaussians. The theorem follows from the bounds in Now we are ready to prove Lemma 4. the previous paragraph, and Theorem 1 of [AK05].  Proof: (Of Lemma 4) For the sake of contradiction, suppose there exists a vector v1 ∈ S1 such that the Multi-View Clustering via Canonical Correlation Analysis

5. Experiments PCA CCA Images 1.1 1.4 5.1. Audio-visual speaker clustering Audio 35.3 12.5 In the first set of experiments, we consider cluster- Images + occlusion 6.1 1.4 ing either audio or face images of speakers. We use Audio + occlusion 35.3 12.5 a subset of the VidTIMIT database [?] consisting of Images + translation 3.4 3.4 41 speakers, speaking 10 sentences (about 20 seconds) Audio + translation 35.3 13.4 each, recorded at 25 frames per second in a studio envi- ronment with no significant lighting or pose variation. Table 1. Conditional perplexities of the speaker given the The audio features are standard 12-dimensional mel cluster, using PCA or CCA bases. “+ occlusion” and “+ cepstra [?] computed every 10ms over a 20ms window, translation” indicate that the images are corrupted with and finally concatenated over a window of 200ms be- occlusion/translation; the audio is unchanged, however. fore and after the current frame, for a total of 1584 di- mensions. The video features are pixels of the face re- gion extracted from each image (2394 dimensions). We tailed look at the clustering behavior, Figures 1(a-d) consider the target cluster variable to be the speaker. show the distributions of clusters for each speaker in We use either CCA or PCA to project the data to several conditions. some lower dimensionality N. In the case of CCA, we initially project the data to an intermediate dimen- sionality M using PCA to reduce the effects of spu- 5.2. Clustering Wikipedia articles rious correlations. For the results reported here, N Next we consider the task of clustering Wikipedia ar- is typically 40 and M is typically 100 for images and ticles, based on either their text or their incoming and 1000 for audio. The parameters were selected using a outgoing links. The link structure L is represented as held-out set. For CCA, we randomize the vectors of a concatenation of “to”and “from” link incidence vec- one view in each sentence, to reduce correlations be- tors, where each element L(i) is the number of times tween the views due to certain other latent variables the current article links to/from article i. The article such as the current phoneme. We then cluster either text is represented as a bag-of-words feature vector, view using k-means into 82 clusters (2 per speaker). i.e. the raw count of each word in the article. A lex- To alleviate the problem of local minima found by k- icon of about 8 million words and a list of about 12 means, each clustering consists of 5 runs of k-means, million articles were used to construct the two feature and the one with the lowest k-means score is taken as vectors. Since the dimensionality of the feature vec- the final cluster assignment. tors is very high (over 20 million for the link view), we Similarly to [?], we measure clustering performance us- use to reduce the dimensionality to ing the conditional entropy of the speaker s given the a computationally manageable level. cluster c, H(s|c). We report the results in terms of We present clustering experiments on a subset of conditional perplexity, 2H(s|c), which is the mean num- Wikipedia consisting of 128,327 articles. We use ei- ber of speakers corresponding to each cluster. Table 1 ther PCA or CCA to reduce the feature vectors to the shows results on the raw data, as well as with syn- final desired dimensionality, followed by clustering. In thetic occlusions and translations of the image data. these experiments, we use a hierarchical clustering pro- Considering the extremely clean visual environment, cedure, as a flat clustering is rather poor with either we expect PCA to do very well on the image data. PCA or CCA (CCA still usually outperforms PCA, Indeed, PCA provides an almost perfect clustering of however). In the hierarchical procedure, all points are the raw images and CCA does not improve it. How- initially considered to be in a single cluster. Next, we ever, CCA far outperforms PCA when clustering the iteratively pick the largest cluster, reduce the dimen- more challenging audio view. When synthetic occlu- sionality using PCA or CCA on the points in this clus- sions or translations are applied to the images, the ter, and use k-means to break the cluster into smaller performance of PCA-based clustering is greatly de- sub-clusters (for some fixed k), until we reach the to- graded. CCA is unaffected in the case of occlusion; in tal desired number of clusters. The intuition for this is the case of translation, CCA-based image clustering that different clusters may have different natural sub- is degraded similarly to PCA, but audio clustering is spaces. almost unaffected. In other words, even when the im- age data are degraded, CCA is able to recover a good As before, we evaluate the clustering using the condi- clustering in at least one of the views. For a more de- tional perplexity of the article category a (as given by Wikipedia) given the cluster c, 2H(a|c). For each arti- Multi-View Clustering via Canonical Correlation Analysis

(a) AV: Audio, PCA basis (c) AV: Images + occlusion, PCA basis (e) Wikipedia: Category perplexity 160 hierarchical CCA 5 5 140 hierarchical PCA 10 10 120 15 15 100 20 20 80 speaker speaker

25 25 perplexity

30 30 60

35 35 40

40 40 20 20 40 60 80 20 40 60 80 0 20 40 60 80 100 120 cluster cluster number of clusters (b) AV: Audio, CCA basis (d) AV: Images + occlusion, CCA basis (f) Wikipedia: Cluster perplexity 120 balanced clustering 5 5 100 hierarchical CCA hierarchical PCA 10 10

15 15 80

20 20 60 Entropy 2 speaker 25 speaker 25 40 30 30

35 35 20

40 40 0 20 40 60 80 20 40 60 80 0 20 40 60 80 100 120 cluster cluster number of clusters

Figure 1. (a-d) Distributions of cluster assignments per speaker in audio-visual experiments. The color of each cell (s, c) corresponds to the empirical probability p(c|s) (darker = higher). (e-f) Wikipedia experiments: (e) Conditional perplexity of article category given cluster, 2H(a|c), as a function of the number of clusters. (f) Perplexity of the cluster distribution (2H(c)) as a function of the number of clusters. cle we use the first category listed in the article. The based clustering. 128,327 articles include roughly 15,000 categories, of which we use the 500 most frequent ones, which cover References 73,145 articles. While the clustering is performed on [AK05] S. Arora and R. Kannan. Learning mixtures of separated non- all 128,327 articles, the reported entropies are for the spherical gaussians. Annals of Applied Probability, 15(1A):69–92, 73,145 articles covered by the 500 categories. Each 2005. sub-clustering consists of 10 runs of k-means, and the [AM05] D. Achlioptas and F. McSherry. On spectral learning of mixtures of distributions. In Proceedings of the 18th Annual Conference on one with the lowest k-means score is taken as the final Learning Theory, pages 458–469, 2005. cluster assignment. [AZ07] Rie Kubota Ando and Tong Zhang. Two-view feature generation model for semi-. In ICML ’07: Proceedings Figure 1(e) shows the conditional perplexity versus the of the 24th international conference on Machine learning, pages number of clusters for PCA- and CCA-based hierarchi- 25–32, New York, NY, USA, 2007. ACM. cal clustering. For any number of clusters, CCA pro- [BM98] Avrim Blum and Tom Mitchell. Combining labeled and unla- duces better clusterings, i.e. ones with lower perplex- beled data with co-training. In COLT: Proceedings of the Work- shop on Computational Learning Theory, Morgan Kaufmann Pub- ity. (Note that the reduction in entropy with larger lishers, pages 92–100, 1998. number of clusters is expected for any approach; the [BV08] S. C. Brubaker and S. Vempala. Isotropic pca and affine- relevant point is the difference for any given number invariant clustering. In Proc. of Foundations of Computer Sci- of clusters.) In addition, the tree structures of the ence, 2008. PCA-/CCA-based clusterings are qualitatively differ- [CR08] K. Chaudhuri and S. Rao. Learning mixtures of distributions using correlations and independence. In In Proc. of Conference ent. With PCA-based clustering, most points are as- on Learning Theory, 2008. signed to a few large clusters, with the remaining clus- [Das99] S. Dasgupta. Learning mixtures of gaussians. In Proceedings of ters consisting of only a few points. CCA-based hi- the 40th IEEE Symposium on Foundations of Computer S cience, erarchical clustering produces more balanced clusters. pages 634–644, 1999. To see this, in Figure 1(f) we show the perplexity of [DS00] S. Dasgupta and L. Schulman. A two-round variant of em for H(c) gaussian mixtures. In Sixteenth Conference on Uncertainty in the cluster distribution, 2 , versus the number of Artificial Intelligence (UAI), 2000. clusters. For about 25 or more clusters, the CCA- [KF07] Sham M. Kakade and Dean P. Foster. Multi-view regression based clusterings have higher perplexity, indicating a via canonical correlation analysis. In Nader H. Bshouty and Claudio Gentile, editors, COLT, volume 4539 of Lecture Notes more uniform distribution of clusters than in PCA- in Computer Science, pages 82–96. Springer, 2007. Multi-View Clustering via Canonical Correlation Analysis

[KSV05] R. Kannan, H. Salmasian, and S. Vempala. The spectral method for general mixture models. In Proceedings of the 18th Annual Conference on Learning Theory, 2005.

[RV07] M. Rudelson and R. Vershynin. from large matrices: An approach through geometric functional analysis. Journal of the ACM, 2007.

[VW02] V. Vempala and G. Wang. A spectral algorithm of learning mix- tures of distributions. In Proceedings of the 43rd IEEE Sympo- sium on Foundations of Computer Science, pages 113–123, 2002.