Multi-View Clustering via Analysis

Kamalika Chaudhuri [email protected] ITA, UC San Diego, 9500 Gilman Drive, La Jolla, CA Sham M. Kakade [email protected] Karen Livescu [email protected] Karthik Sridharan [email protected] Toyota Technological Institute at Chicago, 6045 S. Kenwood Ave., Chicago, IL

Abstract clustering problem significantly more tractable. Clustering data in high dimensions is be- Much recent work has been done on understanding lieved to be a hard problem in general. A under what conditions we can learn a mixture model. number of efficient clustering algorithms de- The basic problem is as follows: We are given indepen- veloped in recent years address this prob- dent samples from a mixture of k distributions, and lem by projecting the data into a lower- our task is to either: 1) infer properties of the under- dimensional subspace, e.g. via Principal lying mixture model (e.g. the mixing weights, means, Components Analysis (PCA) or random pro- etc.) or 2) classify a random sample according to which jections, before clustering. Here, we consider distribution in the mixture it was generated from. constructing such projections using multiple Under no restrictions on the underlying mixture, this views of the data, via Canonical Correlation problem is considered to be hard. However, in many Analysis (CCA). applications, we are only interested in clustering the Under the assumption that the views are un- data when the component distributions are “well sep- correlated given the cluster label, we show arated”. In fact, the focus of recent clustering algo- that the separation conditions required for rithms [Das99, VW02, AM05, BV08] is on efficiently the algorithm to be successful are signifi- learning with as little separation as possible. Typ- cantly weaker than prior results in the lit- ically, the separation conditions are such that when erature. We provide results for mixtures given a random sample from the mixture model, the of Gaussians and mixtures of log concave Bayes optimal classifier is able to reliably recover distributions. We also provide empirical which cluster generated that point. support from audio-visual speaker clustering (where we desire the clusters to correspond to This work makes a natural multi-view assumption: speaker ID) and from hierarchical Wikipedia that the views are (conditionally) uncorrelated, con- document clustering (where one view is the ditioned on which mixture component generated the words in the document and the other is the views. There are many natural applications for which link structure). this assumption applies. For example, we can con- sider multi-modal views, with one view being a video 1. Introduction stream and the other an audio stream, of a speaker — here, conditioned on the speaker identity and maybe The multi-view approach to learning is one in which the phoneme (both of which could label the generat- we have ‘views’ of the data (sometimes in a rather ab- ing cluster), the views may be uncorrelated. A second stract sense) and the goal is to use the relationship be- example is the words and link structure in a document tween these views to alleviate the difficulty of a learn- from a corpus such as Wikipedia – here, conditioned ing problem of interest [BM98, KF07, AZ07]. In this on the category of each document, the words in it and work, we explore how having two ‘views’ makes the its link structure may be uncorrelated. In this paper, Appearing in Proceedings of the 26 th International Confer- we provide experiments for both settings. ence on , Montreal, Canada, 2009. Copy- Under this multi-view assumption, we provide a sim- right 2009 by the author(s)/owner(s). Multi-View Clustering via Canonical Correlation Analysis ple and efficient subspace learning method, based the mixture has a suitably low Fisher coefficient when on Canonical Correlation Analysis (CCA). This algo- in isotropic position. However, their separation in- rithm is affine invariant and is able to learn with some volves a large polynomial dependence on 1 . wmin of the weakest separation conditions to date. The in- The two results most closely related to ours are the tuitive reason for this is that under our multi-view work of [VW02] and [CR08]. [VW02] show that it is assumption, we are able to (approximately) find the sufficient to find the subspace spanned by the means low-dimensional subspace spanned by the means of of the distributions in the mixture for effective cluster- the component distributions. This subspace is impor- ing. Like our algorithm, [CR08] use a projection onto tant, because, when projected onto this subspace, the the top k −1 singular value decomposition subspace of means of the distributions are well-separated, yet the the canonical correlations matrix. They also require typical between points from the same distri- a spreading condition, which is related to our require- bution is smaller than in the original space. The num- ment on the rank. We borrow techniques from both of ber of samples we require to cluster correctly scales these papers. as O(d), where d is the ambient dimension. Finally, we show through experiments that CCA-based algo- [BL08] propose a similar algorithm for multi-view clus- rithms consistently provide better performance than tering, in which data is projected onto the top direc- standard PCA-based clustering methods when applied tions obtained by kernel CCA across the views. They to datasets in the two quite different domains of audio- show empirically that for clustering images using the visual speaker clustering and hierarchical Wikipedia associated text as a second view (where the target clus- document clustering by category. tering is a human-defined category), CCA-based clus- tering methods out-perform PCA-based algorithms. Our work adds to the growing body of results which This Work. Our input is data on a fixed set of ob- show how the multi-view framework can alleviate the jects from two views, where View j is assumed to be difficulty of learning problems. j j generated by a mixture of k Gaussians (D1,...,Dk), for j = 1, 2. To generate a sample, a source i is picked Related Work. Most provably efficient clustering with probability w , and x(1) and x(2) in Views 1 and algorithms first project the data down to some low- i 2 are drawn from distributions D1 and D2. Following dimensional space and then cluster the data in this i i prior theoretical work, our goal is to show that our al- lower dimensional space (an algorithm such as sin- gorithm recovers the correct clustering, provided the gle linkage usually suffices here). Typically, these al- input mixture obeys certain conditons. gorithms also work under a separation requirement, which is measured by the minimum distance between We impose two requirements on these mixtures. First, the means of any two mixture components. we require that conditioned on the source, the two views are uncorrelated. Notice that this is a weaker One of the first provably efficient algorithms for learn- restriction than the condition that given source i, the ing mixture models is due to [Das99], who learns samples from D1 and D2 are drawn independently. a mixture of spherical Gaussians by randomly pro- i i Moreover, this condition allows the distributions in the jecting the mixture onto a low-dimensional subspace. mixture within each view to be completely general, so [VW02] provide an algorithm with an improved sepa- long as they are uncorrelated across views. Although ration requirement that learns a mixture of k spheri- we do not prove this, our algorithm seems robust to cal Gaussians, by projecting the mixture down to the small deviations from this assumption. k-dimensional subspace of highest variance. [KSV05, AM05] extend this result to mixtures of general Gaus- Second, we require the rank of the CCA matrix across sians; however, they require a separation propor- the views to be at least k − 1, when each view is in tional to the maximum directional standard deviation isotropic position, and the k − 1-th singular value of of any mixture component. [CR08] use a canonical this matrix to be at least λmin. This condition ensures correlations-based algorithm to learn mixtures of axis- that there is sufficient correlation between the views. aligned Gaussians with a separation proportional to If the first two conditions hold, then we can recover σ∗, the maximum directional standard deviation in the the subspace containing the means in both views. subspace containing the means of the distributions. In addition, for mixtures of Gaussians, if in at least Their algorithm requires a coordinate-independence one view, say View 1, we have that for every pair of property, and an additional “spreading” condition. distributions i and j in the mixture, None of these algorithms are affine invariant. ||µ1 − µ1|| > Cσ∗k1/4plog(n/δ) Finally, [BV08] provide an affine-invariant algorithm i j for learning mixtures of general Gaussians, so long as for some constant C, then our algorithm can also de- Multi-View Clustering via Canonical Correlation Analysis termine which component each sample came from. For simplicity, we assume that the data have mean 0. 1 Here µi is the mean of the i-th component in View We denote the of the data as: 1 and σ∗ is the maximum directional standard devi- > (1) (1) > ation in the subspace containing the means in View Σ = E[xx ], Σ11 = E[x (x ) ] (2) (2) > (1) (2) > 1. Moreover, the number of samples required to learn Σ22 = E[x (x ) ], Σ12 = E[x (x ) ] this mixture grows (almost) linearly with d.  Σ Σ  Hence, we have: Σ = 11 21 (1) This separation condition is considerably weaker than Σ12 Σ22 previous results in that σ∗ only depends on the direc- tional variance in the subspace spanned by the means, The multi-view assumption we work with is as follows: which can be considerably lower than the maximum di- rectional variance over all directions. The only other Assumption 1 (Multi-View Condition) We assume algorithm which provides affine-invariant guarantees is that conditioned on the source distribution l in the mix- due to [BV08] — the implied separation in their work ture (where l = i is picked with probability wi), the two is rather large and grows with decreasing wmin, the views are uncorrelated. More precisely, we assume that minimum mixing weight. To get our improved sam- for all i ∈ [k], ple complexity bounds, we use a result due to [RV07] (1) (2) > (1) (2) > which may be of independent interest. E[x (x ) |l = i] = E[x |l = i]E[(x ) |l = i]

We stress that our improved results are really due to P 1 2 T This assumption implies that: Σ12 = i wiµi · (µi ) . the multi-view condition. Had we simply combined the To see this, observe that data from both views, and applied previous algorithms on the combined data, we could not have obtained our (1) (2) > X (1) (2) > E[x (x ) ] = ED [x (x ) ] Pr[Di] guarantees. We also emphasize that for our algorithm i i to cluster successfully, it is sufficient for the distribu- X = w E [x(1)] · E [(x(2))>] tions in the mixture to obey the separation condition i Di Di in one view, so long as the multi-view and rank condi- i X 1 2 T tions are obeyed. = wiµi · (µi ) (2) i Finally, we study through experiments the perfor- mance of CCA-based algorithms on data sets from two As the distributions are in isotropic position, we ob- P 1 P 2 different domains. First, we experiment with audio- serve that i wiµi = i wiµi = 0. Therefore, the visual speaker clustering, in which the two views are above equation shows that the rank of Σ12 is at most audio and face images of a speaker, and the target k − 1. We now assume that it has rank precisely k − 1. cluster variable is the speaker. Our experiments show that CCA-based algorithms perform better than PCA- Assumption 2 (Non-Degeneracy Condition) We as- based algorithms on audio data and just as well on sume that Σ12 has rank k − 1 and that the minimal image data, and are more robust to occlusions of the non-zero singular value of Σ12 is λmin > 0 (where we images. For our second experiment, we cluster docu- are working in a coordinate system where Σ11 and Σ22 ments in Wikipedia. The two views are the words and are identity matrices). the link structure in a document, and the target cluster is the category. Our experiments show that a CCA- For clarity of exposition, we also work in an isotropic based hierarchical clustering algorithm out-performs coordinate system in each view. Specifically, the ex- PCA-based hierarchical clustering for this data. pected covariance matrix of the data, in each view, is

the identity matrix, i.e. Σ11 = Id1 , Σ22 = Id2 . 2. The Setting As our analysis shows, our algorithm is robust to er- rors, so we assume that data is whitened as a pre- We assume that our data is generated by a mixture processing step. of k distributions. In particular, we assume that we obtain samples x = (x(1), x(2)), where x(1) and x(2) One way to view the Non-Degeneracy Assumption is are the two views, which live in the vector spaces V1 in terms of correlation coefficients. Recall that for two of dimension d1 and V2 of dimension d2, respectively. directions u ∈ V1 and v ∈ V2, the correlation coeffi- j We let d = d1+d2. Let µi , for i = 1, . . . , k and j = 1, 2, cient is defined as: be the mean of distribution i in view j, and let wi be E[(u · x(1))(v · x(2))] the mixing weight for distribution i. ρ(u, v) = . pE[(u · x(1))2]E[(v · x(2))2] Multi-View Clustering via Canonical Correlation Analysis

An alternative definition of λmin is the min- 3. Apply single linkage clustering [DE04] (for mix- imal non-zero correlation coefficient, λmin = tures of log-concave distributions), or the algo- minu,v:ρ(u,v)6=0 ρ(u, v). Note 1 ≥ λmin > 0. rithm in Section 3.5 of [AK05] (for mixtures of Gaussians) on the projected examples in View 1. We use Σb11 and Σb22 to denote the sample covariance matrices in views 1 and 2 respectively. We use Σb12 to We note that in Step 3, we apply either single link- denote the sample covariance matrix combined across age or the algorithm of [AK05]; this allows us to show views 1 and 2. We assume these are obtained through theoretically that if the distributions in the mixture empirical averages from i.i.d. samples from the under- are of a certain type, and given the right separation lying distribution. conditions, the clusters can be recovered correctly. In practice, however, these algorithms do not perform as 3. The Clustering Algorithm well due to lack of robustness, and one would use an algorithm such as k-means or EM to cluster in this The following lemma provides the intuition for our al- low-dimensional subspace. In particular, a variant of gorithm. the EM algorithm has been shown [DS00] to cluster correctly mixtures of Gaussians, under certain condi- Lemma 1 Under Assumption 2, if U, D, V is the tions. ‘thin’ SVD of Σ (where the thin SVD removes all 12 Moreover, in Step 1, we divide the data set into two zero entries from the diagonal), then the subspace halves to ensure independence between Steps 2 and 3 spanned by the means in view 1 is precisely the col- for our analysis; in practice, however, these steps can umn span of U (and we have the analogous statement be executed on the same sample set. for view 2). Main Results. Our main theorem is as follows. The lemma is a consequence of Equation 2 and the rank assumption. Since samples from a mixture are Theorem 1 (Gaussians) Suppose the source distri- well-separated in the space containing the means of the bution is a mixture of Gaussians, and suppose As- distributions, the lemma suggests the following strat- sumptions 1 and 2 hold. Let σ∗ be the maximum di- egy: use CCA to (approximately) project the data rectional standard deviation of any distribution in the down to the subspace spanned by the means to get 1 k subspace spanned by {µi }i=1. If, for each pair i and j an easier clustering problem, and then apply standard and for a fixed constant C, clustering algorithms in this space. 1 1 ∗ 1/4 kn Our clustering algorithm, based on the above idea, is ||µi − µj || ≥ Cσ k log( ) stated below. We can show that this algorithm clusters δ correctly with high probability, when the data in at then, with probability 1−δ, Algorithm 1 correctly clas- least one of the views obeys a separation condition, in sifies the examples if the number of examples used is addition to our assumptions. d d c · log2( ) log2(1/δ) The input to the algorithm is a set of samples S, and ∗ 2 2 2 ∗ (σ ) λminwmin σ λminwmin a number k, and the output is a clustering of these samples into k clusters. For this algorithm, we assume for some constant c. that the data obeys the separation condition in View Here we assume that a separation condition holds in 1; an analogous algorithm can be applied when the View 1, but a similar theorem also applies to View 2. data obeys the separation condition in View 2 as well. An analogous theorem can also be shown for mixtures Algorithm 1. of log-concave distributions.

1. Randomly partition S into two subsets A and B Theorem 2 (Log-concave Distributions) of equal size. Suppose the source distribution is a mixture of log-concave distributions, and suppose Assumptions ∗ 2. Let Σb12(A)(Σb12(B) resp.) denote the empirical 1 and 2 hold. Let σ be the maximum directional covariance matrix between views 1 and 2, com- standard deviation of any distribution in the subspace 1 k puted from the sample set A (B resp.). Com- spanned by {µi }i=1. If, for each pair i and j and for pute the top k − 1 left singular vectors of Σb12(A) a fixed constant C, (Σb12(B) resp.), and project the samples in B (A √ kn ||µ1 − µ1|| ≥ Cσ∗ k log( ) resp.) on the subspace spanned by these vectors. i j δ Multi-View Clustering via Canonical Correlation Analysis then, with probability 1−δ, Algorithm 1 correctly clas- The main tool in the proof of Lemma 4 is the following sifies the examples if the number of examples used is lemma, which uses a result due to [RV07]. d d c · log3( ) log2(1/δ) Lemma 5 ( Lemma) If the (σ∗)2λ2 w2 σ∗λ w min min min min number of samples for some constant c. d 2 d 2 1 n > c · 2 log ( ) log ( ) The proof follows from the proof of Theorem 1, along  wmin wmin δ with standard results on log-concave probability dis- for some constant c, then, with probability at least 1−δ, tributions – see [KSV05, AM05]. We do not provide a ||Σb12 − Σ12|| ≤ . proof here due to space constraints. A consequence of Lemmas 5, 2 and 3 is the following.

4. Analyzing Our Algorithm d 2 d 2 1 Lemma 6 Let n > C 2 log ( ) log ( ), for  wmin wmin δ In this section, we prove our main theorems. some constant C. Then, with probability 1 − δ, the top k − 1 singular values of Σ have value at least Notation. In the sequel, we assume that we are given b12 λ − . The remaining min(d , d ) − k + 1 singular samples from a mixture which obeys Assumptions 2 min 1 2 and 1. We use the notation S1 (resp. S2) to denote values of Σb12 have value at most . the subspace containing the centers of the distributions The proof follows by a combination of Lemmas 2,3, 5. in the mixture in View 1 (resp. View 2), and notation S01 (resp. S02) to denote the orthogonal complement to Proof:(Of Lemma 5) To prove this lemma, we apply the subspace containing the centers of the distributions Lemma 7. Observe the block representation of Σ in in the mixture in View 1 (resp. View 2). Equation 1. Moreover, with Σ11 and Σ22 in isotropic position, we have that the L norm of Σ is at most For any matrix A, we use ||A|| to denote the L norm 2 12 2 1. Using the triangle inequality, we can write: or maximum singular value of A. 1 Proofs. Now, we are ready to prove our main the- ||Σb12 −Σ12|| ≤ (||Σb −Σ||+||Σb11 −Σ11||+||Σb22 −Σ22||) orem. First, we show the following two lemmas, 2 which demonstrate properties of the expected cross- (where we applied the triangle inequality to the 2 × 2 correlational matrix across the views. Their proofs block matrix with off-diagonal entries Σb12 − Σ12 and are immediate from Assumptions 2 and 1. with 0 diagonal entries). We now apply Lemma 7 three times, on Σ − Σ , Σ − Σ , and a scaled version Lemma 2 Let v1 and v2 be any vectors in S1 and S2 b11 11 b22 22 1 T 2 of Σb − Σ. The first two applications follow directly. respectively. Then, |(v ) Σ12v | > λmin. For the third application, we observe that Lemma 7 1 2 01 Lemma 3 Let v (resp. v ) be any vector in S is rotation invariant, and that scaling each covariance 02 1 (resp. S ). Then, for any u ∈ V1 and u2 ∈ V2, value by some factor s scales the norm of the matrix 1 T 2 1 T 2 (v ) Σ12u = (u ) Σ12v = 0. by at most s. We claim that we can apply Lemma 7 on Σ − Σ with s = 4. Since the covariance of any Next, we show that given sufficiently many samples, b two random variables is at most the product of their the subspace spanned by the top k − 1 singular vec- standard deviations, and since Σ and Σ are I tors of Σ still approximates the subspace containing 11 22 d1 b12 and I respectively, the maximum singular value of the means of the distributions comprising the mixture. d2 Σ is at most 1; so the maximum singular value of Σ Finally, we use this fact, along with some results in 12 is at most 4. Our claim follows. The lemma follows by [AK05] to prove Theorem 1. Our main lemma of this plugging in n as a function of , d and w section is the following. min 

Lemma 4 (Projection Subspace Lemma) Let v1 Lemma 7 Let X be a set of n points generated by (resp. v2) be any vector in S1 (resp. S2). If the num- a mixture of k Gaussians over Rd, scaled such that d 2 d 2 1 T ber of samples n > c 2 2 log ( ) log ( ) E[x · x ] = Id. If M is the sample covariance matrix τ λminwmin τλminwmin δ for some constant c, then, with probability 1 − δ, the of X, then, for n large enough, with probability at least length of the projection of v1 (resp. v2 ) in the sub- 1 − δ, space spanned by the top k − 1 left (resp. right) sin- √ q 2n 2 1 d log n log( ) log(1/δ) gular√ vectors of Σb12 is at least 1 − τ ||v || (resp. δ 2 2 ||M − E[M]|| ≤ C · √ 1 − τ ||v ||). wminn Multi-View Clustering via Canonical Correlation Analysis

Cd 2 d 2 1 where C is a fixed constant, and wmin is the minimum 2 2 log ( ) log ( ), then, with proba- τ λminwmin τλminwmin δ mixing weight of any Gaussian in the mixture. bility at least 1 − δ, the projection of any vector v 1 2 in S or S onto the subspace returned√ by Step 2 of Proof: To prove this lemma, we use a concentra- Algorithm 1 has length at least 1 − τ 2||v||. There- tion result on the L2-norms of matrices due to [RV07]. fore, the maximum directional variance of any Di in We observe that each vector xi in the scaled space this subspace is at most (1 − τ 2)(σ∗)2 + τ 2σ2, where is generated by a Gaussian with some mean µ and σ2 is the maximum directional variance of any D . 2 i maximum directional variance σ . As the total vari- σ∗ ∗ 2 When τ ≤ σ , this is at most 2(σ ) . From the ance of the mixture along any direction is at most 1, isotropic condition, σ ≤ √ 1 . Therefore, when 2 2 wmin wmin(µ + σ ) ≤ 1. Therefore, for all samples xi, with Cd 2 d 2 1 q n > ∗ 2 2 2 log ( ∗ ) log ( ), the maxi- 2n (σ ) λminwmin σ λminwmin δ probability at least 1−δ/2, ||xi|| ≤ ||µ||+σ d log( δ ). mum directional variance of any Di in the mixture in the space output by Step 2 is at most 2(σ∗)2. We condition on the fact that the event ||xi|| ≤ q 2n ||µ|| + σ d log( δ ) happens for all i = 1, . . . , n. The Since A and B are random partitions of the sample set probability of this event is at least 1 − δ/2. S, the subspace produced by the action of Step 2 of Algorithm 1 on the set A is independent of B, and vice Conditioned on this event, the distributions of the vec- versa. Therefore, when projected onto the top k − 1 tors x are independent. Therefore, we can apply The- i SVD subspace of Σb12(A), the samples from B are dis- orem 3.1 in [RV07] on these conditional distributions, tributed as a mixture of (k−1)-dimensional Gaussians. to conclude that: The theorem follows from the previous paragraph, and 2 2 Theorem 1 of [AK05]. Pr[||M − E[M]|| > t] ≤ 2e−cnt /Λ log n  5. Experiments where c is a constant, and Λ is an upper bound on the norm of any vector ||xi||. The lemma follows√ by plug- 5.1. Audio-visual speaker clustering q Λ2 log(4/δ) log n 2 d log(2n/δ) ging in t = , and Λ ≤ √ . cn wmin  In the first set of experiments, we consider clustering either audio or face images of speakers. We use 41 Proof: (Of Lemma 4) For the sake of contradiction, speakers from the VidTIMIT database [San08], speak- suppose there exists a vector v1 ∈ S1 such that the ing 10 sentences (about 20 seconds) each, recorded at 1 projection of v√on the top k−1 left singular vectors of 25 frames per second in a studio environment with no 2 1 Σb12 is equal to 1 − τ˜ ||v ||, whereτ ˜ > τ. Then, there significant lighting or pose variation. The audio fea- 1 exists some unit vector u in V1 in the orthogonal com- tures are standard 12-dimensional mel cepstra [DM80] plement of the space spanned by the top k − 1 left sin- and their derivatives and double derivatives computed 1 gular vectors of Σb12 such that the projection of v on every 10ms over a 20ms window, and finally concate- u1 is equal toτ ˜||v1||. This vector u1 can be written as: nated over a window of 440ms centered on the current u1 =τv ˜ 1 +(1−τ˜2)1/2y1, where y1 is in the orthogonal frame, for a total of 1584 dimensions. The video fea- complement of S1. From Lemma 2, there exists some tures are pixels of the face region extracted from each 2 2 1 > 2 vector u in S , such that (v ) Σ12u ≥ λmin; from image (2394 dimensions). We consider the target clus- 2 1 > 2 Lemma 3, for this vector u ,(u ) Σ12u ≥ τλ˜ min. ter variable to be the speaker. We use either CCA d 2 d 2 1 or PCA to project the data to a lower dimensional- If n > c τ˜2λ2 w log ( τλ˜ w ) log ( δ ), then, from min min min min ity N. In the case of CCA, we initially project to an Lemma 6, (u1)T Σ u2 ≥ τ˜ λ . b12 2 min intermediate dimensionality M using PCA to reduce Now, since u1 is in the orthogonal complement of the effects of spurious correlations. For the results re- the subspace spanned by the top k − 1 left singu- ported here, typical values (selected using a held-out 2 lar vectors of Σb12, for any vector y in the subspace set) are N = 40 and M = 100 for images and 1000 spanned by the top k − 1 right singular vectors of Σb12, for audio. For CCA, we randomize the vectors of one > 2 (u1) Σb12y = 0. This means that there exists a vector view in each sentence, to reduce correlations between 2 z ∈ V2, the orthogonal complement of the subspace the views due to other latent variables such as the spanned by the top k − 1 right singular vectors of Σb12 current phoneme. We then cluster either view using 1 T 2 τ˜ such that (u ) Σb12z ≥ 2 λmin. This implies that the k-means into 82 clusters (2 per speaker). To alleviate τ˜ the problem of local minima found by k-means, each k-th singular value of Σb12 is at least 2 λmin. However, from Lemma 6, all but the top k − 1 singular values of clustering consists of 5 runs of k-means, and the one τ with the lowest score is taken as the final clustering. Σb12 are at most 3 λmin, which is a contradiction.  Proof:(Of Theorem 1) From Lemma 4, if n > Similarly to [BL08], we measure clustering perfor- Multi-View Clustering via Canonical Correlation Analysis

PCA CCA tors is very high (over 20 million for the link view), we Images 1.1 1.4 use random projection to reduce the dimensionality to Audio 35.3 12.5 a computationally manageable level. Images + occlusion 6.1 1.4 We present clustering experiments on a subset of Audio + occlusion 35.3 12.5 Wikipedia consisting of 128,327 articles. We use either Images + translation 3.4 3.4 PCA or CCA to reduce the feature vectors to the final Audio + translation 35.3 13.4 dimensionality, followed by clustering. In these experi- Table 1. Conditional perplexities of the speaker given the ments, we use a hierarchical clustering procedure, as a cluster, using PCA or CCA bases. “+ occlusion” and “+ flat clustering is poor with either PCA or CCA (CCA translation” indicate that the images are corrupted with still usually outperforms PCA, however). In the hier- occlusion/translation; the audio is unchanged, however. archical procedure, all points are initially considered to be in a single cluster. Next, we iteratively pick the mance using the conditional entropy of the speaker largest cluster, reduce the dimensionality using PCA s given the cluster c, H(s|c). We report the results or CCA on the points in this cluster, and use k-means H(s|c) in terms of conditional perplexity, 2 , which is to break the cluster into smaller sub-clusters (for some the mean number of speakers corresponding to each fixed k), until we reach the total desired number of cluster. Table 1 shows results on the raw data, as clusters. The intuition for this is that different clus- well as with synthetic occlusions and translations of ters may have different natural subspaces. the image data. Considering the clean visual environ- ment, we expect PCA to do very well on the image As before, we evaluate the clustering using the condi- data. Indeed, PCA provides an almost perfect clus- tional perplexity of the article category a (as given by tering of the raw images and CCA does not improve Wikipedia) given the cluster c, 2H(a|c). For each arti- it. However, CCA far outperforms PCA when cluster- cle we use the first category listed in the article. The ing the more challenging audio view. When synthetic 128,327 articles include roughly 15,000 categories, of occlusions or translations are applied to the images, which we use the 500 most frequent ones, which cover the performance of PCA-based clustering is greatly de- 73,145 articles. While the clustering is performed on graded. CCA is unaffected in the case of occlusion; in all 128,327 articles, the reported entropies are for the the case of translation, CCA-based image clustering 73,145 articles. Each sub-clustering consists of 10 runs is degraded similarly to PCA, but audio clustering is of k-means, and the one with the lowest k-means score almost unaffected. In other words, even when the im- is taken as the final cluster assignment. age data are degraded, CCA is able to recover a good Figure 1(e) shows the conditional perplexity versus the 1 clustering in at least one of the views. For a more number of clusters for PCA and CCA based hierarchi- detailed look at the clustering behavior, Figures 1(a-d) cal clustering. For any number of clusters, CCA pro- show the distributions of clusters for each speaker. duces better clusterings, i.e. ones with lower perplex- ity. In addition, the tree structures of the PCA/CCA- 5.2. Clustering Wikipedia articles based clusterings are qualitatively different. With Next we consider the task of clustering Wikipedia ar- PCA based clustering, most points are assigned to a ticles, based on either their text or their incoming and few large clusters, with the remaining clusters being outgoing links. The link structure L is represented as very small. CCA-based hierarchical clustering pro- a concatenation of “to”and “from” link incidence vec- duces more balanced clusters. To see this, in Fig- tors, where each element L(i) is the number of times ure 1(f) we show the perplexity of the cluster distribu- the current article links to/from article i. The article tion versus number of clusters. For about 25 or more text is represented as a bag-of-words feature vector, clusters, the CCA-based clustering has higher perplex- i.e. the raw count of each word in the article. A lex- ity, indicating a more uniform distribution of clusters. icon of about 8 million words and a list of about 12 References million articles were used to construct the two feature vectors. Since the dimensionality of the feature vec- [AK05] S. Arora and R. Kannan. Learning mixtures of separated nonspherical Gaussians. Ann. 1The audio task is unusually challenging, as each fea- ture vector corresponds to only a few phonemes. A typ- Applied Prob., 15(1A):69–92, 2005. ical speaker classification setting uses entire sentences. If we force the cluster identity to be constant over each sen- tence (the most frequent cluster label in the sentence), per- [AM05] D. Achlioptas and F. McSherry. On spec- formance improves greatly; e.g., in the “audio+occlusion” tral learning of mixtures of distributions. In case, the perplexity improves to 8.5 (PCA) and 2.1 (CCA). Conf. on Learning Thy, pages 458–469, 2005. Multi-View Clustering via Canonical Correlation Analysis

(e) Wikipedia: Category perplexity (a) AV: Audio, PCA basis (c) AV: Images + occlusion, PCA basis 160 hierarchical CCA 5 5 140 hierarchical PCA

10 10 120

15 15 100 20 20 speaker 25 speaker 25 80 perplexity

30 30 60

35 35 40 40 40 20 40 60 80 20 40 60 80 20 cluster cluster 0 20 40 60 80 100 120 number of clusters (b) AV: Audio, CCA basis (d) AV: Images + occlusion, CCA basis (f) Wikipedia: Cluster perplexity 120 balanced clustering 5 5 100 hierarchical CCA hierarchical PCA 10 10

15 15 80

20 20 60 Entropy 2 speaker 25 speaker 25 40 30 30

35 35 20

40 40 0 20 40 60 80 20 40 60 80 0 20 40 60 80 100 120 cluster cluster number of clusters Figure 1. (a-d) Distributions of cluster assignments per speaker in audio-visual experiments. The color of each cell (s, c) corresponds to the empirical probability p(c|s) (darker = higher). (e-f) Wikipedia experiments: (e) Conditional perplexity of article category given cluster (2H(a|c)). (f) Perplexity of the cluster distribution (2H(c)) [AZ07] R. Kubota Ando and T. Zhang. Two-view [DS00] S. Dasgupta and L. Schulman. A two-round feature generation model for semi-supervised variant of EM for Gaussian mixtures. In Un- learning. In Int. Conf. on Machine Learning, certainty in Art. Int., pages 152–159, 2000. pages 25–32, 2007. [KF07] S. M. Kakade and D. P. Foster. Multi-view [BL08] M. B. Blaschko and C. H. Lampert. Correla- regression via canonical correlation analysis. tional . In Conf. on Comp. In Conf. Learning Thy, pages 82–96, 2007. Vision and , 2008. [KSV05] R. Kannan, H. Salmasian, and S. Vempala. [BM98] A. Blum and T. Mitchell. Combining la- The spectral method for general mixture beled and unlabeled data with co-training. In models. In Conf. on Learning Thy, pages Conf. on Learning Thy., pages 92–100, 1998. 444–457, 2005. [BV08] S. C. Brubaker and S. Vempala. Isotropic [RV07] M. Rudelson and R. Vershynin. Sampling PCA and affine-invariant clustering. In from large matrices: An approach through Found. of Comp. Sci., pages 551–560, 2008. geometric functional analysis. JACM, 2007. [CR08] K. Chaudhuri and S. Rao. Learning mixtures [San08] C. Sanderson. Biometric Person Recogni- of distributions using correlations and inde- tion: Face, Speech and Fusion. VDM-Verlag, pendence. In Conf. On Learning Thy., pages 2008. 9–20, 2008. [Das99] S. Dasgupta. Learning mixtures of Gaus- [VW02] V. Vempala and G. Wang. A spectral al- sians. In Found. of Comp. Sci., pages 634– gorithm for learning mixtures of distribu- 644, 1999. tions. In Found. of Comp. Sci., pages 113– 123, 2002. [DE04] G. Dunn and B. Everitt. An Introduction to Math. Taxonomy. Dover Books, 2004. [DM80] S. B. Davis and P. Mermelstein. Com- parison of parametric representations for monosyllabic word recognition in continu- ously spoken sentences. IEEE Trans. Acous- tics, Speech, and Signal Proc., 28(4):357– 366, 1980.