<<

Multi-View Clustering via Analysis

Kamalika Chaudhuri [email protected] UC San Diego, 9500 Gilman Drive, La Jolla, CA Sham M. Kakade [email protected] Karen Livescu [email protected] Karthik Sridharan [email protected] Toyota Technological Institute at Chicago, 6045 S. Kenwood Ave., Chicago, IL

Keywords: multi-view learning, clustering, canonical correlation analysis

Abstract ing problem of interest [BM98, KF07, AZ07]. In this work, we explore how having two ‘views’ makes the Clustering in high dimensions is be- clustering problem significantly more tractable. lieved to be a hard problem in general. A number of efficient clustering algorithms de- Much recent work has been done on understanding veloped in recent years address this prob- under what conditions we can learn a mixture model. lem by projecting the data into a lower- The basic problem is as follows: We are given indepen- dimensional subspace, e.g. via Principal dent samples from a mixture of k distributions, and Components Analysis (PCA) or random pro- our task is to either: 1) infer properties of the under- jections, before clustering. Here, we consider lying mixture model (e.g. the mixing weights, , constructing such projections using multiple etc.) or 2) classify a random sample according to which views of the data, via Canonical Correlation distribution in the mixture it was generated from. Analysis (CCA). Under no restrictions on the underlying mixture, this Under the assumption that the views are un- problem is considered to be hard. However, in many correlated given the cluster label, we show applications, we are only interested in clustering the that the separation conditions required for data when the component distributions are “well sep- the algorithm to be successful are signifi- arated”. In fact, the focus of recent clustering algo- cantly weaker than prior results in the lit- rithms [Das99, VW02, AM05, BV08] is on efficiently erature. We provide results for mixtures learning with as little separation as possible. Typ- of Gaussians and mixtures of log concave ically, the separation conditions are such that when distributions. We also provide empirical given a random sample from the mixture model, the support from audio-visual speaker clustering Bayes optimal classifier is able to reliably recover (where we desire the clusters to correspond to which cluster generated that point. speaker ID) and from hierarchical Wikipedia document clustering (where one view is the This work makes a natural multi-view assumption: words in the document and the other is the that the views are (conditionally) uncorrelated, con- link structure). ditioned on which mixture component generated the views. There are many natural applications for which 1. Introduction this assumption applies. For example, we can con- sider multi-modal views, with one view being a video The multi-view approach to learning is one in which stream and the other an audio stream, of a speaker — we have ‘views’ of the data (sometimes in a rather ab- here, conditioned on the speaker identity and maybe stract sense) and the goal is to use the relationship be- the phoneme (both of which could label the generat- tween these views to alleviate the difficulty of a learn- ing cluster), the views may be uncorrelated. A second

th example is the words and link structure in a document Appearing in Proceedings of the 26 International Confer- from a corpus such as Wikipedia – here, conditioned ence on , Montreal, Canada, 2009. Copy- right 2009 by the author(s)/owner(s). on the category of each document, the words in it and Multi-View Clustering via Canonical Correlation Analysis its link structure may be uncorrelated. In this paper, None of these algorithms are affine invariant. we provide for both settings. Finally, [BV08] provide an affine-invariant algorithm Under this multi-view assumption, we provide a sim- for learning mixtures of general Gaussians, so long as ple and efficient subspace learning method, based the mixture has a suitably low Fisher coefficient when on Canonical Correlation Analysis (CCA). This algo- in isotropic position. However, their separation in- rithm is affine invariant and is able to learn with some volves a large polynomial dependence on 1 . wmin of the weakest separation conditions to date. The in- The two results most closely related to ours are the tuitive reason for this is that under our multi-view work of [VW02] and [CR08]. [VW02] show that it is assumption, we are able to (approximately) find the sufficient to find the subspace spanned by the means low-dimensional subspace spanned by the means of of the distributions in the mixture for effective cluster- the component distributions. This subspace is impor- ing. Like our algorithm, [CR08] use a projection onto tant, because, when projected onto this subspace, the the top k −1 singular value decomposition subspace of means of the distributions are well-separated, yet the the canonical correlations . They also require typical between points from the same distri- a spreading condition, which is related to our require- bution is smaller than in the original space. The num- ment on the rank. We borrow techniques from both of ber of samples we require to cluster correctly scales these papers. as O(d), where d is the ambient dimension. Finally, we show through experiments that CCA-based algo- [BL08] propose a similar algorithm for multi-view clus- rithms consistently provide better performance than tering, in which data is projected onto the top direc- standard PCA-based clustering methods when applied tions obtained by kernel CCA across the views. They to datasets in the two quite different domains of audio- show empirically that for clustering images using the visual speaker clustering and hierarchical Wikipedia associated text as a second view (where the target clus- document clustering by category. tering is a human-defined category), CCA-based clus- tering methods out-perform PCA-based algorithms. Our work adds to the growing body of results which show how the multi-view framework can alleviate the This Work. Our input is data on a fixed set of ob- difficulty of learning problems. jects from two views, where View j is assumed to be j j generated by a mixture of k Gaussians (D1,...,Dk), Related Work. Most provably efficient clustering for j = 1, 2. To generate a sample, a source i is picked (1) (2) algorithms first project the data down to some low- with probability wi, and x and x in Views 1 and 1 2 dimensional space and then cluster the data in this 2 are drawn from distributions Di and Di . Following lower dimensional space (an algorithm such as sin- prior theoretical work, our goal is to show that our al- gle linkage usually suffices here). Typically, these al- gorithm recovers the correct clustering, provided the gorithms also work under a separation requirement, input mixture obeys certain conditons. which is measured by the minimum distance between We impose two requirements on these mixtures. First, the means of any two mixture components. we require that conditioned on the source, the two One of the first provably efficient algorithms for learn- views are uncorrelated. Notice that this is a weaker ing mixture models is due to [Das99], who learns restriction than the condition that given source i, the 1 2 a mixture of spherical Gaussians by randomly pro- samples from Di and Di are drawn independently. jecting the mixture onto a low-dimensional subspace. Moreover, this condition allows the distributions in the [VW02] provide an algorithm with an improved sepa- mixture within each view to be completely general, so ration requirement that learns a mixture of k spheri- long as they are uncorrelated across views. Although cal Gaussians, by projecting the mixture down to the we do not prove this, our algorithm seems robust to k-dimensional subspace of highest . [KSV05, small deviations from this assumption. AM05] extend this result to mixtures of general Gaus- Second, we require the rank of the CCA matrix across sians; however, they require a separation propor- the views to be at least k − 1, when each view is in tional to the maximum directional isotropic position, and the k − 1-th singular value of of any mixture component. [CR08] use a canonical this matrix to be at least λ . This condition ensures correlations-based algorithm to learn mixtures of axis- min that there is sufficient correlation between the views. aligned Gaussians with a separation proportional to If the first two conditions hold, then we can recover σ∗, the maximum directional standard deviation in the the subspace containing the means in both views. subspace containing the means of the distributions. Their algorithm requires a coordinate-independence In addition, for mixtures of Gaussians, if in at least property, and an additional “spreading” condition. one view, say View 1, we have that for every pair of Multi-View Clustering via Canonical Correlation Analysis distributions i and j in the mixture, of dimension d1 and V2 of dimension d2, respectively. j We let d = d1+d2. Let µi , for i = 1, . . . , k and j = 1, 2, 1 1 ∗ 1/4p ||µi − µj || > Cσ k log(n/δ) be the of distribution i in view j, and let wi be the mixing weight for distribution i. for some constant C, then our algorithm can also de- termine which component each sample came from. For simplicity, we assume that the data have mean 0. 1 We denote the matrix of the data as: Here µi is the mean of the i-th component in View ∗ 1 and σ is the maximum directional standard devi- > (1) (1) > ation in the subspace containing the means in View Σ = E[xx ], Σ11 = E[x (x ) ] (2) (2) > (1) (2) > 1. Moreover, the number of samples required to learn Σ22 = E[x (x ) ], Σ12 = E[x (x ) ] this mixture grows (almost) linearly with d.  Σ Σ  Hence, we have: Σ = 11 21 (1) This separation condition is considerably weaker than Σ12 Σ22 previous results in that σ∗ only depends on the direc- The multi-view assumption we work with is as follows: tional variance in the subspace spanned by the means, which can be considerably lower than the maximum di- rectional variance over all directions. The only other Assumption 1 (Multi-View Condition) We assume algorithm which provides affine-invariant guarantees is that conditioned on the source distribution l in the mix- due to [BV08] — the implied separation in their work ture (where l = i is picked with probability wi), the two views are uncorrelated. More precisely, we assume that is rather large and grows with decreasing wmin, the minimum mixing weight. To get our improved sam- for all i ∈ [k], ple complexity bounds, we use a result due to [RV07] (1) (2) > (1) (2) > which may be of independent interest. E[x (x ) |l = i] = E[x |l = i]E[(x ) |l = i] We stress that our improved results are really due to P 1 2 T This assumption implies that: Σ12 = i wiµi · (µi ) . the multi-view condition. Had we simply combined the To see this, observe that data from both views, and applied previous algorithms on the combined data, we could not have obtained our (1) (2) > X (1) (2) > E[x (x ) ] = EDi [x (x ) ] Pr[Di] guarantees. We also emphasize that for our algorithm i to cluster successfully, it is sufficient for the distribu- X = w E [x(1)] · E [(x(2))>] tions in the mixture to obey the separation condition i Di Di i in one view, so long as the multi-view and rank condi- X 1 2 T tions are obeyed. = wiµi · (µi ) (2) i Finally, we study through experiments the perfor- mance of CCA-based algorithms on data sets from two As the distributions are in isotropic position, we ob- P 1 P 2 different domains. First, we with audio- serve that i wiµi = i wiµi = 0. Therefore, the visual speaker clustering, in which the two views are above equation shows that the rank of Σ12 is at most audio and face images of a speaker, and the target k − 1. We now assume that it has rank precisely k − 1. cluster variable is the speaker. Our experiments show that CCA-based algorithms perform better than PCA- Assumption 2 (Non-Degeneracy Condition) We as- based algorithms on audio data and just as well on sume that Σ12 has rank k − 1 and that the minimal image data, and are more robust to occlusions of the non-zero singular value of Σ12 is λmin > 0 (where we images. For our second experiment, we cluster docu- are working in a coordinate system where Σ11 and Σ22 ments in Wikipedia. The two views are the words and are identity matrices). the link structure in a document, and the target cluster is the category. Our experiments show that a CCA- For clarity of exposition, we also work in an isotropic based algorithm out-performs coordinate system in each view. Specifically, the ex- PCA-based hierarchical clustering for this data. pected of the data, in each view, is

the identity matrix, i.e. Σ11 = Id1 , Σ22 = Id2 . 2. The Setting As our analysis shows, our algorithm is robust to er- rors, so we assume that data is whitened as a pre- We assume that our data is generated by a mixture processing step. of k distributions. In particular, we assume that we obtain samples x = (x(1), x(2)), where x(1) and x(2) One way to view the Non-Degeneracy Assumption is are the two views, which live in the vector spaces V1 in terms of correlation coefficients. Recall that for two Multi-View Clustering via Canonical Correlation Analysis directions u ∈ V1 and v ∈ V2, the correlation coeffi- puted from the sample set A (B resp.). Com- cient is defined as: pute the top k − 1 left singular vectors of Σb12(A) E[(u · x(1))(v · x(2))] (Σb12(B) resp.), and project the samples in B (A ρ(u, v) = . resp.) on the subspace spanned by these vectors. pE[(u · x(1))2]E[(v · x(2))2] 3. Apply single linkage clustering [DE04] (for mix- An alternative definition of λmin is the min- tures of log-concave distributions), or the algo- imal non-zero correlation coefficient, λmin = rithm in Section 3.5 of [AK05] (for mixtures of minu,v:ρ(u,v)6=0 ρ(u, v). Note 1 ≥ λmin > 0. Gaussians) on the projected examples in View 1. We use Σb11 and Σb22 to denote the sample covariance matrices in views 1 and 2 respectively. We use Σb12 to We note that in Step 3, we apply either single link- denote the sample covariance matrix combined across age or the algorithm of [AK05]; this allows us to show views 1 and 2. We assume these are obtained through theoretically that if the distributions in the mixture empirical averages from i.i.d. samples from the under- are of a certain type, and given the right separation lying distribution. conditions, the clusters can be recovered correctly. In practice, however, these algorithms do not perform as well due to lack of robustness, and one would use an 3. The Clustering Algorithm algorithm such as k-means or EM to cluster in this The following lemma provides the intuition for our al- low-dimensional subspace. In particular, a variant of gorithm. the EM algorithm has been shown [DS00] to cluster correctly mixtures of Gaussians, under certain condi- Lemma 1 Under Assumption 2, if U, D, V is the tions. ‘thin’ SVD of Σ (where the thin SVD removes all 12 Moreover, in Step 1, we divide the data set into two zero entries from the diagonal), then the subspace halves to ensure independence between Steps 2 and 3 spanned by the means in view 1 is precisely the col- for our analysis; in practice, however, these steps can umn span of U (and we have the analogous statement be executed on the same sample set. for view 2).

The lemma is a consequence of Equation 2 and the Main Results. Our main theorem is as follows. rank assumption. Since samples from a mixture are well-separated in the space containing the means of the Theorem 1 (Gaussians) Suppose the source distri- distributions, the lemma suggests the following strat- bution is a mixture of Gaussians, and suppose As- ∗ egy: use CCA to (approximately) project the data sumptions 1 and 2 hold. Let σ be the maximum di- down to the subspace spanned by the means to get rectional standard deviation of any distribution in the 1 k an easier clustering problem, and then apply standard subspace spanned by {µi }i=1. If, for each pair i and j clustering algorithms in this space. and for a fixed constant C, Our clustering algorithm, based on the above idea, is kn ||µ1 − µ1|| ≥ Cσ∗k1/4 log( ) stated below. We can show that this algorithm clusters i j δ correctly with high probability, when the data in at least one of the views obeys a separation condition, in then, with probability 1−δ, Algorithm 1 correctly clas- addition to our assumptions. sifies the examples if the number of examples used is The input to the algorithm is a set of samples S, and d d c · log2( ) log2(1/δ) a number k, and the output is a clustering of these ∗ 2 2 2 ∗ (σ ) λminwmin σ λminwmin samples into k clusters. For this algorithm, we assume that the data obeys the separation condition in View for some constant c. 1; an analogous algorithm can be applied when the data obeys the separation condition in View 2 as well. Here we assume that a separation condition holds in View 1, but a similar theorem also applies to View 2. Algorithm 1. An analogous theorem can also be shown for mixtures 1. Randomly partition S into two subsets A and B of log-concave distributions. of equal size. Theorem 2 (Log-concave Distributions) 2. Let Σb12(A)(Σb12(B) resp.) denote the empirical Suppose the source distribution is a mixture of covariance matrix between views 1 and 2, com- log-concave distributions, and suppose Assumptions Multi-View Clustering via Canonical Correlation Analysis

1 and 2 hold. Let σ∗ be the maximum directional Lemma 4 (Projection Subspace Lemma) Let v1 standard deviation of any distribution in the subspace (resp. v2) be any vector in S1 (resp. S2). If the num- 1 k d 2 d 2 1 spanned by {µi }i=1. If, for each pair i and j and for ber of samples n > c 2 2 log ( ) log ( ) τ λminwmin τλminwmin δ a fixed constant C, for some constant c, then, with probability 1 − δ, the 1 2 √ kn length of the projection of v (resp. v ) in the sub- ||µ1 − µ1|| ≥ Cσ∗ k log( ) space spanned by the top k − 1 left (resp. right) sin- i j δ √ 2 1 √gular vectors of Σb12 is at least 1 − τ ||v || (resp. then, with probability 1−δ, Algorithm 1 correctly clas- 1 − τ 2||v2||). sifies the examples if the number of examples used is The main tool in the proof of Lemma 4 is the following d 3 d 2 lemma, which uses a result due to [RV07]. c · ∗ 2 2 2 log ( ∗ ) log (1/δ) (σ ) λminwmin σ λminwmin Lemma 5 ( Lemma) If the for some constant c. number of samples

The proof of this theorem is very similar to Theorem 1, d 2 d 2 1 n > c · 2 log ( ) log ( ) and follows from the proof of Theorem 1, along with  wmin wmin δ standard results on log-concave probability distribu- for some constant c, then, with probability at least 1−δ, tions – see [KSV05, AM05]. We do not provide a proof ||Σb12 − Σ12|| ≤ . here due to space constraints. A consequence of Lemmas 5, 2 and 3 is the following.

4. Analyzing Our Algorithm d 2 d 2 1 Lemma 6 Let n > C 2 log ( ) log ( ), for  wmin wmin δ In this section, we prove our main theorems. some constant C. Then, with probability 1 − δ, the top k − 1 singular values of Σb12 have value at least Notation. In the sequel, we assume that we are given λmin − . The remaining min(d1, d2) − k + 1 singular samples from a mixture which obeys Assumptions 2 values of Σb12 have value at most . and 1. We use the notation S1 (resp. S2) to denote the subspace containing the centers of the distributions The proof follows by a combination of Lemmas 2,3, 5 in the mixture in View 1 (resp. View 2), and notation and a triangle inequality. S01 (resp. S02) to denote the orthogonal complement to the subspace containing the centers of the distributions Proof:(Of Lemma 5) To prove this lemma, we apply in the mixture in View 1 (resp. View 2). Lemma 7. Observe the block representation of Σ in Equation 1. Moreover, with Σ11 and Σ22 in isotropic For any matrix A, we use ||A|| to denote the L2 norm position, we have that the L2 norm of Σ12 is at most or maximum singular value of A. 1. Using the triangle inequality, we can write: Proofs. Now, we are ready to prove our main the- 1 ||Σb12 −Σ12|| ≤ (||Σb −Σ||+||Σb11 −Σ11||+||Σb22 −Σ22||) orem. First, we show the following two lemmas, 2 which demonstrate properties of the expected cross- (where we applied the triangle inequality to the 2 × 2 correlational matrix across the views. Their proofs block matrix with off-diagonal entries Σb12 − Σ12 and are immediate from Assumptions 2 and 1. with 0 diagonal entries). We now apply Lemma 7 three Lemma 2 Let v1 and v2 be any vectors in S1 and S2 times, on Σb11 − Σ11, Σb22 − Σ22, and a scaled version 1 T 2 of Σ − Σ. The first two applications follow directly. respectively. Then, |(v ) Σ12v | > λmin. b For the third application, we observe that Lemma 7 1 2 01 Lemma 3 Let v (resp. v ) be any vector in S is rotation invariant, and that scaling each covariance 02 1 (resp. S ). Then, for any u ∈ V1 and u2 ∈ V2, value by some factor s scales the norm of the matrix 1 T 2 1 T 2 (v ) Σ12u = (u ) Σ12v = 0. by at most s. We claim that we can apply Lemma 7 on Σ − Σ with s = 4. Since the covariance of any Next, we show that given sufficiently many samples, b two random variables is at most the product of their the subspace spanned by the top k − 1 singular vec- standard deviations, and since Σ and Σ are I tors of Σ still approximates the subspace containing 11 22 d1 b12 and I respectively, the maximum singular value of the means of the distributions comprising the mixture. d2 Σ is at most 1; so the maximum singular value of Σ Finally, we use this fact, along with some results in 12 is at most 4. Our claim follows. The lemma follows by [AK05] to prove Theorem 1. Our main lemma of this plugging in n as a function of , d and w section is the following. min  Multi-View Clustering via Canonical Correlation Analysis

Lemma 7 Let X be a set of n points generated by spanned by the top k − 1 right singular vectors of Σb12, d > 2 a mixture of k Gaussians over R , scaled such that (u1) Σb12y = 0. This, in turn, means that there ex- T 2 E[x · x ] = Id. If M is the sample covariance matrix ists a vector z ∈ V2, the orthogonal complement of of X, then, for n large enough, with probability at least the subspace spanned by the top k − 1 right singular 1 T 2 τ˜ 1 − δ, vectors of Σb12 such that (u ) Σb12z ≥ 2 λmin. This q implies that the k-th singular value of Σb12 is at least d log n log( 2n ) log(1/δ) τ˜ δ 2 λmin. However, from Lemma 6, all except the top ||M − E[M]|| ≤ C · √ τ wminn k − 1 singular values of Σb12 are at most 3 λmin, which is a contradiction.  where C is a fixed constant, and wmin is the minimum mixing weight of any Gaussian in the mixture. Proof:(Of Theorem 1) From Lemma 4, if n > Cd 2 d 2 1 Proof: To prove this lemma, we use a concentra- 2 2 log ( ) log ( ), then, with proba- τ λminwmin τλminwmin δ tion result on the L2-norms of matrices due to [RV07]. bility at least 1 − δ, the projection of any vector v 1 2 We observe that each vector xi in the scaled space in S or S onto the subspace returned√ by Step 2 of is generated by a Gaussian with some mean µ and Algorithm 1 has length at least 1 − τ 2||v||. There- 2 maximum directional variance σ . As the total vari- fore, the maximum directional variance of any Di in ance of the mixture along any direction is at most 1, this subspace is at most (1 − τ 2)(σ∗)2 + τ 2σ2, where 2 2 2 wmin(µ + σ ) ≤ 1. Therefore, for all samples xi, with σ is the maximum directional variance of any Di. q ∗ 2n When τ ≤ σ , this is at most 2(σ∗)2. From the probability at least 1−δ/2, ||xi|| ≤ ||µ||+σ d log( δ ). σ isotropic condition, σ ≤ √ 1 . Therefore, when wmin We condition on the fact that the event ||xi|| ≤ Cd 2 d 2 1 n > (σ∗)2λ2 w2 log ( σ∗λ w ) log ( δ ), the maxi- q 2n min min min min ||µ|| + σ d log( δ ) happens for all i = 1, . . . , n. The mum directional variance of any Di in the mixture in probability of this event is at least 1 − δ/2. the space output by Step 2 is at most 2(σ∗)2. Conditioned on this event, the distributions of the vec- Since A and B are random partitions of the sample tors xi are independent. Therefore, we can apply The- set S, the subspace produced by the action of Step orem 3.1 in [RV07] on these conditional distributions, 2 of Algorithm 1 on the set A is independent of B, to conclude that: and vice versa. Therefore, when projected onto the

−cnt2/Λ2 log n top k − 1 SVD subspace of Σb12(A), the samples from Pr[||M − E[M]|| > t] ≤ 2e B are distributed as a mixture of (k − 1)-dimensional where c is a constant, and Λ is an upper bound on the Gaussians. The theorem follows from the bounds in the previous paragraph, and Theorem 1 of [AK05].  norm of any vector ||xi||. The lemma follows√ by plug- q Λ2 log(4/δ) log n 2 d log(2n/δ) ging in t = , and Λ ≤ √ .  cn wmin 5. Experiments 5.1. Audio-visual speaker clustering Proof: (Of Lemma 4) For the sake of contradiction, suppose there exists a vector v1 ∈ S1 such that the In the first set of experiments, we consider clustering 1 projection of v√on the top k−1 left singular vectors of either audio or face images of speakers. We use 41 2 1 Σb12 is equal to 1 − τ˜ ||v ||, whereτ ˜ > τ. Then, there speakers from the VidTIMIT database [San08], speak- 1 exists some unit vector u in V1 in the orthogonal com- ing 10 sentences (about 20 seconds) each, recorded at plement of the space spanned by the top k − 1 left sin- 25 frames per second in a studio environment with no 1 gular vectors of Σb12 such that the projection of v on significant lighting or pose variation. The audio fea- u1 is equal toτ ˜||v1||. This vector u1 can be written as: tures are standard 12-dimensional mel cepstra [DM80] u1 =τv ˜ 1 +(1−τ˜2)1/2y1, where y1 is in the orthogonal and their derivatives and double derivatives computed complement of S1. From Lemma 2, there exists some every 10ms over a 20ms window, and finally concate- 2 2 1 > 2 vector u in S , such that (v ) Σ12u ≥ λmin; from nated over a window of 440ms centered on the current 2 1 > 2 Lemma 3, for this vector u ,(u ) Σ12u ≥ τλ˜ min. frame, for a total of 1584 dimensions. The video fea- d 2 d 2 1 tures are pixels of the face region extracted from each If n > c τ˜2λ2 w log ( τλ˜ w ) log ( δ ), then, from min min min min image (2394 dimensions). We consider the target clus- Lemma 6, (u1)T Σ u2 ≥ τ˜ λ . b12 2 min ter variable to be the speaker. We use either CCA Now, since u1 is in the orthogonal complement of or PCA to project the data to a lower dimensional- the subspace spanned by the top k − 1 left singu- ity N. In the case of CCA, we initially project to an 2 lar vectors of Σb12, for any vector y in the subspace intermediate dimensionality M using PCA to reduce Multi-View Clustering via Canonical Correlation Analysis

PCA CCA 5.2. Clustering Wikipedia articles Images 1.1 1.4 Next we consider the task of clustering Wikipedia ar- Audio 35.3 12.5 ticles, based on either their text or their incoming and Images + occlusion 6.1 1.4 outgoing links. The link structure L is represented as Audio + occlusion 35.3 12.5 a concatenation of “to”and “from” link incidence vec- Images + translation 3.4 3.4 tors, where each element L(i) is the number of times Audio + translation 35.3 13.4 the current article links to/from article i. The article Table 1. Conditional perplexities of the speaker given the text is represented as a bag-of-words feature vector, cluster, using PCA or CCA bases. “+ occlusion” and “+ i.e. the raw count of each word in the article. A lex- translation” indicate that the images are corrupted with icon of about 8 million words and a list of about 12 occlusion/translation; the audio is unchanged, however. million articles were used to construct the two feature vectors. Since the dimensionality of the feature vec- tors is very high (over 20 million for the link view), we the effects of spurious correlations. For the results re- use to reduce the dimensionality to ported here, typical values (selected using a held-out a computationally manageable level. set) are N = 40 and M = 100 for images and 1000 for audio. For CCA, we randomize the vectors of one We present clustering experiments on a subset of view in each sentence, to reduce correlations between Wikipedia consisting of 128,327 articles. We use either the views due to other latent variables such as the PCA or CCA to reduce the feature vectors to the final current phoneme. We then cluster either view using dimensionality, followed by clustering. In these experi- k-means into 82 clusters (2 per speaker). To alleviate ments, we use a hierarchical clustering procedure, as a the problem of local minima found by k-means, each flat clustering is poor with either PCA or CCA (CCA clustering consists of 5 runs of k-means, and the one still usually outperforms PCA, however). In the hier- with the lowest score is taken as the final clustering. archical procedure, all points are initially considered to be in a single cluster. Next, we iteratively pick the Similarly to [BL08], we measure clustering perfor- largest cluster, reduce the dimensionality using PCA mance using the conditional entropy of the speaker or CCA on the points in this cluster, and use k-means s given the cluster c, H(s|c). We report the results to break the cluster into smaller sub-clusters (for some H(s|c) in terms of conditional perplexity, 2 , which is fixed k), until we reach the total desired number of the mean number of speakers corresponding to each clusters. The intuition for this is that different clus- cluster. Table 1 shows results on the raw data, as ters may have different natural subspaces. well as with synthetic occlusions and translations of the image data. Considering the clean visual environ- As before, we evaluate the clustering using the condi- ment, we expect PCA to do very well on the image tional perplexity of the article category a (as given by H(a|c) data. Indeed, PCA provides an almost perfect clus- Wikipedia) given the cluster c, 2 . For each arti- tering of the raw images and CCA does not improve cle we use the first category listed in the article. The it. However, CCA far outperforms PCA when cluster- 128,327 articles include roughly 15,000 categories, of ing the more challenging audio view. When synthetic which we use the 500 most frequent ones, which cover occlusions or translations are applied to the images, 73,145 articles. While the clustering is performed on the performance of PCA-based clustering is greatly de- all 128,327 articles, the reported entropies are for the graded. CCA is unaffected in the case of occlusion; in 73,145 articles. Each sub-clustering consists of 10 runs the case of translation, CCA-based image clustering of k-means, and the one with the lowest k-means score is degraded similarly to PCA, but audio clustering is is taken as the final cluster assignment. almost unaffected. In other words, even when the im- Figure 1(e) shows the conditional perplexity versus the age data are degraded, CCA is able to recover a good number of clusters for PCA and CCA based hierarchi- 1 clustering in at least one of the views. For a more cal clustering. For any number of clusters, CCA pro- detailed look at the clustering behavior, Figures 1(a-d) duces better clusterings, i.e. ones with lower perplex- show the distributions of clusters for each speaker. ity. In addition, the tree structures of the PCA/CCA- 1The audio task is unusually challenging, as each fea- based clusterings are qualitatively different. With ture vector corresponds to only a few phonemes. A typ- PCA based clustering, most points are assigned to a ical speaker classification setting uses entire sentences. If few large clusters, with the remaining clusters being we force the cluster identity to be constant over each sen- very small. CCA-based hierarchical clustering pro- tence (the most frequent cluster label in the sentence), per- duces more balanced clusters. To see this, in Fig- formance improves greatly; e.g., in the “audio+occlusion” case, the perplexity improves to 8.5 (PCA) and 2.1 (CCA). ure 1(f) we show the perplexity of the cluster distribu- Multi-View Clustering via Canonical Correlation Analysis

(e) Wikipedia: Category perplexity (a) AV: Audio, PCA basis (c) AV: Images + occlusion, PCA basis 160 hierarchical CCA 5 5 140 hierarchical PCA

10 10 120

15 15 100 20 20 speaker 25 speaker 25 80 perplexity

30 30 60

35 35 40 40 40 20 40 60 80 20 40 60 80 20 cluster cluster 0 20 40 60 80 100 120 number of clusters (b) AV: Audio, CCA basis (d) AV: Images + occlusion, CCA basis (f) Wikipedia: Cluster perplexity 120 balanced clustering 5 5 100 hierarchical CCA hierarchical PCA 10 10

15 15 80

20 20 60 Entropy 2 speaker 25 speaker 25 40 30 30

35 35 20

40 40 0 20 40 60 80 20 40 60 80 0 20 40 60 80 100 120 cluster cluster number of clusters Figure 1. (a-d) Distributions of cluster assignments per speaker in audio-visual experiments. The color of each cell (s, c) corresponds to the empirical probability p(c|s) (darker = higher). (e-f) Wikipedia experiments: (e) Conditional perplexity of article category given cluster (2H(a|c)). (f) Perplexity of the cluster distribution (2H(c)) tion versus number of clusters. For about 25 or more [DE04] G. Dunn and B. Everitt. An Introduction clusters, the CCA-based clustering has higher perplex- to Mathematical Taxonomy. Dover Books, ity, indicating a more uniform distribution of clusters. 2004. References [DM80] S. B. Davis and P. Mermelstein. Com- parison of parametric representations for [AK05] S. Arora and R. Kannan. Learning mixtures monosyllabic word recognition in continu- of separated nonspherical Gaussians. Ann. ously spoken sentences. IEEE Trans. Acous- Applied Prob., 15(1A):69–92, 2005. tics, Speech, and Signal Proc., 28(4):357– 366, 1980. [AM05] D. Achlioptas and F. McSherry. On spec- tral learning of mixtures of distributions. In [DS00] S. Dasgupta and L. Schulman. A two-round COLT, pages 458–469, 2005. variant of EM for Gaussian mixtures. In UAI, pages 152–159, 2000. [AZ07] R. Kubota Ando and T. Zhang. Two-view feature generation model for semi-supervised [KF07] S. M. Kakade and D. P. Foster. Multi-view learning. In ICML, pages 25–32, 2007. regression via canonical correlation analysis. In COLT, pages 82–96, 2007. [BL08] M. B. Blaschko and C. H. Lampert. Corre- lational . In CVPR, 2008. [KSV05] R. Kannan, H. Salmasian, and S. Vempala. The spectral method for general mixture [BM98] A. Blum and T. Mitchell. Combining la- models. In COLT, pages 444–457, 2005. beled and unlabeled data with co-training. In COLT, pages 92–100, 1998. [RV07] M. Rudelson and R. Vershynin. from large matrices: An approach through [BV08] S. C. Brubaker and S. Vempala. Isotropic geometric functional analysis. Journal of the PCA and affine-invariant clustering. In ACM, 2007. FOCS, pages 551–560, 2008. [San08] C. Sanderson. Biometric Person Recogni- [CR08] K. Chaudhuri and S. Rao. Learning mixtures tion: Face, Speech and Fusion. VDM-Verlag, of distributions using correlations and inde- 2008. pendence. In COLT, pages 9–20, 2008. [VW02] V. Vempala and G. Wang. A spectral algo- [Das99] S. Dasgupta. Learning mixtures of Gaus- rithm for learning mixtures of distributions. sians. In FOCS, pages 634–644, 1999. In FOCS, pages 113–123, 2002.