Multiview Clustering via Canonical Correlation Analysis
Karen Livescu, Sham Kakade, Karthik Sridharan Toyota Technological Institute at Chicago
Kamalika Chaudhuri University of California at San Diego Motivation
Clustering is a stand‐alone task or pre‐processing step in Data analysis Vector quantization (for compression, speech recognition, ...) Density estimation Information retrieval Semi‐supervised learning for classification/regression
Clustering in high dimensions is “hard”, so data may be projected to lower‐ dimensional space via PCA or random projection Does not differentiate between noise and signal dimensions PCA behavior depends on the coordinate system Theoretical guarantees depend on stringent separation requirements Motivation (2)
Can take advantage of multiple views to find meaningful dimensions (e.g. audio + video, text and link structure, images + captions, ...)
(1) (2) (1) (2) Given a data set of paired vectors {(x 1, x 1), ..., (x n, x n)}, can think (1) (2) of each (x i , x i) as a sample from the same cluster (class), plus (high‐ dimensional) additive noise
We think of the two views as independent given the hidden class
If noise is independent in the two views (e.g. audio noise vs. video lighting), then the correlated dimensions are related to the hidden class
cluster
view 1 view 2 Algorithm
(1) (2) CCA finds directions w i and w i that maximize the correlations between the (1) (1) (2) (2) projections of X = {x i} and X = {x i}
xw,xwcorr maxarg ( (1)(1) xw,xwcorr (2)(2) ) The first pair of directions is (1) (2) 1 1 1 w,w 1
Subsequent direction vectors maximize the correlation subject to orthogonality with the previous vectors
(1) (2) (1) (2) (1) (1) (2) (2) Algorithm Given paired vectors {(x 1, x 1), ..., (x n, x n)}, X = {x i}, X = {x i}:
(1) (1) (2) (2) 1. Find the top k CCA directions W = {w i:k}, W = {w i:k}
(1) (1) (1) (2) (2) (2) 2. Project samples: X p = W X , X p = W X
(1) (2) 3. Cluster X p or X p, e.g. via k‐means Theoretical results [Chaudhuri and Kakade 2008]
Assumptions:
1. Uncorrelated views conditioned on the class
2. Nondegeneracy: cov(X(1),X(2)) has rank equal to the number of clusters k Theoretical results [Chaudhuri and Kakade 2008]
Theorem Suppose the source distribution is a mixture of k Gaussians and Assumptions 1 and 2 hold. If in at least one view v ∈ {1,2},
(l) (l) * 1/4 1/2 ||μ v - μ v|| > C σ k (log (kn/δ))
where σ* is the maximum standard deviation in the subspace containing the means, then with probability 1‐ δ the algorithm correctly classifies all * 2 examples in view v given a data set of size O (d/ λmin )
Theorem If the source distribution is a mixture of log‐concave distributions, then the separation requirement is
(l) (l) * 1/2 ||μ v - μ v|| > C σ k log (kn/δ))
* In the case of PCA, the separation requirement depends on σ d, the maximum deviation along any dimension Experiment 1: Audio‐visual clustering of speakers
Data: VidTIMIT 41 speakers, speaking 10 utterances each Audio + face video recorded in studio environment with no significant lighting/pose variation 25 image frames per second
Audio features: Spectra computed over 40ms frames (1501 dimensions) Image features: Pixels of face region (2394 dimensions)
CCA/PCA to 45 dimensions K‐means clustering into 82 clusters (2 per speaker)
Note: Many clusterings possible (speakers, phonemes, clothing, ...) Here we consider the “target” clusters to be speakers Images expected to cluster well Audio not expected to cluster well Experiment 1 selected results: Image clustering
PCA CCA H(sp|clus) 0.27 0.35 perplexity 1.21 1.27 2H(sp|clus) Exp’t 1 selected results: Image clustering with occlusions
PCA results are greatly degraded; CCA results are not
PCA CCA H(sp|clus) 2.72 0.33 perplexity 6.59 1.26 2H(sp|clus) Exp’t 1 selected results: Image clustering with translations
PCA results are greatly degraded; CCA results are not
PCA CCA H(sp|clus) 1.77 0.83 perplexity 3.41 1.77 2H(sp|clus) Exp’t 1 selected results: Audio clustering
Consistent .2‐.3 bit reduction in conditional entropy from PCA to CCA across experiments; currently investigating additional audio features
PCA CCA H(sp|clus) 4.98 4.76 perplexity 31.6 27.1 2H(sp|clus) Experiment 2: Clustering Wikipedia articles
Data: 128,000 Wikipedia pages
Text features: Count of each word in the article (~8 million dimensions)
Link features: Number of links from/to each article (~12 million dimensions)
Initial dimensionality reduction to 1000 dimensions using random projections
Projection via CCA/PCA to 20 dimensions
Used a hierarchical clustering procedure: 1. Perform CCA/PCA and find N clusters using k‐means 2. For each cluster in (1) larger than a threshold, perform CCA/PCA on data in this cluster, and cluster into N smaller clusters Experiment 2: Example CCA‐space clusters
Creationism; War; Hanoi Hilton; USS Florence Nightingale; Roswell, SD; Natural number; Afrocentrism; Hornet; Battle of Texaco; Poltergeist; LaFayette, KY; Old Linear subspace; Wahhabism; Harvey Peleliu; House Isadora Duncan; John Ripley, IL; Hainesville, Cauchy distribution; Milk; Saunders Lewis; (astrology); Hua Allan Muhammad; IL; Belleville, WI; Wrench; List of Jean‐Jacques Mulan; 1972 in 1935 in music; 1934 in Bethany, IL; South matrices; Rousseau; William sports; Moselle; Duke music; 1939 in music; Point, Ohio; Hydrogenation; Alpha Joice; Opera; Idi Amin; of York; Spam (Monty 1921 in music; Star Ashwaubenon, WI; helix; Heat pump; John Donne; History Python); History of Trek; 1776 (musical); North Miami, FL; Soil; Bicarbonate; of the telescope; Vanuatu; Liliopsida; The Godfather Part II; Davenport, IA; Campfire; Thermal Asceticism; Afterlife; 1908 Summer Dial M for Murder; Spartanburg, SC; depolymerization; Information; Satire; Olympics; Lucy Liu; Jimmy Durante; 2006 Hurricane Lili; Cardiac arrest; Notary public; Celia Cruz; National Commonwealth Scranton, PA; Myopia; Neuron; Transformer; Romantic nationalism; Cartoonists Society; Games; 1968 in Influenza; Motion Hepatocellular Thomas Aquinas; List Balearic Islands; SCSI; music; 1977 in music; picture rating system; carcinoma; Yellow of Estonians; List of Asexuality; Copyright Comet Hale‐Bopp; FEMA; Vineland, NJ; fever; Isopropyl fantasy authors; List infringement of High Speech Rail; Fair Lawn, NJ; alcohol; Gorilla; Make of Germans; Vatican software; Lamb of Steve Wozniak; James Medford, NJ; (software); Ostinato; City; Torah; Old God; Calendar date; Carville; David Palisades Park, NJ; Control flow; PHP; Catholic Church; Lapland War; Copperfield; Lynyrd South River, NJ; HTML; Objective‐C; Apostles’ Creed; Totalitarianism; Skynyrd; Poison Gillespie County, TX; Vienna Development Mahmoud Abbas; History of Sweden; (band); LL Cool J; RZA; Hidalgo County, NM; Method; Comparison Boris Yeltsin; Ulysses European Youth Stand; Bob Dole; Finney County, KS; of Java C++; Dot S. Grant; William Parliament; Roman Sergei Prokofiev; Onslow County; NC; product; Elliptic Jennings Bryan; Pipe mythology; Kingdom Peter Hain; Ravana; Brady Township, PA; integral; Catalan organ; George IV of of Israel; List of Michael Porillo; Argentine Township, number; Expected the UK; George V of railway companies; George Pickett MI; Augusta Gharter value; Coenzyme; the UK; Emperor Maundy money Township, MI Neutron; History of Taizong of Tang; geodesy; Orbital David I of Scotland period Experiment 2: Example PCA‐space clusters
Hugh the Great; Eddy Meeme, WI; West Lisp (programming Cary Grant; Hormel; List of abbreviations Duchin; Spoilt Point, WI; Wesley, language); Species; Wilson Pickett; in the CIA world fact Bastard; Irish ME; Charlotte, VT; Alkane; Hindu Microsoft Windows; book; List of rural presidential election, Glen Ridge, FL; calendar; Sexual Louis Kahn; Sugar districts of Germany; 1990; Hadley, NY; Geraldine, MA; harassment; Ray Leonard; Bilge; Orthoptera; Bill Medicine Park, OK; Taylorsville, MI; Ovid, Symmetry; Book of UNIVAC; Nat King Fitch; Muirne; Council Grove, KS; Colorado; Conway, Job; Marginalism; Cole; Supergrass; Ministry; 1770 in Perryville, KY; Albert ND; Grano, ND; Creativity; Rumi; Whoopi Goldberg; literature; John Bell of Sweden; Onamia, MN; Assam; Classical Baby boomer; Magi; Williams; Brooke Ferdinand I of Geneva, IA; Margaret liberalism; NATO; Received Burke; Yeardley Romania; Luigi Avison; Bodomi; Republic of Pronunciation; Smith; Anastasios II Boccherini; Daniel Uncle Scrooge Macedonia; Action Common Pheasant; (emperor); Mary Bernoulli; Market Adventure; 14; Novel; Lombards; Samuel Mudd; River National Park; research; Gram; Demographics of Wars of the Roses; Charlie Brown; Super List of diseases (S); Metastability; Saint Helena; William History of Bolivia; 8 mm film; Loki; Lake Township, MN; Corticosteroid; Prout; Union Hill, IL; Thabo Mbeki; W. H. Equus (play); Second List of Acer species; Misdemeanor; 19 Waldo, OH; East Auden; Apollo; Battle of Fort Fisher; Yucca; List of oboists; (number); Kent Nassau, NY; Gustav Mahler; Romanos I; IBM List of male film Brockman; Bourne Rochester, OH; Mill James Joyce; New Personal System/2; actors; Sparidae; 58 shell; So I Married an Hall, PA; Attu Station, Mexico; Islamic art; Action figure; BC; Xanthine; 19; Axe Murderer; Peyo; AL; Roxbury, NH; Nevada; Metro Albertus Magnus; Pyrrhic; 100 Lipschitz continuity; Lone Rock, WI; Baca Manila; Mount Roland Freisler; gigametres; 177 BC; Equations of Motion; County, CO; Athos; Avignon; Giacomo Puccini; Businessperson; GNU Conjugate transpose; Comanche County, Bangalore; Kolkata; Frederick William II compiler for JAVA; Square number; OK; Polk County, GA Jabba the Hutt; of Prussia; Wiki; Strawberry Field; NJ Datamax UV‐1; Andrea Dworkin; Borzoi; Dynamic Host Route 63; Tribes of Galway; Configuration Cucurbitales; Rita Brethren; Iwi Protocol; Yinglish Johnston Conclusions
In the right conditions, CCA outperforms PCA as a dimensionality reduction technique before k‐means clustering
Ongoing work: Application to semi‐supervised learning for speaker and speech recognition, analysis of hierarchical clustering