<<

Multiview Clustering via Canonical Correlation Analysis

Karen Livescu, Sham Kakade, Karthik Sridharan Toyota Technological Institute at Chicago

Kamalika Chaudhuri University of at San Diego Motivation

Clustering is a stand‐alone task or pre‐processing step in ƒ Data analysis ƒ Vector quantization (for compression, speech recognition, ...) ƒ Density estimation ƒ Information retrieval ƒ Semi‐supervised learning for classification/regression

Clustering in high dimensions is “hard”, so data may be projected to lower‐ dimensional space via PCA or random projection ƒ Does not differentiate between noise and signal dimensions ƒ PCA behavior depends on the coordinate system ƒ Theoretical guarantees depend on stringent separation requirements Motivation (2)

ƒ Can take advantage of multiple views to find meaningful dimensions (e.g. audio + video, text and link structure, images + captions, ...)

(1) (2) (1) (2) ƒ Given a data set of paired vectors {(x 1, x 1), ..., (x n, x n)}, can think (1) (2) of each (x i , x i) as a sample from the same cluster (class), plus (high‐ dimensional) additive noise

ƒ We think of the two views as independent given the hidden class

ƒ If noise is independent in the two views (e.g. audio noise vs. video lighting), then the correlated dimensions are related to the hidden class

cluster

view 1 view 2 Algorithm

(1) (2) CCA finds directions w i and w i that maximize the correlations between the (1) (1) (2) (2) projections of X = {x i} and X = {x i}

xw,xwcorr maxarg ( (1)(1) xw,xwcorr (2)(2) ) ƒ The first pair of directions is (1) (2) 1 1 1 w,w 1

ƒ Subsequent direction vectors maximize the correlation subject to orthogonality with the previous vectors

(1) (2) (1) (2) (1) (1) (2) (2) Algorithm Given paired vectors {(x 1, x 1), ..., (x n, x n)}, X = {x i}, X = {x i}:

(1) (1) (2) (2) 1. Find the top k CCA directions W = {w i:k}, W = {w i:k}

(1) (1) (1) (2) (2) (2) 2. Project samples: X p = W X , X p = W X

(1) (2) 3. Cluster X p or X p, e.g. via k‐means Theoretical results [Chaudhuri and Kakade 2008]

Assumptions:

1. Uncorrelated views conditioned on the class

2. Nondegeneracy: cov(X(1),X(2)) has rank equal to the number of clusters k Theoretical results [Chaudhuri and Kakade 2008]

ƒ Theorem Suppose the source distribution is a mixture of k Gaussians and Assumptions 1 and 2 hold. If in at least one view v ∈ {1,2},

(l) (l) * 1/4 1/2 ||μ v - μ v|| > C σ k (log (kn/δ))

where σ* is the maximum standard deviation in the subspace containing the means, then with probability 1‐ δ the algorithm correctly classifies all * 2 examples in view v given a data set of size O (d/ λmin )

ƒ Theorem If the source distribution is a mixture of log‐concave distributions, then the separation requirement is

(l) (l) * 1/2 ||μ v - μ v|| > C σ k log (kn/δ))

* ƒ In the case of PCA, the separation requirement depends on σ d, the maximum deviation along any dimension Experiment 1: Audio‐visual clustering of speakers

Data: VidTIMIT ƒ 41 speakers, speaking 10 utterances each ƒ Audio + face video recorded in studio environment with no significant lighting/pose variation ƒ 25 image frames per second

Audio features: Spectra computed over 40ms frames (1501 dimensions) Image features: Pixels of face region (2394 dimensions)

CCA/PCA to 45 dimensions K‐means clustering into 82 clusters (2 per speaker)

Note: ƒ Many clusterings possible (speakers, phonemes, clothing, ...) ƒ Here we consider the “target” clusters to be speakers ƒ Images expected to cluster well ƒ Audio not expected to cluster well Experiment 1 selected results: Image clustering

PCA CCA H(sp|clus) 0.27 0.35 perplexity 1.21 1.27 2H(sp|clus) Exp’t 1 selected results: Image clustering with occlusions

PCA results are greatly degraded; CCA results are not

PCA CCA H(sp|clus) 2.72 0.33 perplexity 6.59 1.26 2H(sp|clus) Exp’t 1 selected results: Image clustering with translations

PCA results are greatly degraded; CCA results are not

PCA CCA H(sp|clus) 1.77 0.83 perplexity 3.41 1.77 2H(sp|clus) Exp’t 1 selected results: Audio clustering

Consistent .2‐.3 bit reduction in conditional entropy from PCA to CCA across experiments; currently investigating additional audio features

PCA CCA H(sp|clus) 4.98 4.76 perplexity 31.6 27.1 2H(sp|clus) Experiment 2: Clustering Wikipedia articles

Data: 128,000 Wikipedia pages

Text features: Count of each word in the article (~8 million dimensions)

Link features: Number of links from/to each article (~12 million dimensions)

Initial dimensionality reduction to 1000 dimensions using random projections

Projection via CCA/PCA to 20 dimensions

Used a hierarchical clustering procedure: 1. Perform CCA/PCA and find N clusters using k‐means 2. For each cluster in (1) larger than a threshold, perform CCA/PCA on data in this cluster, and cluster into N smaller clusters Experiment 2: Example CCA‐space clusters

Creationism; War; Hanoi Hilton; USS Florence Nightingale; Roswell, SD; Natural number; Afrocentrism; Hornet; Battle of ; Poltergeist; LaFayette, KY; Old Linear subspace; Wahhabism; Harvey Peleliu; House ; John Ripley, IL; Hainesville, Cauchy distribution; Milk; Saunders Lewis; (astrology); Hua Allan Muhammad; IL; Belleville, WI; Wrench; List of Jean‐Jacques Mulan; 1972 in 1935 in music; 1934 in Bethany, IL; South matrices; Rousseau; William sports; Moselle; Duke music; 1939 in music; Point, Ohio; Hydrogenation; Alpha Joice; Opera; Idi Amin; of York; Spam (Monty 1921 in music; Star Ashwaubenon, WI; helix; Heat pump; John Donne; History Python); History of Trek; 1776 (musical); North , FL; Soil; Bicarbonate; of the telescope; Vanuatu; Liliopsida; The Godfather Part II; Davenport, IA; Campfire; Thermal Asceticism; Afterlife; 1908 Summer Dial M for Murder; Spartanburg, SC; depolymerization; Information; Satire; Olympics; Lucy Liu; Jimmy Durante; 2006 Hurricane Lili; Cardiac arrest; Notary public; Celia Cruz; National Commonwealth Scranton, PA; Myopia; Neuron; Transformer; Romantic nationalism; Cartoonists Society; Games; 1968 in Influenza; Motion Hepatocellular Thomas Aquinas; List Balearic Islands; SCSI; music; 1977 in music; picture rating system; carcinoma; Yellow of Estonians; List of Asexuality; Copyright Comet Hale‐Bopp; FEMA; Vineland, NJ; fever; Isopropyl fantasy authors; List infringement of High Speech Rail; Fair Lawn, NJ; alcohol; Gorilla; Make of Germans; Vatican software; Lamb of Steve Wozniak; James Medford, NJ; (software); Ostinato; City; Torah; Old God; Calendar date; Carville; David Palisades Park, NJ; Control flow; PHP; ; Lapland War; Copperfield; Lynyrd South River, NJ; HTML; Objective‐C; Apostles’ Creed; Totalitarianism; Skynyrd; Poison Gillespie County, TX; Vienna Development Mahmoud Abbas; History of Sweden; (band); LL Cool J; RZA; Hidalgo County, NM; Method; Comparison Boris Yeltsin; Ulysses European Youth Stand; Bob Dole; Finney County, KS; of Java C++; Dot S. Grant; William Parliament; Roman Sergei Prokofiev; Onslow County; NC; product; Elliptic Jennings Bryan; Pipe mythology; Kingdom Peter Hain; Ravana; Brady Township, PA; integral; Catalan organ; George IV of of Israel; List of Michael Porillo; Argentine Township, number; Expected the UK; George V of railway companies; George Pickett MI; Augusta Gharter value; Coenzyme; the UK; Emperor Maundy money Township, MI Neutron; History of Taizong of Tang; geodesy; Orbital David I of Scotland period Experiment 2: Example PCA‐space clusters

Hugh the Great; Eddy Meeme, WI; West Lisp (programming Cary Grant; Hormel; List of abbreviations Duchin; Spoilt Point, WI; Wesley, language); Species; Wilson Pickett; in the CIA world fact Bastard; Irish ME; Charlotte, VT; Alkane; Hindu Microsoft Windows; book; List of rural presidential election, Glen Ridge, FL; calendar; Sexual Louis Kahn; Sugar districts of Germany; 1990; Hadley, NY; Geraldine, MA; harassment; Ray Leonard; Bilge; Orthoptera; Bill Medicine Park, OK; Taylorsville, MI; Ovid, Symmetry; Book of UNIVAC; Nat King Fitch; Muirne; Council Grove, KS; Colorado; Conway, Job; Marginalism; Cole; Supergrass; Ministry; 1770 in Perryville, KY; Albert ND; Grano, ND; Creativity; Rumi; ; literature; John Bell of Sweden; Onamia, MN; Assam; Classical Baby boomer; Magi; Williams; Brooke Ferdinand I of Geneva, IA; Margaret liberalism; NATO; Received Burke; Yeardley Romania; Luigi Avison; Bodomi; Republic of Pronunciation; Smith; Anastasios II Boccherini; Daniel Uncle Scrooge Macedonia; Action Common Pheasant; (emperor); Mary Bernoulli; Market Adventure; 14; Novel; Lombards; Samuel Mudd; River National Park; research; Gram; Demographics of Wars of the Roses; Charlie Brown; Super List of diseases (S); Metastability; Saint Helena; William History of Bolivia; 8 mm film; Loki; Lake Township, MN; Corticosteroid; Prout; Union Hill, IL; Thabo Mbeki; W. H. Equus (play); Second List of Acer species; Misdemeanor; 19 Waldo, OH; East Auden; Apollo; Battle of Fort Fisher; Yucca; List of oboists; (number); Kent Nassau, NY; Gustav Mahler; Romanos I; IBM List of male film Brockman; Bourne Rochester, OH; Mill James Joyce; New Personal System/2; actors; Sparidae; 58 shell; So I Married an Hall, PA; Attu Station, ; Islamic art; Action figure; BC; Xanthine; 19; Axe Murderer; Peyo; AL; Roxbury, NH; Nevada; Metro Albertus Magnus; Pyrrhic; 100 Lipschitz continuity; Lone Rock, WI; Baca Manila; Mount Roland Freisler; gigametres; 177 BC; Equations of Motion; County, CO; Athos; Avignon; Giacomo Puccini; Businessperson; GNU Conjugate transpose; Comanche County, Bangalore; Kolkata; Frederick William II compiler for JAVA; Square number; OK; Polk County, GA Jabba the Hutt; of Prussia; Wiki; Strawberry Field; NJ Datamax UV‐1; Andrea Dworkin; Borzoi; Dynamic Host Route 63; Tribes of Galway; Configuration Cucurbitales; Rita Brethren; Iwi Protocol; Yinglish Johnston Conclusions

ƒ In the right conditions, CCA outperforms PCA as a dimensionality reduction technique before k‐means clustering

ƒ Ongoing work: Application to semi‐supervised learning for speaker and speech recognition, analysis of hierarchical clustering