dimensionality reduction for k-means and low rank approximation

Michael Cohen, Sam Elder, Cameron Musco, Christopher Musco, Mădălina Persu

Massachusetts Institute of Technology

0 overview

Simple techniques to accelerate algorithms for:

∙ k-means clustering ∙ principal component analysis (PCA) ∙ constrained low rank approximation

1 dimensionality reduction

Replace large, high dimensional dataset with low dimensional sketch.

d features d’ << d features

A Ã n data points

2 ∙ Simultaneously improves runtime, memory requirements, communication cost, etc.

dimensionality reduction

Solution on sketch à should approximate original solution.

1.8

1.6

1.4

1.2

1

0.8

0.6

0.4

0.2

0 2

1.5 2 1 1.5 0.5 1 0.5 0 0 −0.5 −0.5

3 dimensionality reduction

Solution on sketch à should approximate original solution.

1.8

1.6

1.4

1.2

1

0.8

0.6

0.4

0.2

0 2

1.5 2 1 1.5 0.5 1 0.5 0 0 −0.5 −0.5

∙ Simultaneously improves runtime, memory requirements,

communication cost, etc. 3 ∙ Choose k clusters to minimize total intra-cluster variance

∑n ∥ − ∥2 min Cost(A, C) = ai µ (C[ai]) 2 C i=1

k-means clustering

∙ Extremely common objective function for clustering

4 ∑n ∥ − ∥2 min Cost(A, C) = ai µ (C[ai]) 2 C i=1

k-means clustering

∙ Extremely common objective function for clustering ∙ Choose k clusters to minimize total intra-cluster variance

4 ∑n ∥ − ∥2 min Cost(A, C) = ai µ (C[ai]) 2 C i=1

k-means clustering

∙ Extremely common objective function for clustering ∙ Choose k clusters to minimize total intra-cluster variance

4 ∑n ∥ − ∥2 min Cost(A, C) = ai µ (C[ai]) 2 C i=1

k-means clustering

∙ Extremely common objective function for clustering ∙ Choose k clusters to minimize total intra-cluster variance

4 ∑n ∥ − ∥2 min Cost(A, C) = ai µ (C[ai]) 2 C i=1

k-means clustering

∙ Extremely common objective function for clustering ∙ Choose k clusters to minimize total intra-cluster variance

4 k-means clustering

ai µ

∑n ∥ − ∥2 min Cost(A, C) = ai µ (C[ai]) 2 C i=1 4 ∙ Several (1 + ϵ) and constant factor approximation algorithms. ∙ In practice: Lloyd’s heuristic (i.e. “the k-means algorithm”) with k-means++ initialization is used. O(log k) approximation guaranteed, typically performs much better.

Dimensionality reduction can speed up any of these algorithms.

∙ NP-hard, even for fixed d or fixed k.

k-means clustering

∑n ∥ − ∥2 min Cost(A, C) = ai µ (C[ai]) 2 C i=1

5 ∙ Several (1 + ϵ) and constant factor approximation algorithms. ∙ In practice: Lloyd’s heuristic (i.e. “the k-means algorithm”) with k-means++ initialization is used. O(log k) approximation guaranteed, typically performs much better.

Dimensionality reduction can speed up any of these algorithms.

k-means clustering

∑n ∥ − ∥2 min Cost(A, C) = ai µ (C[ai]) 2 C i=1

∙ NP-hard, even for fixed dimension d or fixed k.

5 ∙ In practice: Lloyd’s heuristic (i.e. “the k-means algorithm”) with k-means++ initialization is used. O(log k) approximation guaranteed, typically performs much better.

Dimensionality reduction can speed up any of these algorithms.

k-means clustering

∑n ∥ − ∥2 min Cost(A, C) = ai µ (C[ai]) 2 C i=1

∙ NP-hard, even for fixed dimension d or fixed k. ∙ Several (1 + ϵ) and constant factor approximation algorithms.

5 Dimensionality reduction can speed up any of these algorithms.

k-means clustering

∑n ∥ − ∥2 min Cost(A, C) = ai µ (C[ai]) 2 C i=1

∙ NP-hard, even for fixed dimension d or fixed k. ∙ Several (1 + ϵ) and constant factor approximation algorithms. ∙ In practice: Lloyd’s heuristic (i.e. “the k-means algorithm”) with k-means++ initialization is used. O(log k) approximation guaranteed, typically performs much better.

5 k-means clustering

∑n ∥ − ∥2 min Cost(A, C) = ai µ (C[ai]) 2 C i=1

∙ NP-hard, even for fixed dimension d or fixed k. ∙ Several (1 + ϵ) and constant factor approximation algorithms. ∙ In practice: Lloyd’s heuristic (i.e. “the k-means algorithm”) with k-means++ initialization is used. O(log k) approximation guaranteed, typically performs much better.

Dimensionality reduction can speed up any of these algorithms. 5 ∑ n ∥ − ∥2 min i=1 ai µ(ai) 2

µ1 µ2 … µk

a1 a1

a2 a2

a3 a3

A C(A)

an-1 an-1

an an

key observation

k-means clustering == low rank approximation

6 key observation

k-means clustering == low rank approximation ∑ n ∥ − ∥2 min i=1 ai µ(ai) 2

µ1 µ2 … µk

a1 a1

a2 a2

a3 a3

A C(A)

an-1 an-1

an an 6 key observation

k-means clustering == low rank approximation ∑ n ∥ − ∥2 min i=1 ai µ(ai) 2

µ1 µ2 … µk

a1 µ1

a2 a2

a3 a3

A C(A)

an-1 µ1

an an 6 key observation

k-means clustering == low rank approximation ∑ n ∥ − ∥2 min i=1 ai µ(ai) 2

µ1 µ2 … µk

a1 µ1

a2 a2

a3 µ2

A C(A)

an-1 µ1

an an 6 key observation

k-means clustering == low rank approximation ∑ n ∥ − ∥2 min i=1 ai µ(ai) 2

µ1 µ2 … µk

a1 µ1

a2 µk

a3 µ2

A C(A)

an-1 µ1

an µk 6 key observation

k-means clustering == low rank approximation ∑ n ∥ − ∥2 ∥ − ∥2 min i=1 ai µ(ai) 2 = A C(A) F

µ1 µ2 … µk

a1 µ1

a2 µk

a3 µ2

A C(A)

an-1 µ1

an µk 6 key observation

C(A) is actually a projection of A’s columns onto a rank k subspace [Boutsidis, Drineas, Mahoney, Zouzias ‘11]

7 key observation

C(A) is actually a projection of A’s columns onto a rank k subspace [Boutsidis, Drineas, Mahoney, Zouzias ‘11]

1 1/√|C1| 1/√|C1| a1 µ1 √|C | 1 1/√|C | a µ 1 2 2 k

√|Ck| 1/√|Ck| 1/√|Ck| a3 µ2 1 √|C2|

A = C(A)

1 √|C1| 1 an-1 µ1

√|Ck| an µk

7 key observation

C(A) is actually a projection of A’s columns onto a rank k subspace [Boutsidis, Drineas, Mahoney, Zouzias ‘11]

1 1/|C1| 1/|C1| a1 µ1

1/|C2| a2 µk 1 1/|C | k 1/|Ck| a3 µ2 1

A = C(A)

1

an-1 µ1 1 an µk

7 key observation

C(A) is actually a projection of A’s columns onto a rank k subspace [Boutsidis, Drineas, Mahoney, Zouzias ‘11]

1 µ1 µ1 µ 2 = µk 1 µk µ2 1

C(A)

1

µ1 1 µk

7 key observation

C(A) is actually a projection of A’s columns onto a rank k subspace [Boutsidis, Drineas, Mahoney, Zouzias ‘11]

1 1/√|C1| 1/√|C1| a1 µ1 √|C | 1 1/√|C | T a µ 1 2 2 k XC √|Ck| 1/√|Ck| 1/√|Ck| a3 µ2 1 √|C2|

A = C(A) XC

1 √|C1| 1 an-1 µ1

√|Ck| an µk

7 ∙ General form for constrained low rank approximation.

∙ Set S = {All rank k orthonormal matrices} for principal component analysis/singular value decomposition.

key observation

∑n ∥ − ∥2 ⇒ ∥ − ⊤ ∥2 min ai µ (C[ai]) 2 = min A XX A F C rank(X)=k,X∈S i=1

Where X is a rank k orthonormal and for k-means S is the set of all clustering indicator matrices.

8 ∙ General form for constrained low rank approximation.

key observation

∑n ∥ − ∥2 ⇒ ∥ − ⊤ ∥2 min ai µ (C[ai]) 2 = min A XX A F C rank(X)=k,X∈S i=1

Where X is a rank k orthonormal matrix and for k-means S is the set of all clustering indicator matrices.

∙ Set S = {All rank k orthonormal matrices} for principal component analysis/singular value decomposition.

8 key observation

∑n ∥ − ∥2 ⇒ ∥ − ⊤ ∥2 min ai µ (C[ai]) 2 = min A XX A F C rank(X)=k,X∈S i=1

Where X is a rank k orthonormal matrix and for k-means S is the set of all clustering indicator matrices.

∙ Set S = {All rank k orthonormal matrices} for principal component analysis/singular value decomposition. ∙ General form for constrained low rank approximation.

8 ∥ − ⊤ ∥2 ≈ ∥ − ⊤ ∥2 for all rank k X, Ã XX Ã F A XX A F

2 2

A − XXT A ≈ Ã − XXT Ã

F F d O(k)

Projection-Cost Preserving Sketch

so what?

∥ − ⊤ ∥2 Can solve any problem like minrank(X)=k,X∈S A XX A F if:

9 2 2

A − XXT A ≈ Ã − XXT Ã

F F d O(k)

Projection-Cost Preserving Sketch

so what?

∥ − ⊤ ∥2 Can solve any problem like minrank(X)=k,X∈S A XX A F if: ∥ − ⊤ ∥2 ≈ ∥ − ⊤ ∥2 for all rank k X, Ã XX Ã F A XX A F

9 Projection-Cost Preserving Sketch

so what?

∥ − ⊤ ∥2 Can solve any problem like minrank(X)=k,X∈S A XX A F if: ∥ − ⊤ ∥2 ≈ ∥ − ⊤ ∥2 for all rank k X, Ã XX Ã F A XX A F

2 2

A − XXT A ≈ Ã − XXT Ã

F F d O(k)

9 so what?

∥ − ⊤ ∥2 Can solve any problem like minrank(X)=k,X∈S A XX A F if: ∥ − ⊤ ∥2 ≈ ∥ − ⊤ ∥2 for all rank k X, Ã XX Ã F A XX A F

2 2

A − XXT A ≈ Ã − XXT Ã

F F d O(k)

Projection-Cost Preserving Sketch

9 If we find an X that gives:

⊤ ∥ − ⊤ ∥2 ≤ ∥ − f ∥2 à XX à F γ à XXoptà F

then:

∥ − ⊤ ∥2 ≤ · ∥ − ⊤ ∥2 A XX A F γ (1 + ϵ) A (XX)optA F

See coresets of Feldman, Schmidt, Sohler ‘13.

projection-cost preservation

Specifically, we want:

∥ − ⊤ ∥2 ≈ ∥ − ⊤ ∥2 for all X, Ã XX Ã F A XX A F

10 If we find an X that gives:

⊤ ∥ − ⊤ ∥2 ≤ ∥ − f ∥2 à XX à F γ à XXoptà F

then:

∥ − ⊤ ∥2 ≤ · ∥ − ⊤ ∥2 A XX A F γ (1 + ϵ) A (XX)optA F

See coresets of Feldman, Schmidt, Sohler ‘13.

projection-cost preservation

Specifically, we want:

∥ − ⊤ ∥2  ∥ − ⊤ ∥2 for all X, Ã XX Ã F = (1 ϵ) A XX A F

10 If we find an X that gives:

⊤ ∥ − ⊤ ∥2 ≤ ∥ − f ∥2 à XX à F γ à XXoptà F

then:

∥ − ⊤ ∥2 ≤ · ∥ − ⊤ ∥2 A XX A F γ (1 + ϵ) A (XX)optA F

See coresets of Feldman, Schmidt, Sohler ‘13.

projection-cost preservation

Specifically, we want:

∥ − ⊤ ∥2  ∥ − ⊤ ∥2 for all X, Ã XX Ã F + c = (1 ϵ) A XX A F

10 then:

∥ − ⊤ ∥2 ≤ · ∥ − ⊤ ∥2 A XX A F γ (1 + ϵ) A (XX)optA F

See coresets of Feldman, Schmidt, Sohler ‘13.

projection-cost preservation

Specifically, we want:

∥ − ⊤ ∥2  ∥ − ⊤ ∥2 for all X, Ã XX Ã F + c = (1 ϵ) A XX A F

If we find an X that gives:

⊤ ∥ − ⊤ ∥2 ≤ ∥ − f ∥2 à XX à F γ à XXoptà F

10 See coresets of Feldman, Schmidt, Sohler ‘13.

projection-cost preservation

Specifically, we want:

∥ − ⊤ ∥2  ∥ − ⊤ ∥2 for all X, Ã XX Ã F + c = (1 ϵ) A XX A F

If we find an X that gives:

⊤ ∥ − ⊤ ∥2 ≤ ∥ − f ∥2 à XX à F γ à XXoptà F

then:

∥ − ⊤ ∥2 ≤ · ∥ − ⊤ ∥2 A XX A F γ (1 + ϵ) A (XX)optA F

10 projection-cost preservation

Specifically, we want:

∥ − ⊤ ∥2  ∥ − ⊤ ∥2 for all X, Ã XX Ã F + c = (1 ϵ) A XX A F

If we find an X that gives:

⊤ ∥ − ⊤ ∥2 ≤ ∥ − f ∥2 à XX à F γ à XXoptà F

then:

∥ − ⊤ ∥2 ≤ · ∥ − ⊤ ∥2 A XX A F γ (1 + ϵ) A (XX)optA F

See coresets of Feldman, Schmidt, Sohler ‘13.

10 ∙ We can construct a projection-cost preserving sketch à that approximates the distance from A to any rank k subspace. ∙ Stronger guarantee than has been sought in prior work on approximate PCA via sketching

∙ k-means clustering is just constrained k rank approximation.

recap

What we have seen so far:

11 ∙ We can construct a projection-cost preserving sketch à that approximates the distance from A to any rank k subspace. ∙ Stronger guarantee than has been sought in prior work on approximate PCA via sketching

recap

What we have seen so far:

∙ k-means clustering is just constrained k rank approximation.

11 ∙ Stronger guarantee than has been sought in prior work on approximate PCA via sketching

recap

What we have seen so far:

∙ k-means clustering is just constrained k rank approximation. ∙ We can construct a projection-cost preserving sketch à that approximates the distance from A to any rank k subspace.

11 recap

What we have seen so far:

∙ k-means clustering is just constrained k rank approximation. ∙ We can construct a projection-cost preserving sketch à that approximates the distance from A to any rank k subspace. ∙ Stronger guarantee than has been sought in prior work on approximate PCA via sketching

11 our results

What techniques give (1 + ϵ) projection-cost preservation?

Previous Work Our Results Technique Reference Error Dimensions Error Feldman, SVD Schmidt, O(k/ϵ2) 1 + ϵ ⌈k/ϵ⌉ 1 + ϵ Sohler ‘13 Boutsidis, Drineas, Approximate SVD k 2 + ϵ ⌈k/ϵ⌉ 1 + ϵ Mahoney, Zouzias ‘11 Random O(k/ϵ2) 1 + ϵ ” O(k/ϵ2) 2 + ϵ Projection O(log k/ϵ2) 9 + ϵ Column Sampling ” O(k log k/ϵ2) 3 + ϵ O(k log k/ϵ2) 1 + ϵ Boutsidis, Deterministic Magdon- r > k O(n/r) O(k/ϵ2) 1 + ϵ Column Selection Ismail ‘13 Non-oblivious NA NA NA O(k/ϵ) 1 + ϵ Projection 12 our results

Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Feldman, SVD Schmidt, O(k/ϵ2) 1 + ϵ ⌈k/ϵ⌉ 1 + ϵ Sohler ‘13

A Ã Vk/ε

k/ε 13 our results

Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Approximate Drineas, k 2 + ϵ ⌈k/ϵ⌉ 1 + ϵ SVD Mahoney, Zouzias ‘11

A ~ Ã Vk/ε

k/ε 13 our results

Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Approximate Drineas, k 2 + ϵ ⌈k/ϵ⌉ 1 + ϵ SVD Mahoney, Zouzias ‘11

∙ Practically very useful for k-means ∙ No constant factors on k/ϵ, and typically many fewer dimensions are required (Kappmeier, Schmidt, Schmidt ‘15)

13 our results

Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Random Drineas, O(k/ϵ2) 1 + ϵ O(k/ϵ2) 2 + ϵ Projection Mahoney, O(log k/ϵ2) 9 + ϵ Zouzias ‘11

±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1

A Π Ã ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 O(k/ε2) 14 our results

Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Random Drineas, O(k/ϵ2) 1 + ϵ O(k/ϵ2) 2 + ϵ Projection Mahoney, O(log k/ϵ2) 9 + ϵ Zouzias ‘11

±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1

A Π Ã ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 O(k/ε2) 14 our results

Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Random Drineas, O(k/ϵ2) 1 + ϵ O(k/ϵ2) 2 + ϵ Projection Mahoney, O(log k/ϵ2) 9 + ϵ Zouzias ‘11

±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1

A Π Ã ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 O(k/ε2) 14 ∙ Sketch is data oblivious ∙ Lowest communication distributed k-means (improves on Balcan, Kanchanapally, Liang, Woodruff ‘14) ∙ Streaming principal component analysis in a single pass

our results

Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Random Drineas, O(k/ϵ2) 1 + ϵ O(k/ϵ2) 2 + ϵ Projection Mahoney, O(log k/ϵ2) 9 + ϵ Zouzias ‘11

∙ First sketch with dimension sublinear in k. Can (9 + ϵ) be improved?

14 our results

Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Random Drineas, O(k/ϵ2) 1 + ϵ O(k/ϵ2) 2 + ϵ Projection Mahoney, O(log k/ϵ2) 9 + ϵ Zouzias ‘11

∙ First sketch with dimension sublinear in k. Can (9 + ϵ) be improved? ∙ Sketch is data oblivious ∙ Lowest communication distributed k-means (improves on Balcan, Kanchanapally, Liang, Woodruff ‘14) ∙ Streaming principal component analysis in a single pass

14 O(k/ε) ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1

A Π Ã ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1

∥ − ∥2 ≤ ∥ − ∥2 A (PÃA)k F (1 + ϵ) A Ak F. span(Ã) contains a good low rank approximation for A, but we must return to A to find it.

oblivious project-cost preserving sketches

Standard sketches for low rank approximation (Sarlós ‘06, Clarkson, Woodruff ‘13, etc):

15 oblivious project-cost preserving sketches

Standard sketches for low rank approximation (Sarlós ‘06, Clarkson, Woodruff ‘13, etc):

O(k/ε) ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1

A Π Ã ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1

∥ − ∥2 ≤ ∥ − ∥2 A (PÃA)k F (1 + ϵ) A Ak F. span(Ã) contains a good low rank approximation for A, but we must return to A to find it. 15 Is it required for oblivious approximate PCA? (See Ghashami, Liberty, Phillips, Woodruff ‘14/15)

O(k/ε2) ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1

A Π Ã ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1

∥ − ⊤ ∥2 ≤ ∥ − ∥2 A VkVk A F (1 + ϵ) A Ak F. Additional ϵ dependence for stronger sketch.

oblivious project-cost preserving sketches

Projection-cost preserving sketch for low rank approximation:

15 Is it required for oblivious approximate PCA? (See Ghashami, Liberty, Phillips, Woodruff ‘14/15)

oblivious project-cost preserving sketches

Projection-cost preserving sketch for low rank approximation:

O(k/ε2) ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1

A Π Ã ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1

∥ − ⊤ ∥2 ≤ ∥ − ∥2 A VkVk A F (1 + ϵ) A Ak F. Additional ϵ dependence for stronger sketch.

15 oblivious project-cost preserving sketches

Projection-cost preserving sketch for low rank approximation:

O(k/ε2) ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1

A Π Ã ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1

∥ − ⊤ ∥2 ≤ ∥ − ∥2 A VkVk A F (1 + ϵ) A Ak F. Additional ϵ dependence for stronger sketch. Is it required for oblivious approximate PCA? (See Ghashami, Liberty, Phillips, Woodruff ‘14/15) 15 our results

Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Non-oblivious Randomized Sarlós ‘06 NA NA O(k/ϵ) 1 + ϵ Projection )

A A Π Ã span(

O(k/ε)

16 our results

Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Non-oblivious Randomized Sarlós ‘06 NA NA O(k/ϵ) 1 + ϵ Projection

∙ Very good practical performance: can get away with narrower Π than approximate SVD, although final point dimension is a constant factor higher.

16 our results

Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Column Drineas, O(k log k/ϵ2) 3 + ϵ O(k log k/ϵ2) 1 + ϵ Sampling Mahoney, Zouzias ‘11 Deterministic Boutsidis, Column Magdon- r > k O(n/r) O(k/ϵ2) 1 + ϵ Selection Ismail ‘13

A Ã

Õ(k/ε2) 17 ∙ First single shot sampling based dimensionality reduction for (1 + ϵ) error low rank approximation. ∙ Applications to streaming column subset selection? Iterative sampling algorithms for the SVD?

our results

Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Column Drineas, O(k log k/ϵ2) 3 + ϵ O(k log k/ϵ2) 1 + ϵ Sampling Mahoney, Zouzias ‘11

∙ A not only contains a small set of columns that span a good low rank approximation, but a small (reweighted) set whose top principal components approximate those of A.

17 ∙ Applications to streaming column subset selection? Iterative sampling algorithms for the SVD?

our results

Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Column Drineas, O(k log k/ϵ2) 3 + ϵ O(k log k/ϵ2) 1 + ϵ Sampling Mahoney, Zouzias ‘11

∙ A not only contains a small set of columns that span a good low rank approximation, but a small (reweighted) set whose top principal components approximate those of A. ∙ First single shot sampling based dimensionality reduction for (1 + ϵ) error low rank approximation.

17 our results

Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Column Drineas, O(k log k/ϵ2) 3 + ϵ O(k log k/ϵ2) 1 + ϵ Sampling Mahoney, Zouzias ‘11

∙ A not only contains a small set of columns that span a good low rank approximation, but a small (reweighted) set whose top principal components approximate those of A. ∙ First single shot sampling based dimensionality reduction for (1 + ϵ) error low rank approximation. ∙ Applications to streaming column subset selection? Iterative sampling algorithms for the SVD?

17 U, V have orthonormal columns. σ1 ≥ σ2 ≥ ... ≥ σd. Same as Principal Component Analysis (PCA) if we mean center A’s columns/rows.

analysis for svd

Singular Value Decomposition (SVD):

σ1

σ2 A = U Σ VT σd-1

σd

18 Same as Principal Component Analysis (PCA) if we mean center A’s columns/rows.

analysis for svd

Singular Value Decomposition (SVD):

σ1

σ2 A = U Σ VT σd-1

σd

U, V have orthonormal columns. σ1 ≥ σ2 ≥ ... ≥ σd.

18 analysis for svd

Singular Value Decomposition (SVD):

σ1

σ2 A = U Σ VT σd-1

σd

U, V have orthonormal columns. σ1 ≥ σ2 ≥ ... ≥ σd. Same as Principal Component Analysis (PCA) if we mean center A’s columns/rows.

18 We work with Am for simplicity. Denote (A − Am) as A\m

∥ − ⊤ ∥2  ∥ − ⊤ ∥2 Claim: Ak/ϵ XX Ak/ϵ F + c = (1 ϵ) A XX A F

analysis for svd

Partial Singular Value Decomposition (SVD):

σ 1 T σ2 Vm

Am = Um Σkm σd-1

σd

∥ − ∥2 ∥ − ⊤ ∥2 A Am F = minrank(X)=m A XX A F

19 We work with Am for simplicity. Denote (A − Am) as A\m

analysis for svd

Partial Singular Value Decomposition (SVD):

σ 1 T σ2 Vm

Am = Um Σkm σd-1

σd

∥ − ∥2 ∥ − ⊤ ∥2 A Am F = minrank(X)=m A XX A F

∥ − ⊤ ∥2  ∥ − ⊤ ∥2 Claim: Ak/ϵ XX Ak/ϵ F + c = (1 ϵ) A XX A F

19 We work with Am for simplicity. Denote (A − Am) as A\m

analysis for svd

Partial Singular Value Decomposition (SVD):

σ 1 T σ2 Vm

Am = Um Σkm σd-1

σd

∥ − ∥2 ∥ − ⊤ ∥2 A Am F = minrank(X)=m A XX A F

∥ − ⊤ ∥2  ∥ − ⊤ ∥2 Claim: Uk/ϵΣk/ϵ XX Uk/ϵΣk/ϵ F + c = (1 ϵ) A XX A F

19 analysis for svd

Partial Singular Value Decomposition (SVD):

σ 1 T σ2 Vm

Am = Um Σkm σd-1

σd

∥ − ∥2 ∥ − ⊤ ∥2 A Am F = minrank(X)=m A XX A F

∥ − ⊤ ∥2  ∥ − ⊤ ∥2 Claim: Ak/ϵ XX Ak/ϵ F + c = (1 ϵ) A XX A F

We work with Am for simplicity. Denote (A − Am) as A\m

19 Split into (row) orthogonal pairs:

A = Ak/ε + A\k/ε

∥ − ⊤ ∥2 A XX A F ∥ − ⊤ ∥2 − ∥ ⊤ ∥2 ∥ − ⊤ ∥2 = (I XX )Ak/ϵ F + XX A\k/ϵ F = (I XX )A F

analysis for svd

∥ − ⊤ ∥2  ∥ − ⊤ ∥2 Claim: Ak/ϵ XX Ak/ϵ F + c = (1 ϵ) A XX A F

20 ∥ − ⊤ ∥2 A XX A F ∥ − ⊤ ∥2 − ∥ ⊤ ∥2 ∥ − ⊤ ∥2 = (I XX )Ak/ϵ F + XX A\k/ϵ F = (I XX )A F

analysis for svd

∥ − ⊤ ∥2  ∥ − ⊤ ∥2 Claim: Ak/ϵ XX Ak/ϵ F + c = (1 ϵ) A XX A F Split into (row) orthogonal pairs:

A = Ak/ε + A\k/ε

20 ∥ − ⊤ ∥2 − ∥ ⊤ ∥2 ∥ − ⊤ ∥2 = (I XX )Ak/ϵ F + XX A\k/ϵ F = (I XX )A F

analysis for svd

∥ − ⊤ ∥2  ∥ − ⊤ ∥2 Claim: Ak/ϵ XX Ak/ϵ F + c = (1 ϵ) A XX A F Split into (row) orthogonal pairs:

A = Ak/ε + A\k/ε

∥ − ⊤ ∥2 A XX A F

20 ∥ − ⊤ ∥2 − ∥ ⊤ ∥2 ∥ − ⊤ ∥2 = (I XX )Ak/ϵ F + XX A\k/ϵ F = (I XX )A F

analysis for svd

∥ − ⊤ ∥2  ∥ − ⊤ ∥2 Claim: (I XX )Ak/ϵ F + c = (1 ϵ) (I XX )A F Split into (row) orthogonal pairs:

A = Ak/ε + A\k/ε

∥ − ⊤ ∥2 (I XX )A F

20 ∥ − ⊤ ∥2 − ∥ ⊤ ∥2 ∥ − ⊤ ∥2 = (I XX )Ak/ϵ F + XX A\k/ϵ F = (I XX )A F

analysis for svd

∥ − ⊤ ∥2  ∥ − ⊤ ∥2 Claim: (I XX )Ak/ϵ F + c = (1 ϵ) (I XX )A F Split into (row) orthogonal pairs:

A = Ak/ε + A\k/ε

∥ − ⊤ ∥2 ∥ − ⊤ ∥2 ∥ − ⊤ ∥2 = (I XX )Ak/ϵ F + (I XX )A\k/ϵ F = (I XX )A F

20 analysis for svd

∥ − ⊤ ∥2  ∥ − ⊤ ∥2 Claim: (I XX )Ak/ϵ F + c = (1 ϵ) (I XX )A F Split into (row) orthogonal pairs:

A = Ak/ε + A\k/ε

∥ − ⊤ ∥2 ∥ − ⊤ ∥2 ∥ − ⊤ ∥2 = (I XX )Ak/ϵ F + (I XX )A\k/ϵ F = (I XX )A F ∥ − ⊤ ∥2 ∥ ∥2 − ∥ ⊤ ∥2 ∥ − ⊤ ∥2 = (I XX )Ak/ϵ F + A\k/ϵ F XX A\k/ϵ F = (I XX )A F

20 analysis for svd

∥ − ⊤ ∥2  ∥ − ⊤ ∥2 Claim: (I XX )Ak/ϵ F + c = (1 ϵ) (I XX )A F Split into (row) orthogonal pairs:

A = Ak/ε + A\k/ε

∥ − ⊤ ∥2 ∥ − ⊤ ∥2 ∥ − ⊤ ∥2 = (I XX )Ak/ϵ F + (I XX )A\k/ϵ F = (I XX )A F ∥ − ⊤ ∥2 − ∥ ⊤ ∥2 ∥ − ⊤ ∥2 = (I XX )Ak/ϵ F + c XX A\k/ϵ F = (I XX )A F

20 Need to show that:

∥ ⊤ ∥2 ≤ ∥ − ⊤ ∥2 XX A\k/ϵ F ϵ (I XX )A F

2 σ1 2 σ2

2 σk

2 σk/ε+1

2 σk/ε+k

Σ 2 A σd ∑ ∥ ∥2 2 Recall that B F = i σi (B)

analysis for svd

In words, the projection cost for A is explained by the cost over

Ak/ϵ plus the cost over A\k/ϵ. We want to argue that ignoring the tail term is fine.

21 2 σ1 2 σ2

2 σk

2 σk/ε+1

2 σk/ε+k

Σ 2 A σd ∑ ∥ ∥2 2 Recall that B F = i σi (B)

analysis for svd

In words, the projection cost for A is explained by the cost over

Ak/ϵ plus the cost over A\k/ϵ. We want to argue that ignoring the tail term is fine. Need to show that:

∥ ⊤ ∥2 ≤ ∥ − ⊤ ∥2 XX A\k/ϵ F ϵ (I XX )A F

21 2 σ1 2 σ2

2 σk

2 σk/ε+1

2 σk/ε+k

Σ 2 A σd ∑ ∥ ∥2 2 Recall that B F = i σi (B)

analysis for svd

In words, the projection cost for A is explained by the cost over

Ak/ϵ plus the cost over A\k/ϵ. We want to argue that ignoring the tail term is fine. Need to show that:

∥ ⊤ ∥2 ≤ ∥ − ∥2 XX A\k/ϵ F ϵ A Ak F

21 analysis for svd

∥ ⊤ ∥2 ≤ ∥ − ∥2 XX A\k/ϵ F ϵ A Ak F

2 σ1 2 σ2

2 σk

2 σk/ε+1

2 σk/ε+k

Σ 2 A σd ∑ ∥ ∥2 2 Recall that B F = i σi (B) 21 analysis for svd

∥ ⊤ ∥2 ≤ ∥ − ∥2 XX A\k/ϵ F ϵ A Ak F

2 σ1 2 σ2

2 σk

2 ||A\k/ε ||

2 σk/ε+1

2 σk/ε+k

Σ 2 A σd ∑ ∥ ∥2 2 Recall that B F = i σi (B) 21 analysis for svd

∥ ⊤ ∥2 ≤ ∥ − ∥2 XX A\k/ϵ F ϵ A Ak F

2 σ1 2 σ2

2 σk

T 2 ||XX A\k/ε ||

2 σk/ε+1

2 σk/ε+k

Σ 2 A σd ∑ ∥ ∥2 2 Recall that B F = i σi (B) 21 analysis for svd

∥ ⊤ ∥2 ≤ ∥ − ∥2 XX A\k/ϵ F ϵ A Ak F

2 σ1 2 σ2 2 ||A\k || 2 σk

2 σk/ε+1

2 σk/ε+k

Σ 2 A σd ∑ ∥ ∥2 2 Recall that B F = i σi (B) 21 analysis for svd

∥ ⊤ ∥2 ≤ ∥ − ∥2 XX A\k/ϵ F ϵ A Ak F

2 σ1 σ 2 2 2 ||A\k || 2 σk

2 σk/ε+1

T 2 ||XX A\k/ε || 2 σk/ε+k

Σ 2 A σd ∑ ∥ ∥2 2 Recall that B F = i σi (B) 21 analysis for svd

∥ ⊤ ∥2 ≤ ∥ − ∥2 XX A\k/ϵ F ϵ A Ak F

2 σ1 σ 2 2 2 ||A\k || 2 σk

2 σk/ε+1

T 2 ||XX A\k/ε || 2 σk/ε+k k

Σ 2 A σd ∑ ∥ ∥2 2 Recall that B F = i σi (B) 21 analysis for svd

∥ ⊤ ∥2 ≤ ∥ − ∥2 XX A\k/ϵ F ϵ A Ak F

2 σ1 σ 2 2 2 ||A\k || 2 σk

k/ε

2 σk/ε+1

T 2 ||XX A\k/ε || 2 σk/ε+k k

Σ 2 A σd ∑ ∥ ∥2 2 Recall that B F = i σi (B) 21 ∙ In practice, spectrum decay allows for a much smaller sketching dimension.

5 x 10

2

1.5

1

Squared Singular Value 0.5

0 5 10 15 20 25 30 35 Index

analysis for svd

∙ Analysis is very worst case since we assume σk/ϵ+k is just as big as σk+1.

22 5 x 10

2

1.5

1

Squared Singular Value 0.5

0 5 10 15 20 25 30 35 Index

analysis for svd

∙ Analysis is very worst case since we assume σk/ϵ+k is just as big as σk+1. ∙ In practice, spectrum decay allows for a much smaller sketching dimension.

22 analysis for svd

∙ Analysis is very worst case since we assume σk/ϵ+k is just as big as σk+1. ∙ In practice, spectrum decay allows for a much smaller sketching dimension.

5 x 10

2

1.5

1

Squared Singular Value 0.5

0 5 10 15 20 25 30 35 Index

22 ∥ − ⊤ ∥2 ∙ We have some hope of approximating A\k XX A\k F

Except that:

∙ We have to worry about cross terms ∥ − ⊤ ∥2 ̸ ∥ − ⊤ ∥2 ∥ − ⊤ ∥2 AΠ XX AΠ F = AkΠ XX AkΠ F + A\kΠ XX A\kΠ F

proof sketch for

All proofs have similar flavor:

∥ − ⊤ ∥2 ∥ − ⊤ ∥2 Ak XX Ak F + A\k XX A\k F vs. ∥ − ⊤ ∥2 ∥ − ⊤ ∥2 Ak XX AkΠ F + A\k XX A\kΠ F

23 ∥ − ⊤ ∥2 ∙ We have some hope of approximating A\k XX A\k F

proof sketch for random projection

All proofs have similar flavor:

∥ − ⊤ ∥2 ∥ − ⊤ ∥2 Ak XX Ak F + A\k XX A\k F vs. ∥ − ⊤ ∥2 ∥ − ⊤ ∥2 Ak XX AkΠ F + A\k XX A\kΠ F

Except that:

∙ We have to worry about cross terms ∥ − ⊤ ∥2 ̸ ∥ − ⊤ ∥2 ∥ − ⊤ ∥2 AΠ XX AΠ F = AkΠ XX AkΠ F + A\kΠ XX A\kΠ F

23 proof sketch for random projection

All proofs have similar flavor:

∥ − ⊤ ∥2 ∥ − ⊤ ∥2 Ak XX Ak F + A\k XX A\k F vs. ∥ − ⊤ ∥2 ∥ − ⊤ ∥2 Ak XX AkΠ F + A\k XX A\kΠ F

Except that:

∙ We have to worry about cross terms ∥ − ⊤ ∥2 ̸ ∥ − ⊤ ∥2 ∥ − ⊤ ∥2 AΠ XX AΠ F = AkΠ XX AkΠ F + A\kΠ XX A\kΠ F ∥ − ⊤ ∥2 ∙ We have some hope of approximating A\k XX A\k F

23 proof sketch for random projection

Rely on standard sketching tools:

∥ ⊤ ∥2 ≤ ∥ ∥ ∥ ∥ ∙ Approximate Matrix Multiplication: AΠΠ B F ϵ A F B F ∥ ∥2  ∥ ∥2 ∙ Subspace Embedding: YAΠ F = (1 ϵ) YA F for all Y if A has rank k ∥ ∥2  ∥ ∥2 ∙ Frobenius Norm Preservation: AΠ F = (1 ϵ) A F

24 proof sketch for random projection

See paper for the details!

25 ∙ Projection-cost preserving sketches guarantee ∥ − ⊤ ∥2 ≈ ∥ − ⊤ ∥2 Ã XX Ã F A XX A F for all rank k X. ∙ Standard proof approach: Split A into orthogonal pairs. Top of spectrum can be preserved multiplicatively, only top singular values of bottom of spectrum matter.

theoretical takeaway

∙ Dimensionality reduction for any constrained low rank approximation can be unified.

26 ∙ Standard proof approach: Split A into orthogonal pairs. Top of spectrum can be preserved multiplicatively, only top singular values of bottom of spectrum matter.

theoretical takeaway

∙ Dimensionality reduction for any constrained low rank approximation can be unified. ∙ Projection-cost preserving sketches guarantee ∥ − ⊤ ∥2 ≈ ∥ − ⊤ ∥2 Ã XX Ã F A XX A F for all rank k X.

26 theoretical takeaway

∙ Dimensionality reduction for any constrained low rank approximation can be unified. ∙ Projection-cost preserving sketches guarantee ∥ − ⊤ ∥2 ≈ ∥ − ⊤ ∥2 Ã XX Ã F A XX A F for all rank k X. ∙ Standard proof approach: Split A into orthogonal pairs. Top of spectrum can be preserved multiplicatively, only top singular values of bottom of spectrum matter.

26 ∙ C∗(A) can be approximated with O(log k/ϵ2) dimensions. ∙ Not row orthogonal – have to use triangle inquality which leads to the (9 + ϵ) factor.

proof sketch for sublinear result

Split using optimal clustering:

a1 µ1 e1

a2 µk ek

a3 µ2 e2

A = C*(A) + A-C*(A)

an-1 µ1 e1

an µk ek

27 proof sketch for sublinear result

Split using optimal clustering:

a1 µ1 e1

a2 µk ek

a3 µ2 e2

A = C*(A) + A-C*(A)

an-1 µ1 e1

an µk ek

∙ C∗(A) can be approximated with O(log k/ϵ2) dimensions. ∙ Not row orthogonal – have to use triangle inquality which leads to the (9 + ϵ) factor. 27 practical considerations

Experiments, implements, etc. in Cameron Musco ‘15

3500 Dimensionality Reduction Time

3000 Clustering Time

2500

2000

1500 Time (Seconds)

1000

500

Full Dataset SVD Approx. SVD Sampling Approx. Sampling Rand. Proj. NORP

Time for ϵ = .01 compared to baseline Lloyd’s w/ k-means++. 28 practical considerations

Experiments, implements, etc. in Cameron Musco ‘15

70 Dimensionality Reduction Time Clustering Time

60

50

40

30 Time (Seconds)

20

10

Full Dataset SVD Approx. SVD Sampling Approx. Sampling Rand. Proj. NORP

Time for ϵ = .01 compared to baseline Lloyd’s w/ k-means++. 28 Thank you!

29