dimensionality reduction for k-means and low rank approximation
Michael Cohen, Sam Elder, Cameron Musco, Christopher Musco, Mădălina Persu
Massachusetts Institute of Technology
0 overview
Simple techniques to accelerate algorithms for:
∙ k-means clustering ∙ principal component analysis (PCA) ∙ constrained low rank approximation
1 dimensionality reduction
Replace large, high dimensional dataset with low dimensional sketch.
d features d’ << d features
A Ã n data points
2 ∙ Simultaneously improves runtime, memory requirements, communication cost, etc.
dimensionality reduction
Solution on sketch à should approximate original solution.
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0 2
1.5 2 1 1.5 0.5 1 0.5 0 0 −0.5 −0.5
3 dimensionality reduction
Solution on sketch à should approximate original solution.
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0 2
1.5 2 1 1.5 0.5 1 0.5 0 0 −0.5 −0.5
∙ Simultaneously improves runtime, memory requirements,
communication cost, etc. 3 ∙ Choose k clusters to minimize total intra-cluster variance
∑n ∥ − ∥2 min Cost(A, C) = ai µ (C[ai]) 2 C i=1
k-means clustering
∙ Extremely common objective function for clustering
4 ∑n ∥ − ∥2 min Cost(A, C) = ai µ (C[ai]) 2 C i=1
k-means clustering
∙ Extremely common objective function for clustering ∙ Choose k clusters to minimize total intra-cluster variance
4 ∑n ∥ − ∥2 min Cost(A, C) = ai µ (C[ai]) 2 C i=1
k-means clustering
∙ Extremely common objective function for clustering ∙ Choose k clusters to minimize total intra-cluster variance
4 ∑n ∥ − ∥2 min Cost(A, C) = ai µ (C[ai]) 2 C i=1
k-means clustering
∙ Extremely common objective function for clustering ∙ Choose k clusters to minimize total intra-cluster variance
4 ∑n ∥ − ∥2 min Cost(A, C) = ai µ (C[ai]) 2 C i=1
k-means clustering
∙ Extremely common objective function for clustering ∙ Choose k clusters to minimize total intra-cluster variance
4 k-means clustering
ai µ
∑n ∥ − ∥2 min Cost(A, C) = ai µ (C[ai]) 2 C i=1 4 ∙ Several (1 + ϵ) and constant factor approximation algorithms. ∙ In practice: Lloyd’s heuristic (i.e. “the k-means algorithm”) with k-means++ initialization is used. O(log k) approximation guaranteed, typically performs much better.
Dimensionality reduction can speed up any of these algorithms.
∙ NP-hard, even for fixed dimension d or fixed k.
k-means clustering
∑n ∥ − ∥2 min Cost(A, C) = ai µ (C[ai]) 2 C i=1
5 ∙ Several (1 + ϵ) and constant factor approximation algorithms. ∙ In practice: Lloyd’s heuristic (i.e. “the k-means algorithm”) with k-means++ initialization is used. O(log k) approximation guaranteed, typically performs much better.
Dimensionality reduction can speed up any of these algorithms.
k-means clustering
∑n ∥ − ∥2 min Cost(A, C) = ai µ (C[ai]) 2 C i=1
∙ NP-hard, even for fixed dimension d or fixed k.
5 ∙ In practice: Lloyd’s heuristic (i.e. “the k-means algorithm”) with k-means++ initialization is used. O(log k) approximation guaranteed, typically performs much better.
Dimensionality reduction can speed up any of these algorithms.
k-means clustering
∑n ∥ − ∥2 min Cost(A, C) = ai µ (C[ai]) 2 C i=1
∙ NP-hard, even for fixed dimension d or fixed k. ∙ Several (1 + ϵ) and constant factor approximation algorithms.
5 Dimensionality reduction can speed up any of these algorithms.
k-means clustering
∑n ∥ − ∥2 min Cost(A, C) = ai µ (C[ai]) 2 C i=1
∙ NP-hard, even for fixed dimension d or fixed k. ∙ Several (1 + ϵ) and constant factor approximation algorithms. ∙ In practice: Lloyd’s heuristic (i.e. “the k-means algorithm”) with k-means++ initialization is used. O(log k) approximation guaranteed, typically performs much better.
5 k-means clustering
∑n ∥ − ∥2 min Cost(A, C) = ai µ (C[ai]) 2 C i=1
∙ NP-hard, even for fixed dimension d or fixed k. ∙ Several (1 + ϵ) and constant factor approximation algorithms. ∙ In practice: Lloyd’s heuristic (i.e. “the k-means algorithm”) with k-means++ initialization is used. O(log k) approximation guaranteed, typically performs much better.
Dimensionality reduction can speed up any of these algorithms. 5 ∑ n ∥ − ∥2 min i=1 ai µ(ai) 2
µ1 µ2 … µk
a1 a1
a2 a2
a3 a3
A C(A)
an-1 an-1
an an
key observation
k-means clustering == low rank approximation
6 key observation
k-means clustering == low rank approximation ∑ n ∥ − ∥2 min i=1 ai µ(ai) 2
µ1 µ2 … µk
a1 a1
a2 a2
a3 a3
A C(A)
an-1 an-1
an an 6 key observation
k-means clustering == low rank approximation ∑ n ∥ − ∥2 min i=1 ai µ(ai) 2
µ1 µ2 … µk
a1 µ1
a2 a2
a3 a3
A C(A)
an-1 µ1
an an 6 key observation
k-means clustering == low rank approximation ∑ n ∥ − ∥2 min i=1 ai µ(ai) 2
µ1 µ2 … µk
a1 µ1
a2 a2
a3 µ2
A C(A)
an-1 µ1
an an 6 key observation
k-means clustering == low rank approximation ∑ n ∥ − ∥2 min i=1 ai µ(ai) 2
µ1 µ2 … µk
a1 µ1
a2 µk
a3 µ2
A C(A)
an-1 µ1
an µk 6 key observation
k-means clustering == low rank approximation ∑ n ∥ − ∥2 ∥ − ∥2 min i=1 ai µ(ai) 2 = A C(A) F
µ1 µ2 … µk
a1 µ1
a2 µk
a3 µ2
A C(A)
an-1 µ1
an µk 6 key observation
C(A) is actually a projection of A’s columns onto a rank k subspace [Boutsidis, Drineas, Mahoney, Zouzias ‘11]
7 key observation
C(A) is actually a projection of A’s columns onto a rank k subspace [Boutsidis, Drineas, Mahoney, Zouzias ‘11]
1 1/√|C1| 1/√|C1| a1 µ1 √|C | 1 1/√|C | a µ 1 2 2 k
√|Ck| 1/√|Ck| 1/√|Ck| a3 µ2 1 √|C2|
A = C(A)
1 √|C1| 1 an-1 µ1
√|Ck| an µk
7 key observation
C(A) is actually a projection of A’s columns onto a rank k subspace [Boutsidis, Drineas, Mahoney, Zouzias ‘11]
1 1/|C1| 1/|C1| a1 µ1
1/|C2| a2 µk 1 1/|C | k 1/|Ck| a3 µ2 1
A = C(A)
1
an-1 µ1 1 an µk
7 key observation
C(A) is actually a projection of A’s columns onto a rank k subspace [Boutsidis, Drineas, Mahoney, Zouzias ‘11]
1 µ1 µ1 µ 2 = µk 1 µk µ2 1
C(A)
1
µ1 1 µk
7 key observation
C(A) is actually a projection of A’s columns onto a rank k subspace [Boutsidis, Drineas, Mahoney, Zouzias ‘11]
1 1/√|C1| 1/√|C1| a1 µ1 √|C | 1 1/√|C | T a µ 1 2 2 k XC √|Ck| 1/√|Ck| 1/√|Ck| a3 µ2 1 √|C2|
A = C(A) XC
1 √|C1| 1 an-1 µ1
√|Ck| an µk
7 ∙ General form for constrained low rank approximation.
∙ Set S = {All rank k orthonormal matrices} for principal component analysis/singular value decomposition.
key observation
∑n ∥ − ∥2 ⇒ ∥ − ⊤ ∥2 min ai µ (C[ai]) 2 = min A XX A F C rank(X)=k,X∈S i=1
Where X is a rank k orthonormal matrix and for k-means S is the set of all clustering indicator matrices.
8 ∙ General form for constrained low rank approximation.
key observation
∑n ∥ − ∥2 ⇒ ∥ − ⊤ ∥2 min ai µ (C[ai]) 2 = min A XX A F C rank(X)=k,X∈S i=1
Where X is a rank k orthonormal matrix and for k-means S is the set of all clustering indicator matrices.
∙ Set S = {All rank k orthonormal matrices} for principal component analysis/singular value decomposition.
8 key observation
∑n ∥ − ∥2 ⇒ ∥ − ⊤ ∥2 min ai µ (C[ai]) 2 = min A XX A F C rank(X)=k,X∈S i=1
Where X is a rank k orthonormal matrix and for k-means S is the set of all clustering indicator matrices.
∙ Set S = {All rank k orthonormal matrices} for principal component analysis/singular value decomposition. ∙ General form for constrained low rank approximation.
8 ∥ − ⊤ ∥2 ≈ ∥ − ⊤ ∥2 for all rank k X, Ã XX Ã F A XX A F
2 2
A − XXT A ≈ Ã − XXT Ã
F F d O(k)
Projection-Cost Preserving Sketch
so what?
∥ − ⊤ ∥2 Can solve any problem like minrank(X)=k,X∈S A XX A F if:
9 2 2
A − XXT A ≈ Ã − XXT Ã
F F d O(k)
Projection-Cost Preserving Sketch
so what?
∥ − ⊤ ∥2 Can solve any problem like minrank(X)=k,X∈S A XX A F if: ∥ − ⊤ ∥2 ≈ ∥ − ⊤ ∥2 for all rank k X, Ã XX Ã F A XX A F
9 Projection-Cost Preserving Sketch
so what?
∥ − ⊤ ∥2 Can solve any problem like minrank(X)=k,X∈S A XX A F if: ∥ − ⊤ ∥2 ≈ ∥ − ⊤ ∥2 for all rank k X, Ã XX Ã F A XX A F
2 2
A − XXT A ≈ Ã − XXT Ã
F F d O(k)
9 so what?
∥ − ⊤ ∥2 Can solve any problem like minrank(X)=k,X∈S A XX A F if: ∥ − ⊤ ∥2 ≈ ∥ − ⊤ ∥2 for all rank k X, Ã XX Ã F A XX A F
2 2
A − XXT A ≈ Ã − XXT Ã
F F d O(k)
Projection-Cost Preserving Sketch
9 If we find an X that gives:
⊤ ∥ − ⊤ ∥2 ≤ ∥ − f ∥2 à XX à F γ à XXoptà F
then:
∥ − ⊤ ∥2 ≤ · ∥ − ⊤ ∥2 A XX A F γ (1 + ϵ) A (XX)optA F
See coresets of Feldman, Schmidt, Sohler ‘13.
projection-cost preservation
Specifically, we want:
∥ − ⊤ ∥2 ≈ ∥ − ⊤ ∥2 for all X, Ã XX Ã F A XX A F
10 If we find an X that gives:
⊤ ∥ − ⊤ ∥2 ≤ ∥ − f ∥2 à XX à F γ à XXoptà F
then:
∥ − ⊤ ∥2 ≤ · ∥ − ⊤ ∥2 A XX A F γ (1 + ϵ) A (XX)optA F
See coresets of Feldman, Schmidt, Sohler ‘13.
projection-cost preservation
Specifically, we want:
∥ − ⊤ ∥2 ∥ − ⊤ ∥2 for all X, Ã XX Ã F = (1 ϵ) A XX A F
10 If we find an X that gives:
⊤ ∥ − ⊤ ∥2 ≤ ∥ − f ∥2 à XX à F γ à XXoptà F
then:
∥ − ⊤ ∥2 ≤ · ∥ − ⊤ ∥2 A XX A F γ (1 + ϵ) A (XX)optA F
See coresets of Feldman, Schmidt, Sohler ‘13.
projection-cost preservation
Specifically, we want:
∥ − ⊤ ∥2 ∥ − ⊤ ∥2 for all X, Ã XX Ã F + c = (1 ϵ) A XX A F
10 then:
∥ − ⊤ ∥2 ≤ · ∥ − ⊤ ∥2 A XX A F γ (1 + ϵ) A (XX)optA F
See coresets of Feldman, Schmidt, Sohler ‘13.
projection-cost preservation
Specifically, we want:
∥ − ⊤ ∥2 ∥ − ⊤ ∥2 for all X, Ã XX Ã F + c = (1 ϵ) A XX A F
If we find an X that gives:
⊤ ∥ − ⊤ ∥2 ≤ ∥ − f ∥2 à XX à F γ à XXoptà F
10 See coresets of Feldman, Schmidt, Sohler ‘13.
projection-cost preservation
Specifically, we want:
∥ − ⊤ ∥2 ∥ − ⊤ ∥2 for all X, Ã XX Ã F + c = (1 ϵ) A XX A F
If we find an X that gives:
⊤ ∥ − ⊤ ∥2 ≤ ∥ − f ∥2 à XX à F γ à XXoptà F
then:
∥ − ⊤ ∥2 ≤ · ∥ − ⊤ ∥2 A XX A F γ (1 + ϵ) A (XX)optA F
10 projection-cost preservation
Specifically, we want:
∥ − ⊤ ∥2 ∥ − ⊤ ∥2 for all X, Ã XX Ã F + c = (1 ϵ) A XX A F
If we find an X that gives:
⊤ ∥ − ⊤ ∥2 ≤ ∥ − f ∥2 à XX à F γ à XXoptà F
then:
∥ − ⊤ ∥2 ≤ · ∥ − ⊤ ∥2 A XX A F γ (1 + ϵ) A (XX)optA F
See coresets of Feldman, Schmidt, Sohler ‘13.
10 ∙ We can construct a projection-cost preserving sketch à that approximates the distance from A to any rank k subspace. ∙ Stronger guarantee than has been sought in prior work on approximate PCA via sketching
∙ k-means clustering is just constrained k rank approximation.
recap
What we have seen so far:
11 ∙ We can construct a projection-cost preserving sketch à that approximates the distance from A to any rank k subspace. ∙ Stronger guarantee than has been sought in prior work on approximate PCA via sketching
recap
What we have seen so far:
∙ k-means clustering is just constrained k rank approximation.
11 ∙ Stronger guarantee than has been sought in prior work on approximate PCA via sketching
recap
What we have seen so far:
∙ k-means clustering is just constrained k rank approximation. ∙ We can construct a projection-cost preserving sketch à that approximates the distance from A to any rank k subspace.
11 recap
What we have seen so far:
∙ k-means clustering is just constrained k rank approximation. ∙ We can construct a projection-cost preserving sketch à that approximates the distance from A to any rank k subspace. ∙ Stronger guarantee than has been sought in prior work on approximate PCA via sketching
11 our results
What techniques give (1 + ϵ) projection-cost preservation?
Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Feldman, SVD Schmidt, O(k/ϵ2) 1 + ϵ ⌈k/ϵ⌉ 1 + ϵ Sohler ‘13 Boutsidis, Drineas, Approximate SVD k 2 + ϵ ⌈k/ϵ⌉ 1 + ϵ Mahoney, Zouzias ‘11 Random O(k/ϵ2) 1 + ϵ ” O(k/ϵ2) 2 + ϵ Projection O(log k/ϵ2) 9 + ϵ Column Sampling ” O(k log k/ϵ2) 3 + ϵ O(k log k/ϵ2) 1 + ϵ Boutsidis, Deterministic Magdon- r > k O(n/r) O(k/ϵ2) 1 + ϵ Column Selection Ismail ‘13 Non-oblivious NA NA NA O(k/ϵ) 1 + ϵ Projection 12 our results
Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Feldman, SVD Schmidt, O(k/ϵ2) 1 + ϵ ⌈k/ϵ⌉ 1 + ϵ Sohler ‘13
A Ã Vk/ε
k/ε 13 our results
Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Approximate Drineas, k 2 + ϵ ⌈k/ϵ⌉ 1 + ϵ SVD Mahoney, Zouzias ‘11
A ~ Ã Vk/ε
k/ε 13 our results
Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Approximate Drineas, k 2 + ϵ ⌈k/ϵ⌉ 1 + ϵ SVD Mahoney, Zouzias ‘11
∙ Practically very useful for k-means ∙ No constant factors on k/ϵ, and typically many fewer dimensions are required (Kappmeier, Schmidt, Schmidt ‘15)
13 our results
Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Random Drineas, O(k/ϵ2) 1 + ϵ O(k/ϵ2) 2 + ϵ Projection Mahoney, O(log k/ϵ2) 9 + ϵ Zouzias ‘11
±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1
A Π Ã ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 O(k/ε2) 14 our results
Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Random Drineas, O(k/ϵ2) 1 + ϵ O(k/ϵ2) 2 + ϵ Projection Mahoney, O(log k/ϵ2) 9 + ϵ Zouzias ‘11
±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1
A Π Ã ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 O(k/ε2) 14 our results
Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Random Drineas, O(k/ϵ2) 1 + ϵ O(k/ϵ2) 2 + ϵ Projection Mahoney, O(log k/ϵ2) 9 + ϵ Zouzias ‘11
±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1
A Π Ã ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 O(k/ε2) 14 ∙ Sketch is data oblivious ∙ Lowest communication distributed k-means (improves on Balcan, Kanchanapally, Liang, Woodruff ‘14) ∙ Streaming principal component analysis in a single pass
our results
Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Random Drineas, O(k/ϵ2) 1 + ϵ O(k/ϵ2) 2 + ϵ Projection Mahoney, O(log k/ϵ2) 9 + ϵ Zouzias ‘11
∙ First sketch with dimension sublinear in k. Can (9 + ϵ) be improved?
14 our results
Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Random Drineas, O(k/ϵ2) 1 + ϵ O(k/ϵ2) 2 + ϵ Projection Mahoney, O(log k/ϵ2) 9 + ϵ Zouzias ‘11
∙ First sketch with dimension sublinear in k. Can (9 + ϵ) be improved? ∙ Sketch is data oblivious ∙ Lowest communication distributed k-means (improves on Balcan, Kanchanapally, Liang, Woodruff ‘14) ∙ Streaming principal component analysis in a single pass
14 O(k/ε) ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1
A Π Ã ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1
∥ − ∥2 ≤ ∥ − ∥2 A (PÃA)k F (1 + ϵ) A Ak F. span(Ã) contains a good low rank approximation for A, but we must return to A to find it.
oblivious project-cost preserving sketches
Standard sketches for low rank approximation (Sarlós ‘06, Clarkson, Woodruff ‘13, etc):
15 oblivious project-cost preserving sketches
Standard sketches for low rank approximation (Sarlós ‘06, Clarkson, Woodruff ‘13, etc):
O(k/ε) ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1
A Π Ã ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1
∥ − ∥2 ≤ ∥ − ∥2 A (PÃA)k F (1 + ϵ) A Ak F. span(Ã) contains a good low rank approximation for A, but we must return to A to find it. 15 Is it required for oblivious approximate PCA? (See Ghashami, Liberty, Phillips, Woodruff ‘14/15)
O(k/ε2) ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1
A Π Ã ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1
∥ − ⊤ ∥2 ≤ ∥ − ∥2 A VkVk A F (1 + ϵ) A Ak F. Additional ϵ dependence for stronger sketch.
oblivious project-cost preserving sketches
Projection-cost preserving sketch for low rank approximation:
15 Is it required for oblivious approximate PCA? (See Ghashami, Liberty, Phillips, Woodruff ‘14/15)
oblivious project-cost preserving sketches
Projection-cost preserving sketch for low rank approximation:
O(k/ε2) ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1
A Π Ã ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1
∥ − ⊤ ∥2 ≤ ∥ − ∥2 A VkVk A F (1 + ϵ) A Ak F. Additional ϵ dependence for stronger sketch.
15 oblivious project-cost preserving sketches
Projection-cost preserving sketch for low rank approximation:
O(k/ε2) ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1
A Π Ã ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1
∥ − ⊤ ∥2 ≤ ∥ − ∥2 A VkVk A F (1 + ϵ) A Ak F. Additional ϵ dependence for stronger sketch. Is it required for oblivious approximate PCA? (See Ghashami, Liberty, Phillips, Woodruff ‘14/15) 15 our results
Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Non-oblivious Randomized Sarlós ‘06 NA NA O(k/ϵ) 1 + ϵ Projection )
A A Π Ã span(
O(k/ε)
16 our results
Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Non-oblivious Randomized Sarlós ‘06 NA NA O(k/ϵ) 1 + ϵ Projection
∙ Very good practical performance: can get away with narrower Π than approximate SVD, although final point dimension is a constant factor higher.
16 our results
Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Column Drineas, O(k log k/ϵ2) 3 + ϵ O(k log k/ϵ2) 1 + ϵ Sampling Mahoney, Zouzias ‘11 Deterministic Boutsidis, Column Magdon- r > k O(n/r) O(k/ϵ2) 1 + ϵ Selection Ismail ‘13
A Ã
Õ(k/ε2) 17 ∙ First single shot sampling based dimensionality reduction for (1 + ϵ) error low rank approximation. ∙ Applications to streaming column subset selection? Iterative sampling algorithms for the SVD?
our results
Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Column Drineas, O(k log k/ϵ2) 3 + ϵ O(k log k/ϵ2) 1 + ϵ Sampling Mahoney, Zouzias ‘11
∙ A not only contains a small set of columns that span a good low rank approximation, but a small (reweighted) set whose top principal components approximate those of A.
17 ∙ Applications to streaming column subset selection? Iterative sampling algorithms for the SVD?
our results
Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Column Drineas, O(k log k/ϵ2) 3 + ϵ O(k log k/ϵ2) 1 + ϵ Sampling Mahoney, Zouzias ‘11
∙ A not only contains a small set of columns that span a good low rank approximation, but a small (reweighted) set whose top principal components approximate those of A. ∙ First single shot sampling based dimensionality reduction for (1 + ϵ) error low rank approximation.
17 our results
Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Column Drineas, O(k log k/ϵ2) 3 + ϵ O(k log k/ϵ2) 1 + ϵ Sampling Mahoney, Zouzias ‘11
∙ A not only contains a small set of columns that span a good low rank approximation, but a small (reweighted) set whose top principal components approximate those of A. ∙ First single shot sampling based dimensionality reduction for (1 + ϵ) error low rank approximation. ∙ Applications to streaming column subset selection? Iterative sampling algorithms for the SVD?
17 U, V have orthonormal columns. σ1 ≥ σ2 ≥ ... ≥ σd. Same as Principal Component Analysis (PCA) if we mean center A’s columns/rows.
analysis for svd
Singular Value Decomposition (SVD):
σ1
σ2 A = U Σ VT σd-1
σd
18 Same as Principal Component Analysis (PCA) if we mean center A’s columns/rows.
analysis for svd
Singular Value Decomposition (SVD):
σ1
σ2 A = U Σ VT σd-1
σd
U, V have orthonormal columns. σ1 ≥ σ2 ≥ ... ≥ σd.
18 analysis for svd
Singular Value Decomposition (SVD):
σ1
σ2 A = U Σ VT σd-1
σd
U, V have orthonormal columns. σ1 ≥ σ2 ≥ ... ≥ σd. Same as Principal Component Analysis (PCA) if we mean center A’s columns/rows.
18 We work with Am for simplicity. Denote (A − Am) as A\m
∥ − ⊤ ∥2 ∥ − ⊤ ∥2 Claim: Ak/ϵ XX Ak/ϵ F + c = (1 ϵ) A XX A F
analysis for svd
Partial Singular Value Decomposition (SVD):
σ 1 T σ2 Vm
Am = Um Σkm σd-1
σd
∥ − ∥2 ∥ − ⊤ ∥2 A Am F = minrank(X)=m A XX A F
19 We work with Am for simplicity. Denote (A − Am) as A\m
analysis for svd
Partial Singular Value Decomposition (SVD):
σ 1 T σ2 Vm
Am = Um Σkm σd-1
σd
∥ − ∥2 ∥ − ⊤ ∥2 A Am F = minrank(X)=m A XX A F
∥ − ⊤ ∥2 ∥ − ⊤ ∥2 Claim: Ak/ϵ XX Ak/ϵ F + c = (1 ϵ) A XX A F
19 We work with Am for simplicity. Denote (A − Am) as A\m
analysis for svd
Partial Singular Value Decomposition (SVD):
σ 1 T σ2 Vm
Am = Um Σkm σd-1
σd
∥ − ∥2 ∥ − ⊤ ∥2 A Am F = minrank(X)=m A XX A F
∥ − ⊤ ∥2 ∥ − ⊤ ∥2 Claim: Uk/ϵΣk/ϵ XX Uk/ϵΣk/ϵ F + c = (1 ϵ) A XX A F
19 analysis for svd
Partial Singular Value Decomposition (SVD):
σ 1 T σ2 Vm
Am = Um Σkm σd-1
σd
∥ − ∥2 ∥ − ⊤ ∥2 A Am F = minrank(X)=m A XX A F
∥ − ⊤ ∥2 ∥ − ⊤ ∥2 Claim: Ak/ϵ XX Ak/ϵ F + c = (1 ϵ) A XX A F
We work with Am for simplicity. Denote (A − Am) as A\m
19 Split into (row) orthogonal pairs:
A = Ak/ε + A\k/ε
∥ − ⊤ ∥2 A XX A F ∥ − ⊤ ∥2 − ∥ ⊤ ∥2 ∥ − ⊤ ∥2 = (I XX )Ak/ϵ F + XX A\k/ϵ F = (I XX )A F
analysis for svd
∥ − ⊤ ∥2 ∥ − ⊤ ∥2 Claim: Ak/ϵ XX Ak/ϵ F + c = (1 ϵ) A XX A F
20 ∥ − ⊤ ∥2 A XX A F ∥ − ⊤ ∥2 − ∥ ⊤ ∥2 ∥ − ⊤ ∥2 = (I XX )Ak/ϵ F + XX A\k/ϵ F = (I XX )A F
analysis for svd
∥ − ⊤ ∥2 ∥ − ⊤ ∥2 Claim: Ak/ϵ XX Ak/ϵ F + c = (1 ϵ) A XX A F Split into (row) orthogonal pairs:
A = Ak/ε + A\k/ε
20 ∥ − ⊤ ∥2 − ∥ ⊤ ∥2 ∥ − ⊤ ∥2 = (I XX )Ak/ϵ F + XX A\k/ϵ F = (I XX )A F
analysis for svd
∥ − ⊤ ∥2 ∥ − ⊤ ∥2 Claim: Ak/ϵ XX Ak/ϵ F + c = (1 ϵ) A XX A F Split into (row) orthogonal pairs:
A = Ak/ε + A\k/ε
∥ − ⊤ ∥2 A XX A F
20 ∥ − ⊤ ∥2 − ∥ ⊤ ∥2 ∥ − ⊤ ∥2 = (I XX )Ak/ϵ F + XX A\k/ϵ F = (I XX )A F
analysis for svd
∥ − ⊤ ∥2 ∥ − ⊤ ∥2 Claim: (I XX )Ak/ϵ F + c = (1 ϵ) (I XX )A F Split into (row) orthogonal pairs:
A = Ak/ε + A\k/ε
∥ − ⊤ ∥2 (I XX )A F
20 ∥ − ⊤ ∥2 − ∥ ⊤ ∥2 ∥ − ⊤ ∥2 = (I XX )Ak/ϵ F + XX A\k/ϵ F = (I XX )A F
analysis for svd
∥ − ⊤ ∥2 ∥ − ⊤ ∥2 Claim: (I XX )Ak/ϵ F + c = (1 ϵ) (I XX )A F Split into (row) orthogonal pairs:
A = Ak/ε + A\k/ε
∥ − ⊤ ∥2 ∥ − ⊤ ∥2 ∥ − ⊤ ∥2 = (I XX )Ak/ϵ F + (I XX )A\k/ϵ F = (I XX )A F
20 analysis for svd
∥ − ⊤ ∥2 ∥ − ⊤ ∥2 Claim: (I XX )Ak/ϵ F + c = (1 ϵ) (I XX )A F Split into (row) orthogonal pairs:
A = Ak/ε + A\k/ε
∥ − ⊤ ∥2 ∥ − ⊤ ∥2 ∥ − ⊤ ∥2 = (I XX )Ak/ϵ F + (I XX )A\k/ϵ F = (I XX )A F ∥ − ⊤ ∥2 ∥ ∥2 − ∥ ⊤ ∥2 ∥ − ⊤ ∥2 = (I XX )Ak/ϵ F + A\k/ϵ F XX A\k/ϵ F = (I XX )A F
20 analysis for svd
∥ − ⊤ ∥2 ∥ − ⊤ ∥2 Claim: (I XX )Ak/ϵ F + c = (1 ϵ) (I XX )A F Split into (row) orthogonal pairs:
A = Ak/ε + A\k/ε
∥ − ⊤ ∥2 ∥ − ⊤ ∥2 ∥ − ⊤ ∥2 = (I XX )Ak/ϵ F + (I XX )A\k/ϵ F = (I XX )A F ∥ − ⊤ ∥2 − ∥ ⊤ ∥2 ∥ − ⊤ ∥2 = (I XX )Ak/ϵ F + c XX A\k/ϵ F = (I XX )A F
20 Need to show that:
∥ ⊤ ∥2 ≤ ∥ − ⊤ ∥2 XX A\k/ϵ F ϵ (I XX )A F
2 σ1 2 σ2
2 σk
2 σk/ε+1
2 σk/ε+k
Σ 2 A σd ∑ ∥ ∥2 2 Recall that B F = i σi (B)
analysis for svd
In words, the projection cost for A is explained by the cost over
Ak/ϵ plus the cost over A\k/ϵ. We want to argue that ignoring the tail term is fine.
21 2 σ1 2 σ2
2 σk
2 σk/ε+1
2 σk/ε+k
Σ 2 A σd ∑ ∥ ∥2 2 Recall that B F = i σi (B)
analysis for svd
In words, the projection cost for A is explained by the cost over
Ak/ϵ plus the cost over A\k/ϵ. We want to argue that ignoring the tail term is fine. Need to show that:
∥ ⊤ ∥2 ≤ ∥ − ⊤ ∥2 XX A\k/ϵ F ϵ (I XX )A F
21 2 σ1 2 σ2
2 σk
2 σk/ε+1
2 σk/ε+k
Σ 2 A σd ∑ ∥ ∥2 2 Recall that B F = i σi (B)
analysis for svd
In words, the projection cost for A is explained by the cost over
Ak/ϵ plus the cost over A\k/ϵ. We want to argue that ignoring the tail term is fine. Need to show that:
∥ ⊤ ∥2 ≤ ∥ − ∥2 XX A\k/ϵ F ϵ A Ak F
21 analysis for svd
∥ ⊤ ∥2 ≤ ∥ − ∥2 XX A\k/ϵ F ϵ A Ak F
2 σ1 2 σ2
2 σk
2 σk/ε+1
2 σk/ε+k
Σ 2 A σd ∑ ∥ ∥2 2 Recall that B F = i σi (B) 21 analysis for svd
∥ ⊤ ∥2 ≤ ∥ − ∥2 XX A\k/ϵ F ϵ A Ak F
2 σ1 2 σ2
2 σk
2 ||A\k/ε ||
2 σk/ε+1
2 σk/ε+k
Σ 2 A σd ∑ ∥ ∥2 2 Recall that B F = i σi (B) 21 analysis for svd
∥ ⊤ ∥2 ≤ ∥ − ∥2 XX A\k/ϵ F ϵ A Ak F
2 σ1 2 σ2
2 σk
T 2 ||XX A\k/ε ||
2 σk/ε+1
2 σk/ε+k
Σ 2 A σd ∑ ∥ ∥2 2 Recall that B F = i σi (B) 21 analysis for svd
∥ ⊤ ∥2 ≤ ∥ − ∥2 XX A\k/ϵ F ϵ A Ak F
2 σ1 2 σ2 2 ||A\k || 2 σk
2 σk/ε+1
2 σk/ε+k
Σ 2 A σd ∑ ∥ ∥2 2 Recall that B F = i σi (B) 21 analysis for svd
∥ ⊤ ∥2 ≤ ∥ − ∥2 XX A\k/ϵ F ϵ A Ak F
2 σ1 σ 2 2 2 ||A\k || 2 σk
2 σk/ε+1
T 2 ||XX A\k/ε || 2 σk/ε+k
Σ 2 A σd ∑ ∥ ∥2 2 Recall that B F = i σi (B) 21 analysis for svd
∥ ⊤ ∥2 ≤ ∥ − ∥2 XX A\k/ϵ F ϵ A Ak F
2 σ1 σ 2 2 2 ||A\k || 2 σk
2 σk/ε+1
T 2 ||XX A\k/ε || 2 σk/ε+k k
Σ 2 A σd ∑ ∥ ∥2 2 Recall that B F = i σi (B) 21 analysis for svd
∥ ⊤ ∥2 ≤ ∥ − ∥2 XX A\k/ϵ F ϵ A Ak F
2 σ1 σ 2 2 2 ||A\k || 2 σk
k/ε
2 σk/ε+1
T 2 ||XX A\k/ε || 2 σk/ε+k k
Σ 2 A σd ∑ ∥ ∥2 2 Recall that B F = i σi (B) 21 ∙ In practice, spectrum decay allows for a much smaller sketching dimension.
5 x 10
2
1.5
1
Squared Singular Value 0.5
0 5 10 15 20 25 30 35 Index
analysis for svd
∙ Analysis is very worst case since we assume σk/ϵ+k is just as big as σk+1.
22 5 x 10
2
1.5
1
Squared Singular Value 0.5
0 5 10 15 20 25 30 35 Index
analysis for svd
∙ Analysis is very worst case since we assume σk/ϵ+k is just as big as σk+1. ∙ In practice, spectrum decay allows for a much smaller sketching dimension.
22 analysis for svd
∙ Analysis is very worst case since we assume σk/ϵ+k is just as big as σk+1. ∙ In practice, spectrum decay allows for a much smaller sketching dimension.
5 x 10
2
1.5
1
Squared Singular Value 0.5
0 5 10 15 20 25 30 35 Index
22 ∥ − ⊤ ∥2 ∙ We have some hope of approximating A\k XX A\k F
Except that:
∙ We have to worry about cross terms ∥ − ⊤ ∥2 ̸ ∥ − ⊤ ∥2 ∥ − ⊤ ∥2 AΠ XX AΠ F = AkΠ XX AkΠ F + A\kΠ XX A\kΠ F
proof sketch for random projection
All proofs have similar flavor:
∥ − ⊤ ∥2 ∥ − ⊤ ∥2 Ak XX Ak F + A\k XX A\k F vs. ∥ − ⊤ ∥2 ∥ − ⊤ ∥2 Ak XX AkΠ F + A\k XX A\kΠ F
23 ∥ − ⊤ ∥2 ∙ We have some hope of approximating A\k XX A\k F
proof sketch for random projection
All proofs have similar flavor:
∥ − ⊤ ∥2 ∥ − ⊤ ∥2 Ak XX Ak F + A\k XX A\k F vs. ∥ − ⊤ ∥2 ∥ − ⊤ ∥2 Ak XX AkΠ F + A\k XX A\kΠ F
Except that:
∙ We have to worry about cross terms ∥ − ⊤ ∥2 ̸ ∥ − ⊤ ∥2 ∥ − ⊤ ∥2 AΠ XX AΠ F = AkΠ XX AkΠ F + A\kΠ XX A\kΠ F
23 proof sketch for random projection
All proofs have similar flavor:
∥ − ⊤ ∥2 ∥ − ⊤ ∥2 Ak XX Ak F + A\k XX A\k F vs. ∥ − ⊤ ∥2 ∥ − ⊤ ∥2 Ak XX AkΠ F + A\k XX A\kΠ F
Except that:
∙ We have to worry about cross terms ∥ − ⊤ ∥2 ̸ ∥ − ⊤ ∥2 ∥ − ⊤ ∥2 AΠ XX AΠ F = AkΠ XX AkΠ F + A\kΠ XX A\kΠ F ∥ − ⊤ ∥2 ∙ We have some hope of approximating A\k XX A\k F
23 proof sketch for random projection
Rely on standard sketching tools:
∥ ⊤ ∥2 ≤ ∥ ∥ ∥ ∥ ∙ Approximate Matrix Multiplication: AΠΠ B F ϵ A F B F ∥ ∥2 ∥ ∥2 ∙ Subspace Embedding: YAΠ F = (1 ϵ) YA F for all Y if A has rank k ∥ ∥2 ∥ ∥2 ∙ Frobenius Norm Preservation: AΠ F = (1 ϵ) A F
24 proof sketch for random projection
See paper for the details!
25 ∙ Projection-cost preserving sketches guarantee ∥ − ⊤ ∥2 ≈ ∥ − ⊤ ∥2 Ã XX Ã F A XX A F for all rank k X. ∙ Standard proof approach: Split A into orthogonal pairs. Top of spectrum can be preserved multiplicatively, only top singular values of bottom of spectrum matter.
theoretical takeaway
∙ Dimensionality reduction for any constrained low rank approximation can be unified.
26 ∙ Standard proof approach: Split A into orthogonal pairs. Top of spectrum can be preserved multiplicatively, only top singular values of bottom of spectrum matter.
theoretical takeaway
∙ Dimensionality reduction for any constrained low rank approximation can be unified. ∙ Projection-cost preserving sketches guarantee ∥ − ⊤ ∥2 ≈ ∥ − ⊤ ∥2 Ã XX Ã F A XX A F for all rank k X.
26 theoretical takeaway
∙ Dimensionality reduction for any constrained low rank approximation can be unified. ∙ Projection-cost preserving sketches guarantee ∥ − ⊤ ∥2 ≈ ∥ − ⊤ ∥2 Ã XX Ã F A XX A F for all rank k X. ∙ Standard proof approach: Split A into orthogonal pairs. Top of spectrum can be preserved multiplicatively, only top singular values of bottom of spectrum matter.
26 ∙ C∗(A) can be approximated with O(log k/ϵ2) dimensions. ∙ Not row orthogonal – have to use triangle inquality which leads to the (9 + ϵ) factor.
proof sketch for sublinear result
Split using optimal clustering:
a1 µ1 e1
a2 µk ek
a3 µ2 e2
A = C*(A) + A-C*(A)
an-1 µ1 e1
an µk ek
27 proof sketch for sublinear result
Split using optimal clustering:
a1 µ1 e1
a2 µk ek
a3 µ2 e2
A = C*(A) + A-C*(A)
an-1 µ1 e1
an µk ek
∙ C∗(A) can be approximated with O(log k/ϵ2) dimensions. ∙ Not row orthogonal – have to use triangle inquality which leads to the (9 + ϵ) factor. 27 practical considerations
Experiments, implements, etc. in Cameron Musco ‘15
3500 Dimensionality Reduction Time
3000 Clustering Time
2500
2000
1500 Time (Seconds)
1000
500
Full Dataset SVD Approx. SVD Sampling Approx. Sampling Rand. Proj. NORP
Time for ϵ = .01 compared to baseline Lloyd’s w/ k-means++. 28 practical considerations
Experiments, implements, etc. in Cameron Musco ‘15
70 Dimensionality Reduction Time Clustering Time
60
50
40
30 Time (Seconds)
20
10
Full Dataset SVD Approx. SVD Sampling Approx. Sampling Rand. Proj. NORP
Time for ϵ = .01 compared to baseline Lloyd’s w/ k-means++. 28 Thank you!
29