Dimensionality Reduction for K-Means and Low Rank Approximation

dimensionality reduction for k-means and low rank approximation Michael Cohen, Sam Elder, Cameron Musco, Christopher Musco, Mădălina Persu Massachusetts Institute of Technology 0 overview Simple techniques to accelerate algorithms for: ∙ k-means clustering ∙ principal component analysis (PCA) ∙ constrained low rank approximation 1 dimensionality reduction Replace large, high dimensional dataset with low dimensional sketch. d features d’ << d features A Ã n data points 2 ∙ Simultaneously improves runtime, memory requirements, communication cost, etc. dimensionality reduction Solution on sketch Ã should approximate original solution. 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 2 1.5 2 1 1.5 0.5 1 0.5 0 0 −0.5 −0.5 3 dimensionality reduction Solution on sketch Ã should approximate original solution. 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 2 1.5 2 1 1.5 0.5 1 0.5 0 0 −0.5 −0.5 ∙ Simultaneously improves runtime, memory requirements, communication cost, etc. 3 ∙ Choose k clusters to minimize total intra-cluster variance Xn k − k2 min Cost(A; C) = ai µ (C[ai]) 2 C i=1 k-means clustering ∙ Extremely common objective function for clustering 4 Xn k − k2 min Cost(A; C) = ai µ (C[ai]) 2 C i=1 k-means clustering ∙ Extremely common objective function for clustering ∙ Choose k clusters to minimize total intra-cluster variance 4 Xn k − k2 min Cost(A; C) = ai µ (C[ai]) 2 C i=1 k-means clustering ∙ Extremely common objective function for clustering ∙ Choose k clusters to minimize total intra-cluster variance 4 Xn k − k2 min Cost(A; C) = ai µ (C[ai]) 2 C i=1 k-means clustering ∙ Extremely common objective function for clustering ∙ Choose k clusters to minimize total intra-cluster variance 4 Xn k − k2 min Cost(A; C) = ai µ (C[ai]) 2 C i=1 k-means clustering ∙ Extremely common objective function for clustering ∙ Choose k clusters to minimize total intra-cluster variance 4 k-means clustering ai µ Xn k − k2 min Cost(A; C) = ai µ (C[ai]) 2 C i=1 4 ∙ Several (1 + ϵ) and constant factor approximation algorithms. ∙ In practice: Lloyd’s heuristic (i.e. “the k-means algorithm”) with k-means++ initialization is used. O(log k) approximation guaranteed, typically performs much better. Dimensionality reduction can speed up any of these algorithms. ∙ NP-hard, even for fixed dimension d or fixed k. k-means clustering Xn k − k2 min Cost(A; C) = ai µ (C[ai]) 2 C i=1 5 ∙ Several (1 + ϵ) and constant factor approximation algorithms. ∙ In practice: Lloyd’s heuristic (i.e. “the k-means algorithm”) with k-means++ initialization is used. O(log k) approximation guaranteed, typically performs much better. Dimensionality reduction can speed up any of these algorithms. k-means clustering Xn k − k2 min Cost(A; C) = ai µ (C[ai]) 2 C i=1 ∙ NP-hard, even for fixed dimension d or fixed k. 5 ∙ In practice: Lloyd’s heuristic (i.e. “the k-means algorithm”) with k-means++ initialization is used. O(log k) approximation guaranteed, typically performs much better. Dimensionality reduction can speed up any of these algorithms. k-means clustering Xn k − k2 min Cost(A; C) = ai µ (C[ai]) 2 C i=1 ∙ NP-hard, even for fixed dimension d or fixed k. ∙ Several (1 + ϵ) and constant factor approximation algorithms. 5 Dimensionality reduction can speed up any of these algorithms. k-means clustering Xn k − k2 min Cost(A; C) = ai µ (C[ai]) 2 C i=1 ∙ NP-hard, even for fixed dimension d or fixed k. ∙ Several (1 + ϵ) and constant factor approximation algorithms. ∙ In practice: Lloyd’s heuristic (i.e. “the k-means algorithm”) with k-means++ initialization is used. O(log k) approximation guaranteed, typically performs much better. 5 k-means clustering Xn k − k2 min Cost(A; C) = ai µ (C[ai]) 2 C i=1 ∙ NP-hard, even for fixed dimension d or fixed k. ∙ Several (1 + ϵ) and constant factor approximation algorithms. ∙ In practice: Lloyd’s heuristic (i.e. “the k-means algorithm”) with k-means++ initialization is used. O(log k) approximation guaranteed, typically performs much better. Dimensionality reduction can speed up any of these algorithms. 5 P n k − k2 min i=1 ai µ(ai) 2 µ1 µ2 … µk a1 a1 a2 a2 a3 a3 A C(A) an-1 an-1 an an key observation k-means clustering == low rank approximation 6 µ1 µ2 … µk a1 a1 a2 a2 a3 a3 A C(A) an-1 an-1 an an key observation k-means clustering == low rank approximation P n k − k2 min i=1 ai µ(ai) 2 6 key observation k-means clustering == low rank approximation P n k − k2 min i=1 ai µ(ai) 2 µ1 µ2 … µk a1 µ1 a2 a2 a3 a3 A C(A) an-1 µ1 an an 6 key observation k-means clustering == low rank approximation P n k − k2 min i=1 ai µ(ai) 2 µ1 µ2 … µk a1 µ1 a2 a2 a3 µ2 A C(A) an-1 µ1 an an 6 key observation k-means clustering == low rank approximation P n k − k2 min i=1 ai µ(ai) 2 µ1 µ2 … µk a1 µ1 a2 µk a3 µ2 A C(A) an-1 µ1 an µk 6 key observation k-means clustering == low rank approximation P n k − k2 k − k2 min i=1 ai µ(ai) 2 = A C(A) F µ1 µ2 … µk a1 µ1 a2 µk a3 µ2 A C(A) an-1 µ1 an µk 6 key observation C(A) is actually a projection of A’s columns onto a rank k subspace [Boutsidis, Drineas, Mahoney, Zouzias ‘11] 7 key observation C(A) is actually a projection of A’s columns onto a rank k subspace [Boutsidis, Drineas, Mahoney, Zouzias ‘11] 1 1/√|C1| 1/√|C1| a1 µ1 √|C | 1 1/√|C | a µ 1 2 2 k √|Ck| 1/√|Ck| 1/√|Ck| a3 µ2 1 √|C2| A = C(A) 1 √|C1| 1 an-1 µ1 √|Ck| an µk 7 key observation C(A) is actually a projection of A’s columns onto a rank k subspace [Boutsidis, Drineas, Mahoney, Zouzias ‘11] 1 1/|C1| 1/|C1| a1 µ1 1/|C2| a2 µk 1 1/|C | k 1/|Ck| a3 µ2 1 A = C(A) 1 an-1 µ1 1 an µk 7 key observation C(A) is actually a projection of A’s columns onto a rank k subspace [Boutsidis, Drineas, Mahoney, Zouzias ‘11] 1 µ1 µ1 µ 2 = µk 1 µk µ2 1 C(A) 1 µ1 1 µk 7 key observation C(A) is actually a projection of A’s columns onto a rank k subspace [Boutsidis, Drineas, Mahoney, Zouzias ‘11] 1 1/√|C1| 1/√|C1| a1 µ1 √|C | 1 1/√|C | T a µ 1 2 2 k XC √|Ck| 1/√|Ck| 1/√|Ck| a3 µ2 1 √|C2| A = C(A) XC 1 √|C1| 1 an-1 µ1 √|Ck| an µk 7 ∙ General form for constrained low rank approximation. ∙ Set S = {All rank k orthonormal matrices} for principal component analysis/singular value decomposition. key observation Xn k − k2 ) k − > k2 min ai µ (C[ai]) 2 = min A XX A F C rank(X)=k;X2S i=1 Where X is a rank k orthonormal matrix and for k-means S is the set of all clustering indicator matrices. 8 ∙ General form for constrained low rank approximation. key observation Xn k − k2 ) k − > k2 min ai µ (C[ai]) 2 = min A XX A F C rank(X)=k;X2S i=1 Where X is a rank k orthonormal matrix and for k-means S is the set of all clustering indicator matrices. ∙ Set S = {All rank k orthonormal matrices} for principal component analysis/singular value decomposition. 8 key observation Xn k − k2 ) k − > k2 min ai µ (C[ai]) 2 = min A XX A F C rank(X)=k;X2S i=1 Where X is a rank k orthonormal matrix and for k-means S is the set of all clustering indicator matrices. ∙ Set S = {All rank k orthonormal matrices} for principal component analysis/singular value decomposition. ∙ General form for constrained low rank approximation. 8 k − > k2 ≈ k − > k2 for all rank k X; Ã XX Ã F A XX A F 2 2 A − XXT A ≈ Ã − XXT Ã F F d O(k) Projection-Cost Preserving Sketch so what? k − > k2 Can solve any problem like minrank(X)=k;X2S A XX A F if: 9 2 2 A − XXT A ≈ Ã − XXT Ã F F d O(k) Projection-Cost Preserving Sketch so what? k − > k2 Can solve any problem like minrank(X)=k;X2S A XX A F if: k − > k2 ≈ k − > k2 for all rank k X; Ã XX Ã F A XX A F 9 2 2 A − XXT A ≈ Ã − XXT Ã F F d O(k) Projection-Cost Preserving Sketch so what? k − > k2 Can solve any problem like minrank(X)=k;X2S A XX A F if: k − > k2 ≈ k − > k2 for all rank k X; Ã XX Ã F A XX A F 9 2 2 A − XXT A ≈ Ã − XXT Ã F F d O(k) so what? k − > k2 Can solve any problem like minrank(X)=k;X2S A XX A F if: k − > k2 ≈ k − > k2 for all rank k X; Ã XX Ã F A XX A F Projection-Cost Preserving Sketch 9 If we find an X that gives: > k − > k2 ≤ k − f k2 Ã XX Ã F γ Ã XXoptÃ F then: k − > k2 ≤ · k − > k2 A XX A F γ (1 + ϵ) A (XX)optA F See coresets of Feldman, Schmidt, Sohler ‘13.

Load more