The Kernel Trick, Gram Matrices, and Feature Extraction

The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 — Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 — Fall 2017 Principle Component Analysis • Setting: find the dominant eigenvalue-eigenvector pair of a positive semidefinite symmetric matrix A. xT Ax u1 = arg max x xT x • Many ways to write this problem, e.g. B is Frobenius norm k kF B 2 = B2 T 2 k kF i,j λ1u1 = arg min xx A F i j x k − k X X p PCA: A Non-Convex Problem • PCA is not convex in any of these formulations • Why? Think about the solutions to the problem: u and –u • Two distinct solutions à can’t be convex • Can we still use momentum to run PCA more quickly? Power Iteration • Before we apply momentum, we need to choose what base algorithm we’re using. • Simplest algorithm: power iteration • Repeatedly multiply by the matrix A to get an answer xt+1 = Axt Why does Power Iteration Work? n T • Let eigendecomposition of A be A = λiuiui i=1 X • For λ > λ λ λ 1 2 ≥ 1 ≥ ···≥ n • PI converges in direction because cosine-squared of angle to u1 is (uT x )2 (uT Atx )2 λ2t(uT x )2 cos2(✓)= 1 t = 1 0 = 1 1 0 x 2 Atx 2 n λ2t(uT x )2 k tk k 0k i=1 i i 0 n λ2t(uT x )2 λ 2t =1 i=2 i i 0 =1P ⌦ 2 − n λ2t(uT x )2 − λ Pi=1 i i 0 ✓ 1 ◆ ! P What about a more general algorithm? • Use both current iterate, and history of past iterations xt+1 = ↵tAxt + βt,1xt 1 + βt,2xt 2 + + βt,tx0 − − ··· • for fixed parameters ⍺ and β • What class of functions can we express in this form? • Notice: x t is always a degree-t polynomial in A times x0 • Can prove by induction that we can express ANY polynomial Power Iteration and Polynomials • Can also think of power iteration as a degree-t polynomial of A t xt = A x0 t • Is there a better degree-t polynomial to use than f t ( x )= x ? • If we use a different polynomial, then we get n T xt = ft(A)x0 = ft(λi)uiui x0 i=1 X • Ideal solution: choose polynomial with zeros at all non-dominant eigenvalues • Practically, make f ( λ ) to be as large as possible and f (λ) < 1 for all λ < λ t 1 | t | | | 2 Chebyshev Polynomials Again • It turns out that Chebyshev polynomials solve this problem. • Recall: T 0 ( x )=0 ,T 1 ( x )= x and Tn+1(x)=2xTn(x) Tn 1(x) − − • Nice properties: x 1 T (x) 1 | | ) | n | Chebyshev Polynomials 1.5 1 0.5 0 T0(u)=1 -0.5 -1 -1.5 0 0.5 1 1.5 -1.5 -1 -0.5 Chebyshev Polynomials 1.5 1 0.5 0 T1(u)=u -0.5 -1 -1.5 0 0.5 1 1.5 -1.5 -1 -0.5 Chebyshev Polynomials 1.5 1 0.5 0 T (u)=2u2 1 2 − -0.5 -1 -1.5 0 0.5 1 1.5 -1.5 -1 -0.5 Chebyshev Polynomials 1.5 1 0.5 0 -0.5 -1 -1.5 0 0.5 1 1.5 -1.5 -1 -0.5 Chebyshev Polynomials 1.5 1 0.5 0 -0.5 -1 -1.5 0 0.5 1 1.5 -1.5 -1 -0.5 Chebyshev Polynomials 1.5 1 0.5 0 -0.5 -1 -1.5 0 0.5 1 1.5 -1.5 -1 -0.5 Chebyshev Polynomials 1.5 1 0.5 0 -0.5 -1 -1.5 0 0.5 1 1.5 -1.5 -1 -0.5 Chebyshev Polynomials Again • It turns out that Chebyshev polynomials solve this problem. • Recall: T 0 ( x )=0 ,T 1 ( x )= x and Tn+1(x)=2xTn(x) Tn 1(x) − − • Nice properties: n x 1 T (x) 1 Tn(1 + ✏) ⇥ 1+p2✏ | | ) | n | ⇡ ⇣⇣ ⌘ ⌘ Using Chebyshev Polynomials • So we can choose our polynomial f in terms of T • Want: f ( λ ) to be as large as possible, subject to f (λ) < 1 for all λ < λ t 1 | t | | | 2 • To make this work, set x f (x)=T n n λ ✓ 2 ◆ • Can do this by running the update 2A A xt+1 = xt xt 1 xt = Tt x0 λ − − ) λ 2 ✓ 2 ◆ Convergence of Momentum PCA n λi T Tt uiu x0 x i=1 λ2 i t = xt P n ⇣ ⌘ k k T 2 λi (uT x )2 i=1 t λ2 i 0 r P ⇣ ⌘ • Cosine-squared of angle to dominant component: 2 λ1 T 2 T 2 Tt (u1 x0) 2 (u1 xt) λ2 cos (✓)= 2 = xt n ⇣2 ⌘λi T 2 T (u x0) k k i=1 t λ2 i P ⇣ ⌘ Convergence of Momentum PCA (continued) 2 λ1 T 2 T 2 Tt (u1 x0) 2 (u1 xt) λ2 cos (✓)= 2 = xt n ⇣2 ⌘λi T 2 T (u x0) k k i=1 t λ2 i ⇣ ⌘ n P2 λi T 2 i=2 Tt λ (ui x0) =1 2 n ⇣ λ ⌘ − P T 2 i (uT x )2 i=1 t λ2 i 0 n ⇣T ⌘2 P i=2(ui x0) 2 λ1 1 =1 ⌦ Tt− ≥ − T 2 λ1 (uT x )2 − λ2 tP λ2 1 0 ✓ ✓ ◆◆ ⇣ ⌘ Convergence of Momentum PCA (continued) 2 2 λ1 2 λ1 λ2 cos (✓) 1 ⌦ T − =1 ⌦ T − 1+ − ≥ − t λ − t λ ✓ ✓ 2 ◆◆ ✓ ✓ 2 ◆◆ 2t λ λ − =1 ⌦ 1+ 2 1 − 2 λ − 0 r 2 ! 1 • Recall that standard power@ iteration had: A 2t 2t λ λ λ − cos2(✓)=1 ⌦ 2 =1 ⌦ 1+ 1 − 2 − λ − λ ✓ 1 ◆ ! ✓ 2 ◆ ! • So the momentum rate is asymptotically faster than power iteration Questions? The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 — Fall 2017 Basic Linear Models • For classification using model vector w output = sign(wT x) • Optimization methods vary; here’s logistic regression (yi 1, 1 ) 2 {− } 1 n minimize log 1 + exp( wT x y ) w n − i i i=1 X Benefits of Linear Models • Fast classification: just one dot product • Fast training/learning: just a few basic linear algebra operations • Drawback: limited expressivity • Can only capture linear classification boundaries à bad for many problems • How do we let linear models represent a broader class of decision boundaries, while retaining the systems benefits? The Kernel Method • Idea: in a linear model we can think about the similarity between two training examples x and y as being xT y • This is related to the rate at which a random classifier will separate x and y • Kernel methods replace this dot-product similarity with an arbitrary Kernel function that computes the similarity between x and y K(x, y): R X ⇥ X ! Kernel Properties • What properties do kernels need to have to be useful for learning? • Key property: kernel must be symmetric K(x, y)=K(y, x) • Key property: kernel must be positive semi-definite n n ci R,xi , cicjK(xi,xj) 0 8 2 2 X ≥ i=1 j=1 X X • Can check that the dot product has this property Facts about Positive Semidefinite Kernels • Sum of two PSD kernels is a PSD kernel K(x, y)=K1(x, y)+K2(x, y)isaPSDkernel • Product of two PSD kernels is a PSD kernel K(x, y)=K1(x, y)K2(x, y)isaPSDkernel • Scaling by any function on both sides is a kernel K(x, y)=f(x)K1(x, y)f(y)isaPSDkernel Other Kernel Properties • Useful property: kernels are often non-negative K(x, y) 0 ≥ • Useful property: kernels are often scaled such that K(x, y) 1, and K(x, y)=1 x = y , • These properties capture the idea that the kernel is expressing the similarity between x and y Common Kernels • Gaussian kernel/RBF kernel: de-facto kernel in machine learning K(x, y)=exp γ x y 2 − k − k • We can validate that this is a kernel • Symmetric? ✅ • Positive semi-definite? ✅ WHY? • Non-negative? ✅ • Scaled so that K(x,x) = 1? ✅ Common Kernels (continued) • Linear kernel: just the inner product K(x, y)=xT y • Polynomial kernel: K(x, y)=(1+xT y)p • Laplacian kernel: K(x, y)=exp( β x y ) − k − k1 • Last layer of a neural network: if last layer outputs φ(x), then kernel is K(x, y)=φ(x)T φ(y) Classifying with Kernels • An equivalent way of writing a linear model on a training set is n T output(x) = sign w x x 0 i i 1 i=1 ! X @ A • We can kernel-ize this by replacing the dot products with kernel evals n output(x) = sign wiK(xi,x) i=1 ! X Learning with Kernels • An equivalent way of writing linear-model logistic regression is T 1 n n minimize log 1 + exp w x x y w n 0 0− 0 j j1 i i11 i=1 j=1 X B B X CC @ @ @ A AA • We can kernel-ize this by replacing the dot products with kernel evals 1 n n minimize log 1 + exp w y K(x ,x ) w n 0 0− j i j i 11 i=1 j=1 X X @ @ AA The Computational Cost of Kernels • Recall: benefit of learning with kernels is that we can express a wider class of classification functions • Recall: another benefit is linear classifier learning problems are “easy” to solve because they are convex, and gradients easy to compute • Major cost of learning naively with Kernels: have to evaluate K(x, y) • For SGD, need to do this effectively n times per update • Computationally intractable unless K is very simple The Gram Matrix • Address this computational problem by pre-computing the kernel function for all pairs of training examples in the dataset.

Load more