The Kernel Trick, Gram Matrices, and Feature Extraction

The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 — Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 — Fall 2017 Principle Component Analysis • Setting: find the dominant eigenvalue-eigenvector pair of a positive semidefinite symmetric matrix A. xT Ax u1 = arg max x xT x • Many ways to write this problem, e.g. B is Frobenius norm k kF B 2 = B2 T 2 k kF i,j λ1u1 = arg min xx A F i j x k − k X X p PCA: A Non-Convex Problem • PCA is not convex in any of these formulations • Why? Think about the solutions to the problem: u and –u • Two distinct solutions à can’t be convex • Can we still use momentum to run PCA more quickly? Power Iteration • Before we apply momentum, we need to choose what base algorithm we’re using. • Simplest algorithm: power iteration • Repeatedly multiply by the matrix A to get an answer xt+1 = Axt Why does Power Iteration Work? n T • Let eigendecomposition of A be A = λiuiui i=1 X • For λ > λ λ λ 1 2 ≥ 1 ≥ ···≥ n • PI converges in direction because cosine-squared of angle to u1 is (uT x )2 (uT Atx )2 λ2t(uT x )2 cos2(✓)= 1 t = 1 0 = 1 1 0 x 2 Atx 2 n λ2t(uT x )2 k tk k 0k i=1 i i 0 n λ2t(uT x )2 λ 2t =1 i=2 i i 0 =1P ⌦ 2 − n λ2t(uT x )2 − λ Pi=1 i i 0 ✓ 1 ◆ ! P What about a more general algorithm? • Use both current iterate, and history of past iterations xt+1 = ↵tAxt + βt,1xt 1 + βt,2xt 2 + + βt,tx0 − − ··· • for fixed parameters ⍺ and β • What class of functions can we express in this form? • Notice: x t is always a degree-t polynomial in A times x0 • Can prove by induction that we can express ANY polynomial Power Iteration and Polynomials • Can also think of power iteration as a degree-t polynomial of A t xt = A x0 t • Is there a better degree-t polynomial to use than f t ( x )= x ? • If we use a different polynomial, then we get n T xt = ft(A)x0 = ft(λi)uiui x0 i=1 X • Ideal solution: choose polynomial with zeros at all non-dominant eigenvalues • Practically, make f ( λ ) to be as large as possible and f (λ) < 1 for all λ < λ t 1 | t | | | 2 Chebyshev Polynomials Again • It turns out that Chebyshev polynomials solve this problem. • Recall: T 0 ( x )=0 ,T 1 ( x )= x and Tn+1(x)=2xTn(x) Tn 1(x) − − • Nice properties: x 1 T (x) 1 | | ) | n | Chebyshev Polynomials 1.5 1 0.5 0 T0(u)=1 -0.5 -1 -1.5 0 0.5 1 1.5 -1.5 -1 -0.5 Chebyshev Polynomials 1.5 1 0.5 0 T1(u)=u -0.5 -1 -1.5 0 0.5 1 1.5 -1.5 -1 -0.5 Chebyshev Polynomials 1.5 1 0.5 0 T (u)=2u2 1 2 − -0.5 -1 -1.5 0 0.5 1 1.5 -1.5 -1 -0.5 Chebyshev Polynomials 1.5 1 0.5 0 -0.5 -1 -1.5 0 0.5 1 1.5 -1.5 -1 -0.5 Chebyshev Polynomials 1.5 1 0.5 0 -0.5 -1 -1.5 0 0.5 1 1.5 -1.5 -1 -0.5 Chebyshev Polynomials 1.5 1 0.5 0 -0.5 -1 -1.5 0 0.5 1 1.5 -1.5 -1 -0.5 Chebyshev Polynomials 1.5 1 0.5 0 -0.5 -1 -1.5 0 0.5 1 1.5 -1.5 -1 -0.5 Chebyshev Polynomials Again • It turns out that Chebyshev polynomials solve this problem. • Recall: T 0 ( x )=0 ,T 1 ( x )= x and Tn+1(x)=2xTn(x) Tn 1(x) − − • Nice properties: n x 1 T (x) 1 Tn(1 + ✏) ⇥ 1+p2✏ | | ) | n | ⇡ ⇣⇣ ⌘ ⌘ Using Chebyshev Polynomials • So we can choose our polynomial f in terms of T • Want: f ( λ ) to be as large as possible, subject to f (λ) < 1 for all λ < λ t 1 | t | | | 2 • To make this work, set x f (x)=T n n λ ✓ 2 ◆ • Can do this by running the update 2A A xt+1 = xt xt 1 xt = Tt x0 λ − − ) λ 2 ✓ 2 ◆ Convergence of Momentum PCA n λi T Tt uiu x0 x i=1 λ2 i t = xt P n ⇣ ⌘ k k T 2 λi (uT x )2 i=1 t λ2 i 0 r P ⇣ ⌘ • Cosine-squared of angle to dominant component: 2 λ1 T 2 T 2 Tt (u1 x0) 2 (u1 xt) λ2 cos (✓)= 2 = xt n ⇣2 ⌘λi T 2 T (u x0) k k i=1 t λ2 i P ⇣ ⌘ Convergence of Momentum PCA (continued) 2 λ1 T 2 T 2 Tt (u1 x0) 2 (u1 xt) λ2 cos (✓)= 2 = xt n ⇣2 ⌘λi T 2 T (u x0) k k i=1 t λ2 i ⇣ ⌘ n P2 λi T 2 i=2 Tt λ (ui x0) =1 2 n ⇣ λ ⌘ − P T 2 i (uT x )2 i=1 t λ2 i 0 n ⇣T ⌘2 P i=2(ui x0) 2 λ1 1 =1 ⌦ Tt− ≥ − T 2 λ1 (uT x )2 − λ2 tP λ2 1 0 ✓ ✓ ◆◆ ⇣ ⌘ Convergence of Momentum PCA (continued) 2 2 λ1 2 λ1 λ2 cos (✓) 1 ⌦ T − =1 ⌦ T − 1+ − ≥ − t λ − t λ ✓ ✓ 2 ◆◆ ✓ ✓ 2 ◆◆ 2t λ λ − =1 ⌦ 1+ 2 1 − 2 λ − 0 r 2 ! 1 • Recall that standard power@ iteration had: A 2t 2t λ λ λ − cos2(✓)=1 ⌦ 2 =1 ⌦ 1+ 1 − 2 − λ − λ ✓ 1 ◆ ! ✓ 2 ◆ ! • So the momentum rate is asymptotically faster than power iteration Questions? The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 — Fall 2017 Basic Linear Models • For classification using model vector w output = sign(wT x) • Optimization methods vary; here’s logistic regression (yi 1, 1 ) 2 {− } 1 n minimize log 1 + exp( wT x y ) w n − i i i=1 X Benefits of Linear Models • Fast classification: just one dot product • Fast training/learning: just a few basic linear algebra operations • Drawback: limited expressivity • Can only capture linear classification boundaries à bad for many problems • How do we let linear models represent a broader class of decision boundaries, while retaining the systems benefits? The Kernel Method • Idea: in a linear model we can think about the similarity between two training examples x and y as being xT y • This is related to the rate at which a random classifier will separate x and y • Kernel methods replace this dot-product similarity with an arbitrary Kernel function that computes the similarity between x and y K(x, y): R X ⇥ X ! Kernel Properties • What properties do kernels need to have to be useful for learning? • Key property: kernel must be symmetric K(x, y)=K(y, x) • Key property: kernel must be positive semi-definite n n ci R,xi , cicjK(xi,xj) 0 8 2 2 X ≥ i=1 j=1 X X • Can check that the dot product has this property Facts about Positive Semidefinite Kernels • Sum of two PSD kernels is a PSD kernel K(x, y)=K1(x, y)+K2(x, y)isaPSDkernel • Product of two PSD kernels is a PSD kernel K(x, y)=K1(x, y)K2(x, y)isaPSDkernel • Scaling by any function on both sides is a kernel K(x, y)=f(x)K1(x, y)f(y)isaPSDkernel Other Kernel Properties • Useful property: kernels are often non-negative K(x, y) 0 ≥ • Useful property: kernels are often scaled such that K(x, y) 1, and K(x, y)=1 x = y , • These properties capture the idea that the kernel is expressing the similarity between x and y Common Kernels • Gaussian kernel/RBF kernel: de-facto kernel in machine learning K(x, y)=exp γ x y 2 − k − k • We can validate that this is a kernel • Symmetric? ✅ • Positive semi-definite? ✅ WHY? • Non-negative? ✅ • Scaled so that K(x,x) = 1? ✅ Common Kernels (continued) • Linear kernel: just the inner product K(x, y)=xT y • Polynomial kernel: K(x, y)=(1+xT y)p • Laplacian kernel: K(x, y)=exp( β x y ) − k − k1 • Last layer of a neural network: if last layer outputs φ(x), then kernel is K(x, y)=φ(x)T φ(y) Classifying with Kernels • An equivalent way of writing a linear model on a training set is n T output(x) = sign w x x 0 i i 1 i=1 ! X @ A • We can kernel-ize this by replacing the dot products with kernel evals n output(x) = sign wiK(xi,x) i=1 ! X Learning with Kernels • An equivalent way of writing linear-model logistic regression is T 1 n n minimize log 1 + exp w x x y w n 0 0− 0 j j1 i i11 i=1 j=1 X B B X CC @ @ @ A AA • We can kernel-ize this by replacing the dot products with kernel evals 1 n n minimize log 1 + exp w y K(x ,x ) w n 0 0− j i j i 11 i=1 j=1 X X @ @ AA The Computational Cost of Kernels • Recall: benefit of learning with kernels is that we can express a wider class of classification functions • Recall: another benefit is linear classifier learning problems are “easy” to solve because they are convex, and gradients easy to compute • Major cost of learning naively with Kernels: have to evaluate K(x, y) • For SGD, need to do this effectively n times per update • Computationally intractable unless K is very simple The Gram Matrix • Address this computational problem by pre-computing the kernel function for all pairs of training examples in the dataset.

The Kernel Trick, Gram Matrices, and Feature Extraction

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support