ECE595 / STAT598: Machine Learning I Lecture 03: Regression with Kernels

ECE595 / STAT598: Machine Learning I Lecture 03: Regression with Kernels Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University c Stanley Chan 2020. All Rights Reserved. 1 / 28 Outline c Stanley Chan 2020. All Rights Reserved. 2 / 28 Outline Mathematical Background Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3: Kernel Method: Enabling nonlinearity Lecture 3: Kernel Method Kernel Method Dual Form Kernel Trick Algorithm Examples Radial Basis Function (RBF) Regression using RBF Kernel Methods in Classification c Stanley Chan 2020. All Rights Reserved. 3 / 28 Why Another Method? Linear regression: Pick a global model, best fit globally. Kernel method: Pick a local model, best fit locally. In kernel method, instead of picking a line / a quadratic equation, we pick a kernel. A kernel is a measure of distance between training samples. Kernel method buys us the ability to handle nonlinearity. Ordinary regression is based on the columns (features) of A. Kernel method is based on the rows (samples) of A. c Stanley Chan 2020. All Rights Reserved. 4 / 28 Pictorial Illustration c Stanley Chan 2020. All Rights Reserved. 5 / 28 Overview of the Method Model Parameter: We want the model parameter θb to look like: (How? Question 1) N X n θb = αnx : n=1 This model expresses θb as a combination of the samples. The trainable parameters are αn, where n = 1;:::; N. If we can make αn local, i.e., non-zero for only a few of them, then we can achieve our goal: localized, sample-dependent. Predicted Value The predicted value of a new sample x is N T x X x xn yb = θb = αnh ; i: n=1 We want this model to encapsulate nonlinearity. (How? c Stanley Question Chan 2020. All Rights 2) Reserved. 6 / 28 Dual Form of Linear Regression Goal: Addresses Question 1: Express θb as N X n θb = αnx : n=1 We start by listing out a technical lemma: Lemma N×d d For any A 2 R , y 2 R , and λ > 0, (AT A + λI )−1AT y = AT (AAT + λI )−1y: (1) Proof: See Appendix. Remark: The dimensions of I on the left is d × d, on the right is N × N. If λ = 0, then the above is true only when A is invertible. c Stanley Chan 2020. All Rights Reserved. 7 / 28 Dual Form of Linear Regression Using the Lemma, we can show that θb = (AT A + λI )−1AT y (Primal Form) = AT (AAT + λI )−1y (Dual Form) | {z } def =α 2| (x1)T |3T 2α 3 | (x2)T | 1 N 6 7 . X n def T −1 = 6 . 7 6 . 7 = αnx ; αn = [(AA + λI ) y]n: 6 . 7 4 5 4 5 α n=1 | (xN )T | N c Stanley Chan 2020. All Rights Reserved. 8 / 28 The Kernel Trick Goal: Addresses Question 2: Introduce nonlinearity to N T x X x xn yb = θb = αnh ; i: n=1 The Idea: Replace the inner product hx; xni by k(x; xn): N T x X x xn yb = θb = αnk( ; ): n=1 k(·; ·) is called a kernel. A kernel is a measure of the distance between two samples xi and xj . hxi ; xj i measure distance in the ambient space, k(xi ; xj ) measure distance in a transformed space. In particular, a valid kernel takes the form k(xi ; xj ) = hφ(xi ); φ(xj )i for some nonlinear transforms φ. c Stanley Chan 2020. All Rights Reserved. 9 / 28 Kernels Illustrated A kernel typically lifts the ambient dimension to a higher one. 2 3 For example, mapping from R to R 2 2 3 x1 n x1 x = and φ(xn) = 4x1x25 x2 2 c Stanley Chan 2020. All Rights Reserved. x2 10 / 28 Relationship between Kernel and Transform Consider the following kernel k(u; v) = (uT v)2. What is the transform? 2 T 2 Suppose u and v are in R . Then (u v) is 2 ! 0 2 1 T 2 X X (u v) = ui vi @ uj vj A i=1 j=1 2 v 2 3 2 2 1 X X 2 2 6v1v27 = (ui uj )(vi vj ) = u1 u1u2 u2u1 u2 6 7 : 4v2v15 i=1 j=1 2 v2 So if we define φ as 2 2 3 u1 u1 6u1u27 u = 7! φ(u) = 6 7 u2 4u2u15 2 u2 uT v 2 u v c Stanley Chan 2020. All Rights Reserved. then ( ) = hφ( ); φ( )i. 11 / 28 Radial Basis Function A useful kernel is the radial basis kernel (RBF): ku − vk2 k(u; v) = exp − : 2σ2 The corresponding nonlinear transform of RBF is infinite dimensional. See Appendix. ku − vk2 measures the distance between two data points u and v. σ is the std dev, defining \far" and \close". RBF enforces local structure; Only a few samples are used. c Stanley Chan 2020. All Rights Reserved. 12 / 28 Kernel Method Given the choice of the kernel function, we can write down the algorithm as follows. 1 Pick a kernel function k(·; ·). N×N i j 2 Construct a kernel matrix K 2 R , where [K]ij = k(x ; x ), for i = 1;:::; N and j = 1;:::; N. N 3 Compute the coefficients α 2 R , with −1 αn = [(K + λI ) y]n: 4 Estimate the predicted value for a new sample x: N X n gθ(x) = αnk(x; x ): n=1 Therefore, the choice of the regression function is shifted to the choice of the kernel. c Stanley Chan 2020. All Rights Reserved. 13 / 28 Outline Mathematical Background Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3: Kernel Method: Enabling nonlinearity Lecture 3: Kernel Method Kernel Method Dual Form Kernel Trick Algorithm Examples Radial Basis Function (RBF) Regression using RBF Kernel Methods in Classification c Stanley Chan 2020. All Rights Reserved. 14 / 28 Example Goal: Use the kernel method to fit the data points shown as follows. n n What is the input feature vector x ? x = tn: The time stamps. n What is the output yn? y is the height. Which kernel to choose? Let us consider the RBF. 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1 0 10 20 30 40 50 60 70 80 90 100 c Stanley Chan 2020. All Rights Reserved. 15 / 28 Example (using RBF) Define the fitted function as gθ(t). [Here, θ refers to α.] 2 2 The RBF is defined as k(ti ; tj ) = exp{−(ti − tj ) =2σ g, for some σ. The matrix K looks something below 2 2 [K]ij = exp{−(ti − tj ) =2σ g: 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 K is a banded diagonal matrix. (Why?) −1 The coefficient vector is αn = [(K + λI ) y]n. c Stanley Chan 2020. All Rights Reserved. 16 / 28 Example (using RBF) Using the RBF, the predicted value is given by N N new 2 (t −tn) new X new X − 2 gθ(t ) = αnk(t ; tn) = αne 2σ : n=1 n=1 Pictorially, the predicted function gθ can be viewed as the linear combination of the Gaussian kernels. c Stanley Chan 2020. All Rights Reserved. 17 / 28 Effect of σ Large σ: Flat kernel. Over-smoothing. Small σ: Narrow kernel. Under-smoothing. Below shows an example of the fitting and the kernel matrix K. 1 1 1 noisy sample noisy sample noisy sample kernel regression kernel regression kernel regression 0.8 0.8 0.8 ground truth ground truth ground truth 0.6 0.6 0.6 0.4 0.4 0.4 10 0.2 0.2 0.2 20 0 0 30 0 40 -0.2 -0.2 -0.2 10 20 30 40 50 60 70 80 90 100 1050 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 60 10 70 10 20 80 20 30 90 30 100 40 40 10 20 30 40 50 60 70 80 90 100 50 50 60 60 70 70 80 80 90 90 100 100 10 20 30 40 50 60 70 80 90 100 10 20 30 c 40Stanley 50 Chan 60 2020. 70 All80 Rights 90 100 Reserved. Too large About right Too small 18 / 28 Any Improvement? n T We can improve the above kernel by considering x = [yn; tn] . Define the kernel as 2 2 x x (yi − yj ) (ti − tj ) k( i ; j ) = exp − 2 + 2 : 2σr 2σs This new kernel is adaptive (edge-aware). c Stanley Chan 2020. All Rights Reserved. 19 / 28 Any Improvement? Here is a comparison. 1 noisy sample kernel regression 0.8 ground truth 1 0.6 noisy sample kernel regression 0.8 ground truth 0.4 0.6 0.2 0.4 0.2 0 0 -0.2 10 20 30 40 50 60 70 80 90 100 -0.2 10 20 30 40 50 60 70 80 90 100 without improvement with improvement This idea is known as bilateral filter in the computer vision literature. n Can be further extended to 2D image where x = [yn; sn], for some spatial coordinate sn. Many applications. See Reading List. c Stanley Chan 2020. All Rights Reserved. 20 / 28 Kernel Methods in Classification The concept of lifting the data to higher dimension is useful for classification. 1 1Image source: https://towardsdatascience.com/understanding-the-kernel-trick-e0bc6112ef78 c Stanley Chan 2020.

Load more