<<

ECE595 / STAT598: I Lecture 03: Regression with Kernels

Spring 2020

Stanley Chan

School of Electrical and Computer Engineering Purdue University

c Stanley Chan 2020. All Rights Reserved. 1 / 28 Outline

c Stanley Chan 2020. All Rights Reserved. 2 / 28 Outline

Mathematical Background Lecture 1: : A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3: Method: Enabling nonlinearity

Lecture 3: Kernel Method Kernel Method Dual Form Kernel Trick Algorithm Examples Radial Basis Function (RBF) Regression using RBF Kernel Methods in Classification

c Stanley Chan 2020. All Rights Reserved. 3 / 28 Why Another Method?

Linear regression: Pick a global model, best fit globally. Kernel method: Pick a local model, best fit locally. In kernel method, instead of picking a line / a quadratic equation, we pick a kernel. A kernel is a measure of distance between training samples. Kernel method buys us the ability to handle nonlinearity. Ordinary regression is based on the columns (features) of A. Kernel method is based on the rows (samples) of A.

c Stanley Chan 2020. All Rights Reserved. 4 / 28 Pictorial Illustration

c Stanley Chan 2020. All Rights Reserved. 5 / 28 Overview of the Method

Model Parameter: We want the model parameter θb to look like: (How? Question 1) N X n θb = αnx . n=1 This model expresses θb as a combination of the samples. The trainable parameters are αn, where n = 1,..., N. If we can make αn local, i.e., non-zero for only a few of them, then we can achieve our goal: localized, sample-dependent.

Predicted Value The predicted value of a new sample x is N T x X x xn yb = θb = αnh , i. n=1 We want this model to encapsulate nonlinearity. (How? c Stanley Question Chan 2020. All Rights 2) Reserved. 6 / 28 Dual Form of Linear Regression

Goal: Addresses Question 1: Express θb as N X n θb = αnx . n=1

We start by listing out a technical lemma: Lemma N×d d For any A ∈ R , y ∈ R , and λ > 0,

(AT A + λI )−1AT y = AT (AAT + λI )−1y. (1)

Proof: See Appendix.

Remark: The dimensions of I on the left is d × d, on the right is N × N. If λ = 0, then the above is true only when A is invertible. c Stanley Chan 2020. All Rights Reserved. 7 / 28 Dual Form of Linear Regression

Using the Lemma, we can show that θb = (AT A + λI )−1AT y (Primal Form) = AT (AAT + λI )−1y (Dual Form) | {z } def =α —(x1)T —T α  —(x2)T — 1 N   . X n def T −1 =  .   .  = αnx , αn = [(AA + λI ) y]n.  .      α n=1 —(xN )T — N

c Stanley Chan 2020. All Rights Reserved. 8 / 28 The Kernel Trick

Goal: Addresses Question 2: Introduce nonlinearity to N T x X x xn yb = θb = αnh , i. n=1 The Idea: Replace the inner product hx, xni by k(x, xn):

N T x X x xn yb = θb = αnk( , ). n=1 k(·, ·) is called a kernel. A kernel is a measure of the distance between two samples xi and xj . hxi , xj i measure distance in the ambient space, k(xi , xj ) measure distance in a transformed space. In particular, a valid kernel takes the form k(xi , xj ) = hφ(xi ), φ(xj )i for some nonlinear transforms φ. c Stanley Chan 2020. All Rights Reserved. 9 / 28 Kernels Illustrated

A kernel typically lifts the ambient dimension to a higher one. 2 3 For example, mapping from R to R  2    x1 n x1 x = and φ(xn) = x1x2 x2 2 c Stanley Chan 2020. All Rights Reserved. x2 10 / 28 Relationship between Kernel and Transform

Consider the following kernel k(u, v) = (uT v)2. What is the transform? 2 T 2 Suppose u and v are in R . Then (u v) is 2 !  2  T 2 X X (u v) = ui vi  uj vj  i=1 j=1  v 2  2 2 1 X X  2 2 v1v2 = (ui uj )(vi vj ) = u1 u1u2 u2u1 u2   . v2v1 i=1 j=1 2 v2 So if we define φ as  2  u1   u1 u1u2 u = 7→ φ(u) =   u2 u2u1 2 u2 uT v 2 u v c Stanley Chan 2020. All Rights Reserved. then ( ) = hφ( ), φ( )i. 11 / 28 Radial Basis Function

A useful kernel is the radial basis kernel (RBF):  ku − vk2  k(u, v) = exp − . 2σ2 The corresponding nonlinear transform of RBF is infinite dimensional. See Appendix. ku − vk2 measures the distance between two data points u and v. σ is the std dev, defining “far” and “close”. RBF enforces local structure; Only a few samples are used.

c Stanley Chan 2020. All Rights Reserved. 12 / 28 Kernel Method

Given the choice of the kernel function, we can write down the algorithm as follows. 1 Pick a kernel function k(·, ·). N×N i j 2 Construct a kernel matrix K ∈ R , where [K]ij = k(x , x ), for i = 1,..., N and j = 1,..., N. N 3 Compute the coefficients α ∈ R , with

−1 αn = [(K + λI ) y]n.

4 Estimate the predicted value for a new sample x:

N X n gθ(x) = αnk(x, x ). n=1 Therefore, the choice of the regression function is shifted to the choice of

the kernel. c Stanley Chan 2020. All Rights Reserved. 13 / 28 Outline

Mathematical Background Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3: Kernel Method: Enabling nonlinearity

Lecture 3: Kernel Method Kernel Method Dual Form Kernel Trick Algorithm Examples Radial Basis Function (RBF) Regression using RBF Kernel Methods in Classification

c Stanley Chan 2020. All Rights Reserved. 14 / 28 Example

Goal: Use the kernel method to fit the data points shown as follows. n n What is the input feature vector x ? x = tn: The time stamps. n What is the output yn? y is the height. Which kernel to choose? Let us consider the RBF.

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

-0.1 0 10 20 30 40 50 60 70 80 90 100

c Stanley Chan 2020. All Rights Reserved. 15 / 28 Example (using RBF)

Define the fitted function as gθ(t). [Here, θ refers to α.] 2 2 The RBF is defined as k(ti , tj ) = exp{−(ti − tj ) /2σ }, for some σ. The matrix K looks something below

2 2 [K]ij = exp{−(ti − tj ) /2σ }.

10

20

30

40

50

60

70

80

90

100 10 20 30 40 50 60 70 80 90 100

K is a banded diagonal matrix. (Why?) −1 The coefficient vector is αn = [(K + λI ) y]n.

c Stanley Chan 2020. All Rights Reserved. 16 / 28 Example (using RBF)

Using the RBF, the predicted value is given by

N N new 2 (t −tn) new X new X − 2 gθ(t ) = αnk(t , tn) = αne 2σ . n=1 n=1

Pictorially, the predicted function gθ can be viewed as the linear combination of the Gaussian kernels.

c Stanley Chan 2020. All Rights Reserved. 17 / 28 Effect of σ

Large σ: Flat kernel. Over-smoothing. Small σ: Narrow kernel. Under-smoothing. Below shows an example of the fitting and the kernel matrix K.

1 1 1 noisy sample noisy sample noisy sample kernel regression kernel regression kernel regression 0.8 0.8 0.8 ground truth ground truth ground truth

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

-0.2 -0.2 -0.2 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100

10 10 10

20 20 20

30 30 30

40 40 40

50 50 50

60 60 60

70 70 70

80 80 80

90 90 90

100 100 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 c 40Stanley 50 Chan 60 2020. 70 All80 Rights 90 100 Reserved. Too large About right Too small 18 / 28 Any Improvement?

n T We can improve the above kernel by considering x = [yn, tn] . Define the kernel as   2 2  x x (yi − yj ) (ti − tj ) k( i , j ) = exp − 2 + 2 . 2σr 2σs This new kernel is adaptive (edge-aware).

c Stanley Chan 2020. All Rights Reserved. 19 / 28 Any Improvement?

Here is a comparison.

1 1 noisy sample noisy sample kernel regression kernel regression 0.8 0.8 ground truth ground truth

0.6 0.6

0.4 0.4

0.2 0.2

0 0

-0.2 -0.2 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 without improvement with improvement

This idea is known as bilateral filter in the computer vision literature. n Can be further extended to 2D image where x = [yn, sn], for some spatial coordinate sn. Many applications. See Reading List. c Stanley Chan 2020. All Rights Reserved. 20 / 28 Kernel Methods in Classification

The concept of lifting the data to higher dimension is useful for classification. 1

1Image source:

https://towardsdatascience.com/understanding-the-kernel-trick-e0bc6112ef78 c Stanley Chan 2020. All Rights Reserved. 21 / 28 Kernels in Support Vector Machines

Example. RBF for SVM (We will discuss SVM later in the semester.) Radial Basis Function is often used in support vector machine. Poor choice of parameter can lead to low training loss, but with the risk of over-fit. Under-fitted data can sometimes give better generalization.

c Stanley Chan 2020. All Rights Reserved. 22 / 28 Reading List

Kernel Method: Learning from Data (Chapter 3.4) https://work.caltech.edu/telecourse CMU 10701 Lecture 4 https://www.cs.cmu.edu/~tom/10701_ sp11/slides/Kernels_SVM_04_7_2011-ann.pdf Berkeley CS 194 Lecture 7 https: //people.eecs.berkeley.edu/~russell/classes/cs194/ Oxford C19 Lecture 3 http://www.robots.ox.ac.uk/~az/lectures/ml/lect3.pdf Kernel Regression in Computer Vision: Bilateral Filter https://people.csail.mit.edu/sparis/bf_ course/course_notes.pdf Takeda and Milanfar, “Kernel regression for image processing and reconstruction”, IEEE Trans. Image Process. (2007) https://ieeexplore.ieee.org/document/4060955 c Stanley Chan 2020. All Rights Reserved. 23 / 28 Appendix

c Stanley Chan 2020. All Rights Reserved. 24 / 28 Proof of Lemma

Lemma N×d d For any matrix A ∈ R , y ∈ R , and λ > 0,

(AT A + λI )−1AT y = AT (AAT + λI )−1y. (2)

The left hand side is solution to normal equation, which means AT Aθ + λθ = AT y. AT  1 y A  Rearrange terms gives θ = λ ( − θ) . 1 y A AT Define α = λ ( − θ), then θ = α. AT 1 y A Substitute θ = α into α = λ ( − θ), we have 1 α = (y − AAT α). λ Rearrange terms gives (AAT + λI )α = y, which yields α = (AAT + λI )−1y. Substitute into θ = AT α gives θ = AT (AAT + λI )−1y. c Stanley Chan 2020. All Rights Reserved. 25 / 28 Non-Linear Transform for RBF

Let us consider scalar u ∈ R. k(u, v) = exp{−(u − v)2} = exp{−u2} exp{2uv} exp{−v 2} ∞ ! X 2k uk v k = exp{−u2} exp{−v 2} k! k=0 r r r !T 21 22 23 = exp{−u2} 1, u, u2, u3,..., 1! 2! 3! r r r ! 21 22 23 × 1, v, v 2, v 3,..., exp{−v 2} 1! 2! 3! So Φ is r r r ! 21 22 23 φ(x) = exp{−x2} 1, x, x2, x3,..., 1! 2! 3! c Stanley Chan 2020. All Rights Reserved. 26 / 28 Kernels are Positive Semi-Definite

x N K Given { j }j=1, construct a N × N matrix such that

T [K]ij = K(xi , xj ) = Φ(xi ) Φ(xj ).

Claim: K is positive semi-definite.

Let z be an arbitrary vector. Then,

n N N N T X X X X T z Kz = zi Kij zj = zi Φ(xi ) Φ(xj )zj i=1 j=1 i=1 j=1 N N N ! N N !2 X X X (a) X X = zi [Φ(xi )]k [Φ(xj )]k zj = [Φ(xi )]k zi ≥ 0 i=1 j=1 k=1 k=1 i=1

where [Φ(xi )]k denotes the k-th element of the vector Φ(xi ).

c Stanley Chan 2020. All Rights Reserved. 27 / 28 Existence of Nonlinear Transform

T We just showed that: If K(xi , xj ) = Φ(xi ) Φ(xj ) for any x1,..., xN , then K is symmetric positive semi-definite. The converse also holds: If K is symmetric positive semi-definite for any x1,..., xN , then there exist Φ such that T K(xi , xj ) = Φ(xi ) Φ(xj ). This converse is difficult to prove. It is called the Mercer Condition. Kernels satisfying Mercer’s condition have Φ. You can use the condition to rule out invalid kernels. But proving a valid kernel is still hard.

c Stanley Chan 2020. All Rights Reserved. 28 / 28