18-661 Introduction to – II

Spring 2020

ECE – Carnegie Mellon University • If you are not able to access Gradescope, the entry code is 9NEDVR • Python code will be graded for correctness and not efficiency The class waitlist is almost clear now. Any questions about • registration? The few classes will be taught by Prof. Carlee Joe-Wong and • broadcast from SV to Pittsburgh & Kigali

Announcements

Homework 1 due today. •

1 • Python code will be graded for correctness and not efficiency The class waitlist is almost clear now. Any questions about • registration? The few classes will be taught by Prof. Carlee Joe-Wong and • broadcast from SV to Pittsburgh & Kigali

Announcements

Homework 1 due today. • • If you are not able to access Gradescope, the entry code is 9NEDVR

1 The class waitlist is almost clear now. Any questions about • registration? The few classes will be taught by Prof. Carlee Joe-Wong and • broadcast from SV to Pittsburgh & Kigali

Announcements

Homework 1 due today. • • If you are not able to access Gradescope, the entry code is 9NEDVR • Python code will be graded for correctness and not efficiency

1 The few classes will be taught by Prof. Carlee Joe-Wong and • broadcast from SV to Pittsburgh & Kigali

Announcements

Homework 1 due today. • • If you are not able to access Gradescope, the entry code is 9NEDVR • Python code will be graded for correctness and not efficiency The class waitlist is almost clear now. Any questions about • registration?

1 Announcements

Homework 1 due today. • • If you are not able to access Gradescope, the entry code is 9NEDVR • Python code will be graded for correctness and not efficiency The class waitlist is almost clear now. Any questions about • registration? The few classes will be taught by Prof. Carlee Joe-Wong and • broadcast from SV to Pittsburgh & Kigali

1 Today’s Class: Practical Issues with Using Linear Regression and How to Address Them

1 Outline

1. Review of Linear Regression

2. Methods

3. Feature Scaling

4. Ridge regression

5. Non-linear Basis Functions

6. Overfitting

2 Review of Linear Regression Example: Predicting house prices

Sale price price per sqft square footage + fixed expense ≈ ×

3 Minimize squared errors

Our model: Sale price = price per sqft square footage + fixed expense + unexplainable stuff × Training data: sqft sale price prediction error squared error 2000 810K 720K 90K 8100 2100 907K 800K 107K 1072 1100 312K 350K 38K 382 5500 2,600K 2,600K 0 0

··· ··· Total 8100 + 1072 + 382 + 0 + ··· Aim: Adjust price per sqft and fixed expense such that the sum of the squared error is minimized — i.e., the unexplainable stuff is minimized.

4 Linear regression

Setup:

Input: x RD (covariates, predictors, features, etc) • ∈ Output: y R (responses, targets, outcomes, outputs, etc) • ∈ PD > Model: f : x y, with f (x) = w0 + d=1 wd xd = w0 + w x. • → > • w = [w1 w2 ··· wD ] : weights, parameters, or parameter vector

• w0 is called bias. > • Sometimes, we also call w = [w0 w1 w2 ··· wD ] parameters.

Training data: = (xn, yn), n = 1, 2,..., N • D { } Minimize the Residual sum of squares:

N N D X 2 X X 2 RSS(w) = [yn f (xn)] = [yn (w0 + wd xnd )] − − n=1 n=1 d=1

5 A simple case: x is just one-dimensional (D=1)

Residual sum of squares:

X 2 X 2 RSS(w) = [yn f (xn)] = [yn (w0 + w1xn)] n − n −

6 Stationary points: Take derivative with respect to parameters and set it to zero

∂RSS(w) X = 0 2 [y (w + w x )] = 0, w n 0 1 n ∂ 0 ⇒ − n −

∂RSS(w) X = 0 2 [y (w + w x )]x = 0. w n 0 1 n n ∂ 1 ⇒ − n −

A simple case: x is just one-dimensional (D=1)

Residual sum of squares:

X 2 X 2 RSS(w) = [yn f (xn)] = [yn (w0 + w1xn)] n − n −

7 A simple case: x is just one-dimensional (D=1)

Residual sum of squares:

X 2 X 2 RSS(w) = [yn f (xn)] = [yn (w0 + w1xn)] n − n −

Stationary points: Take derivative with respect to parameters and set it to zero

∂RSS(w) X = 0 2 [y (w + w x )] = 0, w n 0 1 n ∂ 0 ⇒ − n −

∂RSS(w) X = 0 2 [y (w + w x )]x = 0. w n 0 1 n n ∂ 1 ⇒ − n −

7 Simplify these expressions to get the “Normal Equations”: X X yn = Nw0 + w1 xn

X X X 2 xnyn = w0 xn + w1 xn Solving the system we obtain the least squares coefficient estimates: P (xn x¯)(yn y¯) w1 = P− 2− and w0 =y ¯ w1x¯ (xi x¯) − − 1 P 1 P wherex ¯ = N n xn andy ¯ = N n yn.

A simple case: x is just one-dimensional (D=1)

∂RSS(w) X = 0 2 [y (w + w x )] = 0 w n 0 1 n ∂ 0 ⇒ − n −

∂RSS(w) X = 0 2 [y (w + w x )]x = 0 w n 0 1 n n ∂ 1 ⇒ − n −

8 Solving the system we obtain the least squares coefficient estimates: P (xn x¯)(yn y¯) w1 = P− 2− and w0 =y ¯ w1x¯ (xi x¯) − − 1 P 1 P wherex ¯ = N n xn andy ¯ = N n yn.

A simple case: x is just one-dimensional (D=1)

∂RSS(w) X = 0 2 [y (w + w x )] = 0 w n 0 1 n ∂ 0 ⇒ − n −

∂RSS(w) X = 0 2 [y (w + w x )]x = 0 w n 0 1 n n ∂ 1 ⇒ − n −

Simplify these expressions to get the “Normal Equations”: X X yn = Nw0 + w1 xn

X X X 2 xnyn = w0 xn + w1 xn

8 A simple case: x is just one-dimensional (D=1)

∂RSS(w) X = 0 2 [y (w + w x )] = 0 w n 0 1 n ∂ 0 ⇒ − n −

∂RSS(w) X = 0 2 [y (w + w x )]x = 0 w n 0 1 n n ∂ 1 ⇒ − n −

Simplify these expressions to get the “Normal Equations”: X X yn = Nw0 + w1 xn

X X X 2 xnyn = w0 xn + w1 xn Solving the system we obtain the least squares coefficient estimates: P (xn x¯)(yn y¯) w1 = P− 2− and w0 =y ¯ w1x¯ (xi x¯) − − 1 P 1 P wherex ¯ = N n xn andy ¯ = N n yn. 8 Design matrix and target vector:  >    x1 y1 >  x   y2   2  N×(D+1)   N X =  .  R , y =  .  R  .  ∈  .  ∈  .   .  > xN yN Compact expression: n > o RSS(w) = Xw y 2 = w>X>Xw 2 X>y w + const k − k2 −

Least Squares when x is D-dimensional

RSS(w) in matrix form:

X X 2 X > 2 RSS(w) = [yn (w0 + wd xnd )] = [yn w xn] , − − n d n where we have redefined some variables (by augmenting)

> > x [1 x1 x2 ... xD ] , w [w0 w1 w2 ... wD ] ← ←

9   y1  y2    N , y =  .  R  .  ∈  .  yN Compact expression: n > o RSS(w) = Xw y 2 = w>X>Xw 2 X>y w + const k − k2 −

Least Mean Squares when x is D-dimensional

RSS(w) in matrix form:

X X 2 X > 2 RSS(w) = [yn (w0 + wd xnd )] = [yn w xn] , − − n d n where we have redefined some variables (by augmenting)

> > x [1 x1 x2 ... xD ] , w [w0 w1 w2 ... wD ] ← ← Design matrix and target vector:  >  x1  x>   2  N×(D+1) X =  .  R  .  ∈  .  > xN

9 Compact expression: n > o RSS(w) = Xw y 2 = w>X>Xw 2 X>y w + const k − k2 −

Least Mean Squares when x is D-dimensional

RSS(w) in matrix form:

X X 2 X > 2 RSS(w) = [yn (w0 + wd xnd )] = [yn w xn] , − − n d n where we have redefined some variables (by augmenting)

> > x [1 x1 x2 ... xD ] , w [w0 w1 w2 ... wD ] ← ← Design matrix and target vector:  >    x1 y1 >  x   y2   2  N×(D+1)   N X =  .  R , y =  .  R  .  ∈  .  ∈  .   .  > xN yN

9 Least Mean Squares when x is D-dimensional

RSS(w) in matrix form:

X X 2 X > 2 RSS(w) = [yn (w0 + wd xnd )] = [yn w xn] , − − n d n where we have redefined some variables (by augmenting)

> > x [1 x1 x2 ... xD ] , w [w0 w1 w2 ... wD ] ← ← Design matrix and target vector:  >    x1 y1 >  x   y2   2  N×(D+1)   N X =  .  R , y =  .  R  .  ∈  .  ∈  .   .  > xN yN Compact expression: n > o RSS(w) = Xw y 2 = w>X>Xw 2 X>y w + const k − k2 −

9 Example: RSS(w) in compact form

sqft (1000’s) bedrooms bathrooms sale price (100k) 1 2 1 2 2 2 2 3.5 1.5 3 2 3 2.5 4 2.5 4.5

Design matrix and target vector:

 >    x1 y1 >  x   y2   2  N×(D+1)   N X =  .  R , y =  .  R  .  ∈  .  ∈  .   .  > xN yN . Compact expression: n > o RSS(w) = Xw y 2 = w>X>Xw 2 X>y w + const k − k2 −

10 Example: RSS(w) in compact form

sqft (1000’s) bedrooms bathrooms sale price (100k) 1 2 1 2 2 2 2 3.5 1.5 3 2 3 2.5 4 2.5 4.5

Design matrix and target vector:  >      x1 1 1 2 1 2  >   x2  1 2 2 2  3.5 X =  .  =   , y =    .  1 1.5 3 2   3   .      > 1 2.5 4 2.5 4.5 xN . Compact expression: n > o RSS(w) = Xw y 2 = w>X>Xw 2 X>y w + const k − k2 −

11 Three Optimization Methods

Want to Minimize

n > o RSS(w) = Xw y 2 = w>X>Xw 2 X>y w + const || − ||2 −

Least-Squares Solution; taking the derivative and setting it to zero • Batch Gradient Descent • Stochastic Gradient Descent •

12 Gradients of Linear and Quadratic Functions

> x(b x) = b •∇ > x(x Ax) = 2Ax (symmetric A) •∇ Normal equation

> > wRSS(w) = 2X Xw 2X y = 0 ∇ − This leads to the least-mean-squares (LMS) solution

−1 wLMS = X>X X>y

Least-Squares Solution

Compact expression

n > o RSS(w) = Xw y 2 = w>X>Xw 2 X>y w + const || − ||2 −

13 Normal equation

> > wRSS(w) = 2X Xw 2X y = 0 ∇ − This leads to the least-mean-squares (LMS) solution

−1 wLMS = X>X X>y

Least-Squares Solution

Compact expression

n > o RSS(w) = Xw y 2 = w>X>Xw 2 X>y w + const || − ||2 − Gradients of Linear and Quadratic Functions

> x(b x) = b •∇ > x(x Ax) = 2Ax (symmetric A) •∇

13 Least-Squares Solution

Compact expression

n > o RSS(w) = Xw y 2 = w>X>Xw 2 X>y w + const || − ||2 − Gradients of Linear and Quadratic Functions

> x(b x) = b •∇ > x(x Ax) = 2Ax (symmetric A) •∇ Normal equation

> > wRSS(w) = 2X Xw 2X y = 0 ∇ − This leads to the least-mean-squares (LMS) solution

−1 wLMS = X>X X>y

13 Gradient Descent Methods Outline

Review of Linear Regression

Gradient Descent Methods

Feature Scaling

Ridge regression

Non-linear Basis Functions

Overfitting

14 Three Optimization Methods

Want to Minimize

n > o RSS(w) = Xw y 2 = w>X>Xw 2 X>y w + const || − ||2 −

Least-Squares Solution; taking the derivative and setting it to zero • Batch Gradient Descent • Stochastic Gradient Descent •

15 O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent • theoretical advances) for matrix inversion of X>X O(ND) for matrix multiplication X>y •  −1 O(D2) for X >X times X >y • O(ND2) + O(D3) – Impractical for very large D or N

Computational complexity

Bottleneck of computing the solution?

 −1 w = X >X X >y

How many operations do we need?

O(ND2) for matrix multiplication X>X •

16 O(ND) for matrix multiplication X>y •  −1 O(D2) for X >X times X >y • O(ND2) + O(D3) – Impractical for very large D or N

Computational complexity

Bottleneck of computing the solution?

 −1 w = X >X X >y

How many operations do we need?

O(ND2) for matrix multiplication X>X • O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent • theoretical advances) for matrix inversion of X>X

16  −1 O(D2) for X >X times X >y • O(ND2) + O(D3) – Impractical for very large D or N

Computational complexity

Bottleneck of computing the solution?

 −1 w = X >X X >y

How many operations do we need?

O(ND2) for matrix multiplication X>X • O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent • theoretical advances) for matrix inversion of X>X O(ND) for matrix multiplication X>y •

16 O(ND2) + O(D3) – Impractical for very large D or N

Computational complexity

Bottleneck of computing the solution?

 −1 w = X >X X >y

How many operations do we need?

O(ND2) for matrix multiplication X>X • O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent • theoretical advances) for matrix inversion of X>X O(ND) for matrix multiplication X>y •  −1 O(D2) for X >X times X >y •

16 O(ND2) + O(D3) – Impractical for very large D or N

Computational complexity

Bottleneck of computing the solution?

 −1 w = X >X X >y

How many operations do we need?

O(ND2) for matrix multiplication X>X • O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent • theoretical advances) for matrix inversion of X>X O(ND) for matrix multiplication X>y •  −1 O(D2) for X >X times X >y •

16 – Impractical for very large D or N

Computational complexity

Bottleneck of computing the solution?

 −1 w = X >X X >y

How many operations do we need?

O(ND2) for matrix multiplication X>X • O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent • theoretical advances) for matrix inversion of X>X O(ND) for matrix multiplication X>y •  −1 O(D2) for X >X times X >y • O(ND2) + O(D3)

16 Computational complexity

Bottleneck of computing the solution?

 −1 w = X >X X >y

How many operations do we need?

O(ND2) for matrix multiplication X>X • O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent • theoretical advances) for matrix inversion of X>X O(ND) for matrix multiplication X>y •  −1 O(D2) for X >X times X >y • O(ND2) + O(D3) – Impractical for very large D or N

16 Alternative method: Batch Gradient Descent

(Batch) Gradient descent Initialize w to w (0) (e.g., randomly); • set t = 0; choose η > 0 Loop until convergence • 1. Compute the gradient ∇RSS(w) = X>(Xw (t) − y) 2. Update the parameters w (t+1) = w (t) − η∇RSS(w) 3. t ← t + 1

What is the complexity of each iteration? O(ND)

17 Why would this work?

If gradient descent converges, it will converge to the same solution as using matrix inversion.

This is because RSS(w) is a convex function in its parameters w

Hessian of RSS

> RSS(w) = w >X>Xw 2 X>y w + const − ∂2RSS(w) = 2X>X ⇒ ∂ww > X>X is positive semidefinite, because for any v

v >X>Xv = X>v 2 0 k k2 ≥

18 Three Optimization Methods

Want to Minimize

n > o RSS(w) = Xw y 2 = w>X>Xw 2 X>y w + const || − ||2 −

Least-Squares Solution; taking the derivative and setting it to zero • Batch Gradient Descent • Stochastic Gradient Descent •

19 Stochastic gradient descent (SGD)

Widrow-Hoff rule: update parameters using one example at a time

Initialize w to some w (0); set t = 0; choose η > 0 • Loop until convergence • 1. random choose a training a sample x t 2. Compute its contribution to the gradient

> (t) g t = (x t w − yt )xt

3. Update the parameters (t+1) (t) w = w − ηg t 4. t ← t + 1

How does the complexity per iteration compare with gradient descent?

O(ND) for gradient descent versus O(D) for SGD •

20 SGD versus Batch GD

SGD reduces per-iteration complexity from O(ND) to O(D) • But it is noisier and can take longer to converge •

21 Example: Comparing the Three Methods

sqft (1000’s) sale price (100k) data 1 regression2 intelligence 2 3.5 1.5 3 2.5 4.5

= ?? pirce ($) pirce

house size

22 Example: Least Squares Solution

sqft (1000’s) sale price (100k) 1 2 2 3.5 1.5 3 2.5 4.5

The w0 and w1 that minimize this are given by:

−1 wLMS = X>X X>y −1  1 1   2  " # " # " # w  1 1 1 1 1 2  1 1 1 1 3 5 0     .  =      w1  1 2 1.5 2.5 1 1.5 1 2 1.5 2.5  3  1 2.5 4.5

23 Example: Least Squares Solution

sqft (1000’s) sale price (100k) 1 2 2 3.5 1.5 3 2.5 4.5

The w0 and w1 that minimize this are given by: −1 wLMS = X>X X>y −1  1 1   2  " # " # " # w  1 1 1 1 1 2  1 1 1 1 3 5 0     .  =      w1  1 2 1.5 2.5 1 1.5 1 2 1.5 2.5  3  1 2.5 4.5 " # " # w0 0.45 ∗ LMS 2 = Minimum RSS is RSS = Xw y 2 = 0.2236 w1 1.6 || − ||

24 Example: Batch Gradient Descent

sqft (1000’s) sale price (100k) 1 2 2 3.5 1.5 3 2.5 4.5

  w (t+1) = w (t) η RSS(w) = w (t) ηX> Xw (t) y − ∇ − −

η = 0.01 1.5

1.0 RSS Value 0.5

0 20 40 Number of Iterations 25 Larger η gives faster convergence

sqft (1000’s) sale price (100k) 1 2 2 3.5 1.5 3 2.5 4.5

  w (t+1) = w (t) η RSS(w) = w (t) ηX> Xw (t) y − ∇ − −

η = 0.01 1.5 η = 0.1

1.0 RSS Value 0.5

0 20 40 Number of Iterations 26 But too large η makes GD unstable

sqft (1000’s) sale price (100k) 1 2 2 3.5 1.5 3 2.5 4.5

  w (t+1) = w (t) η RSS(w) = w (t) ηX> Xw (t) y − ∇ − −

η = 0.01 40 η = 0.1 η = 0.12

20 RSS Value

0 0 20 40 Number of Iterations 27 Example: Stochastic Gradient Descent

sqft (1000’s) sale price (100k) 1 2 2 3.5 1.5 3 2.5 4.5

(t+1) (t) (t)  > (t)  w = w η RSS(w) = w η x w y xt − ∇ − t −

2.0 η = 0.05

1.5

1.0 RSS Value 0.5

0 20 40 Number of Iterations 28 Larger η gives faster convergence

sqft (1000’s) sale price (100k) 1 2 2 3.5 1.5 3 2.5 4.5

(t+1) (t) (t)  > (t)  w = w η RSS(w) = w η x w y xt − ∇ − t −

2.0 η = 0.05 η = 0.1 1.5

1.0 RSS Value 0.5

0 20 40 Number of Iterations 29 But too large η makes SGD unstable

sqft (1000’s) sale price (100k) 1 2 2 3.5 1.5 3 2.5 4.5

(t+1) (t) (t)  > (t)  w = w η RSS(w) = w η x w y xt − ∇ − t −

2.0 η = 0.05 η = 0.1 1.5 η = 0.25

1.0 RSS Value 0.5

0 20 40 Number of Iterations 30 How to Choose Learning Rate η in practice?

Try 0.0001, 0.001, 0.01, 0.1 etc. on a validation dataset (more on • this later) and choose the one that gives fastest, stable convergence Reduce η by a constant factor (eg. 10) when learning saturates so • that we can reach closer to the true minimum. More advanced learning rate schedules such as AdaGrad, Adam, • AdaDelta are used in practice.

31 Summary of Gradient Descent Methods

Batch gradient descent computes the exact gradient. • Stochastic gradient descent approximates the gradient with a single • data point; its expectation equals the true gradient. Mini-batch variant: set the batch size to trade-off between accuracy • of estimating gradient and computational cost Similar ideas extend to other ML optimization problems. •

32 Feature Scaling Outline

Review of Linear Regression

Gradient Descent Methods

Feature Scaling

Ridge regression

Non-linear Basis Functions

Overfitting

33 Batch Gradient Descent: Scaled Features

sqft (1000’s) sale price (100k) 1 2 2 3.5 1.5 3 2.5 4.5

  w (t+1) = w (t) η RSS(w) = w (t) ηX> Xw (t) y − ∇ − −

η = 0.01 1.5 η = 0.1

1.0 RSS Value 0.5

0 20 40 Number of Iterations 34 RSS(w (t)) = X> Xw (t) y becomes HUGE, causing instability •∇ − We need a tiny η to compensate, but this leads to slow convergence •

η = 0.0000001 200000

150000

RSS Value 100000

50000 0 20 40 Number of Iterations

Batch Gradient Descent: Without Feature Scaling

sqft sale price 1000 200,000 2000 350,000 1500 300,000 2500 450,000

Least-squares solution is (w ∗, w ∗) = (45000, 160) • 0 1

35 We need a tiny η to compensate, but this leads to slow convergence •

η = 0.0000001 200000

150000

RSS Value 100000

50000 0 20 40 Number of Iterations

Batch Gradient Descent: Without Feature Scaling

sqft sale price 1000 200,000 2000 350,000 1500 300,000 2500 450,000

Least-squares solution is (w ∗, w ∗) = (45000, 160) • 0 1 RSS(w (t)) = X> Xw (t) y becomes HUGE, causing instability •∇ −

35 Batch Gradient Descent: Without Feature Scaling

sqft sale price 1000 200,000 2000 350,000 1500 300,000 2500 450,000

Least-squares solution is (w ∗, w ∗) = (45000, 160) • 0 1 RSS(w (t)) = X> Xw (t) y becomes HUGE, causing instability •∇ − We need a tiny η to compensate, but this leads to slow convergence •

η = 0.0000001 200000

150000

RSS Value 100000

50000 0 20 40 Number of Iterations 35 Batch Gradient Descent: Without Feature Scaling

sqft sale price 1000 200,000 2000 350,000 1500 300,000 2500 450,000

Least-squares solution is (w ∗, w ∗) = (45000, 160) • 0 1 RSS(w) becomes HUGE, causing instability •∇ We need a tiny η to compensate, but this leads to slow convergence • 10110 × 1.00 η = 0.0000001 η = 0.00001 0.75

0.50 RSS Value 0.25

0.00 0 20 40 Number of Iterations 36 Min-max normalization • 0 xd minn(xd ) xd = − maxn xd minn xd − (1) (N) The min and max are taken over the possible values xd ,... xd of xd in the dataset. This will result in all scaled features 0 xd 1 ≤ ≤ Mean normalization • 0 xd avg(xd ) xd = − maxn xd minn xd − This will result in all scaled features 1 xd 1 − ≤ ≤

Several other methods: eg. dividing by (Z-score normalization) Labels y (1),... y (N) should be similarly re-scaled

How to Scale Features?

37 Mean normalization • 0 xd avg(xd ) xd = − maxn xd minn xd − This will result in all scaled features 1 xd 1 − ≤ ≤ Several other methods: eg. dividing by standard deviation (Z-score normalization) Labels y (1),... y (N) should be similarly re-scaled

How to Scale Features?

Min-max normalization • 0 xd minn(xd ) xd = − maxn xd minn xd − (1) (N) The min and max are taken over the possible values xd ,... xd of xd in the dataset. This will result in all scaled features 0 xd 1 ≤ ≤

37 Several other methods: eg. dividing by standard deviation (Z-score normalization) Labels y (1),... y (N) should be similarly re-scaled

How to Scale Features?

Min-max normalization • 0 xd minn(xd ) xd = − maxn xd minn xd − (1) (N) The min and max are taken over the possible values xd ,... xd of xd in the dataset. This will result in all scaled features 0 xd 1 ≤ ≤ Mean normalization • 0 xd avg(xd ) xd = − maxn xd minn xd − This will result in all scaled features 1 xd 1 − ≤ ≤

37 Several other methods: eg. dividing by standard deviation (Z-score normalization) Labels y (1),... y (N) should be similarly re-scaled

How to Scale Features?

Min-max normalization • 0 xd minn(xd ) xd = − maxn xd minn xd − (1) (N) The min and max are taken over the possible values xd ,... xd of xd in the dataset. This will result in all scaled features 0 xd 1 ≤ ≤ Mean normalization • 0 xd avg(xd ) xd = − maxn xd minn xd − This will result in all scaled features 1 xd 1 − ≤ ≤

37 Labels y (1),... y (N) should be similarly re-scaled

How to Scale Features?

Min-max normalization • 0 xd minn(xd ) xd = − maxn xd minn xd − (1) (N) The min and max are taken over the possible values xd ,... xd of xd in the dataset. This will result in all scaled features 0 xd 1 ≤ ≤ Mean normalization • 0 xd avg(xd ) xd = − maxn xd minn xd − This will result in all scaled features 1 xd 1 − ≤ ≤ Several other methods: eg. dividing by standard deviation (Z-score normalization)

37 How to Scale Features?

Min-max normalization • 0 xd minn(xd ) xd = − maxn xd minn xd − (1) (N) The min and max are taken over the possible values xd ,... xd of xd in the dataset. This will result in all scaled features 0 xd 1 ≤ ≤ Mean normalization • 0 xd avg(xd ) xd = − maxn xd minn xd − This will result in all scaled features 1 xd 1 − ≤ ≤ Several other methods: eg. dividing by standard deviation (Z-score normalization) Labels y (1),... y (N) should be similarly re-scaled

37 Ridge regression Outline

Review of Linear Regression

Gradient Descent Methods

Feature Scaling

Ridge regression

Non-linear Basis Functions

Overfitting

38 Answer 1: N < D. Not enough data to estimate all parameters. • X>X is not full-rank Answer 2: Columns of X are not linearly independent, e.g., some • features are linear functions of other features. In this case, solution is not unique. Examples: • A feature is a re-scaled version of another, for example, having two features correspond to length in meters and feet respectively • Same feature is repeated twice – could happen when there are many features • A feature has the same value for all data points • Sum of two features is equal to a third feature

What if X >X is not invertible?

−1 wLMS = X>X X>y

Why might this happen?

39 Answer 2: Columns of X are not linearly independent, e.g., some • features are linear functions of other features. In this case, solution is not unique. Examples: • A feature is a re-scaled version of another, for example, having two features correspond to length in meters and feet respectively • Same feature is repeated twice – could happen when there are many features • A feature has the same value for all data points • Sum of two features is equal to a third feature

What if X >X is not invertible?

−1 wLMS = X>X X>y

Why might this happen?

Answer 1: N < D. Not enough data to estimate all parameters. • X>X is not full-rank

39 • A feature is a re-scaled version of another, for example, having two features correspond to length in meters and feet respectively • Same feature is repeated twice – could happen when there are many features • A feature has the same value for all data points • Sum of two features is equal to a third feature

What if X >X is not invertible?

−1 wLMS = X>X X>y

Why might this happen?

Answer 1: N < D. Not enough data to estimate all parameters. • X>X is not full-rank Answer 2: Columns of X are not linearly independent, e.g., some • features are linear functions of other features. In this case, solution is not unique. Examples:

39 • Same feature is repeated twice – could happen when there are many features • A feature has the same value for all data points • Sum of two features is equal to a third feature

What if X >X is not invertible?

−1 wLMS = X>X X>y

Why might this happen?

Answer 1: N < D. Not enough data to estimate all parameters. • X>X is not full-rank Answer 2: Columns of X are not linearly independent, e.g., some • features are linear functions of other features. In this case, solution is not unique. Examples: • A feature is a re-scaled version of another, for example, having two features correspond to length in meters and feet respectively

39 • A feature has the same value for all data points • Sum of two features is equal to a third feature

What if X >X is not invertible?

−1 wLMS = X>X X>y

Why might this happen?

Answer 1: N < D. Not enough data to estimate all parameters. • X>X is not full-rank Answer 2: Columns of X are not linearly independent, e.g., some • features are linear functions of other features. In this case, solution is not unique. Examples: • A feature is a re-scaled version of another, for example, having two features correspond to length in meters and feet respectively • Same feature is repeated twice – could happen when there are many features

39 What if X >X is not invertible?

−1 wLMS = X>X X>y

Why might this happen?

Answer 1: N < D. Not enough data to estimate all parameters. • X>X is not full-rank Answer 2: Columns of X are not linearly independent, e.g., some • features are linear functions of other features. In this case, solution is not unique. Examples: • A feature is a re-scaled version of another, for example, having two features correspond to length in meters and feet respectively • Same feature is repeated twice – could happen when there are many features • A feature has the same value for all data points • Sum of two features is equal to a third feature

39 Design matrix and target vector:     1 1 2   2 w 1 2 2 0 3 5      .  X =   , w = w1 , y =   1 1.5 2  3  w2 1 2.5 2 4.5

The ’bathrooms’ feature is redundant, so we don’t need w2

y = w0 + w1x1 + w2x2

= w0 + w1x1 + w2 2, since x2 is always 2! × = w0,eff + w1x1, where w0,eff = (w0 + 2w2)

Example: Matrix X >X is not invertible

sqft (1000’s) bathrooms sale price (100k) 1 2 2 2 2 3.5 1.5 2 3 2.5 2 4.5

40 The ’bathrooms’ feature is redundant, so we don’t need w2

y = w0 + w1x1 + w2x2

= w0 + w1x1 + w2 2, since x2 is always 2! × = w0,eff + w1x1, where w0,eff = (w0 + 2w2)

Example: Matrix X >X is not invertible

sqft (1000’s) bathrooms sale price (100k) 1 2 2 2 2 3.5 1.5 2 3 2.5 2 4.5

Design matrix and target vector:     1 1 2   2 w 1 2 2 0 3 5      .  X =   , w = w1 , y =   1 1.5 2  3  w2 1 2.5 2 4.5

40 Example: Matrix X >X is not invertible

sqft (1000’s) bathrooms sale price (100k) 1 2 2 2 2 3.5 1.5 2 3 2.5 2 4.5

Design matrix and target vector:     1 1 2   2 w 1 2 2 0 3 5      .  X =   , w = w1 , y =   1 1.5 2  3  w2 1 2.5 2 4.5

The ’bathrooms’ feature is redundant, so we don’t need w2

y = w0 + w1x1 + w2x2

= w0 + w1x1 + w2 2, since x2 is always 2! × = w0,eff + w1x1, where w0,eff = (w0 + 2w2) 40 What does the RSS loss function look like?

When X >X is not invertible, the RSS objective function has a ridge, • that is, the minimum is a line instead of a single point

In our example, this line is w0,eff = (w0 + 2w2) 41 But this can be tedious and non-trivial, especially when a feature is • a linear combination of several other features

Need a general way that doesn’t require manual SOLUTION: Ridge Regression

How do you fix this issue?

sqft (1000’s) bathrooms sale price (100k) 1 2 2 2 2 3.5 1.5 2 3 2.5 2 4.5

Manually remove redundant features •

42 Need a general way that doesn’t require manual feature engineering SOLUTION: Ridge Regression

How do you fix this issue?

sqft (1000’s) bathrooms sale price (100k) 1 2 2 2 2 3.5 1.5 2 3 2.5 2 4.5

Manually remove redundant features • But this can be tedious and non-trivial, especially when a feature is • a linear combination of several other features

42 Need a general way that doesn’t require manual feature engineering SOLUTION: Ridge Regression

How do you fix this issue?

sqft (1000’s) bathrooms sale price (100k) 1 2 2 2 2 3.5 1.5 2 3 2.5 2 4.5

Manually remove redundant features • But this can be tedious and non-trivial, especially when a feature is • a linear combination of several other features

42 How do you fix this issue?

sqft (1000’s) bathrooms sale price (100k) 1 2 2 2 2 3.5 1.5 2 3 2.5 2 4.5

Manually remove redundant features • But this can be tedious and non-trivial, especially when a feature is • a linear combination of several other features

Need a general way that doesn’t require manual feature engineering SOLUTION: Ridge Regression

42 Consider the SVD of this matrix:   λ1 0 0 0 ···  0 λ 0 0   2  >  ···  > X X = V  0 0  V  ·········  0 λr 0  ······  0 0 0 ······

where λ1 λ2 λr > 0 and r < D. We will have a divide by zero ≥ ≥ · · · issue when computing (X >X )−1 Fix the problem: ensure all singular values are non-zero:

> > X X + λI = V diag(λ1 + λ, λ2 + λ, , λ)V ··· where λ > 0 and I is the identity matrix.

Ridge regression

Intuition: what does a non-invertible X >X mean?

43 Fix the problem: ensure all singular values are non-zero:

> > X X + λI = V diag(λ1 + λ, λ2 + λ, , λ)V ··· where λ > 0 and I is the identity matrix.

Ridge regression

Intuition: what does a non-invertible X >X mean? Consider the SVD of this matrix:   λ1 0 0 0 ···  0 λ 0 0   2  >  ···  > X X = V  0 0  V  ·········  0 λr 0  ······  0 0 0 ······

where λ1 λ2 λr > 0 and r < D. We will have a divide by zero ≥ ≥ · · · issue when computing (X >X )−1

43 Ridge regression

Intuition: what does a non-invertible X >X mean? Consider the SVD of this matrix:   λ1 0 0 0 ···  0 λ 0 0   2  >  ···  > X X = V  0 0  V  ·········  0 λr 0  ······  0 0 0 ······

where λ1 λ2 λr > 0 and r < D. We will have a divide by zero ≥ ≥ · · · issue when computing (X >X )−1 Fix the problem: ensure all singular values are non-zero:

> > X X + λI = V diag(λ1 + λ, λ2 + λ, , λ)V ··· where λ > 0 and I is the identity matrix.

43 This is equivalent to adding an extra term to RSS(w)

RSS(w) z }| { 1   >  1 w >X >X w 2 X >y w + λ w 2 2 − 2 k k2 | {z } regularization

Benefits Numerically more stable, invertible matrix • Force w to be small • Prevent overfitting — more on this later •

Regularized least square (ridge regression)

Solution  −1 w = X >X + λI X >y

44 Benefits Numerically more stable, invertible matrix • Force w to be small • Prevent overfitting — more on this later •

Regularized least square (ridge regression)

Solution  −1 w = X >X + λI X >y

This is equivalent to adding an extra term to RSS(w)

RSS(w) z }| { 1   >  1 w >X >X w 2 X >y w + λ w 2 2 − 2 k k2 | {z } regularization

44 Regularized least square (ridge regression)

Solution  −1 w = X >X + λI X >y

This is equivalent to adding an extra term to RSS(w)

RSS(w) z }| { 1   >  1 w >X >X w 2 X >y w + λ w 2 2 − 2 k k2 | {z } regularization

Benefits Numerically more stable, invertible matrix • Force w to be small • Prevent overfitting — more on this later •

44 The ’bathrooms’ feature is redundant, so we don’t need w2

y = w0 + w1x1 + w2x2

= w0 + w1x1 + w2 2, since x2 is always 2! × = w0,eff + w1x1, where w0,eff = (w0 + 2w2)

= 0.45 + 1.6x1 Should get this

Applying this to our example

sqft (1000’s) bathrooms sale price (100k) 1 2 2 2 2 3.5 1.5 2 3 2.5 2 4.5

45 Applying this to our example

sqft (1000’s) bathrooms sale price (100k) 1 2 2 2 2 3.5 1.5 2 3 2.5 2 4.5

The ’bathrooms’ feature is redundant, so we don’t need w2

y = w0 + w1x1 + w2x2

= w0 + w1x1 + w2 2, since x2 is always 2! × = w0,eff + w1x1, where w0,eff = (w0 + 2w2)

= 0.45 + 1.6x1 Should get this

45 Compute the solution for λ = 0.5   w 0  −1   > > w1 = X X + λI X y w2     w0 0.208     w1 =  1.247  w2 0.4166

Applying this to our example

The ’bathrooms’ feature is redundant, so we don’t need w2

y = w0 + w1x1 + w2x2

= w0 + w1x1 + w2 2, since x2 is always 2! × = w0,eff + w1x1, where w0,eff = (w0 + 2w2)

= 0.45 + 1.6x1 Should get this

46 Applying this to our example

The ’bathrooms’ feature is redundant, so we don’t need w2

y = w0 + w1x1 + w2x2

= w0 + w1x1 + w2 2, since x2 is always 2! × = w0,eff + w1x1, where w0,eff = (w0 + 2w2)

= 0.45 + 1.6x1 Should get this

Compute the solution for λ = 0.5   w 0  −1   > > w1 = X X + λI X y w2     w0 0.208     w1 =  1.247  w2 0.4166

46 How does λ affect the solution?

  w 0  −1   > > w1 = X X + λI X y w2 0 Let us plot w = w0 + 2w2 and w1 for different λ [0.01, 20] o ∈

1.50 w0,eff w1 1.25

1.00

0.75 Parameter Values 0.50 0 5 10 15 20 Hyperparameter λ

Setting small λ gives almost the least-squares solution, but it can cause numerical instability in the inversion 47 How to choose λ?

λ is referred as hyperparameter

Associated with the estimation method, not the dataset • In contrast w is the parameter vector • Use validation set or cross-validation to find good choice of λ (more • on this in the next lecture)

1.50 w0,eff w1 1.25

1.00

0.75 Parameter Values 0.50 0 5 10 15 20 Hyperparameter λ

48 Why is it called Ridge Regression?

When X >X is not invertible, the RSS objective function has a ridge, • that is, the minimum is a line instead of a single point Adding the regularizer term 1 λ w 2 yields a unique minimum, thus • 2 k k2 avoiding instability in matrix inversion

49 Probabilistic interpretation: Place a prior on our weights Interpret w as a random variable • Assume that each wd is centered around zero • Use observed data to update our prior belief on w • D Gaussian priors lead to ridge regression.

Probabilistic Interpretation of Ridge Regression

Add a term to the objective function. Choose the parameters to not just minimize risk, but avoid being • too large.

1   >  1 w >X >X w 2 X >y w + λ w 2 2 − 2 k k2

50 Probabilistic interpretation: Place a prior on our weights Interpret w as a random variable • Assume that each wd is centered around zero • Use observed data to update our prior belief on w • D Gaussian priors lead to ridge regression.

Probabilistic Interpretation of Ridge Regression

Add a term to the objective function. Choose the parameters to not just minimize risk, but avoid being • too large.

1   >  1 w >X >X w 2 X >y w + λ w 2 2 − 2 k k2

50 Gaussian priors lead to ridge regression.

Probabilistic Interpretation of Ridge Regression

Add a term to the objective function. Choose the parameters to not just minimize risk, but avoid being • too large.

1   >  1 w >X >X w 2 X >y w + λ w 2 2 − 2 k k2

Probabilistic interpretation: Place a prior on our weights Interpret w as a random variable • Assume that each wd is centered around zero • Use observed data to update our prior belief on w • D

50 Probabilistic Interpretation of Ridge Regression

Add a term to the objective function. Choose the parameters to not just minimize risk, but avoid being • too large.

1   >  1 w >X >X w 2 X >y w + λ w 2 2 − 2 k k2

Probabilistic interpretation: Place a prior on our weights Interpret w as a random variable • Assume that each wd is centered around zero • Use observed data to update our prior belief on w • D Gaussian priors lead to ridge regression.

50 Frequentist interpretation: We assume that w is fixed.

The likelihood function maps parameters to probabilities • 2 2 2 Y 2 L : w, σ0 p( w, σ0) = p(y X , w, σ0) = p(yn x n, w, σ0) 7→ D| | n | Maximizing the likelihood with respect to w minimizes the RSS and • yields the LMS solution:

LMS ML 2 w = w = arg maxw L(w, σ0)

Review: Probabilistic interpretation of Linear Regression

Linear Regression model: Y = w >X + η η N(0, σ2) is a Gaussian random variable and Y N(w >X , σ2) ∼ 0 ∼ 0

51 Review: Probabilistic interpretation of Linear Regression

Linear Regression model: Y = w >X + η η N(0, σ2) is a Gaussian random variable and Y N(w >X , σ2) ∼ 0 ∼ 0 Frequentist interpretation: We assume that w is fixed.

The likelihood function maps parameters to probabilities • 2 2 2 Y 2 L : w, σ0 p( w, σ0) = p(y X , w, σ0) = p(yn x n, w, σ0) 7→ D| | n | Maximizing the likelihood with respect to w minimizes the RSS and • yields the LMS solution:

LMS ML 2 w = w = arg maxw L(w, σ0)

51 Y N(w >X , σ2) is a Gaussian random variable (as before) • ∼ 0 2 wd N(0, σ ) are i.i.d. Gaussian random variables (unlike before) • ∼ 2 Note that all wd share the same variance σ •

To find w given data , compute the posterior distribution of w: • D p( w)p(w) p(w ) = D| |D p( ) D Maximum a posterior (MAP) estimate: • w map = arg max p(w ) = arg max p( w)p(w) w |D w D|

Probabilistic interpretation of Ridge Regression

Ridge Regression model: Y = w >X + η

52 To find w given data , compute the posterior distribution of w: • D p( w)p(w) p(w ) = D| |D p( ) D Maximum a posterior (MAP) estimate: • w map = arg max p(w ) = arg max p( w)p(w) w |D w D|

Probabilistic interpretation of Ridge Regression

Ridge Regression model: Y = w >X + η

Y N(w >X , σ2) is a Gaussian random variable (as before) • ∼ 0 2 wd N(0, σ ) are i.i.d. Gaussian random variables (unlike before) • ∼ 2 Note that all wd share the same variance σ •

52 Probabilistic interpretation of Ridge Regression

Ridge Regression model: Y = w >X + η

Y N(w >X , σ2) is a Gaussian random variable (as before) • ∼ 0 2 wd N(0, σ ) are i.i.d. Gaussian random variables (unlike before) • ∼ 2 Note that all wd share the same variance σ •

To find w given data , compute the posterior distribution of w: • D p( w)p(w) p(w ) = D| |D p( ) D Maximum a posterior (MAP) estimate: • w map = arg max p(w ) = arg max p( w)p(w) w |D w D|

52 Joint likelihood of data and parameters (given σ0, σ): Y Y p( , w) = p( w)p(w) = p(yn x n, w) p(wd ) D D| | n d Plugging in the Gaussian PDF, we get: X X log p( , w) = log p(yn x n, w) + log p(wd ) D | n d P > 2 (w x n yn) X 1 = n − w 2 + const − 2σ2 − 2σ2 d 0 d MAP estimate: w map = arg max log p( , w) w D P > 2 map n(w x n yn) 1 2 w = argminw 2 − + 2 w 2 2σ0 2σ k k

Estimating w

> 2 2 Let x1,..., xN be i.i.d. with y w, x N(w x, σ ); wd N(0, σ ). | ∼ 0 ∼

53 Plugging in the Gaussian PDF, we get: X X log p( , w) = log p(yn x n, w) + log p(wd ) D | n d P > 2 (w x n yn) X 1 = n − w 2 + const − 2σ2 − 2σ2 d 0 d MAP estimate: w map = arg max log p( , w) w D P > 2 map n(w x n yn) 1 2 w = argminw 2 − + 2 w 2 2σ0 2σ k k

Estimating w

> 2 2 Let x1,..., xN be i.i.d. with y w, x N(w x, σ ); wd N(0, σ ). | ∼ 0 ∼ Joint likelihood of data and parameters (given σ0, σ): Y Y p( , w) = p( w)p(w) = p(yn x n, w) p(wd ) D D| | n d

53 MAP estimate: w map = arg max log p( , w) w D P > 2 map n(w x n yn) 1 2 w = argminw 2 − + 2 w 2 2σ0 2σ k k

Estimating w

> 2 2 Let x1,..., xN be i.i.d. with y w, x N(w x, σ ); wd N(0, σ ). | ∼ 0 ∼ Joint likelihood of data and parameters (given σ0, σ): Y Y p( , w) = p( w)p(w) = p(yn x n, w) p(wd ) D D| | n d Plugging in the Gaussian PDF, we get: X X log p( , w) = log p(yn x n, w) + log p(wd ) D | n d P > 2 (w x n yn) X 1 = n − w 2 + const − 2σ2 − 2σ2 d 0 d

53 Estimating w

> 2 2 Let x1,..., xN be i.i.d. with y w, x N(w x, σ ); wd N(0, σ ). | ∼ 0 ∼ Joint likelihood of data and parameters (given σ0, σ): Y Y p( , w) = p( w)p(w) = p(yn x n, w) p(wd ) D D| | n d Plugging in the Gaussian PDF, we get: X X log p( , w) = log p(yn x n, w) + log p(wd ) D | n d P > 2 (w x n yn) X 1 = n − w 2 + const − 2σ2 − 2σ2 d 0 d MAP estimate: w map = arg max log p( , w) w D P > 2 map n(w x n yn) 1 2 w = argminw 2 − + 2 w 2 2σ0 2σ k k

53 Maximum a posterior (MAP) estimate

X > 2 2 (w) = (w x n yn) + λ w 2 E n − k k where λ > 0 is used to denote σ2/σ2. This extra term w 2 is called 0 k k2 regularization/regularizer and controls the magnitude of w.

Intuitions

2 2 • If λ → +∞, then σ0  σ : the variance of noise is far greater than what our prior model can allow for w. In this case, our prior model on w will force w to be close to zero. Numerically,

w map → 0

• If λ → 0, then we trust our data more. Numerically,

map lms X > 2 w → w = argmin (w x n − yn) n

54 Outline

1. Review of Linear Regression

2. Gradient Descent Methods

3. Feature Scaling

4. Ridge regression

5. Non-linear Basis Functions

6. Overfitting

55 Non-linear Basis Functions Outline

Review of Linear Regression

Gradient Descent Methods

Feature Scaling

Ridge regression

Non-linear Basis Functions

Overfitting

56 Is a linear modeling assumption always a good idea?

Figure 1: Sale price can saturate as sq.footage increases

1 t

0

−1

0 x 1

Figure 2: Temperature has cyclic variations over each year 57 We can apply existing learning methods on the transformed data:

linear methods: prediction is based on w >φ(x) • other methods: nearest neighbors, decision trees, etc •

General nonlinear basis functions

We can use a nonlinear mapping to a new feature vector:

φ(x): x RD z RM ∈ → ∈

M is dimensionality of new features z (or φ(x)) • M could be greater than, less than, or equal to D •

58 General nonlinear basis functions

We can use a nonlinear mapping to a new feature vector:

φ(x): x RD z RM ∈ → ∈

M is dimensionality of new features z (or φ(x)) • M could be greater than, less than, or equal to D •

We can apply existing learning methods on the transformed data:

linear methods: prediction is based on w >φ(x) • other methods: nearest neighbors, decision trees, etc •

58 The LMS solution can be formulated with the new design matrix  >  φ(x 1) >  φ(x 2)  −1   N×M  >  > Φ =  .  R , w lms = Φ Φ Φ y  .  ∈  .  > φ(x N )

Regression with nonlinear basis

Residual sum of squares

X > 2 [w φ(x n) yn] n − where w RM , the same dimensionality as the transformed features ∈ φ(x).

59 Regression with nonlinear basis

Residual sum of squares

X > 2 [w φ(x n) yn] n − where w RM , the same dimensionality as the transformed features ∈ φ(x).

The LMS solution can be formulated with the new design matrix  >  φ(x 1) >  φ(x 2)  −1   N×M  >  > Φ =  .  R , w lms = Φ Φ Φ y  .  ∈  .  > φ(x N )

59 Example: Lot of Flexibility in Designing New Features!

2 2 x1, Area (1k sqft) x1 , Area Price (100k) 1 1 2 2 4 3.5 1.5 2.25 3 2.5 6.25 4.5

2 Figure 3: Add x1 as a feature to allow us to fit quadratic, instead of linear

functions of the house area x1

60 Example: Lot of Flexibility in Designing New Features!

x1, front (100ft) x2 depth (100ft) 10x1x2, Lot (1k sqft) Price (100k) 0.5 0.5 2.5 2 0.5 1 5 3.5 0.8 1.5 12 3 1.0 1.5 15 4.5

Figure 4: Instead of having frontage and depth as two separate features, it may be better to consider the lot-area, which is equal to frontage×depth 61 Fitting samples from a sine function: underfitting since f (x) is too simple

1 M = 0 1 M = 1 t t

0 0

−1 −1

0 x 1 0 x 1

Example with regression

Polynomial basis functions  1   x    M  x 2  X m φ(x) =   f (x) = w0 + wmx  .  ⇒  .  m=1  .  x M

62 Example with regression

Polynomial basis functions  1   x    M  x 2  X m φ(x) =   f (x) = w0 + wmx  .  ⇒  .  m=1  .  x M

Fitting samples from a sine function: underfitting since f (x) is too simple

1 M = 0 1 M = 1 t t

0 0

−1 −1

0 x 1 0 x 1 62 M=9: overfitting

1 M = 9 t

0

−1

0 x 1

More complex features lead to better results on the training data, but potentially worse results on new data, e.g., test data!

Adding high-order terms

M=3

1 M = 3 t

0

−1

0 x 1

63 Adding high-order terms

M=3 M=9: overfitting

1 M = 3 1 M = 9 t t

0 0

−1 −1

0 x 1 0 x 1

More complex features lead to better results on the training data, but potentially worse results on new data, e.g., test data!

63 Overfitting Outline

Review of Linear Regression

Gradient Descent Methods

Feature Scaling

Ridge regression

Non-linear Basis Functions

Overfitting

64 Overfitting

Parameters for higher-order polynomials are very large

M = 0 M = 1 M = 3 M = 9

w0 0.19 0.82 0.31 0.35 w1 -1.27 7.99 232.37 w2 -25.43 -5321.83 w3 17.37 48568.31 w4 -231639.30 w5 640042.26 w6 -1061800.52 w7 1042400.18 w8 -557682.99 w9 125201.43

65 Overfitting can be quite disastrous

Fitting the housing price data with large M:

Predicted price goes to zero (and is ultimately negative) if you buy a big enough house!

This is called poor generalization/overfitting.

66 Detecting overfitting

Plot model complexity versus objective function:

X axis: model complexity, e.g., M 1 • Training Y axis: error, e.g., RSS, RMS Test • (square root of RSS), 0-1 loss

RMS 0.5 Compute the objective on a training and E test dataset. 0 0 3 6 9 M

As a model increases in complexity:

Training error keeps improving • Test error may first improve but eventually will deteriorate •

67 1 N = 15 1 N = 100 t t

0 0

−1 −1

0 x 1 0 x 1 What if we do not have a lot of data?

Dealing with overfitting

Try to use more training data

1 M = 9 t

0

−1

0 x 1

68 What if we do not have a lot of data?

Dealing with overfitting

Try to use more training data

1 M = 9 1 N = 15 1 N = 100 t t t

0 0 0

−1 −1 −1

0 x 1 0 x 1 0 x 1

68 Dealing with overfitting

Try to use more training data

1 M = 9 1 N = 15 1 N = 100 t t t

0 0 0

−1 −1 −1

0 x 1 0 x 1 0 x 1 What if we do not have a lot of data?

68 Regularization methods

Intuition: Give preference to ‘simpler’ models

How do we define a simple linear regression model — w >x? • Intuitively, the weights should not be “too large” •

M = 0 M = 1 M = 3 M = 9

w0 0.19 0.82 0.31 0.35 w1 -1.27 7.99 232.37 w2 -25.43 -5321.83 w3 17.37 48568.31 w4 -231639.30 w5 640042.26 w6 -1061800.52 w7 1042400.18 w8 -557682.99 w9 125201.43

69 Next Class: Overfitting, Regularization and the Bias-variance trade-off

69