Linear Regression Linear Regression with Shrinkage

Some slides are due to Tommi Jaakkola, MIT AI Lab Introduction  The goal of regression is to make quantitative (real valued) predictions on the basis of a (vector of) features or attributes.  Examples: house prices, stock values, survival time, fuel efficiency of cars, etc.

Predicting vehicle fuel efficiency (mpg) from 8 attributes: A generic regression problem

 The input attributes are given as fixed length vectors x ∈ℜM that could come from different sources: inputs, transformation of inputs (log, square root, etc…) or basis functions.  The outputs are assumed to be real valued y ∈ ℜ (with some possible restrictions).

 Given n iid training samples D = {( x 1 , y 1 )...( x n , y n )} from unknown distribution P(x, y), the goal is to minimize the prediction error (loss) on new examples (x, y) drawn at random from the same P(x, y).  An a example of a loss function: L( f ),x( y) = ( f x)( − y)2 squared loss

our prediction for x Regression Function

 We need to define a class of functions (types of predictions we make) .

Linear prediction:

f (x;w0 ,w1) = w0 + w1x

where w 0 , w 1 are the parameters we need estimate. Linear Regression

Typically we have a set of training data ( x 1 , y 1 )...( xn , yn ) from which we estimate the parameters w 0 , w 1 .

Each x i = ( x i 1 ,.., x iM ) is a vector of measurements for i th case.

or

h(xi ) = {h0 (xi ),..., hM (xi )} a basis expansion of xi

M t y(x) ≈ f (x;w) = w0 + ∑ w j hj (x) = w h(x) j=1 (define h0 (x) = )1 Basis Functions

 There are many basis functions we can use e.g.  Polynomial j−1 hj (x) = x  (x − )2   Radial basis functions h (x) = exp − j  j  2s2   x −     Sigmoidal  j  hj (x) = σ    s   Splines, Fourier, Wavelets, etc Estimation Criterion

 We need a fitting/estimation criterion to select appropriate

values for the parameters w 0 , w 1 based on the training set

D = {( x1, y1)...( xn , yn )}  For example, we can use the empirical loss: n 1 2 J n (w1, w0 ) = ∑()yi − f (xi ;w1, w0 ) n i=1 (note: the loss is the same as in evaluation) Empirical loss: motivation

 Ideally, we would like to find the parameters w0 , w1 , that minimize the expected loss (unlimited training data):

2 J (w0 , w1) = E(x, y ~) P (yi − f (xi ;w0 , w1))

where the expectation is over samples from P(x,y).  When the number of training examples n is large:

n 2 1 2 E(x,y ~) P ()yi − f (xi ;w0 , w1) ≈ ∑()yi − f (xi ;w0 , w1) n i=1

Expected loss Empirical loss Linear Regression Estimation

 Minimize the empirical squared loss

n 1 2 J n (w0 , w1) = ∑()yi − f (xi ;w0 , w1) n i=1 n 1 2 = ∑()yi − w0 − w1xi n i=1

 By setting the derivatives with respect to w 1 , w 0 to zero we get necessary conditions for the “optimal” parameter values. ∂ 2 n J n (w0 , w1) = ∑()yi − w0 − w1xi (−xi ) = 0 ∂w1 n i=1 ∂ 2 n J n (w0 , w1) = ∑()yi − w0 − w1xi (− )1 = 0 ∂w0 n i=1 Interpretation

The optimality conditions 2 n ∑()yi − w0 − w1xi (−xi ) = 0 n i=1 2 n ∑()yi − w0 − w1xi (− )1 = 0 n i=1

ensure that the prediction error ε i = (y − w0 − w1xi ) is decorrelated with any linear function of the inputs. Linear Regression: Matrix Form

x n samples y 1 1  w0  1  M+1   X = ......    y = ...   w =    w1 1 xn  y       n  M+1

n 1 2 t J n (w0 , w1) = ∑()()()yi − w0 − w1xi = Xw − y Xw − y n i=1 Linear Regression Solution

 By setting the derivatives of (Xw − y)t (Xw − y) to zero, 2 (Xt y − Xt Xw )= 0 n

we get the solution:

−1 wˆ = (Xt X) Xt y

The solution is a linear function of the outputs y. Statistical view of linear regression  In a statistical regression model we model both the function and noise Observed output = function + noise y(x) = f (x;w) + ε

where, e.g., ε ~ N(0,σ 2 )

 Whatever we cannot capture with our chosen family of functions will be interpreted as noise Statistical view of linear regression  f(x;w ) is trying to capture the mean of the observations y given the input x: E[y | x] = E[ f (x; w) + ε | x] = f (x;w)

 where E[ y| x ] is the conditional expectation of y given x, evaluated according to the model (not according to the underlying distribution of X) Statistical view of linear regression

 According to our statistical model

y(x) = f (x;w) + ε, ε ~ N ,0( σ 2 )

the outputs y given x are normally distributed with mean f (x;w) and σ2: 1  1  p(y | x, w,σ 2 ) = exp − (y − f (x;w)) 2  2  2πσ 2  2σ 

(we model the uncertainty in the predictions, not just the mean) Maximum likelihood estimation

 Given observations D = { ( x 1 , y 1 ),..., ( x n , y n ) } we find the parameters w that maximize the likelihood of

the outputs: n 2 2 L(w,σ ) = ∏ p(yi | xi , w,σ ) i=1 n  1   1 n  =   exp − ()y − f ()x ;w 2  2   2 ∑ k k   2πσ   2σ i=1   Maximize log-likelihood n  1   1 n  log L(w,σ 2 ) = log   −  ()y − f ()x ;w 2   2  2 ∑ k k  2πσ   2σ i=1  minimize Maximum likelihood estimation

 Thus n 2 wMLE = arg min ∑()yi − f ()xi ;w w i=1  But the empirical squared loss is

n 1 2 J n (w) = ∑()yi − f (xi ; w) n i=1

Least-squares Linear Regression is MLE for Gaussian noise !!! Linear Regression (is it “good”?)

 Simple model  Straightforward solution BUT • MLS is not a good estimator for prediction error • The matrix XTX could be ill conditioned Inputs are correlated Input dimension is large Training set is smal l Linear Regression - Example

In this example the output is generated by the model (where ε is a small white Gaussian noise):

y = 8.0 x1 + 2.0 x2 +ε Three training sets with different correlations between the two inputs were randomly chosen, and the linear regression solution was applied. Linear Regression – Example

And the results are…

RSS=0.29RSS =0.0901465 RSS=0.204RSS =0.104955 RSS=0.47RSS =0.0725469 w=0.82,b=80.807276 0.209 , 0.209322 < w=0.42,b=80.434832 0.56 , 0.568331 < w=30.36,b=830.2363, -29.21-29.2196 < 2 1.5 3 3 1 2 2 0.5 1 2 2 2 1 x x x 0 0 0 -0.5 -1 -1 -1 -1.5 -2 -2 -2 -1 0 1 2 3 -2 -1 0 1 2 3 -2 -1 0 1 2 3 x1 x1 x1 x ,x correlated x ,x strongly x1,x 2 uncorrelated 1 2 1 2 correlated

Strong correlation can cause the coefficients to be very large, and they will cancel each other to achieve a good RSS . Linear Regression

What can be done? Shrinkage methods

 Ridge Regression   PCR (Principal Components Regression)  PLS (Partial ) Shrinkage methods Before we proceed:  Since the following methods are not invariant under input scale, we will assume the input is normalized (mean 0, variance 1): xij′ xij′ ← xij − x j xij ← σ j

 The offset w0 is always estimated as w0 = (∑ yi /) N i and we will work with centered y (meaning y  y-w0) Ridge Regression  Ridge regression shrinks the regression coefficients by imposing a penalty on their size ( also called weight decay)  In ridge regression, we add a quadratic penalty on the weights: 2 N  M  M J (w) =  y − w − x w  + λ w2 ∑ i 0 ∑ ij j  ∑ j i=1  j=1  j=1

where λ ≥ 0 is a tuning parameter that controls the amount of shrinkage.

 The size constraint prevents the phenomenon of wildly large coefficients that cancel each other from occurring. Ridge Regression Solution  Ridge regression in matrix form: J (w) = (y − Xw )t (y − Xw )+ λwt w  The solution is

ridge t −1 t LS t −1 t wˆ = (X X + λI M ) X y wˆ = (X X) X y

 The solution adds a positive constant to the diagonal of XTX before inversion. This makes the problem non singular even if X does not have full column rank.  For orthogonal inputs the ridge estimates are the scaled version of least squares estimates: wˆ ridge = γ wˆ LS 0 ≤ γ ≤1 Ridge Regression (insights)

The matrix X can be represented by it’s SVD: X =UDV T • U is a N*M matrix, it’s columns span the column space of X

• D is a M*M diagonal matrix of singular values • V is a M*M matrix, it’s columns span the row space of X

Lets see how this method looks in the Principal Components coordinates of X Ridge Regression (insights)

X '= XV Xw = ()XV ()V T w = X 'w'

−1 The least squares solution is given by: wˆ ls = (XtX) Xt y −1 wˆ'ls =V T wˆ ls =V T (VDU TUDV T ) VDU T y = D−1U T y

The Ridge Regression is similarly given by:

−1 wˆ'ridge = V T wˆ ridge =V T (VDU TUDV T + λI ) VDU T y =

2 −1 T 2 −1 2 ls = (D + λI ) DU y = (D + λI ) D wˆ' Diagonal Ridge Regression (insights)

2 ridge d j ls wˆ' j = 2 wˆ' j d j + λ In the PCA axes, the ridge coefficients are just scaled LS coefficients!

The coefficients that correspond to smaller input variance directions are scaled down more.

d1 d2 Ridge Regression

In the following simulations, quantities are plotted versus the quantity

M d 2 ()λ = j df ∑ 2 j=1 d j + λ

This monotonic decreasing function is the effective degrees of freedom of the ridge regression fit. Ridge Regression

Simulation results: 4 dimensions (w=1,2,3,4)

No correlation Strong correlation

Ridge Regression Ridge Regression 6 6 5 5 4 4 3 3 b wb 2 w 2 1 1 0 0 0 0.5 1 1.5 2 2.5 3 3.5 0 1 2 3 4 Effective Dimension Effective Dimension 6 6 5 5 4

S 4 S S S 3

R 3 R 2 2 1 1 0 0.5 1 1.5 2 2.5 3 3.5 0.5 1 1.5 2 2.5 3 3.5 4 Effective Dimension Effective Dimension Ridge regression is MAP with Gaussian prior J (w) = −log P(D | w)P(w)

n  Ν t 2 Ν 2  = −log ∏ (yi | w xi ,σ ) (w ,0| τ )  i=1  1 1 = (y − Xw )t (y − Xw ) + wt w + const 2σ 2 2τ 2

This is the same objective function that ridge solves, using λ = σ 2 /τ 2

Ridge: J (w) = (y − Xw )t (y − Xw )+ λwt w Lasso

Lasso is a shrinkage method like ridge, with a subtle but important difference. It is defined by

2  N  M   wˆ lasso = arg min  y − x w  ∑ i ∑ ij j   β  i=1  j =1   M subject to ∑ wj ≤ t j=1 There is a similarity to the ridge regression problem: the

L2 ridge regression penalty is replaced by the L1 lasso penalty.

The L1 penalty makes the solution non-linear in y and requires a quadratic programming algorithm to compute it. Lasso

M ˆ ls If t is chosen to be larger then t0: t ≥ t0 ≡ ∑ w j=1 then the lasso estimation is identical to the least squares. On the other hand, for say t=t 0/2, the least squares coefficients are shrunk by about 50% on average.

For convenience we will plot the simulation results versus shrinkage factor s: t t s ≡ = t M 0 ∑ wˆ ls j=1 Lasso

Simulation results: No correlation Strong correlation Lasso Lasso 6 6 5 5 4 4 3 3 b wb 2 w 2 1 1 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Shrinkage factor Shrinkage factor 6 4 5 4 3 S S S S 3

R 2 R 2 1 1 0 0 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Shrinkage factor Shrinkage factor L2 vs L1 penalties

Contours of the LS error function

L1 L2

2 2 2 β1 + β2 ≤ t β1 = 0 β1 + β2 ≤ t

In Lasso the constraint region has corners; when the solution hits a corner the corresponding coefficients becomes 0 (when M>2 more than one). Problem: Figure 1 plots linear regression results on the basis of only three data points. We used various types of regularization to obtain the plots (see below) but got confused about which plot corresponds to which regularization method. Please assign each plot to one (and only one) of the following regularization method .