Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression with Shrinkage Some slides are due to Tommi Jaakkola, MIT AI Lab Introduction The goal of regression is to make quantitative (real valued) predictions on the basis of a (vector of) features or attributes. Examples: house prices, stock values, survival time, fuel efficiency of cars, etc. Predicting vehicle fuel efficiency (mpg) from 8 attributes: A generic regression problem The input attributes are given as ﬁxed length vectors x ∈ℜM that could come from different sources: inputs, transformation of inputs (log, square root, etc…) or basis functions. The outputs are assumed to be real valued y ∈ ℜ (with some possible restrictions). Given n iid training samples D = {( x 1 , y 1 )...( x n , y n )} from unknown distribution P(x, y), the goal is to minimize the prediction error (loss) on new examples (x, y) drawn at random from the same P(x, y). An a example of a loss function: L( f ),x( y) = ( f x)( − y)2 squared loss our prediction for x Regression Function We need to define a class of functions (types of predictions we make) . Linear prediction: f (x;w0 , w1) = w0 + w1x where w 0 , w 1 are the parameters we need estimate. Linear Regression Typically we have a set of training data ( x 1 , y 1 )...( xn , yn ) from which we estimate the parameters w 0 , w 1 . Each x i = ( x i 1 ,.., x iM ) is a vector of measurements for i th case. or h(xi ) = {h0 (xi ),..., hM (xi )} a basis expansion of xi M t y(x) ≈ f (x;w) = w0 + ∑ w j hj (x) = w h(x) j=1 (define h0 (x) = )1 Basis Functions There are many basis functions we can use e.g. Polynomial j−1 hj (x) = x 2 (x − µ ) Radial basis functions h (x) = exp − j j 2s2 x − µ Sigmoidal j hj (x) = σ s Splines, Fourier, Wavelets, etc Estimation Criterion We need a fitting/estimation criterion to select appropriate values for the parameters w 0 , w 1 based on the training set D = {( x1, y1)...( xn , yn )} For example, we can use the empirical loss: n 1 2 J n (w1, w0 ) = ∑()yi − f (xi ;w1, w0 ) n i=1 (note: the loss is the same as in evaluation) Empirical loss: motivation Ideally, we would like to ﬁnd the parameters w0 , w1 , that minimize the expected loss (unlimited training data): 2 J (w0 , w1) = E(x, y ~) P (yi − f (xi ;w0 , w1)) where the expectation is over samples from P(x,y). When the number of training examples n is large: n 2 1 2 E(x,y ~) P ()yi − f (xi ;w0 , w1) ≈ ∑()yi − f (xi ;w0 , w1) n i=1 Expected loss Empirical loss Linear Regression Estimation Minimize the empirical squared loss n 1 2 J n (w0 , w1) = ∑()yi − f (xi ;w0 , w1) n i=1 n 1 2 = ∑()yi − w0 − w1xi n i=1 By setting the derivatives with respect to w 1 , w 0 to zero we get necessary conditions for the “optimal” parameter values. ∂ 2 n J n (w0 , w1) = ∑()yi − w0 − w1xi (−xi ) = 0 ∂w1 n i=1 ∂ 2 n J n (w0 , w1) = ∑()yi − w0 − w1xi (− )1 = 0 ∂w0 n i=1 Interpretation The optimality conditions 2 n ∑()yi − w0 − w1xi (−xi ) = 0 n i=1 2 n ∑()yi − w0 − w1xi (− )1 = 0 n i=1 ensure that the prediction error ε i = (y − w0 − w1xi ) is decorrelated with any linear function of the inputs. Linear Regression: Matrix Form x samples n y 1 1 w0 1 M+1 X = ... ... y = ... w = w1 1 xn y n M+1 n 1 2 t J n (w0 , w1) = ∑()()()yi − w0 − w1xi = Xw − y Xw − y n i=1 Linear Regression Solution By setting the derivatives of ( Xw − y)t (Xw − y) to zero, 2 (Xt y − Xt Xw )= 0 n we get the solution: −1 wˆ = (Xt X) Xt y The solution is a linear function of the outputs y. Statistical view of linear regression In a statistical regression model we model both the function and noise Observed output = function + noise y(x) = f (x;w) + ε where, e.g., ε ~ N(0,σ 2 ) Whatever we cannot capture with our chosen family of functions will be interpreted as noise Statistical view of linear regression f(x;w ) is trying to capture the mean of the observations y given the input x: E[y | x] = E[ f (x; w) + ε | x] = f (x;w) where E[ y| x ] is the conditional expectation of y given x, evaluated according to the model (not according to the underlying distribution of X) Statistical view of linear regression According to our statistical model y(x) = f (x;w) + ε, ε ~ N ,0( σ 2 ) the outputs y given x are normally distributed with mean f (x;w) and variance σ2: 1 1 p(y | x, w,σ 2 ) = exp − (y − f (x;w)) 2 2 2πσ 2 2σ (we model the uncertainty in the predictions, not just the mean) Maximum likelihood estimation Given observations D = { ( x 1 , y 1 ),..., ( x n , y n ) } we find the parameters w that maximize the likelihood of the outputs: n 2 2 L(w,σ ) = ∏ p(yi | xi , w,σ ) i=1 n 1 1 n = exp − ()y − f ()x ;w 2 2 2 ∑ k k 2πσ 2σ i=1 Maximize log-likelihood n 1 1 n log L(w,σ 2 ) = log − ()y − f ()x ;w 2 2 2 ∑ k k 2πσ 2σ i=1 minimize Maximum likelihood estimation Thus n 2 wMLE = arg min ∑()yi − f ()xi ;w w i=1 But the empirical squared loss is n 1 2 J n (w) = ∑()yi − f (xi ; w) n i=1 Least-squares Linear Regression is MLE for Gaussian noise !!! Linear Regression (is it “good”?) Simple model Straightforward solution BUT • MLS is not a good estimator for prediction error • The matrix XTX could be ill conditioned Inputs are correlated Input dimension is large Training set is smal l Linear Regression - Example In this example the output is generated by the model (where ε is a small white Gaussian noise): y = 8.0 x1 + 2.0 x2 + ε Three training sets with different correlations between the two inputs were randomly chosen, and the linear regression solution was applied. Linear Regression – Example And the results are… RSS=0.29RSS =0.0901465 RSS=0.204RSS =0.104955 RSS=0.47RSS =0.0725469 w=0.82,b=80.807276 0.209 , 0.209322 < w=0.42,b=80.434832 0.56 , 0.568331 < w=30.36,b=830.2363, -29.21-29.2196 < 2 1.5 3 3 1 2 2 0.5 1 2 2 2 1 x x x 0 0 0 -0.5 -1 -1 -1 -1.5 -2 -2 -2 -1 0 1 2 3 -2 -1 0 1 2 3 -2 -1 0 1 2 3 x1 x1 x1 x ,x correlated x ,x strongly x1,x 2 uncorrelated 1 2 1 2 correlated Strong correlation can cause the coefficients to be very large, and they will cancel each other to achieve a good RSS . Linear Regression What can be done? Shrinkage methods Ridge Regression Lasso PCR (Principal Components Regression) PLS (Partial Least Squares) Shrinkage methods Before we proceed: Since the following methods are not invariant under input scale, we will assume the input is normalized (mean 0, variance 1): xij′ xij′ ← xij − x j xij ← σ j The offset w0 is always estimated as w0 = (∑ yi /) N i and we will work with centered y (meaning y y-w0) Ridge Regression Ridge regression shrinks the regression coefficients by imposing a penalty on their size ( also called weight decay) In ridge regression, we add a quadratic penalty on the weights: 2 N M M J (w) = y − w − x w + λ w2 ∑ i 0 ∑ ij j ∑ j i=1 j=1 j=1 where λ ≥ 0 is a tuning parameter that controls the amount of shrinkage. The size constraint prevents the phenomenon of wildly large coefficients that cancel each other from occurring. Ridge Regression Solution Ridge regression in matrix form: J (w) = (y − Xw )t (y − Xw )+ λwt w The solution is ridge t −1 t LS t −1 t wˆ = (X X + λI M ) X y wˆ = (X X) X y The solution adds a positive constant to the diagonal of XTX before inversion. This makes the problem non singular even if X does not have full column rank. For orthogonal inputs the ridge estimates are the scaled version of least squares estimates: wˆ ridge = γ wˆ LS 0 ≤ γ ≤1 Ridge Regression (insights) The matrix X can be represented by it’s SVD: X = UDV T • U is a N*M matrix, it’s columns span the column space of X • D is a M*M diagonal matrix of singular values • V is a M*M matrix, it’s columns span the row space of X Lets see how this method looks in the Principal Components coordinates of X Ridge Regression (insights) X '= XV Xw = ()XV ()V T w = X 'w' −1 The least squares solution is given by: wˆ ls = (Xt X) Xt y −1 wˆ'ls = V T wˆ ls = V T (VDU TUDV T ) VDU T y = D−1U T y The Ridge Regression is similarly given by: −1 wˆ'ridge = V T wˆ ridge = V T (VDU TUDV T + λI ) VDU T y = 2 −1 T 2 −1 2 ls = (D + λI ) DU y = (D + λI ) D wˆ' Diagonal Ridge Regression (insights) 2 ridge d j ls wˆ' j = 2 wˆ' j d j + λ In the PCA axes, the ridge coefficients are just scaled LS coefficients! The coefficients that correspond to smaller input variance directions are scaled down more.

Load more