<<

Linear, Ridge Regression, and Principal Component Analysis

Linear, Ridge Regression, and Principal Component Analysis

Jia Li

Department of The Pennsylvania State University

Email: [email protected] http://www.stat.psu.edu/∼jiali

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

Introduction to Regression

I Input vector: X = (X1, X2, ..., Xp). I Output Y is real-valued. I Predict Y from X by f (X ) so that the expected E(L(Y , f (X ))) is minimized.

I Square loss: L(Y , f (X )) = (Y − f (X ))2 .

I The optimal predictor ∗ 2 f (X ) = argminf (X )E(Y − f (X )) = E(Y | X ) .

I The function E(Y | X ) is the regression function.

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

Example

The number of active physicians in a Standard Metropolitan Statistical Area (SMSA), denoted by Y , is expected to be related to total population (X1, measured in thousands), land area (X2, measured in square miles), and total personal income (X3, measured in millions of dollars). Data are collected for 141 SMSAs, as shown in the following table. i : 1 2 3 ... 139 140 141 X1 9387 7031 7017 ... 233 232 231 X2 1348 4069 3719 ... 1011 813 654 X3 72100 52737 54542 ... 1337 1589 1148 Y 25627 15389 13326 ... 264 371 140 Goal: Predict Y from X1, X2, and X3.

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

Linear Methods

I The model

p X f (X ) = β0 + Xj βj . j=1

I What if the model is not true?

I It is a good approximation I Because of the lack of training data/or smarter algorithms, it is the most we can extract robustly from the data.

I Comments on Xj :

I Quantitative inputs p I Transformations of quantitative inputs, e.g., log(·), (·). 2 3 I Basis expansions: X2 = X1 , X3 = X1 , X3 = X1 · X2.

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

Estimation

I The issue of finding the regression function E(Y | X ) is converted to estimating βj , j = 0, 1, ..., p.

I Training data:

{(x1, y1), (x2, y2), ..., (xN , yN )},

where xi = (xi1, xi2, ..., xip) .

T I Denote β = (β0, β1, ..., βp) . 2 I The loss function E(Y − f (X )) is approximated by the empirical loss RSS(β)/N:

N N p X 2 X X 2 RSS(β) = (yi − f (xi )) = (yi − β0 − xij βj ) . i=1 i=1 j=1

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

Notation

I The input matrix X of dimension N × (p + 1):   1 x1,1 x1,2 ... x1,p  1 x2,1 x2,2 ... x2,p     ......  1 xN,1 xN,2 ... xN,p

I Output vector y:   y1  y2  y =    ...  yN

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

I The estimated β is βˆ.

I The fitted values at the training inputs:

p X yˆi = βˆ0 + xij βˆj j=1

and   yˆ1  yˆ2  yˆ =    ...  yˆN

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

Point Estimate

I The least square estimation of βˆ is

ˆ T −1 T §β = (X X) X y ¤

I The fitted value vector¦ is ¥

ˆ T −1 T §yˆ = Xβ = X(X X) X y ¤

I Hat matrix: ¦ ¥ H = X(XT X)−1XT § ¤ ¦ ¥

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

Geometric Interpretation

I Each column of X is a vector in an N-dimensional space (NOT the p-dimensional feature vector space).

X = (x0, x1, ..., xp)

I The fitted output vector yˆ is a linear combination of the column vectors xj , j = 0, 1, ..., p.

I yˆ lies in the subspace spanned by xj , j = 0, 1, ..., p. 2 I RSS(βˆ) =k y − yˆ k .

I y − yˆ is perpendicular to the subspace, i.e., yˆ is the projection of y on the subspace.

I The geometric interpretation is very helpful for understanding coefficient shrinkage and subset selection.

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

Example Results for the SMSA Problem

I Yˆi = −143.89 + 0.341Xi1 − 0.0193Xi2 + 0.255Xi3. I RSS(βˆ) = 52942336.

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

If the Linear Model Is True Pp I E(Y | X ) = β0 + j=1 Xj βj I The least square estimation of β is unbiased,

E(βˆj ) = βj j = 0, 1, ..., p .

I To draw inferences about β, further assume: Y = E(Y | X ) + 

where  ∼ N(0, σ2) and is independent of X .

I Xij are regarded as fixed, Yi are random due to . T −1 2 I Estimation accuracy: Var(βˆ) = (X X) σ . T −1 2 I Under the assumption, βˆ ∼ N(β, (X X) σ ) . I Confidence intervals can be computed and significant tests can be done.

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

Gauss-Markov Theorem

I Assume the linear model is true.

I For any linear combination of the parameters β0,...,βp, denoted by θ = aT β, aT βˆ is an unbiased estimation since βˆ is unbiased.

I The estimate of θ is

θˆ = aT βˆ = aT (XT X)−1Xy T , ˜a y , which is linear in y.

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

T I Suppose c y is another unbiased linear estimate of θ, i.e., E(cT y) = θ.

I The least square estimate yields the minimum among all linear unbiased estimate.

Var(˜aT y) ≤ Var(cT y) .

T T I βj , j = 0, 1, ..., p are special cases of a β, where a only has one non-zero element that equals 1.

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

Subset Selection and Coefficient Shrinkage

I Biased estimation may yield better prediction accuracy. ˆ 2 ˆ ˜ βˆ I Squared loss: E(β − 1) = Var(β). For β = a , a ≥ 1, ˜ 2 ˜ ˜ 2 1 1 2 E(β − 1) = Var(β) + (E(β) − 1) = a2 + ( a − 1) . I Practical consideration: interpretation. Sometimes, we are not satisfied with a “black box”.

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

Assume βˆ ∼ N(1, 1). The squared error loss is reduced by shrinking the estimation.

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

Subset Selection

I To choose k predicting variables from the total of p variables, search for the subset yielding minimum RSS(βˆ).

I Forward stepwise selection: start with the intercept, then sequentially adds into the model the predictor that most improves the fit.

I Backward stepwise selection: start with the full model, and sequentially deletes predictors.

I How to choose k: stop forward or backward stepwise selection when no predictor produces the F -ratio greater than a threshold.

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

Ridge Regression

Centered inputs

I Suppose xj , j = 1, ..., p, are removed. ˆ PN I β0 =y ¯ = i=1 yi /N. I If we remove the mean of yi , we can assume

p X E(Y | X ) = βj Xj j=1

I Input matrix X has p (rather than p + 1) columns. T −1 T I βˆ = (X X) X y T −1 T I yˆ = X(X X) X y

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

Singular Value Decomposition (SVD)

I If the column vectors of X are orthonormal, i.e., the variables Xj , j = 1, 2, ..., p, are uncorrelated and have unit norm.

I βˆj are the coordinates of y on the orthonormal basis X.

I In general X = UDVT . § ¤ ¦ ¥ I U = (u1, u2, ..., up) is an N × p orthogonal matrix. uj , j = 1, ..., p form an orthonormal basis for the space spanned by the column vectors of X. I V = (v1, v2, ..., vp) is an p × p orthogonal matrix. vj , j = 1, ..., p form an orthonormal basis for the space spanned by the row vectors of X. I D = diag(d1, d2, ..., dp), d1 ≥ d2 ≥ ... ≥ dp ≥ 0 are the singular values of X.

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

Principal Components

I The sample matrix of X is

S = XT X/N .

T I Eigen decomposition of X X:

XT X = (UDVT )T (UDVT ) = VDUT UDVT = VD2VT

T I The eigenvectors of X X, vj , are called principal component direction of X.

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

I It’s easy to see that zj = Xvj = uj dj . Hence uj , is simply the projection of the row vectors of X, i.e., the input predictor vectors, on the direction vj , scaled by dj . For example   X1,1v1,1 + X1,2v1,2 + ··· + X1,pv1,p  X2,1v1,1 + X2,2v1,2 + ··· + X2,pv1,p  z =   1  . . .   . . .  XN,1v1,1 + XN,2v1,2 + ··· + XN,pv1,p

I The principal components of X are zj = dj uj , j = 1, ..., p. I The first principal component of X, z1, has the largest sample variance amongst all normalized linear combinations of the columns of X. 2 Var(z1) = d1 /N .

I Subsequent principal components zj have maximum variance 2 dj /N, subject to being orthogonal to the earlier ones.

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

Ridge Regression

I Minimize a penalized residual sum of squares    N p p  ˆridge X X 2 X 2 β = argminβ (yi − β0 − xij βj ) + λ βj  i=1 j=1 j=1 

I Equivalently

N p ridge X X 2 βˆ = argminβ (yi − β0 − xij βj ) i=1 j=1 p X 2 subject to βj ≤ s . j=1

I λ or s controls the model complexity.

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

Solution

I With centered inputs,

RSS(λ) = (y − Xβ)T (y − Xβ) + λβT β ,

and βˆridge = (XT X + λI)−1XT y T I Solution exists even when X X is singular, i.e., has zero eigen values. T I When X X is ill-conditioned (nearly singular), the ridge regression solution is more robust.

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

Geometric Interpretation

I Center inputs. I Consider the fitted response yˆ = Xβˆridge = X(XT X + λI)−1XT y = UD(D2 + λI)−1DUT y p 2 X dj = u uT y , j d2 + λ j j=1 j

where uj are the normalized principal components of X. I Ridge regression shrinks the coordinates with respect to the orthonormal basis formed by the principal components.

I Coordinate with respect to the principal component with a smaller variance is shrunk more.

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

I Instead of using X = (X1, X2, ..., Xp) as predicting variables, use the transformed variables

(X v1, X v2, ..., X vp) as predictors. T I The input matrix is X˜ = UD (Note X = UDV ). I Then for the new inputs 2 ˆridge dj T ˆ σ βj = 2 uj y , Var(βj ) = 2 dj + λ dj where σ2 is the variance of the error term  in the linear model. I The factor of shrinkage given by ridge regression is 2 dj 2 . dj + λ

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

The Geometric interpretation of principal components and shrinkage by ridge regression.

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

2 Compare squared loss E(βj − βˆj ) 2 2 I Without shrinkage: σ /dj . 2 I With shrinkage: Bias + Variance.

2 2 2 dj 2 σ dj 2 (βj − βj · 2 ) + 2 · ( 2 ) dj + λ dj dj + λ β2 2 2 2 2 j σ d (d + λ 2 ) = · j j σ 2 2 2 dj (dj + λ)

I Consider the ratio between squared loss

2 2 2 2 βj d (d + λ 2 ) j j σ . 2 2 (dj + λ)

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

The ratio between the squared loss with and without shrinkage. The amount of shrinkage is set by λ = 1.0. The four curves correspond to β2/σ2 = 0.5, 1.0, 2.0, 4.0. When β2/σ2 = 0.5, 1.0, 2.0, shrinkage always leads to lower squared loss. When β2/σ2 = 4.0, shrinkage leads to lower 2 2 squared loss when dj ≤ 0.71. Shrinkage is more beneficial when dj is small. Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

Principal Components Regression (PCR)

I In stead of smoothly shrinking the coordinates on the principal components, PCR either does not shrink a coordinate at all or shrinks it to zero.

I Principal component regression forms the derived input columns zm = Xvm, and then regresses y on z1, z2, ..., zM for some M ≤ p.

I Principal components regression discards the p − M smallest eigenvalue components.

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

The

I The lasso estimate is defined by

N p lasso X X 2 βˆ = argminβ (yi − β0 − xij βj ) i=1 j=1 p X subject to |βj | ≤ s j=1

Pp 2 I Comparison with ridge regression: L2 penalty j=1 βj is Pp replaced by the L1 lasso penalty j=1 |βj |. I Some of the coefficients may be shrunk to exactly zero.

I Orthonormal columns in X are assumed in the following figure.

Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis

Jia Li http://www.stat.psu.edu/∼jiali