<<

Lecture 4: Regression Methods I ()

Wenbin Lu

Department of North Carolina State University

Fall 2019

Wenbin Lu (NCSU) Mining and Machine Learning Fall 2019 1 / 43 Outline

1 Regression: Supervised Learning with Continuous Responses 2 Linear Models and Multiple Linear Regression Statistical inferences Computational algorithms

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 2 / 43 Linear Model Introduction Regression Models

If the response Y take real values, we refer this type of supervised learning problem as regression problem. linear regression models parametric models nonparametric regression splines, kernel estimator, local polynomial regression semiparametric regression Broad coverage: penalized regression, regression trees, support vector regression, quantile regression

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 3 / 43 Linear Model Introduction Linear Regression Models

A standard linear regression model assumes

T 2 yi = xi β + i , i ∼ i.i.d, E(i ) = 0, Var(i ) = σ ,

d yi is the response for the ith observation, xi ∈ R is the covariates β ∈ Rd is the d-dimensional parameter vector Common model assumptions: independence of errors constant error variance (homoscedasticity)  independent of X. Normality is not needed.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 4 / 43 Linear Model Introduction About Linear Models

Linear models has been a mainstay of statistics for the past 30 years and remains one of the most important tools. The covariates may come from different sources quantitative inputs; dummy coding√ qualitative inputs. transformed inputs: log(X ), X 2, X , ... 2 3 basis expansion: X1, X1 , X1 , ... (polynomial representation) interaction between variables: X1X2, ...

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 5 / 43 Linear Model Introduction Review on Theory - Notations

A is an m × m matrix. col(A): the subspace of Rm spanned by the columns of A.

Im is the of size m.

Vectors, x is a nonzero m × 1 vector 0 is a zero vector of m × 1.

ei , i = 1,..., m is m × 1 unit vector, with 1 in the ith position and zeros elsewhere.

The ith column of A can be expresed as Aei , for i = 1,..., m.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 6 / 43 Linear Model Introduction Basic Concepts

The determinant of A is det(A) = |A|. The trace of A is tr(A) = the sum of the diagonal elements. The roots of the mth degree of polynomial equation in λ.

|λIm − A| = 0,

denoted by λ1, ··· , λm are called the eigenvalues of A.

The collection {λ1, ··· , λm} is called the spectrum of A.

Any nonzero m × 1 vector xi 6= 0 such that

Axi = λi xi

is an eigenvector of A corresponding to the eigenvalue λi .

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 7 / 43 Linear Model Introduction

Let B be another m × m matrix, then

|AB| = |A||B|, tr(AB) = tr(BA).

A is symmetric if A0 = A.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 8 / 43 Linear Model Introduction Review on Matrix Theory (II)

The following are equivalent: |A|= 6 0 rank(A) = m A−1 exists.

Linear transformation: Ax generates a vector in col(A)

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 9 / 43 Linear Model Introduction

An m × m matrix P is called an orthogonal matrix if

0 0 −1 0 PP = P P = Im, or P = P .

If P is an orthogonal matrix, then |PP0| = |P||P0| = |P|2 = |I | = 1, so |P| = ±1. For any A, we have tr(PAP0) = tr(AP0P) = tr(A). PAP0 and A have the same eigenvalues, since

0 0 0 2 |λIm − PAP | = |λPP − PAP | = |P| |λIm − A| = |λIm − A|.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 10 / 43 Linear Model Introduction Spectral Decomposition of

If A is symmetric, there exists an orthogonal matrix P such that

0 P AP = Λ = diag{λ1, ··· , λm},

λi ’s are the eigenvalues of A. The eigenvectors of A are the column vectors of P.

Denote the ith column of P by pi , then m 0 X 0 PP = pi pi = Im. i=1 The spectral decomposition of A is m 0 X 0 A = PΛP = λi pi pi i=1 Pn Qm tr(A) = tr(Λ) = i=1 λi and |A| = |Λ| = i=1 λi . Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 11 / 43 Linear Model Introduction Idempotent Matrices

An m × m matrix A is idempotent if

A2 = AA = A.

The eigenvalues of an are either zero or one

λx = Ax = A(Ax) = A(λx) = λ2x,

=⇒ λ = λ2.

If A is idempotent, so is Im − A.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 12 / 43 Linear Model Introduction

A symmetric, idempotent matrix A is called a projection matrix.

If A is a symmetric idempotent, then If rank(A) = r, then A has r eigenvalues equal to 1 and m − r zero eigenvalues. tr(A) = rank(A).

Im − A is also symmetric idempotent, of rank m − r.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 13 / 43 Linear Model Introduction Projection Matrices

Given x ∈ Rm, define y = Ax, z = (I − A)x = x − y. Then y ⊥ z. y is the orthogonal projection of x onto the subspace col(A). z = (I − A)x is the orthogonal projection of x onto the complementary subspace such that

x = y + z = Ax + (I − A)x.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 14 / 43 Linear Model Introduction Matrix Notations for Linear Regression

T The response vector y = (y1, ··· , yn) The design matrix X . Assume the first column of X is 1. The dimension of X is n × (1 + d). β  The regression coefficients β = 0 . β1 T The error vector  = (1, ··· , n) . The linear model is written as:

y = X β + 

the estimated coefficients βb the predicted response by = X βb.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 15 / 43 Linear Model Introduction OLS Method Ordinary Least Squares (OLS)

The most popular method for fitting the linear model is the ordinary least squares (OLS):

min RSS(β) = (y − X β)T (y − X β). β

Normal equations: X T (y − X β) = 0 T −1 T T −1 T βb = (X X ) X y and by = X (X X ) X y. Residual vector is r = y − yˆ = (I − PX )y. Residual sum squares RSS = rT r.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 16 / 43 Linear Model Introduction OLS Method Projection Matrix

Call the following square matrix the projection or hat matrix:

T −1 T PX = X (X X ) X .

Properties: symmetric and non-negative definite 2 idempotent: PX = PX . The eigenvalues are 0’s and 1’s. PX X = X , (I − PX )X = 0. We have T r = (I − PX )y, RSS = y (I − PX )y. Note T T X r = X (I − PX )y = 0. The residual vector is orthogonal to the column space spanned by X , col(X ).

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 17 / 43

Linear Model Introduction OLS Method

­

Ý

Ü

¾

Ý^

Ü

½

Æ

Ý

Ü Ü

½ ¾

^

Ý

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 18 / 43 Linear Model Introduction Statistical Inferences Sampling Properties of βb

Var(βˆ) = σ2(X T X )−1, The variance σ2 can be estimated as

σˆ2 = RSS/(n − d − 1).

This is an unbiased estimator, i.e., E(ˆσ2) = σ2

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 19 / 43 Linear Model Introduction Statistical Inferences Inferences for Gaussian Errors

Under the Normal assumption on the error , we have βˆ ∼ N(β, σ2(X T X )−1) 2 2 2 (n − d − 1)ˆσ ∼ σ χn−d−1 βˆ is independent ofσ ˆ2

To test H0 : βj = 0, we use ˆ 2 βj if σ is known, zj = √ has a Z distribution under H0; σ vj ˆ 2 βj if σ is unknown, tj = √ has a t distribution under H0; σˆ vj n−d−1 T −1 where vj is the jth diagonal element of (X X ) .

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 20 / 43 Linear Model Introduction Statistical Inferences Confidence Interval for Individual Coefficients

Under Normal assumption, the 100(1 − α)% C.I. of βj is √ βˆj ± t α σˆ vj , n−d−1; 2 where tk;ν is 1 − ν percentile of tk distribution.

In practice, we use the approximate 100(1 − α)% C.I. of βj √ βˆj ± z α σˆ vj , 2 α where z α is 1 − percentile of the standard Normal distribution. 2 2 Even if the Gaussian assumption does not hold, this interval is approximately right, with the coverage probability 1 − α as n → ∞.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 21 / 43 Linear Model Introduction Statistical Inferences Review on Multivariate Normal Distributions

Distributions of Quadratic Form (Non-central χ2):

If X ∼ Np(µ, Ip), then

p X 1 W = XT X = X 2 ∼ χ2(λ), λ = µT µ. i p 2 i=1

T 2 Special case: If X ∼ Np(0, Ip), then W = X X ∼ χp. If X ∼ Np(µ, V ) where V is nonsingular, then 1 W = XT V −1X ∼ χ2(λ), λ = µT V −1µ. p 2

If X ∼ Np(µ, V ) with V nonsingular, and if A is symmetric and AV is idempotent with rank s, then 1 W = XT AX ∼ χ2(λ), λ = µT Aµ. s 2

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 22 / 43 Linear Model Introduction Statistical Inferences Cochran’s Theorem

2 Let y ∼ Nn(µ, σ In) and let Aj , j = 1, ··· , J be symmetric idempotent PJ matrices with rank sj . Furthermore, assume that j=1 Aj = In and PJ j=1 sj = n, then (i) 1 W = yT A y ∼ χ2 (λ ), j σ2 j sj j 1 T where λj = 2σ2 µ Aj µ (ii) Wj ’s are mutually independent with each other. Essentially: we decompose yT y into the (scaled) sum of its quadratic forms, n J X 2 T X T yi = y Iny = y Aj y. i=1 j=1

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 23 / 43 Linear Model Introduction Statistical Inferences Application of Cochran’s Theorem to Linear Models

2 Example: Assume y ∼ Nn(X β, σ In). Define A = I − PX and the residual sum of squares: RSS = yT Ay = krk2 T 2 the sum of squares regression: SSR = y PX y = kbyk . By Cochran’s Theorem, we have (i) 2 2 2 2 RSS/σ ∼ χn−d−1, SSR/σ ∼ χd+1(λ), where λ = (X β)T (X β)/(2σ2), (ii) RSS is independent from SSR. (Note r ⊥ by)

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 24 / 43 Linear Model Introduction Statistical Inferences F Distribution

2 2 If U1 ∼ χp, U2 ∼ χq and U1 ⊥ U2, then

U1/p F = ∼ Fp,q. U2/q

2 2 If U1 ∼ χp(λ), U2 ∼ χq and U1 ⊥ U2, then

U1/p F = ∼ Fp,q(λ), (noncentral F ) U2/q

2 Example: Assume y ∼ Nn(X β, σ In). Let A = I − PX , and

T T 2 T 2 RSS = y Ay = krk , SSR = y PX y = kbyk . Then SSR/(d + 1) F = ∼ F (λ), λ = kX βk2/(2σ2). RSS/(n − d − 1) d+1,n−d−1

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 25 / 43 Linear Model Introduction Statistical Inferences Making Inferences about Multiple Parameters

Assume X = [X0, X1], where X0 consists of the first k columns. 0 0 0 Correspondingly, β = [β0, β1] . To test H0 : β0 = 0, using (RSS − RSS)/k F = 1 RSS/(n − d − 1)

T RSS1 = y (I − PX1 )y (reduced model). T RSS = y (I − PX )y (full model) 2 2 RSS1 ∼ σ χn−d−1. T RSS1 − RSS = y (PX − PX1 )y.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 26 / 43 Linear Model Introduction Statistical Inferences Testing Multiple Parameter

Applying Cochran’s Theorem to RSS1, RSS and RSS1 − RSS, they are independent they respectively follow noncentral χ2 distributions, with T 2 noncentralities (X β) (I − PX1 )(X β)/(2σ ), 0, and T 2 (X β) (PX − PX1 )(X β)/(2σ ). . Then we have T 2 F ∼ Fk,n−d−1(λ), with λ = (X β) (PX − PX1 )(X β)/(2σ ). Under H0, we have X β = X1β1, so F ∼ Fk,n−d−1.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 27 / 43 Linear Model Introduction Statistical Inferences Nested Model Selection

To test for significance of groups of coefficients simultaneously, we use F -statistic (RSS − RSS )/(d − d ) F = 0 1 1 0 , RSS1/(n − d1 − 1) where

RSS1 is the RSS for the bigger model with d1 + 1 parameters

RSS0 is the RSS for the nested smaller model with d0 + 1 parameter, have d1 − d0 parameters constrained to zero. F -statistic measure the change in RSS per additional parameter in the bigger model, and it is normalized byσ ˆ2. Under the assumption that the smaller model is correct,

F ∼ Fd1−d0,n−d1−1.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 28 / 43 Linear Model Introduction Statistical Inferences Confidence Set

The approximate confidence set of β is

ˆ T T ˆ 2 2 Cβ = {β|(β − β) (X X )(β − β) ≤ σˆ χd+1;1−α},

2 2 where χk;1−α is 1 − α percentile of χk distribution. The confidence interval for the true function f (x) = xT β is

T {x β|β ∈ Cβ}

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 29 / 43 Linear Model Introduction Statistical Inferences Gauss-Markov Theorem

Assume sT β is linearly estimable, i.e., there exists a linear estimator b + cT y such that E(b + cT y) = sT β. A function sT β is linearly estimable iff s = X T a for some a. Theorem: If sT β is linearly estimable, then sT βb is the best linear unbiased estimator (BLUE) of sT β: For any cT y satisfying E(cT y) = sT β, we have

Var(sT βb) ≤ Var(cT y).

sT βb is the best among all the unbiased estimators. (It is a function of the complete and sufficient statistic (yT y, XT y).) Question: Is it possible to find a slightly biased linear estimator but with smaller variance? (– Trade a little bias for a large reduction in variance.)

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 30 / 43 Computation Issues Linear Regression with Orthogonal Design

If X is univariate, the least square estimate is P ˆ i xi yi < x, y > β = P 2 = . i xi < x, x >

if X = [x1, ..., xd ] has orthogonal columns, i.e.,

< xj , xk >= 0, ∀j 6= k;

T 2 2 or equivalently, X X = diag kx1k , ..., kxd k . The OLS estimates are given as < xj , y > βˆj = for j = 1, ..., d. < xj , xj >

Each input has no effect on the estimation of other parameters. Multiple linear regression reduces to univariate regression.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 31 / 43 Computation Issues Orthogonalize Design Matrix How to orthogonalize X ?

Consider the y = β01 + β1x + . We regress x onto 1 and obtain the residual

z = x − x¯1.

Orthogonalization Process: The residual z is orthogonal to the regressor 1. The column space of X is span{1, x}. Note: by ∈ span{1, x} = span{1, z}, because

β01 + β1x = β0 + β1[¯x1 + (x − x¯1)]

= β0 + β1[¯x1 + z]

= (β0 + β1x¯)1 + β1z

= η01 + β1z.

{1, z} form an orthogonal basis for the column space of X .

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 32 / 43 Computation Issues Orthogonalize Design Matrix How to orthogonalize X ? (continued)

Estimation Process:

First, we regress y onto z for the OLS estimate of the slope βˆ1 < y, z > < y, x − x¯1 > βˆ = = . 1 < z, z > < x − x¯1, x − x¯1 >

Second, we regress y onto 1 and get the coefficientη ˆ0 =y ¯. The OLS fit is given as

yˆ =η ˆ01 + βˆ1z

=η ˆ01 + βˆ1(x − x1¯ ) = (ˆη0 − βˆ1x¯)1 + βˆ1x.

Therefore, the OLS slope is obtained as

βˆ0 =η ˆ0 − βˆ1x¯ =y ¯ − βˆ1x¯.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 33 / 43

Computation Issues Orthogonalize Design Matrix

­

Ý

Ü

¾

Þ

Ý^

Ü

½

Ü

¾

Ü Þ

½

Ý Þ Æ Ü

¾

Ý Ü

½

^

Þ Ý

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 34 / 43 Computation Issues Orthogonalize Design Matrix How to orthogonalize X ? (d=2)

Consider y = β1x1 + β2x2 + β3x3 + .(x1 = 1) Orthogonization process:

1 We regress x2 onto x1, compute the residual

z1 = x2 − γ12x1. (note z1 ⊥ x1)

2 We regress x3 onto (x1, z1), compute the residual

z2 = x3 − γ13x1 − γ23z1. (note z2 ⊥ {x1, z1})

Note: span{x1, x2, x3} = span{x1, z1, z2}, because

β1x1 + β2x2 + β3x3 = β1x1 + β2(γ12x1 + z1) + β3(γ13x1 + γ23z1 + z2)

= (β1 + β2γ12 + β3γ13)x1 + (β2 + β3γ23)z1 + β3z2

= η1x1 + η2z1 + β3z2.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 35 / 43 Computation Issues Orthogonalize Design Matrix Estimation Process

We project y onto the orthogonal basis {x1, z2, z3} one by one, and then recover the coefficients corresponding to the original columns of X .

First, we regress y onto z2 for the OLS estimate of the slope βˆ3

< y, z2 > βˆ3 = < z2, z2 >

Second, we regress y onto z1, leading to the coefficientη ˆ2, and

βˆ2 =η ˆ2 − βˆ3γ23

Third, we regress y onto x1, leading to the coefficientη ˆ1, and

βˆ1 =η ˆ1 − βˆ3γ13 − βˆ2γ12

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 36 / 43 Computation Issues Orthogonalize Design Matrix Gram-Schmidt Procedure (Successive Orthogonalization)

1 Initialize z0 = x0 = 1 2 For j = 1, ..., d Regression xj on z0, z1, ..., zj−1 to produce coefficientsγ ˆkj = for k = 0, ..., j − 1, and residual vector Pj−1 zj = xj − k=0 γˆkj zk .({z0, z1, ..., zj−1} are orthogonal) 3 Regress y on the residual zd to get

< y, zd > βˆd =η ˆd = < zd , zd >

4 Compute βˆj , j = d − 1, ··· , 0 in that order successively based on

< y, zj > ηˆj = < zj , zj >

{z0, z1, ..., zd } forms orthogonal basis for Col(X ). Multiple regression coefficient βˆj is the additional contribution of xj to y, after xj has been adjusted for x0, x1, ..., xj−1, xj+1, ..., xd .

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 37 / 43 Computation Issues Orthogonalize Design Matrix Collinearity Issue

The dth coefficient < zd , y > βˆd = < zd , zd > 0 If xd is highly correlated with some of the other xj s, then The residual vector zd is close to zero

The coefficient βˆd will be very unstable The variance estimates

2 ˆ σ Var(βd ) = 2 . kzd k

The for estimating βˆd depends on the length of zd , or, how much xd is unexplained by the other xk ’s

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 38 / 43 Two Popular Algorithms Two Computational Algorithms For Multiple Regression

Consider the Normal Equation

X T X β = X T y.

We like to avoid computing (X T X )−1 directly. 1 QR decomposition of X X = QR where Q is orthonormal and R is upper triangular Essentially, a process of orthogonal matrix triangularization T 2 Cholesky decomposition of X X . X T X = R˜R˜T where R˜ is lower triangular

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 39 / 43 Two Popular Algorithms QR Decomposition Matrix Formulation of Orthogonalization

In Step 2 of Gram-Schmidt procedure, for j = 1, ..., d

j−1 j−1 X X zj = xj − γˆkj zk =⇒ xj = γˆkj zk + zj . k=0 k=0

In matrix form X = [x1, ..., xd ] and Z = [z1, ..., zd ],

X = ZΓ

The columns of Z are orthogonal to each other The matrix Γ is upper triangular, with 1 at the diagonals.

Standardizing Z using D = diag{kz1k, ..., kzd k},

X = ZΓ = ZD−1DΓ ≡ QR, with Q = ZD−1, R = DΓ.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 40 / 43 Two Popular Algorithms QR Decomposition QR Decomposition

The columns of Q consists of an orthonormal basis for the column space of X . Q is orthogonal matrix of n × d, satisfying QT Q = I . R is upper of d × d, full-ranked. X T X = (QR)T (QR) = RT QT QR = RT R The least square solutions are

βb = (X T X )−1X T y = R−1R−T RT QT y = R−1QT y yˆ = X βˆ = (QR)(R−1QT y) = QQT y.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 41 / 43 Two Popular Algorithms QR Decomposition QR Algorithm for Normal Equations

Regard βb as the solution for linear equations system:

Rβ = QT y.

1 Conduct QR decomposition of X = QR. (Gram-Schmidt Orthogonalization) T 2 Compute Q y. T 3 Solve the triangular system Rβ = Q y. The computational complexity: nd2

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 42 / 43 Two Popular Algorithms Cholesky Algorithm Cholesky Decomposition Algorithm

For any positive definite square matrix A, we have

A = RRT , where R is a lower triangular matrix of full rank. T T 1 Compute X X and X y. T T T −1 −1 T 2 Factoring X X = RR , then βˆ = (R ) R X y T 3 Solve the triangular system Rw = X y for w. T 4 Solve the triangular system R β = w for β. The computational complexity: d3 + nd2/2 (can be faster than QR for small d, but can be less stable)

T ˆ 2 T T −1 −1 Var(yˆ0) = Var(x0 β) = σ (x0 (R ) R x0).

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 43 / 43