Lecture 4: Regression Methods I (Linear Regression)
Wenbin Lu
Department of Statistics North Carolina State University
Fall 2019
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 1 / 43 Outline
1 Regression: Supervised Learning with Continuous Responses 2 Linear Models and Multiple Linear Regression Ordinary Least Squares Statistical inferences Computational algorithms
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 2 / 43 Linear Model Introduction Regression Models
If the response Y take real values, we refer this type of supervised learning problem as regression problem. linear regression models parametric models nonparametric regression splines, kernel estimator, local polynomial regression semiparametric regression Broad coverage: penalized regression, regression trees, support vector regression, quantile regression
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 3 / 43 Linear Model Introduction Linear Regression Models
A standard linear regression model assumes
T 2 yi = xi β + i , i ∼ i.i.d, E(i ) = 0, Var(i ) = σ ,
d yi is the response for the ith observation, xi ∈ R is the covariates β ∈ Rd is the d-dimensional parameter vector Common model assumptions: independence of errors constant error variance (homoscedasticity) independent of X. Normality is not needed.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 4 / 43 Linear Model Introduction About Linear Models
Linear models has been a mainstay of statistics for the past 30 years and remains one of the most important tools. The covariates may come from different sources quantitative inputs; dummy coding√ qualitative inputs. transformed inputs: log(X ), X 2, X , ... 2 3 basis expansion: X1, X1 , X1 , ... (polynomial representation) interaction between variables: X1X2, ...
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 5 / 43 Linear Model Introduction Review on Matrix Theory - Notations
A is an m × m matrix. col(A): the subspace of Rm spanned by the columns of A.
Im is the identity matrix of size m.
Vectors, x is a nonzero m × 1 vector 0 is a zero vector of m × 1.
ei , i = 1,..., m is m × 1 unit vector, with 1 in the ith position and zeros elsewhere.
The ith column of A can be expresed as Aei , for i = 1,..., m.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 6 / 43 Linear Model Introduction Basic Concepts
The determinant of A is det(A) = |A|. The trace of A is tr(A) = the sum of the diagonal elements. The roots of the mth degree of polynomial equation in λ.
|λIm − A| = 0,
denoted by λ1, ··· , λm are called the eigenvalues of A.
The collection {λ1, ··· , λm} is called the spectrum of A.
Any nonzero m × 1 vector xi 6= 0 such that
Axi = λi xi
is an eigenvector of A corresponding to the eigenvalue λi .
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 7 / 43 Linear Model Introduction
Let B be another m × m matrix, then
|AB| = |A||B|, tr(AB) = tr(BA).
A is symmetric if A0 = A.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 8 / 43 Linear Model Introduction Review on Matrix Theory (II)
The following are equivalent: |A|= 6 0 rank(A) = m A−1 exists.
Linear transformation: Ax generates a vector in col(A)
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 9 / 43 Linear Model Introduction Orthogonal Matrix
An m × m matrix P is called an orthogonal matrix if
0 0 −1 0 PP = P P = Im, or P = P .
If P is an orthogonal matrix, then |PP0| = |P||P0| = |P|2 = |I | = 1, so |P| = ±1. For any A, we have tr(PAP0) = tr(AP0P) = tr(A). PAP0 and A have the same eigenvalues, since
0 0 0 2 |λIm − PAP | = |λPP − PAP | = |P| |λIm − A| = |λIm − A|.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 10 / 43 Linear Model Introduction Spectral Decomposition of Symmetric Matrix
If A is symmetric, there exists an orthogonal matrix P such that
0 P AP = Λ = diag{λ1, ··· , λm},
λi ’s are the eigenvalues of A. The eigenvectors of A are the column vectors of P.
Denote the ith column of P by pi , then m 0 X 0 PP = pi pi = Im. i=1 The spectral decomposition of A is m 0 X 0 A = PΛP = λi pi pi i=1 Pn Qm tr(A) = tr(Λ) = i=1 λi and |A| = |Λ| = i=1 λi . Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 11 / 43 Linear Model Introduction Idempotent Matrices
An m × m matrix A is idempotent if
A2 = AA = A.
The eigenvalues of an idempotent matrix are either zero or one
λx = Ax = A(Ax) = A(λx) = λ2x,
=⇒ λ = λ2.
If A is idempotent, so is Im − A.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 12 / 43 Linear Model Introduction Projection Matrix
A symmetric, idempotent matrix A is called a projection matrix.
If A is a symmetric idempotent, then If rank(A) = r, then A has r eigenvalues equal to 1 and m − r zero eigenvalues. tr(A) = rank(A).
Im − A is also symmetric idempotent, of rank m − r.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 13 / 43 Linear Model Introduction Projection Matrices
Given x ∈ Rm, define y = Ax, z = (I − A)x = x − y. Then y ⊥ z. y is the orthogonal projection of x onto the subspace col(A). z = (I − A)x is the orthogonal projection of x onto the complementary subspace such that
x = y + z = Ax + (I − A)x.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 14 / 43 Linear Model Introduction Matrix Notations for Linear Regression
T The response vector y = (y1, ··· , yn) The design matrix X . Assume the first column of X is 1. The dimension of X is n × (1 + d). β The regression coefficients β = 0 . β1 T The error vector = (1, ··· , n) . The linear model is written as:
y = X β +
the estimated coefficients βb the predicted response by = X βb.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 15 / 43 Linear Model Introduction OLS Method Ordinary Least Squares (OLS)
The most popular method for fitting the linear model is the ordinary least squares (OLS):
min RSS(β) = (y − X β)T (y − X β). β
Normal equations: X T (y − X β) = 0 T −1 T T −1 T βb = (X X ) X y and by = X (X X ) X y. Residual vector is r = y − yˆ = (I − PX )y. Residual sum squares RSS = rT r.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 16 / 43 Linear Model Introduction OLS Method Projection Matrix
Call the following square matrix the projection or hat matrix:
T −1 T PX = X (X X ) X .
Properties: symmetric and non-negative definite 2 idempotent: PX = PX . The eigenvalues are 0’s and 1’s. PX X = X , (I − PX )X = 0. We have T r = (I − PX )y, RSS = y (I − PX )y. Note T T X r = X (I − PX )y = 0. The residual vector is orthogonal to the column space spanned by X , col(X ).
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 17 / 43
Linear Model Introduction OLS Method