Lecture 4: Regression Methods I (Linear Regression)

Lecture 4: Regression Methods I (Linear Regression) Wenbin Lu Department of Statistics North Carolina State University Fall 2019 Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 1 / 43 Outline 1 Regression: Supervised Learning with Continuous Responses 2 Linear Models and Multiple Linear Regression Ordinary Least Squares Statistical inferences Computational algorithms Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 2 / 43 Linear Model Introduction Regression Models If the response Y take real values, we refer this type of supervised learning problem as regression problem. linear regression models parametric models nonparametric regression splines, kernel estimator, local polynomial regression semiparametric regression Broad coverage: penalized regression, regression trees, support vector regression, quantile regression Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 3 / 43 Linear Model Introduction Linear Regression Models A standard linear regression model assumes T 2 yi = xi β + i ; i ∼ i.i.d; E(i ) = 0; Var(i ) = σ ; d yi is the response for the ith observation, xi 2 R is the covariates β 2 Rd is the d-dimensional parameter vector Common model assumptions: independence of errors constant error variance (homoscedasticity) independent of X. Normality is not needed. Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 4 / 43 Linear Model Introduction About Linear Models Linear models has been a mainstay of statistics for the past 30 years and remains one of the most important tools. The covariates may come from different sources quantitative inputs; dummy codingp qualitative inputs. transformed inputs: log(X ); X 2; X ; ::: 2 3 basis expansion: X1; X1 ; X1 ; ::: (polynomial representation) interaction between variables: X1X2; ::: Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 5 / 43 Linear Model Introduction Review on Matrix Theory - Notations A is an m × m matrix. col(A): the subspace of Rm spanned by the columns of A. Im is the identity matrix of size m. Vectors, x is a nonzero m × 1 vector 0 is a zero vector of m × 1. ei ; i = 1;:::; m is m × 1 unit vector, with 1 in the ith position and zeros elsewhere. The ith column of A can be expresed as Aei , for i = 1;:::; m. Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 6 / 43 Linear Model Introduction Basic Concepts The determinant of A is det(A) = jAj. The trace of A is tr(A) = the sum of the diagonal elements: The roots of the mth degree of polynomial equation in λ. jλIm − Aj = 0; denoted by λ1; ··· ; λm are called the eigenvalues of A. The collection fλ1; ··· ; λmg is called the spectrum of A. Any nonzero m × 1 vector xi 6= 0 such that Axi = λi xi is an eigenvector of A corresponding to the eigenvalue λi . Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 7 / 43 Linear Model Introduction Let B be another m × m matrix, then jABj = jAjjBj; tr(AB) = tr(BA): A is symmetric if A0 = A: Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 8 / 43 Linear Model Introduction Review on Matrix Theory (II) The following are equivalent: jAj 6= 0 rank(A) = m A−1 exists. Linear transformation: Ax generates a vector in col(A) Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 9 / 43 Linear Model Introduction Orthogonal Matrix An m × m matrix P is called an orthogonal matrix if 0 0 −1 0 PP = P P = Im; or P = P : If P is an orthogonal matrix, then jPP0j = jPjjP0j = jPj2 = jI j = 1, so jPj = ±1. For any A, we have tr(PAP0) = tr(AP0P) = tr(A). PAP0 and A have the same eigenvalues, since 0 0 0 2 jλIm − PAP j = jλPP − PAP j = jPj jλIm − Aj = jλIm − Aj: Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 10 / 43 Linear Model Introduction Spectral Decomposition of Symmetric Matrix If A is symmetric, there exists an orthogonal matrix P such that 0 P AP = Λ = diagfλ1; ··· ; λmg; λi 's are the eigenvalues of A. The eigenvectors of A are the column vectors of P. Denote the ith column of P by pi , then m 0 X 0 PP = pi pi = Im: i=1 The spectral decomposition of A is m 0 X 0 A = PΛP = λi pi pi i=1 Pn Qm tr(A) = tr(Λ) = i=1 λi and jAj = jΛj = i=1 λi . Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 11 / 43 Linear Model Introduction Idempotent Matrices An m × m matrix A is idempotent if A2 = AA = A: The eigenvalues of an idempotent matrix are either zero or one λx = Ax = A(Ax) = A(λx) = λ2x; =) λ = λ2: If A is idempotent, so is Im − A. Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 12 / 43 Linear Model Introduction Projection Matrix A symmetric, idempotent matrix A is called a projection matrix. If A is a symmetric idempotent, then If rank(A) = r, then A has r eigenvalues equal to 1 and m − r zero eigenvalues. tr(A) = rank(A). Im − A is also symmetric idempotent, of rank m − r. Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 13 / 43 Linear Model Introduction Projection Matrices Given x 2 Rm, define y = Ax, z = (I − A)x = x − y. Then y ? z. y is the orthogonal projection of x onto the subspace col(A). z = (I − A)x is the orthogonal projection of x onto the complementary subspace such that x = y + z = Ax + (I − A)x: Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 14 / 43 Linear Model Introduction Matrix Notations for Linear Regression T The response vector y = (y1; ··· ; yn) The design matrix X . Assume the first column of X is 1. The dimension of X is n × (1 + d). β The regression coefficients β = 0 . β1 T The error vector = (1; ··· ; n) . The linear model is written as: y = X β + the estimated coefficients βb the predicted response by = X βb: Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 15 / 43 Linear Model Introduction OLS Method Ordinary Least Squares (OLS) The most popular method for fitting the linear model is the ordinary least squares (OLS): min RSS(β) = (y − X β)T (y − X β): β Normal equations: X T (y − X β) = 0 T −1 T T −1 T βb = (X X ) X y and by = X (X X ) X y. Residual vector is r = y − y^ = (I − PX )y. Residual sum squares RSS = rT r: Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 16 / 43 Linear Model Introduction OLS Method Projection Matrix Call the following square matrix the projection or hat matrix: T −1 T PX = X (X X ) X : Properties: symmetric and non-negative definite 2 idempotent: PX = PX . The eigenvalues are 0's and 1's. PX X = X ; (I − PX )X = 0. We have T r = (I − PX )y; RSS = y (I − PX )y: Note T T X r = X (I − PX )y = 0: The residual vector is orthogonal to the column space spanned by X , col(X ). Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 17 / 43 Linear Model Introduction OLS Method ÐÑÒØ× Ó ÄÖÒÒ À×Ø¸ Ì×ÖÒ ² ÖÑÒ ¾¼¼½ ÔØÖ ¿ ÈËÖ Ø× Ý Ü ¾ Ý^ Ü ½ ÙÖ ¿º¾ Ì Æ ¹ÑÒ×ÓÒÐ ÓÑØÖÝ Ó Ð×Ø ×ÕÙÖ× ÖÖ××ÓÒ ÛØ ØÛÓ ÔÖ Ì ÓÑ Ú Ý × ÓÖØÓÓÒÐ ÐÝ ÔÖÓ ÓÒØÓ Ø ÝÔÖÔÐÒ ×ÔÒÒ Ý Ø ÒÔÙØ Ú Ü Ò Ü º Ì ÔÖÓ ½ ¾ ^ Ý ÖÔÖ×ÒØ× Ø Ú Ó Ø Ð×Ø ×ÕÙÖ× ÔÖ Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 18 / 43 Linear Model Introduction Statistical Inferences Sampling Properties of βb Var(β^) = σ2(X T X )−1, The variance σ2 can be estimated as σ^2 = RSS=(n − d − 1): This is an unbiased estimator, i.e., E(^σ2) = σ2 Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 19 / 43 Linear Model Introduction Statistical Inferences Inferences for Gaussian Errors Under the Normal assumption on the error , we have β^ ∼ N(β; σ2(X T X )−1) 2 2 2 (n − d − 1)^σ ∼ σ χn−d−1 β^ is independent ofσ ^2 To test H0 : βj = 0, we use ^ 2 βj if σ is known, zj = p has a Z distribution under H0; σ vj ^ 2 βj if σ is unknown, tj = p has a t distribution under H0; σ^ vj n−d−1 T −1 where vj is the jth diagonal element of (X X ) . Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 20 / 43 Linear Model Introduction Statistical Inferences Confidence Interval for Individual Coefficients Under Normal assumption, the 100(1 − α)% C.I. of βj is p β^j ± t α σ^ vj ; n−d−1; 2 where tk;ν is 1 − ν percentile of tk distribution. In practice, we use the approximate 100(1 − α)% C.I. of βj p β^j ± z α σ^ vj ; 2 α where z α is 1 − percentile of the standard Normal distribution. 2 2 Even if the Gaussian assumption does not hold, this interval is approximately right, with the coverage probability 1 − α as n ! 1. Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 21 / 43 Linear Model Introduction Statistical Inferences Review on Multivariate Normal Distributions Distributions of Quadratic Form (Non-central χ2): If X ∼ Np(µ; Ip), then p X 1 W = XT X = X 2 ∼ χ2(λ); λ = µT µ: i p 2 i=1 T 2 Special case: If X ∼ Np(0; Ip), then W = X X ∼ χp. If X ∼ Np(µ; V ) where V is nonsingular, then 1 W = XT V −1X ∼ χ2(λ); λ = µT V −1µ: p 2 If X ∼ Np(µ; V ) with V nonsingular, and if A is symmetric and AV is idempotent with rank s, then 1 W = XT AX ∼ χ2(λ); λ = µT Aµ: s 2 Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 22 / 43 Linear Model Introduction Statistical Inferences Cochran's Theorem 2 Let y ∼ Nn(µ; σ In) and let Aj ; j = 1; ··· ; J be symmetric idempotent PJ matrices with rank sj .

Lecture 4: Regression Methods I (Linear Regression)

Lecture 13: Simple Linear Regression in Matrix Format

Introducing the Game Design Matrix: a Step-By-Step Process for Creating Serious Games

Stat 5102 Notes: Regression

Uncertainty of the Design and Covariance Matrices in Linear Statistical Model*

The Concept of a Generalized Inverse for Matrices Was Introduced by Moore(1920)

Kriging Models for Linear Networks and Non‐Euclidean Distances

Linear Regression with Shuffled Data: Statistical and Computational

Week 7: Multiple Regression

Stat 714 Linear Statistical Models

Fitting Differential Equations to Functional Data: Principal Differential Analysis

Linear Regression in Matrix Form

Some Matrix Results