Chapter 14. Linear Least Squares

Serik Sagitov, Chalmers and GU, March 9, 2016 Chapter 14. Linear least squares 1 Simple linear regression model A linear model for the random response Y = Y (x) to an independent variable X = x. For a given set of values (x1; : : : ; xn) of the independent variable put Yi = β0 + β1xi + i; i = 1; : : : ; n; 2 assuming that the noise vector (1; : : : ; n) has independent N(0,σ ) random components. Given the 2 data (y1; : : : ; yn), the model is characterised by the likelihood function of three parameters β0, β1, σ n 2 S(β ,β ) 2 Y 1 n (yi − β0 − β1xi) o −n=2 −n − 0 1 L(β ; β ; σ ) = p exp − = (2π) σ e 2σ2 ; 0 1 2σ2 i=1 2πσ Pn 2 where S(β0; β1) = i=1(yi − β0 − β1xi) . Observe that n −1 −1 X 2 2 2 2 2 n S(β0; β1) = n (yi − β0 − β1xi) = β0 + 2β0β1x¯ − 2β0y¯ − 2β1xy + β1 x + y : i=1 Least squares estimates Regression lines: true y = β0 + β1x and fitted y = b0 + b1x. We want to find (b0; b1) such that the observed responses yi are approximated by the predicted responsesy î = b0 + b1xi in an optimal way. P 2 Least squares method: find (b0; b1) minimising the sum of squares S(b0; b1) = (yi − yî) . From @S=@b0 = 0 and @S=@b1 = 0 we get the so-called Normal Equations: 8 xy − x¯y¯ rsy b0 + b1x¯ =y ¯ < b1 = = 2 2 2 implying x − x¯ sx b0x¯ + b1x = xy : b0 =y ¯ − b1x¯ The least square regression line y = b + b x takes the form y =y ¯ + r sy (x − x¯): 0 1 sx 2 1 P 2 2 1 P 2 sample variances sx = n−1 (xi − x¯) , sy = n−1 (yi − y¯) , 1 P sample covariance sxy = n−1 (xi − x¯)(yi − y¯), sample correlation coefficient r = sxy . sxsy The least square estimates (b0; b1) are the maximum likelihood estimates of (β0; β1). The least square estimates (b0; b1) are not robust: outliers exert leverage on the fitted line. 2 Residuals The estimated regression line predicts the responses to the values of the explanatory variable by y^ =y ¯ + r sy (x − x¯). The noise in the observed responses y is represented by the residuals i sx i i e = y − y^ = y − y¯ − r sy (x − x¯), i i i i sx i e1 + ::: + en = 0, x1e1 + ::: + xnen = 0,y ^1e1 + ::: +y ^nen = 0. 1 Residuals ei have normal distributions with zero mean and P 2 P 2 k(xk − xi) 2 k(xk − xi)(xk − xj) Var(ei) = σ 1 − 2 ; Cov(ei; ej) = −σ · 2 : n(n − 1)sx n(n − 1)sx Error sum of squares 2 P 2 P 2 sy 2 sy P 2 2 2 SSE = e = (yi − y¯) − 2r n(xy − y¯x¯) + r 2 (xi − x¯) = (n − 1)s (1 − r ). i i i sx sx i y 2 2 SSE n−1 2 2 Corrected maximum likelihood estimate of σ : s = n−2 = n−2 sy(1 − r ) Using yi − y¯ =y î − y¯ + ei we obtain SST = SSR + SSE, P 2 2 SST = i(yi − y¯) = (n − 1)sy is the total sum of squares, P 2 2 2 SSR = i(^yi − y¯) = (n − 1)b1sx is the regression sum of squares. 2 SSR SSE Coefficient of determination r = SST = 1 − SST . Coefficient of determination is the proportion of variation in Y explained by main factor X. Thus r2 has a more transparent meaning than the correlation coefficient r. To test the normality assumption use the normal distribution plot for the standardized residuals ei , si q P 2 k(xk−xi) where si = s 1 − 2 are the estimated standard deviations of ei. n(n−1)sx The expected plot of the standardised residuals versus xi is a horizontal blur (linearity), variance does not depend on x (homoscedasticity). Example (flow rate vs stream depth) For this example with n = 10, the scatter plot looks slightly non-linear. The residual plot gives a clearer picture having the U-shape. After the log-log transformation, the scatter plot is closer to linear and the residual plot has a horizontal profile. 3 Confidence intervals and hypothesis testing The list square estimators (b0; b1) are unbiased and consistent. Due to the normality assumption we have the following exact distributions p σ2·P x2 s P x2 2 2 i b0−β0 p i b0 ∼ N(β0; σ0), σ0 = n(n−1)s2 , s ∼ tn−2; sb0 = , x b0 sx n(n−1) 2 2 2 σ b1−β1 ps b1 ∼ N(β1; σ1), σ1 = 2 , ∼ tn−2; sb1 = : (n−1)sx sb1 sx n−1 σ2·x¯ Weak dependence between the two estimators: Cov(b0; b1) = − 2 . (n−1)sx α Exact 100(1 − α)% CI for βi: bi ± tn−2( 2 ) · sbi bi−βi0 Hypothesis testing H0: βi = βi0: test statistic T = , exact null distribution T ∼ tn−2. sbi Model utility test and zero-intercept test H0: β1 = 0 (no relationship between X and Y ), test statistic T = b1=sb1 , null distribution T ∼ tn−2. H0: β0 = 0, test statistic T = b0=sb0 , null distribution T ∼ tn−2. 2 Intervals for individual observations Given x predict the value y for the random variable Y = β0 +β1 ·x+. Its expected value µ = β0 +β1 ·x has the least square estimateµ ^ = b0 + b1 · x. The standard error ofµ ^ is computed as the square root of Var(^µ) = σ2 + σ2 · ( x−x¯ )2. n n−1 sx q Exact 100(1 − α)% confidence interval for the mean µ: b + b x ± t ( α ) · s 1 + 1 ( x−x¯ )2 0 1 n−2 2 n n−1 sx q Exact 100(1 − α)% prediction interval for y: b + b x ± t ( α ) · s 1 + 1 + 1 ( x−x¯ )2 0 1 n−2 2 n n−1 sx Prediction interval has wider limits Var(Y − µ^) = σ2+Var(^µ) = σ2(1 + 1 + 1 · ( x−x¯ )2), n n−1 sx since it contains the uncertainty due the noise factors. Compare these two formulas by drawing the confidence bands around the regression line both for the individual observation y and the mean µ. 4 Linear regression and ANOVA Recall the two independent samples case from Chapter 11: first sample µ1 + 1; : : : ; µ1 + n, second sample µ2 + n+1; : : : ; µ2 + n+m, 2 where the noise variables are independent and identically distributed i ∼ N(0; σ ). This setting is equivalent to the simple linear regression model Yi = β0 + β1xi + i; x1 = ::: = xn = 0; xn+1 = ::: = xn+m = 1; with µ1 = β0; µ2 = β0 + β1: The model utility test H0 : β1 = 0 is equivalent to the equality test H0 : µ1 = µ2. More generally, for the one-way ANOVA setting with I = p levels for the main factor and n = pJ observations β0 + i; i = 1; : : : ; J; β0 + β1 + i; i = J + 1;:::; 2J; ::: β0 + βp−1 + i; i = (p − 1)J + 1; : : : ; n; we need a multiple linear regression model Yi = β0 + β1xi;1 + ::: + βp−1xi;p−1 + i; i = 1; : : : ; n with dummy variables xi;j taking values 0 and 1 so that xi;1 = 1 only for i = J + 1;::: 2J; xi;2 = 1 only for i = 2J + 1;::: 3J; ::: xi;p−1 = 1 only for i = (p − 1)J + 1; : : : ; n: 3 5 Multiple linear regression Consider a linear regression model 2 Y = β0 + β1x1 + ::: + βp−1xp−1 + , ∼ N(0; σ ) with p − 1 explanatory variables and a homoscedastic noise. This is an extension of the simple linear regression model with p = 2. The corresponding data set consists of observations (y1; : : : ; yn) with n > p, which are realisations of n independent random variables Y1 = β0 + β1x1;1 + ::: + βp−1x1;p−1 + 1; ::: Yn = β0 + β1xn;1 + ::: + βp−1xn;p−1 + n: T In the matrix notation the column vector y = (y1; : : : ; yn) is a realisation of Y = Xβ + , where T T T Y = (Y1;:::;Yn) ; β = (β0; : : : ; βp−1) ; = (1; : : : ; n) ; are column vectors, and X is the so called design matrix 0 1 1 x1;1 : : : x1;p−1 X = @ :::::::::::: A 1 xn;1 : : : xn;p−1 T 2 assumed to have rank p. Least square estimates b = (b0; : : : ; bp−1) minimise S(b) = ky − Xbk , where kak is the length of a vector a. Solving the normal equations XT Xb = XT y we find the least squares estimates being b = MXT y; M = (XT X)−1: Least squares multiple regression: predicted responses y^ = Xb = Py, where P = XMXT . 2 Covariance matrix for the least square estimates Σbb = σ M is a p × p matrix with elements 2 Cov(bi; bj). The vector of residuals e = y − y^ = (I − P)y have a covariance matrix Σee = σ (I − P). 2 2 SSE 2 An unbiased estimate of σ is given by s = n−p , where SSE = kek . p The standard error of bi is computed as sbj = s mjj, where mjj is a diagonal element of M. bj −βj Exact sampling distributions ∼ tn−p, j = 1; : : : ; p − 1. sbj yi−yî Inspect the normal probability plot for the standardised residuals p , where pii are the diagonal s 1−pii elements of P. Coefficient of multiple determination can be computed similarly to the simple linear regression 2 SSE 2 2 model as R = 1 − SST , where SST = (n − 1)sy. The problem with R is that it increases even if irrelevant variables are added to the model. To punish for irrelevant variables it is better to use the adjusted coefficient of multiple determination 4 2 n−1 SSE s2 R = 1 − · = 1 − 2 .

Chapter 14. Linear Least Squares

Ordinary Least Squares 1 Ordinary Least Squares

An Analysis of Random Design Linear Regression

ROBUST LINEAR LEAST SQUARES REGRESSION 3 Bias Term R(F ∗) R(F (Reg)) Has the Order D/N of the Estimation Term (See [3, 6, 10] and References− Within)

Testing for Heteroskedastic Mixture of Ordinary Least 5

Bayesian Uncertainty Analysis for a Regression Model Versus Application of GUM Supplement 1 to the Least-Squares Estimate

Chapter 1 Linear Regression with One Predictor Variable

Chapter 7 Least Squares Estimation

Week 6: Linear Regression with Two Regressors

Fast Model-Fitting of Bayesian Variable Selection Regression Using The

Skedastic: Heteroskedasticity Diagnostics for Linear Regression

Instrumental Variable Analysis in Epidemiologic Studies: An

Krls: a Stata Package for Kernel-Based Regularized Least