Stat331: Applied Linear Models
Total Page:16
File Type:pdf, Size:1020Kb
Stat331: Applied Linear Models Helena S. Ven 05 Sept. 2019 Instructor: Leilei Zeng ([email protected]) Office: TTh, 4:00 { 5:00, at M3.4223 Topics: Modeling the relationship between a response variable and several explanatory variables via regression models. 1. Parameter estimation and inference 2. Confidence Intervals 3. Prediction 4. Model assumption checking, methods dealing with violations 5. Variable selection 6. Applied Issues: Outliers, Influential points, Multi-collinearity (highly correlated explanatory varia- bles) Grading: 20% Assignments (4), 20% Midterms (2), 50% Final Midterms are on 22 Oct., 12 Nov. Textbook: Abraham B., Ledolter J. Introduction to Regression Modeling Index 1 Simple Linear Regression 2 1.1 Regression Model . .2 1.2 Simple Linear Regression . .2 1.3 Method of Least Squares . .3 1.4 Method of Maximum Likelihood . .5 1.5 Properties of Estimators . .6 2 Confidence Intervals and Hypothesis Testing 9 2.1 Pivotal Quantities . .9 2.2 Estimation of the Mean Response . 10 2.3 Prediction of a Single Response . 11 2.4 Analysis of Variance and F-test . 11 2.5 Co¨efficient of Determination . 14 3 Multiple Linear Regression 15 3.1 Random Vectors . 15 3.1.1 Multilinear Regression . 16 3.1.2 Parameter Estimation . 17 3.2 Estimation of Variance . 19 3.3 Inference and Prediction . 20 3.4 Maximum Likelihood Estimation . 21 3.5 Geometric Interpretation of Least Squares . 21 3.6 ANOVA For Multilinear Regression . 21 3.7 Hypothesis Testing . 23 3.7.1 Testing on Subsets of Co¨efficients . 23 3.8 Testing General Linear Hypothesis . 25 3.9 Interactions in Regression and Categorical Variables . 25 4 Model Checking and Residual Analysis 27 4.1 Errors and Residuals . 27 4.1.1 Checking Higher Ordered Term . 28 4.1.2 Checking Constant Variance . 28 4.2 Q-Q Plot and Data Transformation . 29 4.3 Data Transformation . 29 4.3.1 Box-Cox Power Transformation . 29 4.4 Weighted Lease-Square Regression . 30 4.5 Outliers and Influential Cases . 31 4.5.1 Influential Case . 31 5 Model/Variable Selection 33 5.1 Automated Methods . 33 5.2 All subsets regressions . 34 1 Caput 1 Simple Linear Regression Regression is a statistical technique for investigating and modeling the relationship between variables. The response variable (Dependent, outcome) is the variable of interest and we evaluate how it changes depending on the explanatory variable, (or Independent, Covariate, Predictor) which affects the response variables. By convention, the explanatory variables are represented by x and the response variable by y. 1.1 Regression Model This course only studies about linear models. A general linear model on d explanatory variables can be written as | y = β0 + β x + = β0 + β1x1 + ··· + βdxd + | {z } |{z} Deterministic Part Random Error β0; β are the regression parameters or co¨efficients. Since the right hand side is a linear function, y must be a continuous variable. There are ways to convert this to a binary variable via e.g. the logistic function. The objective of linear regression is to determine the unknown β0; β parameters. Fortunately, all of the optimisation problems in this course are solvable analytically. We do not need e.g. gradient descent to find the unknown parameters. 1.2 Simple Linear Regression The simplest case of linear regression is one with only one variable: y = β0 + β1x + The 2D plot of (x; y) pairs is a scatter plot. Suppose from a random sample of n subjects, we obtain (xi; yi). Substituting xi; yi into the linear regression equation gives yi = β0 + β1xi + i Assumptions: 1. The expectation of a random error is 0: E[i] = 0 2. The random error has equal variance 2 var i = σ 3. Independence: f1; : : : ; ng are independent. 2 4. Distributional Assumption: i ∼ N(0; σ ) 2 Hence i are IID normal. The normality assumption on i implies the normality of response yi, because E[Yi] = E[β0 + β1xi + i] = β0 + β1xi + E[i] = β0 + β1xi 2 and also Yi ∼ N(β0 + β1xi; σ ) What is the interpretation of β0 and β1? 1. β0 is the average outcome when xi = 0. 2. β1 is more important. β1 is the expected/average change in y when x moves by 1. B. Unfortunately in some cases the explanatory variable x is not allowed to take values close to 0, or moving x by 1 causes x to fall out of the range of samples. Scaling is needed to produce a sensible interpetation. 1.3 Method of Least Squares A common method of determining β0 and β1 is to use the least square estimate. i.e. Choose β0; β1 such that the sample errors n X 2 Sum of Squares of Errors := i i=1 is minimised. Least square estimate amounts to minimising n X 2 arg min (yi − (β0 + β1xi)) β0,β1 i=1 Fortunately there is an analytic solution of this nonlinear optimisation problem. 1.3.1 Proposition. The β0; β1 which attain the above minima are ^ ^ β0 :=y ¯ − β1x¯ P x y − nx¯y¯ β^ := i i i 1 P 2 i xi − nx¯ Proof. Differentiating w.r.t. β0, n n @ X 2 X (y − (β + β x )) = 2 (y − (β + β x )) (−1) @β i 0 1 i i 0 1 i 0 i=1 i=1 Similarly for β1, n n @ X 2 X (y − (β + β x )) = 2 (y − (β + β x )) (−x ) @β i 0 1 i i 0 1 i i 1 i=1 i=1 3 If we set these derivatives to zero, n 1 @ X ^ ^ 0 = − MSE = yi − (β0 + β1xi) 2 @β0 ^ βj =βj i=1 n 1 @ X ^ ^ 0 = − MSE = yi − (β0 + β1xi) xi 2 @β1 ^ βj =βj i=1 Expanding, X ^ ^ X yi − nβ0 − β1 xi = 0 i i X ^ X ^ X 2 xiyi − β0 xi − β1 xi = 0 i i i ^ If we solve for β0 in the first equation, 1 X 1 X β^ = y − β^ x 0 n i 1 n i i i ^ =y ¯ − β1x¯ Substutiting this into the second equation, X ^ ^ X 2 X ^ 2 ^ xiyi − (¯y − β1x¯)nx¯ − β1 xi = xiyi − nx¯y¯ + β1n(¯x) − β1 i i i Hence P x y − nx¯y¯ β^ = i i 1 P 2 xi − nx¯ We can verify that the objective function is convex so this is indeed the global minimum. Definition The Sum of Squares of is X Sxy := (xi − x¯)(yi − y¯) i Observe that Sxx can be expanded n n X X Sxx = (xi − x¯)xi − x¯ (xi − x¯) i=1 i=1 | {z } =0 n X 2 2 = xi − nx¯ i=1 Likewise, n X Sxy = (xiyi − xiy¯ − xy¯ i +x ¯y¯) i=1 n X = xiyi − nx¯y¯ − nx¯y¯ + nx¯y¯ i=1 n X = xiyi − nx¯y¯ i=1 4 Hence ^ Sxy β1 = Sxx For each subject i, ^ ^ y^i := β0 + β1xi is the estimated mean or fitted value for xi. The difference ri := Observed − Fitted = yi − y^i is the residue. ^ ^ From the minimisation conditions of β0; β1 we know that X X X ri = 0; rixi = 0; riy^i = 0 i i i The last condition follows from X X ^ ^ ^ X ^ X riy^i = ri(β0 + β1xi) = β0 ri + β1 rixi i i i i What is the variance of (ri)? Recall that the sample variance is an unbiased estimator of population variance: n 1 X 2 s^ := (x − x¯) x n − 1 i i=1 One degree of freedom is lost becausex ¯ is used as an estimator of the population mean. Another way to view this is that if we know x1; : : : ; xn−1 andx ¯, then xn is automatically known. 2 We must recognise that each yi has a different distribution, i.e. Yi ∼ N(β0 + β1xi; σ ). The Mean Square Errors is a random variable: n n 2 1 X 2 1 X 2 S := (β~ + β~ x − Y^ ) = R n − 2 0 1 i i n − 2 i i=1 i=1 ~ ~ 2 degrees of freedoms are lost since we are estimating β0 and β1 using β0; β1. 1.3.2 Proposition. The MSE S2 is an unbiased estimator of σ2. 1.4 Method of Maximum Likelihood Suppose there is a random variable Y and it has a density function f(y; θ) or mass function P (Y = y; θ). We infer θ by sampling Y . If we observe y, the probability of θ = θ^ is the likelihood function ( P (Y = y; θ) if discrete L(θ) = f(y; θ) if continuous In essence we want to maximise P (θjy). The maximum likelihood estimator is θ^ := arg max L(θ) θ It is often easier to work with the logarithmic likelihood since L(θ) involves chains of multiplications. `(θ) := log L(θ) The score function is the derivative of `: @ s(θ) := `(θ) @θ 5 In simple linear regression, we assume each sample Yi is independent and has a normal distribution Yi ∼ 2 N(β0 + β1xi; σ ). The joint distribution of Y1;:::;Yn is n Y 2 L(β0; β1; σ) = f(yi; β0 + β1xi; σ ) i=1 n Y 1 1 2 = exp − (y − β − β x ) p 2 2 i 0 1 i i=1 2πσ 2σ The logarithmic likelihood is n 2 n 2 1 X 2 `(β ; β ; σ ) = − log(2πσ ) − (y − β − β x ) 0 1 2 2 i 0 1 i 2σ i=1 The scores of β0; β1 are n @` 1 X 2 = (y − β − β x ) @β 2 i 0 1 i 0 σ i=1 n @` 1 X 2 = (y − β − β x ) x @β 2 i 0 1 i i 1 σ i=1 ^ ^ We found that the maximum-likelihood estimators β0; β1 coincide with the least-square estimators.