22S:152 Applied Linear Regression Chapter 5: Ordinary Least Squares

22s:152 Applied Linear Regression Chapter 5: Ordinary Least Squares Regression |||||||||||||||||||| Part 1: Simple Linear Regression Introduction and Estimation • Methods for studying the relationship of two or more quantitative variables • Examples: { predict salary from years of experience { find effect of lead exposure on school per- formance { predict force at which a metal alloy rod bends based on iron content 1 Simple Linear Regression Linear regression model • The basic model Yi = β0 + β1xi + i { Yi is the response of dependent variable { xi is the observed predictor, explanatory variable, independent variable, covariate { xi is treated as a fixed quantity (or if random it is conditioned upon) { i is the error term 2 { i are iid N(0; σ ) So, E[Yi] = β0 + β1xi + 0 = β0 + β1xi 2 Simple Linear Regression Linear regression model • Key assumptions (will check these later) { linear relationship (between Y and x) *we say the relationship between Y and x is linear if the means of the conditional distributions of Yjx lie on a straight line { independent errors (independent observations in SLR) { constant variance of errors { normally distributed errors 3 Simple Linear Regression Interpreting the model • Model can also be written as: 2 YijXi = xi ∼ N(β0 + β1xi; σ ) { mean of Y given X = x is β0 + β1x (known as conditional mean) { β0 + β1x is the mean value of all the Y 's for the given value of x { β0 is conditional mean when x=0 { β1 is slope, change in mean of Y per 1 unit change in x { σ2 is the variation of responses at x (i.e. dispersion around conditional mean) 4 Simple Linear Regression Estimation of β0 andβ1 We wish to use the sample data to estimate the population parameters: the slope β1 and the intercept β0 • Least squares estimation ^ ^ { choose β0 = b0 and β1 = b1 such that we minimize the sum of the squared Pn ^ 2 residuals, i.e. minimize i=1(Yi − Yi) { minimize Pn 2 g(b0; b1) = i=1(Yi − (b0 + b1xi)) { Take derivative of g(b0; b1) with respect to b0 and b1, set equal to zero, and solve 5 { Results: b0 = Y¯ − b1x¯ Pn ¯ i=1(xi−x¯)(Yi−Y ) b1 = Pn 2 i=1(xi−x¯) the point (x;¯ Y¯ ) will always be on the least squares line { b0 and b1 are best linear unbiased estimators (best meaning smallest variance estimator) Notation for fitted line: ^ ^ ^ Yi = β0 + β1xi or ^ Yi = b0 + b1xi or in the text ^ Yi = A + Bxi 6 ^ { predicted (fitted) value: Yi = b0 + b1xi ^ { residual: ei = Yi − Yi 4 2 0 Y −2 −4 0 5 10 15 20 25 X The least squares regression line minimizes the Pn ^ 2 residual sums of squares (RSS)= i=1(Yi−Yi) 7 Example: Cigarette data Measurements of weight and tar, nicotine, and carbon monoxide content are given for 25 brands of domestic cigarettes. VARIABLE DESCRIPTIONS: Brand name Tar content (mg) Nicotine content (mg) Weight (g) Carbon monoxide content (mg) Mendenhall, William, and Sincich, Terry (1992), Statistics for Engi- neering and the Sciences (3rd ed.), New York: Dellen Publishing 8 Do a scatterplot, fit the best fitting line according to least squares estimation. > cig.data=as.data.frame(read.delim("cig.txt",sep=" ", header=FALSE)) > dim(cig.data) [1] 25 5 ## This data set had no header, so I will assign ## the column names here: > dimnames(cig.data)[[2]]=c("Brand","Tar","Nic", "Weight","CO") > head(cig.data) Brand Tar Nic Weight CO 1 Alpine 14.1 0.86 0.9853 13.6 2 Benson-Hedges 16.0 1.06 1.0938 16.6 3 BullDurham 29.8 2.03 1.1650 23.5 4 CamelLights 8.0 0.67 0.9280 10.2 5 Carlton 4.1 0.40 0.9462 5.4 6 Chesterfield 15.0 1.04 0.8885 15.0 9 > plot(cig.data$Tar,cig.data$Nic) ● 2.0 1.5 ● ● ● ● ● ● ●●● 1.0 ● ● cig.data$Nic ● ● ● ● ●● ● ● ● ● 0.5 ●● ● 0 5 10 15 20 25 30 cig.data$Tar ## Fit a simple linear regression of Nicotine on Tar. > lm.out=lm(Nic~Tar,data=cig.data) ## Get the estimated slope and intercept: > lm.out$coefficients (Intercept) Tar 0.13087532 0.06102854 You can do this manually too... 10 Pn ¯ i=1(xi−x¯)(Yi−Y ) b1 = Pn 2 i=1(xi−x¯) R easily works with vectors and matrices. > numerator=sum((cig.data$Tar-mean(cig.data$Tar))* (cig.data$Nic-mean(cig.data$Nic))) > denominator=sum((cig.data$Tar-mean(cig.data$Tar))^2) > b1=numerator/denominator > b1 [1] 0.06102854 b0 = Y¯ − b1x¯ > b0=mean(cig.data$Nic)-mean(cig.data$Tar)*b1 > b0 [1] 0.1308753 The fitted line for this data: ^ Yi = 0:1309 + 0:0610xi 11 ## Add the fitted line to the original plot: > plot(cig.data$Tar,cig.data$Nic) > abline(lm.out) ● 2.0 1.5 ● ● ● ● ● ● ●●● 1.0 ● ● cig.data$Nic ● ● ● ● ●● ● ● ● ● 0.5 ●● ● 0 5 10 15 20 25 30 cig.data$Tar 12 13 Simple Linear Regression Estimating σ2 • One of the assumptions of linear regression is that the variance for each of the conditional distributions of Y jx is the same at all x-values. 4 2 0 Y −2 −4 0 5 10 15 20 25 • In this case, it makes senseX to pool all the error information to come up with a common estimate for σ2 14 Recall the model: iid 2 Yi = β0 + β1xi + i with i ∼ N(0; σ ) • We use the sum of the squares of the residuals to estimate σ2 Acronyms: RSS ≡ Residual sum of squares SSE ≡ Sum of squared errors RSS ≡ SSE Pn ^ 2 ^2 RSS i=1(Yi−Yi) σ = n−2 = n−2 Pn ^ 2 RSS = i=1(Yi − Yi) RSS 2 E[ n−2 ] = σ p 2 σ^ = SE = SE is called the standard error for the regression (a phrase used by this author) 15 { `2' is subtracted from n in the denominator because we've used 2 degrees of freedom for estimating the slope and intercept (i.e. there were 2 parameters estimated in the mean structure). { When we estimate σ2 in a 1-sample Pn ^ 2 population, we divide i=1(Yi − Yi) by (n − 1) because we only estimate 1 parameter in the mean structure, namely µ. 16 Simple Linear Regression Total sums of squares (TSS) • Total sums of squares (TSS) quantifies the overall squared distance of the Y -values from the overall mean of the responses Y¯ ● 30 ● ● Y−bar= 10.91 ● ● ● ● ● ● ● ● ● ● ● ● ● 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● 10 y ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● 0 ● ● ● ● ● ● ● ● ● −10 ● 0 2 4 6 x Pn ¯ 2 • TSS= i=1(Yi − Y ) 17 • For regression, we can `decompose' this distance and write: Yi − Y¯ = (Yi − Yî) + (Yî − Y¯ ) | {z } | {z } distance from distance from observation to fitted line to fitted line overall mean • Which leads to the equation1: n n n X 2 X 2 X 2 (Yi − Y¯ ) = (Yi − Yî) + (Yî − Y¯ ) i=1 i=1 i=1 or T SS = RSS + RegSS where RegSS is the regression sum of squares 1 (a + b)2 6= a2 + b2. You must square both sides, then include the summation terms and then the cross terms will cancel out due to properties of the fitted line. 18 • Total variability has been decomposed into \explained" and \un- explained" variability • In general, when the proportion of total variability that is explained is high, we have a good fitting model • The R2 value (coefficient of determination): { the proportion of variation in the response that is explained by the model 2 RegSS { R = T SS 2 RSS { R = 1 − T SS { also stated as r2 in simple linear regression { the square of the correlation coefficient `r' { 0 ≤ R2 ≤ 1 { R2 near 1 suggests a good fit to the data { if R2 = 1, ALL points fall exactly on the line { different disciplines have different views on what is a high R2 = 1, in other words what is a good model 19 ∗ social scientists may get excited about an R2 near 0.30 ∗ a researcher with a designed experiment may want to see an R2 near 0.80 20 Simple Linear Regression Analysis of Variance (ANOVA) The decomposition of total variance into parts is part of ANOVA. As was stated before: ● 2.0 T SS = RSS + RegSS 1.5 Example: cigarette data ● ● ● ● ● ● ●●● 1.0 ● ● cig.data$Nic ● ● ● ● ●● ● ● ● ● 0.5 ●● ● 0 5 10 15 20 25 30 cig.data$Tar 21 Look at the ANOVA table: You can get these sums of squares manually too... > sum((lm.out$fitted.values-mean(cig.data$Nic))^2) [1] 2.869467 > sum(lm.out$residuals^2) [1] 0.1391091 > sum((cig.data$Nic-mean(cig.data$Nic))^2) [1] 3.008576 Get the R2 value (2 ways shown): > summary(lm.out) look for.... Multiple R-Squared: 0.9538 > summary(lm.out)$r.squared [1] 0.9537625 22 Example: Lifespan and Thorax of fruitflies LONGEVITY Lifespan, in days THORAX Length of thorax, in mm n=125 100 ● ● ● ● ● ● ● ● ● ● ● ● 80 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● data$Longevity ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 ● ● ● 0.65 0.70 0.75 0.80 0.85 0.90 0.95 data$Thorax \Sexual Activity and the Lifespan of Male Fruitflies” by Linda Par- tridge and Marion Farquhar. Nature, 294, 580-581, 1981. 23 The data and the variables: > ff.data=as.data.frame(read.delim("/fruitfly.txt", sep="\t",header=FALSE)) > dimnames(ff.data)[[2]]=c("ID","Partners","Type", "Longevity","Thorax","Sleep") > head(ff.data) ID Partners Type Longevity Thorax Sleep 1 1 8 0 35 0.64 22 2 2 8 0 37 0.68 9 3 3 8 0 49 0.68 49 4 4 8 0 46 0.72 1 5 5 8 0 63 0.72 23 6 6 8 0 39 0.76 83 See how many different Partner values there are: > unique(ff.data$Partners) [1] 8 0 1 24 Fit the simple linear regression model: > lm.fruitflies=lm(ff.data$Longevity~ff.data$Thorax) > summary(lm.fruitflies) .

Load more