Linear Regression Linear Regressionη(X)= Isβ0 A+ Simpleβ1x1 Approach+ Β2x2 + ...Β to Supervisedpxp • Almostlearning

SLDM III c Hastie&Tibshirani-March7,2013 LinearRegression 71 Linearity assumption? Linear regression Linear regressionη(x)= isβ0 a+ simpleβ1x1 approach+ β2x2 + ...β to supervisedpxp • Almostlearning. always It thought assumes of that as an the approximation dependence of toY theon truth. FunctionsX1;X in2;:::X naturep is are linear. rarely linear. True regression functions are never linear! • f(X) 3 4 5 6 7 2 4 6 8 X although it may seem overly simplistic, linear regression is • extremely useful both conceptually and practically. 1 / 48 Linear regression for the advertising data Consider the advertising data shown on the next slide. Questions we might ask: Is there a relationship between advertising budget and • sales? How strong is the relationship between advertising budget • and sales? Which media contribute to sales? • How accurately can we predict future sales? • Is the relationship linear? • Is there synergy among the advertising media? • 2 / 48 Advertising data Sales Sales Sales 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100 TV Radio Newspaper 3 / 48 Simple linear regression using a single predictor X. We assume a model • Y = β0 + β1X + , where β0 and β1 are two unknown constants that represent the intercept and slope, also known as coefficients or parameters, and is the error term. Given some estimates β^ and β^ for the model coefficients, • 0 1 we predict future sales using y^ = β^0 + β^1x; wherey ^ indicates a prediction of Y on the basis of X = x. The hat symbol denotes an estimated value. 4 / 48 Estimation of the parameters by least squares Lety ^ = β^ + β^ x be the prediction for Y based on the ith • i 0 1 i value of X. Then ei = yi yî represents the ith residual − We define the residual sum of squares (RSS) as • RSS = e2 + e2 + + e2 ; 1 2 ··· n or equivalently as 2 2 2 RSS = (y β^ β^ x ) +(y β^ β^ x ) +:::+(yn β^ β^ xn) : 1− 0− 1 1 2− 0− 1 2 − 0− 1 The least squares approach chooses β^ and β^ to minimize • 0 1 the RSS. The minimizing values can be shown to be n ^ i=1(xi x¯)(yi y¯) β1 = n − −2 ; (xi x¯) P i=1 − β^ =y ¯ β^ x;¯ 0 −P1 1 n 1 n wherey ¯ yi andx ¯ xi are the sample ≡ n i=1 ≡ n i=1 means. P P 5 / 48 4 3. Linear Regression between the ith observed response value and the ith response value that is predicted by our linear model. We define the residual sum of squares (RSS) residual sum of as squares RSS = e2 + e2 + + e2 , 1 2 ··· n or equivalently as RSS = (y βˆ βˆ x )2 +(y βˆ βˆ x )2 +...+(y βˆ βˆ x )2. (3.3) 1 − 0 − 1 1 2 − 0 − 1 2 n − 0 − 1 n The least squares approach chooses βˆ0 and βˆ1 to minimize the RSS. Using some calculus, one can show that the minimizers are n ˆ i=1(xi x¯)(yi y¯) β1 = n − −2 , (xi x¯) (3.4) i=1 − βˆ =y ¯ βˆ x,¯ 0 −1 1 n 1 n wherey ¯ n i=1 yi andx ¯ n i=1 xi are the sample means. In other words, (3.4)≡ definesExample: the least≡ squares advertisingcoefficient estimates fordata simple linear regression. Sales 5 10 15 20 25 0 50 100 150 200 250 300 TV FIGURE 3.1. For the Advertising data, the least squares fit for the regression The leastof sales squaresonto TV is shown. fit for Thethe fit is found regression by minimizing of thesales sum of squaredonto TV. In thiserrors. case Each a grey linear line segment fit capturesrepresents an error, the and essence the fit makes of a compro-the relationship, mise by averaging their squares. In this case a linear fit captures the essence of althoughthe relationship, it is somewhat although it is somewhat deficient deficient in in the the left ofleft the plot. of the plot. Figure 3.1 displays the simple linear regression fit to the Advertising data, where βˆ0 = 7.03 and βˆ1 = 0.0475. In other words, according to this 6 / 48 Assessing the Accuracy of the Coefficient Estimates The standard error of an estimator reflects how it varies • under repeated sampling. We have 2 2 ^ 2 σ ^ 2 2 1 x¯ SE(β1) = n 2 ; SE(β0) = σ + n 2 ; (xi x¯) n (xi x¯) i=1 − i=1 − where σ2 =P Var() P These standard errors can be used to compute confidence • intervals. A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter. It has the form β^ 2 SE(β^ ): 1 ± · 1 7 / 48 Confidence intervals | continued That is, there is approximately a 95% chance that the interval β^ 2 SE(β^ ); β^ + 2 SE(β^ ) 1 − · 1 1 · 1 h i will contain the true value of β1 (under a scenario where we got repeated samples like the present sample) For the advertising data, the 95% confidence interval for β1 is [0:042; 0:053] 8 / 48 Hypothesis testing Standard errors can also be used to perform hypothesis • tests on the coefficients. The most common hypothesis test involves testing the null hypothesis of H0 : There is no relationship between X and Y versus the alternative hypothesis HA : There is some relationship between X and Y: Mathematically, this corresponds to testing • H0 : β1 = 0 versus HA : β = 0; 1 6 since if β1 = 0 then the model reduces to Y = β0 + , and X is not associated with Y . 9 / 48 Hypothesis testing | continued To test the null hypothesis, we compute a t-statistic, given • by β^ 0 t = 1 − ; SE(β^1) This will have a t-distribution with n 2 degrees of • − freedom, assuming β1 = 0. Using statistical software, it is easy to compute the • probability of observing any value equal to t or larger. We j j call this probability the p-value. 10 / 48 Results for the advertising data Coefficient Std. Error t-statistic p-value Intercept 7.0325 0.4578 15.36 < 0:0001 TV 0.0475 0.0027 17.67 < 0:0001 11 / 48 Assessing the Overall Accuracy of the Model We compute the Residual Standard Error • n 1 1 2 RSE = RSS = (yi yî) ; n 2 vn 2 − r u i=1 − u − X t n 2 where the residual sum-of-squares is RSS = (yi yî) : i=1 − R-squared or fraction of variance explained is • P TSS RSS RSS R2 = − = 1 TSS − TSS n 2 where TSS = (yi y¯) is the total sum of squares. i=1 − It can be shown that in this simple linear regression setting • P that R2 = r2, where r is the correlation between X and Y : n (xi x)(yi y) r = i=1 − − : n 2 n 2 (xi x) (yi y) i=1P − i=1 − pP pP 12 / 48 Advertising data results Quantity Value Residual Standard Error 3.26 R2 0.612 F-statistic 312.1 13 / 48 Multiple Linear Regression Here our model is • Y = β + β X + β X + + βpXp + , 0 1 1 2 2 ··· We interpret β as the average effect on Y of a one unit • j increase in Xj, holding all other predictors fixed. In the advertising example, the model becomes sales = β + β TV + β radio + β newspaper + . 0 1 × 2 × 3 × 14 / 48 Interpreting regression coefficients The ideal scenario is when the predictors are uncorrelated • | a balanced design: - Each coefficient can be estimated and tested separately. - Interpretations such as \a unit change in Xj is associated with a βj change in Y , while all the other variables stay fixed", are possible. Correlations amongst predictors cause problems: • - The variance of all coefficients tends to increase, sometimes dramatically - Interpretations become hazardous | when Xj changes, everything else changes. Claims of causality should be avoided for observational • data. 15 / 48 The woes of (interpreting) regression coefficients \Data Analysis and Regression" Mosteller and Tukey 1977 a regression coefficient β estimates the expected change in • j Y per unit change in Xj, with all other predictors held fixed. But predictors usually change together! Example: Y total amount of change in your pocket; • X1 = # of coins; X2 = # of pennies, nickels and dimes. By itself, regression coefficient of Y on X2 will be > 0. But how about with X1 in model? Y = number of tackles by a football player in a season; W • and H are his weight and height. Fitted regression model is Y^ = b + :50W :10H. How do we interpret β^ < 0? 0 − 2 16 / 48 Two quotes by famous Statisticians \Essentially, all models are wrong, but some are useful" George Box \The only way to find out what will happen when a complex system is disturbed is to disturb the system, not merely to observe it passively" Fred Mosteller and John Tukey, paraphrasing George Box 17 / 48 Estimation and Prediction for Multiple Regression Given estimates β^ ; β^ ;::: β^ , we can make predictions • 0 1 p using the formula y^ = β^ + β^ x + β^ x + + β^pxp: 0 1 1 2 2 ··· We estimate β ; β ; : : : ; β as the values that minimize the • 0 1 p sum of squared residuals n 2 RSS = (yi yî) − i=1 Xn 2 = (yi β^ β^ xi β^ xi β^pxip) : − 0 − 1 1 − 2 2 − · · · − i X=1 This is done using standard statistical software.

Load more