Linear Regression Linear Regressionη(X)= Isβ0 A+ Simpleβ1x1 Approach+ Β2x2 + ...Β to Supervisedpxp • Almostlearning

Total Page:16

File Type:pdf, Size:1020Kb

Linear Regression Linear Regressionη(X)= Isβ0 A+ Simpleβ1x1 Approach+ Β2x2 + ...Β to Supervisedpxp • Almostlearning SLDM III c Hastie&Tibshirani-March7,2013 LinearRegression 71 Linearity assumption? Linear regression Linear regressionη(x)= isβ0 a+ simpleβ1x1 approach+ β2x2 + ...β to supervisedpxp • Almostlearning. always It thought assumes of that as an the approximation dependence of toY theon truth. FunctionsX1;X in2;:::X naturep is are linear. rarely linear. True regression functions are never linear! • f(X) 3 4 5 6 7 2 4 6 8 X although it may seem overly simplistic, linear regression is • extremely useful both conceptually and practically. 1 / 48 Linear regression for the advertising data Consider the advertising data shown on the next slide. Questions we might ask: Is there a relationship between advertising budget and • sales? How strong is the relationship between advertising budget • and sales? Which media contribute to sales? • How accurately can we predict future sales? • Is the relationship linear? • Is there synergy among the advertising media? • 2 / 48 Advertising data Sales Sales Sales 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100 TV Radio Newspaper 3 / 48 Simple linear regression using a single predictor X. We assume a model • Y = β0 + β1X + , where β0 and β1 are two unknown constants that represent the intercept and slope, also known as coefficients or parameters, and is the error term. Given some estimates β^ and β^ for the model coefficients, • 0 1 we predict future sales using y^ = β^0 + β^1x; wherey ^ indicates a prediction of Y on the basis of X = x. The hat symbol denotes an estimated value. 4 / 48 Estimation of the parameters by least squares Lety ^ = β^ + β^ x be the prediction for Y based on the ith • i 0 1 i value of X. Then ei = yi y^i represents the ith residual − We define the residual sum of squares (RSS) as • RSS = e2 + e2 + + e2 ; 1 2 ··· n or equivalently as 2 2 2 RSS = (y β^ β^ x ) +(y β^ β^ x ) +:::+(yn β^ β^ xn) : 1− 0− 1 1 2− 0− 1 2 − 0− 1 The least squares approach chooses β^ and β^ to minimize • 0 1 the RSS. The minimizing values can be shown to be n ^ i=1(xi x¯)(yi y¯) β1 = n − −2 ; (xi x¯) P i=1 − β^ =y ¯ β^ x;¯ 0 −P1 1 n 1 n wherey ¯ yi andx ¯ xi are the sample ≡ n i=1 ≡ n i=1 means. P P 5 / 48 4 3. Linear Regression between the ith observed response value and the ith response value that is predicted by our linear model. We define the residual sum of squares (RSS) residual sum of as squares RSS = e2 + e2 + + e2 , 1 2 ··· n or equivalently as RSS = (y βˆ βˆ x )2 +(y βˆ βˆ x )2 +...+(y βˆ βˆ x )2. (3.3) 1 − 0 − 1 1 2 − 0 − 1 2 n − 0 − 1 n The least squares approach chooses βˆ0 and βˆ1 to minimize the RSS. Using some calculus, one can show that the minimizers are n ˆ i=1(xi x¯)(yi y¯) β1 = n − −2 , (xi x¯) (3.4) i=1 − βˆ =y ¯ βˆ x,¯ 0 −1 1 n 1 n wherey ¯ n i=1 yi andx ¯ n i=1 xi are the sample means. In other words, (3.4)≡ definesExample: the least≡ squares advertisingcoefficient estimates fordata simple linear regression. Sales 5 10 15 20 25 0 50 100 150 200 250 300 TV FIGURE 3.1. For the Advertising data, the least squares fit for the regression The leastof sales squaresonto TV is shown. fit for Thethe fit is found regression by minimizing of thesales sum of squaredonto TV. In thiserrors. case Each a grey linear line segment fit capturesrepresents an error, the and essence the fit makes of a compro-the relationship, mise by averaging their squares. In this case a linear fit captures the essence of althoughthe relationship, it is somewhat although it is somewhat deficient deficient in in the the left ofleft the plot. of the plot. Figure 3.1 displays the simple linear regression fit to the Advertising data, where βˆ0 = 7.03 and βˆ1 = 0.0475. In other words, according to this 6 / 48 Assessing the Accuracy of the Coefficient Estimates The standard error of an estimator reflects how it varies • under repeated sampling. We have 2 2 ^ 2 σ ^ 2 2 1 x¯ SE(β1) = n 2 ; SE(β0) = σ + n 2 ; (xi x¯) n (xi x¯) i=1 − i=1 − where σ2 =P Var() P These standard errors can be used to compute confidence • intervals. A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter. It has the form β^ 2 SE(β^ ): 1 ± · 1 7 / 48 Confidence intervals | continued That is, there is approximately a 95% chance that the interval β^ 2 SE(β^ ); β^ + 2 SE(β^ ) 1 − · 1 1 · 1 h i will contain the true value of β1 (under a scenario where we got repeated samples like the present sample) For the advertising data, the 95% confidence interval for β1 is [0:042; 0:053] 8 / 48 Hypothesis testing Standard errors can also be used to perform hypothesis • tests on the coefficients. The most common hypothesis test involves testing the null hypothesis of H0 : There is no relationship between X and Y versus the alternative hypothesis HA : There is some relationship between X and Y: Mathematically, this corresponds to testing • H0 : β1 = 0 versus HA : β = 0; 1 6 since if β1 = 0 then the model reduces to Y = β0 + , and X is not associated with Y . 9 / 48 Hypothesis testing | continued To test the null hypothesis, we compute a t-statistic, given • by β^ 0 t = 1 − ; SE(β^1) This will have a t-distribution with n 2 degrees of • − freedom, assuming β1 = 0. Using statistical software, it is easy to compute the • probability of observing any value equal to t or larger. We j j call this probability the p-value. 10 / 48 Results for the advertising data Coefficient Std. Error t-statistic p-value Intercept 7.0325 0.4578 15.36 < 0:0001 TV 0.0475 0.0027 17.67 < 0:0001 11 / 48 Assessing the Overall Accuracy of the Model We compute the Residual Standard Error • n 1 1 2 RSE = RSS = (yi y^i) ; n 2 vn 2 − r u i=1 − u − X t n 2 where the residual sum-of-squares is RSS = (yi y^i) : i=1 − R-squared or fraction of variance explained is • P TSS RSS RSS R2 = − = 1 TSS − TSS n 2 where TSS = (yi y¯) is the total sum of squares. i=1 − It can be shown that in this simple linear regression setting • P that R2 = r2, where r is the correlation between X and Y : n (xi x)(yi y) r = i=1 − − : n 2 n 2 (xi x) (yi y) i=1P − i=1 − pP pP 12 / 48 Advertising data results Quantity Value Residual Standard Error 3.26 R2 0.612 F-statistic 312.1 13 / 48 Multiple Linear Regression Here our model is • Y = β + β X + β X + + βpXp + , 0 1 1 2 2 ··· We interpret β as the average effect on Y of a one unit • j increase in Xj, holding all other predictors fixed. In the advertising example, the model becomes sales = β + β TV + β radio + β newspaper + . 0 1 × 2 × 3 × 14 / 48 Interpreting regression coefficients The ideal scenario is when the predictors are uncorrelated • | a balanced design: - Each coefficient can be estimated and tested separately. - Interpretations such as \a unit change in Xj is associated with a βj change in Y , while all the other variables stay fixed", are possible. Correlations amongst predictors cause problems: • - The variance of all coefficients tends to increase, sometimes dramatically - Interpretations become hazardous | when Xj changes, everything else changes. Claims of causality should be avoided for observational • data. 15 / 48 The woes of (interpreting) regression coefficients \Data Analysis and Regression" Mosteller and Tukey 1977 a regression coefficient β estimates the expected change in • j Y per unit change in Xj, with all other predictors held fixed. But predictors usually change together! Example: Y total amount of change in your pocket; • X1 = # of coins; X2 = # of pennies, nickels and dimes. By itself, regression coefficient of Y on X2 will be > 0. But how about with X1 in model? Y = number of tackles by a football player in a season; W • and H are his weight and height. Fitted regression model is Y^ = b + :50W :10H. How do we interpret β^ < 0? 0 − 2 16 / 48 Two quotes by famous Statisticians \Essentially, all models are wrong, but some are useful" George Box \The only way to find out what will happen when a complex system is disturbed is to disturb the system, not merely to observe it passively" Fred Mosteller and John Tukey, paraphrasing George Box 17 / 48 Estimation and Prediction for Multiple Regression Given estimates β^ ; β^ ;::: β^ , we can make predictions • 0 1 p using the formula y^ = β^ + β^ x + β^ x + + β^pxp: 0 1 1 2 2 ··· We estimate β ; β ; : : : ; β as the values that minimize the • 0 1 p sum of squared residuals n 2 RSS = (yi y^i) − i=1 Xn 2 = (yi β^ β^ xi β^ xi β^pxip) : − 0 − 1 1 − 2 2 − · · · − i X=1 This is done using standard statistical software.
Recommended publications
  • A Generalized Linear Model for Principal Component Analysis of Binary Data
    A Generalized Linear Model for Principal Component Analysis of Binary Data Andrew I. Schein Lawrence K. Saul Lyle H. Ungar Department of Computer and Information Science University of Pennsylvania Moore School Building 200 South 33rd Street Philadelphia, PA 19104-6389 {ais,lsaul,ungar}@cis.upenn.edu Abstract they are not generally appropriate for other data types. Recently, Collins et al.[5] derived generalized criteria We investigate a generalized linear model for for dimensionality reduction by appealing to proper- dimensionality reduction of binary data. The ties of distributions in the exponential family. In their model is related to principal component anal- framework, the conventional PCA of real-valued data ysis (PCA) in the same way that logistic re- emerges naturally from assuming a Gaussian distribu- gression is related to linear regression. Thus tion over a set of observations, while generalized ver- we refer to the model as logistic PCA. In this sions of PCA for binary and nonnegative data emerge paper, we derive an alternating least squares respectively by substituting the Bernoulli and Pois- method to estimate the basis vectors and gen- son distributions for the Gaussian. For binary data, eralized linear coefficients of the logistic PCA the generalized model's relationship to PCA is anal- model. The resulting updates have a simple ogous to the relationship between logistic and linear closed form and are guaranteed at each iter- regression[12]. In particular, the model exploits the ation to improve the model's likelihood. We log-odds as the natural parameter of the Bernoulli dis- evaluate the performance of logistic PCA|as tribution and the logistic function as its canonical link.
    [Show full text]
  • Ordinary Least Squares 1 Ordinary Least Squares
    Ordinary least squares 1 Ordinary least squares In statistics, ordinary least squares (OLS) or linear least squares is a method for estimating the unknown parameters in a linear regression model. This method minimizes the sum of squared vertical distances between the observed responses in the dataset and the responses predicted by the linear approximation. The resulting estimator can be expressed by a simple formula, especially in the case of a single regressor on the right-hand side. The OLS estimator is consistent when the regressors are exogenous and there is no Okun's law in macroeconomics states that in an economy the GDP growth should multicollinearity, and optimal in the class of depend linearly on the changes in the unemployment rate. Here the ordinary least squares method is used to construct the regression line describing this law. linear unbiased estimators when the errors are homoscedastic and serially uncorrelated. Under these conditions, the method of OLS provides minimum-variance mean-unbiased estimation when the errors have finite variances. Under the additional assumption that the errors be normally distributed, OLS is the maximum likelihood estimator. OLS is used in economics (econometrics) and electrical engineering (control theory and signal processing), among many areas of application. Linear model Suppose the data consists of n observations { y , x } . Each observation includes a scalar response y and a i i i vector of predictors (or regressors) x . In a linear regression model the response variable is a linear function of the i regressors: where β is a p×1 vector of unknown parameters; ε 's are unobserved scalar random variables (errors) which account i for the discrepancy between the actually observed responses y and the "predicted outcomes" x′ β; and ′ denotes i i matrix transpose, so that x′ β is the dot product between the vectors x and β.
    [Show full text]
  • Linear Regression
    DATA MINING 2 Linear and Logistic Regression Riccardo Guidotti a.a. 2019/2020 Regression • Given a dataset containing N observations Xi, Yi i = 1, 2, …, N • Regression is the task of learning a target function f that maps each input attribute set X into a output Y. • The goal is to find the target function that can fit the input data with minimum error. • The error function can be expressed as • Absolute Error = ∑! |�! − �(�!)| " • Squared Error = ∑!(�! − � �! ) residuals Linear Regression Y • Linear regression is a linear approach to modeling the relationship between a dependent variable Y and one or more independent (explanatory) variables X. • The case of one explanatory variable is called simple linear regression. • For more than one explanatory variable, the process is called multiple linear X regression. • For multiple correlated dependent variables, the process is called multivariate linear regression. What does it mean to predict Y? • Look at X = 5. There are many different Y values at X=5. • When we say predict Y at X =5, we are really asking: • What is the expected value (average) of Y at X = 5? What does it mean to predict Y? • Formally, the regression function is given by E(Y|X=x). This is the expected value of Y at X=x. • The ideal or optimal predictor of Y based on X is thus • f(X) = E(Y | X=x) Simple Linear Regression Dependent Independent Variable Variable Linear Model: Y = mX + b � = �!� + �" Slope Intercept (bias) • In general, such a relationship may not hold exactly for the largely unobserved population • We call the unobserved deviations from Y the errors.
    [Show full text]
  • Linear, Ridge Regression, and Principal Component Analysis
    Linear, Ridge Regression, and Principal Component Analysis Linear, Ridge Regression, and Principal Component Analysis Jia Li Department of Statistics The Pennsylvania State University Email: [email protected] http://www.stat.psu.edu/∼jiali Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Introduction to Regression I Input vector: X = (X1, X2, ..., Xp). I Output Y is real-valued. I Predict Y from X by f (X ) so that the expected loss function E(L(Y , f (X ))) is minimized. I Square loss: L(Y , f (X )) = (Y − f (X ))2 . I The optimal predictor ∗ 2 f (X ) = argminf (X )E(Y − f (X )) = E(Y | X ) . I The function E(Y | X ) is the regression function. Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Example The number of active physicians in a Standard Metropolitan Statistical Area (SMSA), denoted by Y , is expected to be related to total population (X1, measured in thousands), land area (X2, measured in square miles), and total personal income (X3, measured in millions of dollars). Data are collected for 141 SMSAs, as shown in the following table. i : 1 2 3 ... 139 140 141 X1 9387 7031 7017 ... 233 232 231 X2 1348 4069 3719 ... 1011 813 654 X3 72100 52737 54542 ... 1337 1589 1148 Y 25627 15389 13326 ... 264 371 140 Goal: Predict Y from X1, X2, and X3. Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Linear Methods I The linear regression model p X f (X ) = β0 + Xj βj .
    [Show full text]
  • Simple Linear Regression with Least Square Estimation: an Overview
    Aditya N More et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 7 (6) , 2016, 2394-2396 Simple Linear Regression with Least Square Estimation: An Overview Aditya N More#1, Puneet S Kohli*2, Kshitija H Kulkarni#3 #1-2Information Technology Department,#3 Electronics and Communication Department College of Engineering Pune Shivajinagar, Pune – 411005, Maharashtra, India Abstract— Linear Regression involves modelling a relationship amongst dependent and independent variables in the form of a (2.1) linear equation. Least Square Estimation is a method to determine the constants in a Linear model in the most accurate way without much complexity of solving. Metrics where such as Coefficient of Determination and Mean Square Error is the ith value of the sample data point determine how good the estimation is. Statistical Packages is the ith value of y on the predicted regression such as R and Microsoft Excel have built in tools to perform Least Square Estimation over a given data set. line The above equation can be geometrically depicted by Keywords— Linear Regression, Machine Learning, Least Squares Estimation, R programming figure 2.1. If we draw a square at each point whose length is equal to the absolute difference between the sample data point and the predicted value as shown, each of the square would then represent the residual error in placing the I. INTRODUCTION regression line. The aim of the least square method would Linear Regression involves establishing linear be to place the regression line so as to minimize the sum of relationships between dependent and independent variables.
    [Show full text]
  • Linear Regression
    eesc BC 3017 statistics notes 1 LINEAR REGRESSION Systematic var iation in the true value Up to now, wehav e been thinking about measurement as sampling of values from an ensemble of all possible outcomes in order to estimate the true value (which would, according to our previous discussion, be well approximated by the mean of a very large sample). Givenasample of outcomes, we have sometimes checked the hypothesis that it is a random sample from some ensemble of outcomes, by plotting the data points against some other variable, such as ordinal position. Under the hypothesis of random selection, no clear trend should appear.Howev er, the contrary case, where one finds a clear trend, is very important. Aclear trend can be a discovery,rather than a nuisance! Whether it is adiscovery or a nuisance (or both) depends on what one finds out about the reasons underlying the trend. In either case one must be prepared to deal with trends in analyzing data. Figure 2.1 (a) shows a plot of (hypothetical) data in which there is a very clear trend. The yaxis scales concentration of coliform bacteria sampled from rivers in various regions (units are colonies per liter). The x axis is a hypothetical indexofregional urbanization, ranging from 1 to 10. The hypothetical data consist of 6 different measurements at each levelofurbanization. The mean of each set of 6 measurements givesarough estimate of the true value for coliform bacteria concentration for rivers in a region with that urbanization level. The jagged dark line drawn on the graph connects these estimates of true value and makes the trend quite clear: more extensive urbanization is associated with higher true values of bacteria concentration.
    [Show full text]
  • Principal Component Analysis (PCA) As a Statistical Tool for Identifying Key Indicators of Nuclear Power Plant Cable Insulation
    Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations 2017 Principal component analysis (PCA) as a statistical tool for identifying key indicators of nuclear power plant cable insulation degradation Chamila Chandima De Silva Iowa State University Follow this and additional works at: https://lib.dr.iastate.edu/etd Part of the Materials Science and Engineering Commons, Mechanics of Materials Commons, and the Statistics and Probability Commons Recommended Citation De Silva, Chamila Chandima, "Principal component analysis (PCA) as a statistical tool for identifying key indicators of nuclear power plant cable insulation degradation" (2017). Graduate Theses and Dissertations. 16120. https://lib.dr.iastate.edu/etd/16120 This Thesis is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Principal component analysis (PCA) as a statistical tool for identifying key indicators of nuclear power plant cable insulation degradation by Chamila C. De Silva A thesis submitted to the graduate faculty in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Major: Materials Science and Engineering Program of Study Committee: Nicola Bowler, Major Professor Richard LeSar Steve Martin The student author and the program of study committee are solely responsible for the content of this thesis. The Graduate College will ensure this thesis is globally accessible and will not permit alterations after a degree is conferred.
    [Show full text]
  • Generalized Linear Models
    CHAPTER 6 Generalized linear models 6.1 Introduction Generalized linear modeling is a framework for statistical analysis that includes linear and logistic regression as special cases. Linear regression directly predicts continuous data y from a linear predictor Xβ = β0 + X1β1 + + Xkβk.Logistic regression predicts Pr(y =1)forbinarydatafromalinearpredictorwithaninverse-··· logit transformation. A generalized linear model involves: 1. A data vector y =(y1,...,yn) 2. Predictors X and coefficients β,formingalinearpredictorXβ 1 3. A link function g,yieldingavectoroftransformeddataˆy = g− (Xβ)thatare used to model the data 4. A data distribution, p(y yˆ) | 5. Possibly other parameters, such as variances, overdispersions, and cutpoints, involved in the predictors, link function, and data distribution. The options in a generalized linear model are the transformation g and the data distribution p. In linear regression,thetransformationistheidentity(thatis,g(u) u)and • the data distribution is normal, with standard deviation σ estimated from≡ data. 1 1 In logistic regression,thetransformationistheinverse-logit,g− (u)=logit− (u) • (see Figure 5.2a on page 80) and the data distribution is defined by the proba- bility for binary data: Pr(y =1)=y ˆ. This chapter discusses several other classes of generalized linear model, which we list here for convenience: The Poisson model (Section 6.2) is used for count data; that is, where each • data point yi can equal 0, 1, 2, ....Theusualtransformationg used here is the logarithmic, so that g(u)=exp(u)transformsacontinuouslinearpredictorXiβ to a positivey ˆi.ThedatadistributionisPoisson. It is usually a good idea to add a parameter to this model to capture overdis- persion,thatis,variationinthedatabeyondwhatwouldbepredictedfromthe Poisson distribution alone.
    [Show full text]
  • Generalized Linear Models
    Generalized Linear Models Advanced Methods for Data Analysis (36-402/36-608) Spring 2014 1 Generalized linear models 1.1 Introduction: two regressions • So far we've seen two canonical settings for regression. Let X 2 Rp be a vector of predictors. In linear regression, we observe Y 2 R, and assume a linear model: T E(Y jX) = β X; for some coefficients β 2 Rp. In logistic regression, we observe Y 2 f0; 1g, and we assume a logistic model (Y = 1jX) log P = βT X: 1 − P(Y = 1jX) • What's the similarity here? Note that in the logistic regression setting, P(Y = 1jX) = E(Y jX). Therefore, in both settings, we are assuming that a transformation of the conditional expec- tation E(Y jX) is a linear function of X, i.e., T g E(Y jX) = β X; for some function g. In linear regression, this transformation was the identity transformation g(u) = u; in logistic regression, it was the logit transformation g(u) = log(u=(1 − u)) • Different transformations might be appropriate for different types of data. E.g., the identity transformation g(u) = u is not really appropriate for logistic regression (why?), and the logit transformation g(u) = log(u=(1 − u)) not appropriate for linear regression (why?), but each is appropriate in their own intended domain • For a third data type, it is entirely possible that transformation neither is really appropriate. What to do then? We think of another transformation g that is in fact appropriate, and this is the basic idea behind a generalized linear model 1.2 Generalized linear models • Given predictors X 2 Rp and an outcome Y , a generalized linear model is defined by three components: a random component, that specifies a distribution for Y jX; a systematic compo- nent, that relates a parameter η to the predictors X; and a link function, that connects the random and systematic components • The random component specifies a distribution for the outcome variable (conditional on X).
    [Show full text]
  • Time-Series Regression and Generalized Least Squares in R*
    Time-Series Regression and Generalized Least Squares in R* An Appendix to An R Companion to Applied Regression, third edition John Fox & Sanford Weisberg last revision: 2018-09-26 Abstract Generalized least-squares (GLS) regression extends ordinary least-squares (OLS) estimation of the normal linear model by providing for possibly unequal error variances and for correlations between different errors. A common application of GLS estimation is to time-series regression, in which it is generally implausible to assume that errors are independent. This appendix to Fox and Weisberg (2019) briefly reviews GLS estimation and demonstrates its application to time-series data using the gls() function in the nlme package, which is part of the standard R distribution. 1 Generalized Least Squares In the standard linear model (for example, in Chapter 4 of the R Companion), E(yjX) = Xβ or, equivalently y = Xβ + " where y is the n×1 response vector; X is an n×k +1 model matrix, typically with an initial column of 1s for the regression constant; β is a k + 1 ×1 vector of regression coefficients to estimate; and " is 2 an n×1 vector of errors. Assuming that " ∼ Nn(0; σ In), or at least that the errors are uncorrelated and equally variable, leads to the familiar ordinary-least-squares (OLS) estimator of β, 0 −1 0 bOLS = (X X) X y with covariance matrix 2 0 −1 Var(bOLS) = σ (X X) More generally, we can assume that " ∼ Nn(0; Σ), where the error covariance matrix Σ is sym- metric and positive-definite. Different diagonal entries in Σ error variances that are not necessarily all equal, while nonzero off-diagonal entries correspond to correlated errors.
    [Show full text]
  • Simple Linear Regression
    The simple linear model Represents the dependent variable, yi, as a linear function of one Regression Analysis: Basic Concepts independent variable, xi, subject to a random “disturbance” or “error”, ui. yi β0 β1xi ui = + + Allin Cottrell The error term ui is assumed to have a mean value of zero, a constant variance, and to be uncorrelated with its own past values (i.e., it is “white noise”). The task of estimation is to determine regression coefficients βˆ0 and βˆ1, estimates of the unknown parameters β0 and β1 respectively. The estimated equation will have the form yˆi βˆ0 βˆ1x = + 1 OLS Picturing the residuals The basic technique for determining the coefficients βˆ0 and βˆ1 is Ordinary Least Squares (OLS). Values for the coefficients are chosen to minimize the sum of the ˆ ˆ squared estimated errors or residual sum of squares (SSR). The β0 β1x + estimated error associated with each pair of data-values (xi, yi) is yi uˆ defined as i uˆi yi yˆi yi βˆ0 βˆ1xi yˆi = − = − − We use a different symbol for this estimated error (uˆi) as opposed to the “true” disturbance or error term, (ui). These two coincide only if βˆ0 and βˆ1 happen to be exact estimates of the regression parameters α and β. The estimated errors are also known as residuals . The SSR may be written as xi 2 2 2 SSR uˆ (yi yˆi) (yi βˆ0 βˆ1xi) = i = − = − − Σ Σ Σ The residual, uˆi, is the vertical distance between the actual value of the dependent variable, yi, and the fitted value, yˆi βˆ0 βˆ1xi.
    [Show full text]
  • Simple Linear Regression: Straight Line Regression Between an Outcome Variable (Y ) and a Single Explanatory Or Predictor Variable (X)
    1 Introduction to Regression \Regression" is a generic term for statistical methods that attempt to fit a model to data, in order to quantify the relationship between the dependent (outcome) variable and the predictor (independent) variable(s). Assuming it fits the data reasonable well, the estimated model may then be used either to merely describe the relationship between the two groups of variables (explanatory), or to predict new values (prediction). There are many types of regression models, here are a few most common to epidemiology: Simple Linear Regression: Straight line regression between an outcome variable (Y ) and a single explanatory or predictor variable (X). E(Y ) = α + β × X Multiple Linear Regression: Same as Simple Linear Regression, but now with possibly multiple explanatory or predictor variables. E(Y ) = α + β1 × X1 + β2 × X2 + β3 × X3 + ::: A special case is polynomial regression. 2 3 E(Y ) = α + β1 × X + β2 × X + β3 × X + ::: Generalized Linear Model: Same as Multiple Linear Regression, but with a possibly transformed Y variable. This introduces considerable flexibil- ity, as non-linear and non-normal situations can be easily handled. G(E(Y )) = α + β1 × X1 + β2 × X2 + β3 × X3 + ::: In general, the transformation function G(Y ) can take any form, but a few forms are especially common: • Taking G(Y ) = logit(Y ) describes a logistic regression model: E(Y ) log( ) = α + β × X + β × X + β × X + ::: 1 − E(Y ) 1 1 2 2 3 3 2 • Taking G(Y ) = log(Y ) is also very common, leading to Poisson regression for count data, and other so called \log-linear" models.
    [Show full text]