Lecture 2 Linear Regression: a Model for the Mean Sharyn O’Halloran Closer Look At

Total Page:16

File Type:pdf, Size:1020Kb

Lecture 2 Linear Regression: a Model for the Mean Sharyn O’Halloran Closer Look At Lecture 2 Linear Regression: A Model for the Mean Sharyn O’Halloran Closer Look at: Linear Regression Model Least squares procedure Inferential tools Confidence and Prediction Intervals Assumptions Robustness Model checking Log transformation (of Y, X, or both) U9611 Spring 2005 2 Linear Regression: Introduction Data: (Yi, Xi) for i = 1,...,n Interest is in the probability distribution of Y as a function of X Linear Regression model: Mean of Y is a straight line function of X, plus an error term or residual Goal is to find the best fit line that minimizes the sum of the error terms U9611 Spring 2005 3 Estimated regression line Steer example (see Display 7.3, p. 177) Equation for estimated regression line: Intercept=6.98 7 .73 6.5 Fitted line ^ PH 1 Y= 6.98-.73X 6 Error term 5.5 0 1 2 ltime Fitted values PH U9611 Spring 2005 4 Create a new variable ltime=log(time) Regression analysis U9611 Spring 2005 5 Regression Terminology Regression: the mean of a response variable as a function of one or more explanatory variables: µ{Y | X} Regression model: an ideal formula to approximate the regression Simple linear regression model: µ{Y | X} = β 0 + β 1X “mean of Y given X” or Unknown “regression of Y on X” Intercept Slope parameter U9611 Spring 2005 6 Regression Terminology Y X Dependent variable Independent variable Explained variable Explanatory variable Response variable Control variable Y’s probability distribution is to be explained by X b0 and b1 are the regressionregression coefficientscoefficients (See Display 7.5, p. 180) Note: Y = b0 + b1 X is NOT simple regression U9611 Spring 2005 7 Regression Terminology: Estimated coefficients β 0 + β 1X β 0 + β 1 X βˆ 0 + βˆ 1 X βˆ 0 + βˆ 1 X β 0 + β 1 βˆ 0 βˆ 1 βˆ 0 + βˆ 1 Choose β ˆ 0 and β ˆ 1 to make the residuals small U9611 Spring 2005 8 Regression Terminology FittedFitted valuevalue for obs. i is its estimated mean: ˆ Y = fiti = µ{Y | X} = β 0 + β1 X ResidualResidual for obs. i: ˆ resi = Yi - fiti ⇒ ei = Yi −Y LeastLeast SquaresSquares statistical estimation method finds those estimates that minimize the sum of squared residuals. n n 2 ˆ 2 ∑(yi − (β0 + β1xi )) = ∑(yi − y) i=1 i=1 Solution (from calculus) on p. 182 of Sleuth U9611 Spring 2005 9 Least Squares Procedure The Least-squares procedure obtains estimates of the linear equation coefficients β0 and β1, in the model yˆi = β0 + β1xi by minimizing the sum of the squared residuals or errors (ei) 2 ˆ 2 SSE = ∑ei = ∑(yi − yi ) This results in a procedure stated as 2 2 SSE = ∑ei = ∑(yi − (β0 + β1xi )) Choose β0 and β1 so that the quantity is minimized. U9611 Spring 2005 10 Least Squares Procedure The slope coefficient estimator is n CORRELATION BETWEEN X AND Y ∑(xi − X )(yi −Y ) i=1 sY ˆ STANDARD DEVIATION β1 = = rxy n s OF Y OVER THE (x − X )2 X STANDARD DEVIATION ∑ i OF X i=1 And the constant or intercept indicator is ˆ ˆ β0 = Y − β1 X U9611 Spring 2005 11 Least Squares Procedure(cont.) Note that the regression line always goes through the mean X, Y. Think of this regression line as Relation Between Yield and Fertilizer the expected value 100 of Y for a given ) value of X. 80 60 40 Trend line Bushel/Acre That is, for any value of the ( independent variable there is 20 Yield Yield a single most likely value for 0 the dependent variable 0 100 200 300 400 500 600 700 800 Fertilizer (lb/Acre) U9611 Spring 2005 12 Tests and Confidence Intervals for β0, β1 Degrees of freedom: (n-2) = sample size - number of coefficients Variance {Y|X} σσ2= (sum of squared residuals)/(n-2) Standard errors (p. 184) Ideal normal model: the sampling distributions of β0 and β1 have the shape of a t-distribution on (n-2) d.f. Do t-tests and CIs as usual (df=n-2) U9611 Spring 2005 13 P values for Ho=0 Confidence intervals U9611 Spring 2005 14 Inference Tools Hypothesis Test and Confidence Interval for mean of Y at some X: Estimate the mean of Y at X = X0 by ˆ ˆ µˆ{Y | X 0 } = β 0 + β 1 X 0 ˆ Standard Error of β0 1 ( X − X ) 2 ˆ ˆ 0 SE [µ{Y | X 0 }] = σ + 2 n (n − 1)s x Conduct t-test and confidence interval in the usual way (df = n-2) U9611 Spring 2005 15 Confidence bands for conditional means confidence bands in simple regression have an hourglass shape, narrowest at the mean of X the lfitci command automatically calculate and graph the confidence bands U9611 Spring 2005 16 Prediction PredictionPrediction of a future Y at X=X0 Pred(Y | X 0 ) = µˆ{Y | X 0} StandardStandard errorerror ofof predictionprediction: 2 2 SE[Pred(Y | X 0 )] = σˆ + (SE[µˆ(Y | X 0 )]) Variability of Y Uncertainty in about its mean the estimated mean 95%95% predictionprediction intervalinterval: Pred(Y | X 0 ) ± tdf (.975)*SE[Pred(Y | X 0 )] U9611 Spring 2005 17 Residuals vs. predicted values plot After any regression analysis we can automatically draw a residual-versus-fitted plot just by typing U9611 Spring 2005 18 Predicted values (yhatyhat) After any regression, the predict command can create a new variable yhat containing predicted Y values about its mean U9611 Spring 2005 19 Residuals (ee) the resid command can create a new variable e containing the residuals U9611 Spring 2005 20 The residual-versus-predicted-values plot could be drawn “by hand” using these commands U9611 Spring 2005 21 Second type of confidence interval for regression prediction: “prediction band” This express our uncertainty in estimating the unknown value of Y for an individual observation with known X value Command: lftci with stdf option Additional note: Predict can generate two kinds of standard errors for the predicted y value, which have two different applications. Confidence bands for conditional means (stdp) Confidence bands for individual-case predictions (stdf) 3 3 2 2 1 Distance Distance 1 0 0 -1 -500 0 500 1000 -500 0 500 1000 VELOCITY VELOCITY Confidence bands for conditional means (stdp) 3 95% confidence interval for µ{Y|1000} 2 Distance 1 confidence band:band a set of confidence intervals 0 for µ{Y|X0} -500 0 500 1000 VELOCITY Confidence bands for individual-case predictions (stdf) 3 95% prediction interval 2 for Y at X=1000 1 Distance Calibration interval:interval 0 values of X for which Y0is in a prediction interval -1 -500 0 500 1000 U9611 Spring 2005 VELOCITY 24 Notes about confidence and prediction bands Both are narrowest at the mean of X Beware of extrapolation The width of the Confidence Interval is zero if n is large enough; this is not true of the Prediction Interval. U9611 Spring 2005 25 Review of simple linear regression 1. Model with µ{Y | X} = β 0 + β 1X constant variance. var{Y | X} = σ 2 n n 2. Least squares: ˆ 2 choose estimators β 1 = ∑(X i − X )(Yi −Y ) /∑(X i − X ) . β0 and β1 i=1 i=1 to minimize the sum of ˆ ˆ squared residuals. β 0 = Y − β1 X ˆ ˆ resi = Yi − β0 − β1 X i (i =1,..,n) n ˆ 2 2 3. Properties σ = ∑ resi /(n − 2) of estimators. i=1 ˆ 2 SE(β1) = σˆ / (n −1)sx ˆ 2 2 U9611 SE(β0Spring) = 2005σˆ / (1/ n) + X /(n −1)sx26 Assumptions of Linear Regression A linear regression model assumes: Linearity: µ {Y|X} = β0 + β1X Constant Variance: var{Y|X} = σ2 Normality Dist. of Y’s at any X is normal Independence Given Xi’s, the Yi’s are independent U9611 Spring 2005 27 Examples of Violations Non-Linearity The true relation between the independent and dependent variables may not be linear. For example, consider campaign fundraising and the probability of winning an election. P(w) Probability of The probability of Winning an winning increases with Election each additional dollar spent and then levels off after $50,000. $50,000 Spending U9611 Spring 2005 28 Consequences of violation of linearity : If “linearity” is violated, misleading conclusions may occur (however, the degree of the problem depends on the degree of non-linearity) U9611 Spring 2005 29 Examples of Violations: Constant Variance Constant Variance or Homoskedasticity The Homoskedasticity assumption implies that, on average, we do not expect to get larger errors in some cases than in others. Of course, due to the luck of the draw, some errors will turn out to be larger then others. But homoskedasticity is violated only when this happens in a predictable manner. Example: income and spending on certain goods. People with higher incomes have more choices about what to buy. We would expect that there consumption of certain goods is more variable than for families with lower incomes. U9611 Spring 2005 30 Violation of constant variance X10 Relation between Income X 8 and Spending violates Spending homoskedasticity X6 ε 8 ε 6 ε6=−+((YabX66 )) ε 9 ε9=−+((YabX99 )) X4 ε 7 X 2 As income increases so X X9 ε 5 7 do the errors (vertical distance from the X X 5 3 predicted line) X1 income U9611 Spring 2005 31 Consequences of non-constant variance If “constant variance” is violated, LS estimates are still unbiased but SEs, tests, Confidence Intervals, and Prediction Intervals are incorrect However, the degree depends… U9611 Spring 2005 32 Violation of Normality Non-Normality Nicotine use is characterized by a large number of people Frequency of not smoking at all and Nicotine use another large number of people who smoke every day.
Recommended publications
  • A Simple Censored Median Regression Estimator
    Statistica Sinica 16(2006), 1043-1058 A SIMPLE CENSORED MEDIAN REGRESSION ESTIMATOR Lingzhi Zhou The Hong Kong University of Science and Technology Abstract: Ying, Jung and Wei (1995) proposed an estimation procedure for the censored median regression model that regresses the median of the survival time, or its transform, on the covariates. The procedure requires solving complicated nonlinear equations and thus can be very difficult to implement in practice, es- pecially when there are multiple covariates. Moreover, the asymptotic covariance matrix of the estimator involves the density of the errors that cannot be esti- mated reliably. In this paper, we propose a new estimator for the censored median regression model. Our estimation procedure involves solving some convex min- imization problems and can be easily implemented through linear programming (Koenker and D'Orey (1987)). In addition, a resampling method is presented for estimating the covariance matrix of the new estimator. Numerical studies indi- cate the superiority of the finite sample performance of our estimator over that in Ying, Jung and Wei (1995). Key words and phrases: Censoring, convexity, LAD, resampling. 1. Introduction The accelerated failure time (AFT) model, which relates the logarithm of the survival time to covariates, is an attractive alternative to the popular Cox (1972) proportional hazards model due to its ease of interpretation. The model assumes that the failure time T , or some monotonic transformation of it, is linearly related to the covariate vector Z 0 Ti = β0Zi + "i; i = 1; : : : ; n: (1.1) Under censoring, we only observe Yi = min(Ti; Ci), where Ci are censoring times, and Ti and Ci are independent conditional on Zi.
    [Show full text]
  • Understanding Linear and Logistic Regression Analyses
    EDUCATION • ÉDUCATION METHODOLOGY Understanding linear and logistic regression analyses Andrew Worster, MD, MSc;*† Jerome Fan, MD;* Afisi Ismaila, MSc† SEE RELATED ARTICLE PAGE 105 egression analysis, also termed regression modeling, come). For example, a researcher could evaluate the poten- Ris an increasingly common statistical method used to tial for injury severity score (ISS) to predict ED length-of- describe and quantify the relation between a clinical out- stay by first producing a scatter plot of ISS graphed against come of interest and one or more other variables. In this ED length-of-stay to determine whether an apparent linear issue of CJEM, Cummings and Mayes used linear and lo- relation exists, and then by deriving the best fit straight line gistic regression to determine whether the type of trauma for the data set using linear regression carried out by statis- team leader (TTL) impacts emergency department (ED) tical software. The mathematical formula for this relation length-of-stay or survival.1 The purpose of this educa- would be: ED length-of-stay = k(ISS) + c. In this equation, tional primer is to provide an easily understood overview k (the slope of the line) indicates the factor by which of these methods of statistical analysis. We hope that this length-of-stay changes as ISS changes and c (the “con- primer will not only help readers interpret the Cummings stant”) is the value of length-of-stay when ISS equals zero and Mayes study, but also other research that uses similar and crosses the vertical axis.2 In this hypothetical scenario, methodology.
    [Show full text]
  • Cluster Analysis, a Powerful Tool for Data Analysis in Education
    International Statistical Institute, 56th Session, 2007: Rita Vasconcelos, Mßrcia Baptista Cluster Analysis, a powerful tool for data analysis in Education Vasconcelos, Rita Universidade da Madeira, Department of Mathematics and Engeneering Caminho da Penteada 9000-390 Funchal, Portugal E-mail: [email protected] Baptista, Márcia Direcção Regional de Saúde Pública Rua das Pretas 9000 Funchal, Portugal E-mail: [email protected] 1. Introduction A database was created after an inquiry to 14-15 - year old students, which was developed with the purpose of identifying the factors that could socially and pedagogically frame the results in Mathematics. The data was collected in eight schools in Funchal (Madeira Island), and we performed a Cluster Analysis as a first multivariate statistical approach to this database. We also developed a logistic regression analysis, as the study was carried out as a contribution to explain the success/failure in Mathematics. As a final step, the responses of both statistical analysis were studied. 2. Cluster Analysis approach The questions that arise when we try to frame socially and pedagogically the results in Mathematics of 14-15 - year old students, are concerned with the types of decisive factors in those results. It is somehow underlying our objectives to classify the students according to the factors understood by us as being decisive in students’ results. This is exactly the aim of Cluster Analysis. The hierarchical solution that can be observed in the dendogram presented in the next page, suggests that we should consider the 3 following clusters, since the distances increase substantially after it: Variables in Cluster1: mother qualifications; father qualifications; student’s results in Mathematics as classified by the school teacher; student’s results in the exam of Mathematics; time spent studying.
    [Show full text]
  • The Detection of Heteroscedasticity in Regression Models for Psychological Data
    Psychological Test and Assessment Modeling, Volume 58, 2016 (4), 567-592 The detection of heteroscedasticity in regression models for psychological data Andreas G. Klein1, Carla Gerhard2, Rebecca D. Büchner2, Stefan Diestel3 & Karin Schermelleh-Engel2 Abstract One assumption of multiple regression analysis is homoscedasticity of errors. Heteroscedasticity, as often found in psychological or behavioral data, may result from misspecification due to overlooked nonlinear predictor terms or to unobserved predictors not included in the model. Although methods exist to test for heteroscedasticity, they require a parametric model for specifying the structure of heteroscedasticity. The aim of this article is to propose a simple measure of heteroscedasticity, which does not need a parametric model and is able to detect omitted nonlinear terms. This measure utilizes the dispersion of the squared regression residuals. Simulation studies show that the measure performs satisfactorily with regard to Type I error rates and power when sample size and effect size are large enough. It outperforms the Breusch-Pagan test when a nonlinear term is omitted in the analysis model. We also demonstrate the performance of the measure using a data set from industrial psychology. Keywords: Heteroscedasticity, Monte Carlo study, regression, interaction effect, quadratic effect 1Correspondence concerning this article should be addressed to: Prof. Dr. Andreas G. Klein, Department of Psychology, Goethe University Frankfurt, Theodor-W.-Adorno-Platz 6, 60629 Frankfurt; email: [email protected] 2Goethe University Frankfurt 3International School of Management Dortmund Leipniz-Research Centre for Working Environment and Human Factors 568 A. G. Klein, C. Gerhard, R. D. Büchner, S. Diestel & K. Schermelleh-Engel Introduction One of the standard assumptions underlying a linear model is that the errors are inde- pendently identically distributed (i.i.d.).
    [Show full text]
  • Simple Linear Regression with Least Square Estimation: an Overview
    Aditya N More et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 7 (6) , 2016, 2394-2396 Simple Linear Regression with Least Square Estimation: An Overview Aditya N More#1, Puneet S Kohli*2, Kshitija H Kulkarni#3 #1-2Information Technology Department,#3 Electronics and Communication Department College of Engineering Pune Shivajinagar, Pune – 411005, Maharashtra, India Abstract— Linear Regression involves modelling a relationship amongst dependent and independent variables in the form of a (2.1) linear equation. Least Square Estimation is a method to determine the constants in a Linear model in the most accurate way without much complexity of solving. Metrics where such as Coefficient of Determination and Mean Square Error is the ith value of the sample data point determine how good the estimation is. Statistical Packages is the ith value of y on the predicted regression such as R and Microsoft Excel have built in tools to perform Least Square Estimation over a given data set. line The above equation can be geometrically depicted by Keywords— Linear Regression, Machine Learning, Least Squares Estimation, R programming figure 2.1. If we draw a square at each point whose length is equal to the absolute difference between the sample data point and the predicted value as shown, each of the square would then represent the residual error in placing the I. INTRODUCTION regression line. The aim of the least square method would Linear Regression involves establishing linear be to place the regression line so as to minimize the sum of relationships between dependent and independent variables.
    [Show full text]
  • Testing Hypotheses for Differences Between Linear Regression Lines
    United States Department of Testing Hypotheses for Differences Agriculture Between Linear Regression Lines Forest Service Stanley J. Zarnoch Southern Research Station e-Research Note SRS–17 May 2009 Abstract The general linear model (GLM) is often used to test for differences between means, as for example when the Five hypotheses are identified for testing differences between simple linear analysis of variance is used in data analysis for designed regression lines. The distinctions between these hypotheses are based on a priori assumptions and illustrated with full and reduced models. The studies. However, the GLM has a much wider application contrast approach is presented as an easy and complete method for testing and can also be used to compare regression lines. The for overall differences between the regressions and for making pairwise GLM allows for the use of both qualitative and quantitative comparisons. Use of the Bonferroni adjustment to ensure a desired predictor variables. Quantitative variables are continuous experimentwise type I error rate is emphasized. SAS software is used to illustrate application of these concepts to an artificial simulated dataset. The values such as diameter at breast height, tree height, SAS code is provided for each of the five hypotheses and for the contrasts age, or temperature. However, many predictor variables for the general test and all possible specific tests. in the natural resources field are qualitative. Examples of qualitative variables include tree class (dominant, Keywords: Contrasts, dummy variables, F-test, intercept, SAS, slope, test of conditional error. codominant); species (loblolly, slash, longleaf); and cover type (clearcut, shelterwood, seed tree). Fraedrich and others (1994) compared linear regressions of slash pine cone specific gravity and moisture content Introduction on collection date between families of slash pine grown in a seed orchard.
    [Show full text]
  • Application of General Linear Models (GLM) to Assess Nodule Abundance Based on a Photographic Survey (Case Study from IOM Area, Pacific Ocean)
    minerals Article Application of General Linear Models (GLM) to Assess Nodule Abundance Based on a Photographic Survey (Case Study from IOM Area, Pacific Ocean) Monika Wasilewska-Błaszczyk * and Jacek Mucha Department of Geology of Mineral Deposits and Mining Geology, Faculty of Geology, Geophysics and Environmental Protection, AGH University of Science and Technology, 30-059 Cracow, Poland; [email protected] * Correspondence: [email protected] Abstract: The success of the future exploitation of the Pacific polymetallic nodule deposits depends on an accurate estimation of their resources, especially in small batches, scheduled for extraction in the short term. The estimation based only on the results of direct seafloor sampling using box corers is burdened with a large error due to the long sampling interval and high variability of the nodule abundance. Therefore, estimations should take into account the results of bottom photograph analyses performed systematically and in large numbers along the course of a research vessel. For photographs taken at the direct sampling sites, the relationship linking the nodule abundance with the independent variables (the percentage of seafloor nodule coverage, the genetic types of nodules in the context of their fraction distribution, and the degree of sediment coverage of nodules) was determined using the general linear model (GLM). Compared to the estimates obtained with a simple Citation: Wasilewska-Błaszczyk, M.; linear model linking this parameter only with the seafloor nodule coverage, a significant decrease Mucha, J. Application of General in the standard prediction error, from 4.2 to 2.5 kg/m2, was found. The use of the GLM for the Linear Models (GLM) to Assess assessment of nodule abundance in individual sites covered by bottom photographs, outside of Nodule Abundance Based on a direct sampling sites, should contribute to a significant increase in the accuracy of the estimation of Photographic Survey (Case Study nodule resources.
    [Show full text]
  • The Simple Linear Regression Model
    The Simple Linear Regression Model Suppose we have a data set consisting of n bivariate observations {(x1, y1),..., (xn, yn)}. Response variable y and predictor variable x satisfy the simple linear model if they obey the model yi = β0 + β1xi + ǫi, i = 1,...,n, (1) where the intercept and slope coefficients β0 and β1 are unknown constants and the random errors {ǫi} satisfy the following conditions: 1. The errors ǫ1, . , ǫn all have mean 0, i.e., µǫi = 0 for all i. 2 2 2 2. The errors ǫ1, . , ǫn all have the same variance σ , i.e., σǫi = σ for all i. 3. The errors ǫ1, . , ǫn are independent random variables. 4. The errors ǫ1, . , ǫn are normally distributed. Note that the author provides these assumptions on page 564 BUT ORDERS THEM DIFFERENTLY. Fitting the Simple Linear Model: Estimating β0 and β1 Suppose we believe our data obey the simple linear model. The next step is to fit the model by estimating the unknown intercept and slope coefficients β0 and β1. There are various ways of estimating these from the data but we will use the Least Squares Criterion invented by Gauss. The least squares estimates of β0 and β1, which we will denote by βˆ0 and βˆ1 respectively, are the values of β0 and β1 which minimize the sum of errors squared S(β0, β1): n 2 S(β0, β1) = X ei i=1 n 2 = X[yi − yˆi] i=1 n 2 = X[yi − (β0 + β1xi)] i=1 where the ith modeling error ei is simply the difference between the ith value of the response variable yi and the fitted/predicted valuey ˆi.
    [Show full text]
  • 5. Dummy-Variable Regression and Analysis of Variance
    Sociology 740 John Fox Lecture Notes 5. Dummy-Variable Regression and Analysis of Variance Copyright © 2014 by John Fox Dummy-Variable Regression and Analysis of Variance 1 1. Introduction I One of the limitations of multiple-regression analysis is that it accommo- dates only quantitative explanatory variables. I Dummy-variable regressors can be used to incorporate qualitative explanatory variables into a linear model, substantially expanding the range of application of regression analysis. c 2014 by John Fox Sociology 740 ° Dummy-Variable Regression and Analysis of Variance 2 2. Goals: I To show how dummy regessors can be used to represent the categories of a qualitative explanatory variable in a regression model. I To introduce the concept of interaction between explanatory variables, and to show how interactions can be incorporated into a regression model by forming interaction regressors. I To introduce the principle of marginality, which serves as a guide to constructing and testing terms in complex linear models. I To show how incremental I -testsareemployedtotesttermsindummy regression models. I To show how analysis-of-variance models can be handled using dummy variables. c 2014 by John Fox Sociology 740 ° Dummy-Variable Regression and Analysis of Variance 3 3. A Dichotomous Explanatory Variable I The simplest case: one dichotomous and one quantitative explanatory variable. I Assumptions: Relationships are additive — the partial effect of each explanatory • variable is the same regardless of the specific value at which the other explanatory variable is held constant. The other assumptions of the regression model hold. • I The motivation for including a qualitative explanatory variable is the same as for including an additional quantitative explanatory variable: to account more fully for the response variable, by making the errors • smaller; and to avoid a biased assessment of the impact of an explanatory variable, • as a consequence of omitting another explanatory variables that is relatedtoit.
    [Show full text]
  • Generalized Linear Models
    Generalized Linear Models Advanced Methods for Data Analysis (36-402/36-608) Spring 2014 1 Generalized linear models 1.1 Introduction: two regressions • So far we've seen two canonical settings for regression. Let X 2 Rp be a vector of predictors. In linear regression, we observe Y 2 R, and assume a linear model: T E(Y jX) = β X; for some coefficients β 2 Rp. In logistic regression, we observe Y 2 f0; 1g, and we assume a logistic model (Y = 1jX) log P = βT X: 1 − P(Y = 1jX) • What's the similarity here? Note that in the logistic regression setting, P(Y = 1jX) = E(Y jX). Therefore, in both settings, we are assuming that a transformation of the conditional expec- tation E(Y jX) is a linear function of X, i.e., T g E(Y jX) = β X; for some function g. In linear regression, this transformation was the identity transformation g(u) = u; in logistic regression, it was the logit transformation g(u) = log(u=(1 − u)) • Different transformations might be appropriate for different types of data. E.g., the identity transformation g(u) = u is not really appropriate for logistic regression (why?), and the logit transformation g(u) = log(u=(1 − u)) not appropriate for linear regression (why?), but each is appropriate in their own intended domain • For a third data type, it is entirely possible that transformation neither is really appropriate. What to do then? We think of another transformation g that is in fact appropriate, and this is the basic idea behind a generalized linear model 1.2 Generalized linear models • Given predictors X 2 Rp and an outcome Y , a generalized linear model is defined by three components: a random component, that specifies a distribution for Y jX; a systematic compo- nent, that relates a parameter η to the predictors X; and a link function, that connects the random and systematic components • The random component specifies a distribution for the outcome variable (conditional on X).
    [Show full text]
  • Simple Linear Regression: Straight Line Regression Between an Outcome Variable (Y ) and a Single Explanatory Or Predictor Variable (X)
    1 Introduction to Regression \Regression" is a generic term for statistical methods that attempt to fit a model to data, in order to quantify the relationship between the dependent (outcome) variable and the predictor (independent) variable(s). Assuming it fits the data reasonable well, the estimated model may then be used either to merely describe the relationship between the two groups of variables (explanatory), or to predict new values (prediction). There are many types of regression models, here are a few most common to epidemiology: Simple Linear Regression: Straight line regression between an outcome variable (Y ) and a single explanatory or predictor variable (X). E(Y ) = α + β × X Multiple Linear Regression: Same as Simple Linear Regression, but now with possibly multiple explanatory or predictor variables. E(Y ) = α + β1 × X1 + β2 × X2 + β3 × X3 + ::: A special case is polynomial regression. 2 3 E(Y ) = α + β1 × X + β2 × X + β3 × X + ::: Generalized Linear Model: Same as Multiple Linear Regression, but with a possibly transformed Y variable. This introduces considerable flexibil- ity, as non-linear and non-normal situations can be easily handled. G(E(Y )) = α + β1 × X1 + β2 × X2 + β3 × X3 + ::: In general, the transformation function G(Y ) can take any form, but a few forms are especially common: • Taking G(Y ) = logit(Y ) describes a logistic regression model: E(Y ) log( ) = α + β × X + β × X + β × X + ::: 1 − E(Y ) 1 1 2 2 3 3 2 • Taking G(Y ) = log(Y ) is also very common, leading to Poisson regression for count data, and other so called \log-linear" models.
    [Show full text]
  • Lecture 18: Regression in Practice
    Regression in Practice Regression in Practice ● Regression Errors ● Regression Diagnostics ● Data Transformations Regression Errors Ice Cream Sales vs. Temperature Image source Linear Regression in R > summary(lm(sales ~ temp)) Call: lm(formula = sales ~ temp) Residuals: Min 1Q Median 3Q Max -74.467 -17.359 3.085 23.180 42.040 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -122.988 54.761 -2.246 0.0513 . temp 28.427 2.816 10.096 3.31e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 35.07 on 9 degrees of freedom Multiple R-squared: 0.9189, Adjusted R-squared: 0.9098 F-statistic: 101.9 on 1 and 9 DF, p-value: 3.306e-06 Some Goodness-of-fit Statistics ● Residual standard error ● R2 and adjusted R2 ● F statistic Anatomy of Regression Errors Image Source Residual Standard Error ● A residual is a difference between a fitted value and an observed value. ● The total residual error (RSS) is the sum of the squared residuals. ○ Intuitively, RSS is the error that the model does not explain. ● It is a measure of how far the data are from the regression line (i.e., the model), on average, expressed in the units of the dependent variable. ● The standard error of the residuals is roughly the square root of the average residual error (RSS / n). ○ Technically, it’s not √(RSS / n), it’s √(RSS / (n - 2)); it’s adjusted by degrees of freedom. R2: Coefficient of Determination ● R2 = ESS / TSS ● Interpretations: ○ The proportion of the variance in the dependent variable that the model explains.
    [Show full text]