Lecture 14 Multiple Linear Regression and Logistic Regression

Total Page:16

File Type:pdf, Size:1020Kb

Lecture 14 Multiple Linear Regression and Logistic Regression Lecture 14 Multiple Linear Regression and Logistic Regression Fall 2013 Prof. Yao Xie, [email protected] H. Milton Stewart School of Industrial Systems & Engineering Georgia Tech H1 Outline • Multiple regression • Logistic regression H2 Simple linear regression Based on the scatter diagram, it is probably reasonable to assume that the mean of the random variable Y is related to X by the following simple linear regression model: Response Regressor or Predictor Yi = β0 + β1X i + εi i =1,2,!,n 2 ε i 0, εi ∼ Ν( σ ) Intercept Slope Random error where the slope and intercept of the line are called regression coefficients. • The case of simple linear regression considers a single regressor or predictor x and a dependent or response variable Y. H3 Multiple linear regression • Simple linear regression: one predictor variable x • Multiple linear regression: multiple predictor variables x1, x2, …, xk • Example: • simple linear regression property tax = a*house price + b • multiple linear regression property tax = a1*house price + a2*house size + b • Question: how to fit multiple linear regression model? H4 JWCL232_c12_449-512.qxd 1/15/10 10:06 PM Page 450 450 CHAPTER 12 MULTIPLE LINEAR REGRESSION 12-2 HYPOTHESIS TESTS IN MULTIPLE 12-4 PREDICTION OF NEW LINEAR REGRESSION OBSERVATIONS 12-2.1 Test for Significance of 12-5 MODEL ADEQUACY CHECKING Regression 12-5.1 Residual Analysis 12-2.2 Tests on Individual Regression 12-5.2 Influential Observations Coefficients and Subsets of Coefficients 12-6 ASPECTS OF MULTIPLE REGRESSION MODELING 12-3 CONFIDENCE INTERVALS IN MULTIPLE LINEAR 12-6.1 Polynomial Regression Models REGRESSION 12-6.2 Categorical Regressors and 12-3.1 Confidence Intervals on Indicator Variables JWCL232_c12_449-512.qxdIndividual Regression 1/15/10 10:0712-6.3 PM Selection Page 451 of Variables and Coefficients Model Building JWCL232_c12_449-512.qxd 1/15/10 10:07 PM Page 451 JWCL232_c12_449-512.qxd 1/15/10 10:07 PM Page 451 12-3.2 Confidence Interval on 12-6.4 Multicollinearity JWCL232_c12_449-512.qxd 1/15/10 10:07 PM Pagethe 451Mean Response LEARNING OBJECTIVES 12-1 MULTIPLE LINEAR REGRESSION MODEL 451 12-1 MULTIPLE LINEAR REGRESSION MODEL 451 12-1 MULTIPLE LINEAR REGRESSION MODEL 451 After careful study of this chapter you should be able to do the following: x2 1. Use multiple regression techniques12-1 MULTIPLEto build LINEARempirical REGRESSION models MODELto engineering 451 and scientific10 x2 220 data x2 10 10 240x2 8 2. Understand how the method of least squares extends to fitting220 multiple220 regression models 20010 203 240 220 240 3. Assess regression88 model adequacy 240 160 203 6 200200 8 203 4. Test hypotheses and construct confidence intervals on the regression coefficients 186 200 E(Y) 120 203 160 66 160 4 160 5. Use the regression model to estimate806 the mean response and to make predictions and to construct E(Y)E(Y)120120 186 186 E(Y) 120 confidence intervals and prediction40 intervals 186 10 80 44 169 80 4 8 2 80 6. Build regression10 models with polynomial0 terms 6 JWCL232_c12_449-512.qxd4040 1/15/10 10:07 PM 10 Page 451 169 169 4 40 88 2 10 0 7. Use indicator variables2 to model categorical2 regressors 2 169x2 67 84 101 118 135 152 0 66 8 2 4 6 0 4 6 8 0 0 0 0 0 4 x1 10 2 2 8.2 Use stepwisex2 regression4 and67 other84 model101 building118 techniques135 to152 select the appropriate set of vari-0 246810x1 44 60 2 2 x2 67 84 10167 11884 101135 118152 135 152 6 8 4 0 002 x2 (a) x1 8 10 0 6 0 2468100 x (b) x1 10x ables for8 a regression10 0 0 model2468101 x (a) 1 0 2468101 x1 (a) (a) (b) Figure 12-1 (a)(b) The regression(b )plane for the model E(Y ) ϭ 50 ϩ 10x1 ϩ 7x2. (b) The contour plot. Figure 12-1 (a) The regression plane for the model E(Y ) ϭ 50 ϩ 10x1 ϩ 7x2. (b) The contour plot. Figure 12-1 (a) The regressionFigure 12-1 plane(a) for The the regression model E plane(Y ) ϭ for50 theϩ model10x1 ϩE(Y7)x2ϭ. (b)50 Theϩ 10 contourx1 ϩ 7x 2plot.. (b) The contour plot. 12-1 MULTIPLE LINEAR REGRESSION MODEL The regression model in Equation 12-1 describes a plane in the three-dimensional space The regression modelThe in regression Equation 12-1model describes in Equation a plane 12-1 in describes the three-dimensional a planeof Y in, x the, and three-dimensional space x . Figure 12-1(a) space shows this plane for the regression model 12-1.1of YThe, x , Introductionregressionand x . Figure model 12-1(a) in Equation shows this 12-1 plane describes for the regression a plane in model the three-dimensional1 space2 1 2 of Y, x1, and x2. Figure 12-1(a) shows this plane for the regression model of Y, x1, and x2. Figure 12-1(a) shows this plane for the regression model 12-1 MULTIPLE LINEAR REGRESSION MODEL 451 E Y ϭ 50 ϩ 10x ϩ 7x E Y ϭ 50 ϩ 10x ϩ 7x 1 2 Many applications of regression1E Y analysisϭ2 50 ϩ 10involvex1 ϩ 7 xsituations2 in which there are more than one E Y ϭ 50 ϩ 10x1 ϩ 7x2 regressor or predictorMultiple variable. Alinear regression modelregressionwhere that contains we have moreassumed modelthan that one the regressor expected1 2 vari- value of the error term is zero; that is E(⑀) ϭ 0. The where we have assumedwhere that we the have expected assumed1 2 value that ofthe the expected error1 2 term value is of zero; the errorthat is term xE2(⑀ )isϭ zero;0. The that is E(⑀) ϭ 0. The able is called 1a multiple2 regression model. whereparameter we have ␤ is assumed the intercept that the␤of expectedthe plane. value We sometimes of the error call term ␤ andis zero; ␤ partial thatparameter is ␤Eregression(⑀) ϭ␤␤00.is The the intercept of the plane. We sometimes call ␤1 and ␤2 partial regression 0 parameterAs an0 example,is• theMultiple linear regression model with two regressors intercept supposeof the that plane. the effective We1 sometimes life2 of call10 a cutting 1 and tool2 partial depends regression on the cutting speed parametercoefficients, ␤0 becauseis the coefficients,intercept ␤ measuresofbecause the the plane. expected ␤ measures We changesometimes the in expected Y percall unit ␤ change1 andchange ␤in2coefficients, partialYinper x whenunit regression change x becauseis in x ␤when1 measures x is the expected change in Y per unit change in x1 when x2 is and1 the tool angle.JWCL232_c12_449-512.qxd A1 multiple regression 1/15/10 model 10:07 that PM might Page1 describe451 2 this relationship1 2 is 220 coefficients,held constant,because and ␤held2 measures␤1 constant,measures the and expectedthe ␤(predictor variables) 2expectedmeasures change change the in expected Y per in unitY perchange change unit in change inY perxheld2 when unit in constant, changex x11whenis held in andx x22iswhen ␤2 measures x1 is held the expected change in Y per unit change in x2 when x1 is held 240 heldconstant. constant, Figure and 12-1(b) ␤constant.measures shows Figure athe contour expected12-1(b) plot shows changeof the a contour regression in Y per plot unit model—thatof changethe regression inconstant. is, x lineswhen model—that of Figure x con-is held 12-1(b) is, lines showsof con- a contour plot of the regression model—that is, lines of con- 2 ϭ␤ ϩ␤ ϩ␤82 ϩ⑀1 stant E(Y ) as a functionstant of E (xY )and as ax function. Notice ofthat x theand contour x . NoticeY lines that 0in the this contour 1plotx1 are lines straight2x 2in this lines. plot are straight lines. (12-1) 200 constant. Figure 12-1(b) shows1 a contour2 plot1 of the2 regression model—thatstant is, E (linesY ) as of a con-function of x1 and x2. Notice that the contour lines in this 203plot are straight lines. stant EIn( Ygeneral,) as a function the dependent ofIn xgeneral,1 and variable x 2the. Notice dependentor responsethat the variable contourY mayor belines response related in this toY plotmayk independent arebeIn relatedstraight general, to orlines. k theindependent dependentor variable or response Y may be related to k independent or where Y represents the tool life, x1 represents the cutting speed, x2 represents the tool angle, 160 regressor variables.regressorThe model variables. The model 6 12-1 MULTIPLE LINEAR REGRESSION MODEL In general, the dependentand ⑀ is a variablerandom erroror response term. ThisY may is abe multiple related tolinearregressor k independent regression variables. modelor The with model two regressors. 451 E(Y) regressor variables. The model 186 120 p ˛ ˛ The term linear is usedϭ␤ becauseϩ␤ Equation˛ ϩ␤˛ 12-1ϩ isϩ␤ a linearϩ⑀ function of the unknown parameters Y ϭ␤0 ϩ␤1x1 ϩ␤Y2x2 ϩ0 p ϩ␤1x1 k xk ϩ⑀2x2 k xk (12-2) x2 (12-2) p ˛ ˛ 10 Y ϭ␤0 ϩ␤1x1 ϩ␤2x2 ϩ ϩ␤k xk ϩ⑀ (12-2) ␤0, ␤1, and ␤2. 4 220 80 ϭ␤ ϩ␤ ϩ␤ ϩ p ϩ␤˛ ˛ ϩ⑀ is calledY a multiple0 1 xlinear1 regression2x2 modelk x withk k regressor variables.(12-2) The parameters ␤ , is called a multiple linear regression model with k regressor240 variables. The parameters ␤j, 8 j is called a multiple linear regression model with k regressor203 variables. The parameters ␤ , ϭ j ϭ 0, 1, p , k, are called the regression200 coefficients.10 This model describes a hyperplane in j 40 j 0, 1, p , k, are called the regression coefficients. This model describes a hyperplane in ␤ 169 is called a multiple thelinear k-dimensional regression spacemodel of with the regressork160regressor variables8 variables.
Recommended publications
  • Understanding Linear and Logistic Regression Analyses
    EDUCATION • ÉDUCATION METHODOLOGY Understanding linear and logistic regression analyses Andrew Worster, MD, MSc;*† Jerome Fan, MD;* Afisi Ismaila, MSc† SEE RELATED ARTICLE PAGE 105 egression analysis, also termed regression modeling, come). For example, a researcher could evaluate the poten- Ris an increasingly common statistical method used to tial for injury severity score (ISS) to predict ED length-of- describe and quantify the relation between a clinical out- stay by first producing a scatter plot of ISS graphed against come of interest and one or more other variables. In this ED length-of-stay to determine whether an apparent linear issue of CJEM, Cummings and Mayes used linear and lo- relation exists, and then by deriving the best fit straight line gistic regression to determine whether the type of trauma for the data set using linear regression carried out by statis- team leader (TTL) impacts emergency department (ED) tical software. The mathematical formula for this relation length-of-stay or survival.1 The purpose of this educa- would be: ED length-of-stay = k(ISS) + c. In this equation, tional primer is to provide an easily understood overview k (the slope of the line) indicates the factor by which of these methods of statistical analysis. We hope that this length-of-stay changes as ISS changes and c (the “con- primer will not only help readers interpret the Cummings stant”) is the value of length-of-stay when ISS equals zero and Mayes study, but also other research that uses similar and crosses the vertical axis.2 In this hypothetical scenario, methodology.
    [Show full text]
  • Application of General Linear Models (GLM) to Assess Nodule Abundance Based on a Photographic Survey (Case Study from IOM Area, Pacific Ocean)
    minerals Article Application of General Linear Models (GLM) to Assess Nodule Abundance Based on a Photographic Survey (Case Study from IOM Area, Pacific Ocean) Monika Wasilewska-Błaszczyk * and Jacek Mucha Department of Geology of Mineral Deposits and Mining Geology, Faculty of Geology, Geophysics and Environmental Protection, AGH University of Science and Technology, 30-059 Cracow, Poland; [email protected] * Correspondence: [email protected] Abstract: The success of the future exploitation of the Pacific polymetallic nodule deposits depends on an accurate estimation of their resources, especially in small batches, scheduled for extraction in the short term. The estimation based only on the results of direct seafloor sampling using box corers is burdened with a large error due to the long sampling interval and high variability of the nodule abundance. Therefore, estimations should take into account the results of bottom photograph analyses performed systematically and in large numbers along the course of a research vessel. For photographs taken at the direct sampling sites, the relationship linking the nodule abundance with the independent variables (the percentage of seafloor nodule coverage, the genetic types of nodules in the context of their fraction distribution, and the degree of sediment coverage of nodules) was determined using the general linear model (GLM). Compared to the estimates obtained with a simple Citation: Wasilewska-Błaszczyk, M.; linear model linking this parameter only with the seafloor nodule coverage, a significant decrease Mucha, J. Application of General in the standard prediction error, from 4.2 to 2.5 kg/m2, was found. The use of the GLM for the Linear Models (GLM) to Assess assessment of nodule abundance in individual sites covered by bottom photographs, outside of Nodule Abundance Based on a direct sampling sites, should contribute to a significant increase in the accuracy of the estimation of Photographic Survey (Case Study nodule resources.
    [Show full text]
  • The Simple Linear Regression Model
    The Simple Linear Regression Model Suppose we have a data set consisting of n bivariate observations {(x1, y1),..., (xn, yn)}. Response variable y and predictor variable x satisfy the simple linear model if they obey the model yi = β0 + β1xi + ǫi, i = 1,...,n, (1) where the intercept and slope coefficients β0 and β1 are unknown constants and the random errors {ǫi} satisfy the following conditions: 1. The errors ǫ1, . , ǫn all have mean 0, i.e., µǫi = 0 for all i. 2 2 2 2. The errors ǫ1, . , ǫn all have the same variance σ , i.e., σǫi = σ for all i. 3. The errors ǫ1, . , ǫn are independent random variables. 4. The errors ǫ1, . , ǫn are normally distributed. Note that the author provides these assumptions on page 564 BUT ORDERS THEM DIFFERENTLY. Fitting the Simple Linear Model: Estimating β0 and β1 Suppose we believe our data obey the simple linear model. The next step is to fit the model by estimating the unknown intercept and slope coefficients β0 and β1. There are various ways of estimating these from the data but we will use the Least Squares Criterion invented by Gauss. The least squares estimates of β0 and β1, which we will denote by βˆ0 and βˆ1 respectively, are the values of β0 and β1 which minimize the sum of errors squared S(β0, β1): n 2 S(β0, β1) = X ei i=1 n 2 = X[yi − yˆi] i=1 n 2 = X[yi − (β0 + β1xi)] i=1 where the ith modeling error ei is simply the difference between the ith value of the response variable yi and the fitted/predicted valuey ˆi.
    [Show full text]
  • Chapter 2 Simple Linear Regression Analysis the Simple
    Chapter 2 Simple Linear Regression Analysis The simple linear regression model We consider the modelling between the dependent and one independent variable. When there is only one independent variable in the linear regression model, the model is generally termed as a simple linear regression model. When there are more than one independent variables in the model, then the linear model is termed as the multiple linear regression model. The linear model Consider a simple linear regression model yX01 where y is termed as the dependent or study variable and X is termed as the independent or explanatory variable. The terms 0 and 1 are the parameters of the model. The parameter 0 is termed as an intercept term, and the parameter 1 is termed as the slope parameter. These parameters are usually called as regression coefficients. The unobservable error component accounts for the failure of data to lie on the straight line and represents the difference between the true and observed realization of y . There can be several reasons for such difference, e.g., the effect of all deleted variables in the model, variables may be qualitative, inherent randomness in the observations etc. We assume that is observed as independent and identically distributed random variable with mean zero and constant variance 2 . Later, we will additionally assume that is normally distributed. The independent variables are viewed as controlled by the experimenter, so it is considered as non-stochastic whereas y is viewed as a random variable with Ey()01 X and Var() y 2 . Sometimes X can also be a random variable.
    [Show full text]
  • The Conspiracy of Random Predictors and Model Violations Against Classical Inference in Regression 1 Introduction
    The Conspiracy of Random Predictors and Model Violations against Classical Inference in Regression A. Buja, R. Berk, L. Brown, E. George, E. Pitkin, M. Traskin, K. Zhang, L. Zhao March 10, 2014 Dedicated to Halbert White (y2012) Abstract We review the early insights of Halbert White who over thirty years ago inaugurated a form of statistical inference for regression models that is asymptotically correct even under \model misspecification.” This form of inference, which is pervasive in econometrics, relies on the \sandwich estimator" of standard error. Whereas the classical theory of linear models in statistics assumes models to be correct and predictors to be fixed, White permits models to be \misspecified” and predictors to be random. Careful reading of his theory shows that it is a synergistic effect | a \conspiracy" | of nonlinearity and randomness of the predictors that has the deepest consequences for statistical inference. It will be seen that the synonym \heteroskedasticity-consistent estimator" for the sandwich estimator is misleading because nonlinearity is a more consequential form of model deviation than heteroskedasticity, and both forms are handled asymptotically correctly by the sandwich estimator. The same analysis shows that a valid alternative to the sandwich estimator is given by the \pairs bootstrap" for which we establish a direct connection to the sandwich estimator. We continue with an asymptotic comparison of the sandwich estimator and the standard error estimator from classical linear models theory. The comparison shows that when standard errors from linear models theory deviate from their sandwich analogs, they are usually too liberal, but occasionally they can be too conservative as well.
    [Show full text]
  • Design and Analysis of Ecological Data Landscape of Statistical Methods: Part 1
    Design and Analysis of Ecological Data Landscape of Statistical Methods: Part 1 1. The landscape of statistical methods. 2 2. General linear models.. 4 3. Nonlinearity. 11 4. Nonlinearity and nonnormal errors (generalized linear models). 16 5. Heterogeneous errors. 18 *Much of the material in this section is taken from Bolker (2008) and Zur et al. (2009) Landscape of statistical methods: part 1 2 1. The landscape of statistical methods The field of ecological modeling has grown amazingly complex over the years. There are now methods for tackling just about any problem. One of the greatest challenges in learning statistics is figuring out how the various methods relate to each other and determining which method is most appropriate for any particular problem. Unfortunately, the plethora of statistical methods defy simple classification. Instead of trying to fit methods into clearly defined boxes, it is easier and more meaningful to think about the factors that help distinguish among methods. In this final section, we will briefly review these factors with the aim of portraying the “landscape” of statistical methods in ecological modeling. Importantly, this treatment is not meant to be an exhaustive survey of statistical methods, as there are many other methods that we will not consider here because they are not commonly employed in ecology. In the end, the choice of a particular method and its interpretation will depend heavily on whether the purpose of the analysis is descriptive or inferential, the number and types of variables (i.e., dependent, independent, or interdependent) and the type of data (e.g., continuous, count, proportion, binary, time at death, time series, circular).
    [Show full text]
  • Chapter 11 Autocorrelation
    Chapter 11 Autocorrelation One of the basic assumptions in the linear regression model is that the random error components or disturbances are identically and independently distributed. So in the model y Xu , it is assumed that 2 u ifs 0 Eu(,tts u ) 0 if0s i.e., the correlation between the successive disturbances is zero. 2 In this assumption, when Eu(,tts u ) u , s 0 is violated, i.e., the variance of disturbance term does not remain constant, then the problem of heteroskedasticity arises. When Eu(,tts u ) 0, s 0 is violated, i.e., the variance of disturbance term remains constant though the successive disturbance terms are correlated, then such problem is termed as the problem of autocorrelation. When autocorrelation is present, some or all off-diagonal elements in E(')uu are nonzero. Sometimes the study and explanatory variables have a natural sequence order over time, i.e., the data is collected with respect to time. Such data is termed as time-series data. The disturbance terms in time series data are serially correlated. The autocovariance at lag s is defined as sttsEu( , u ); s 0, 1, 2,... At zero lag, we have constant variance, i.e., 22 0 Eu()t . The autocorrelation coefficient at lag s is defined as Euu()tts s s ;s 0,1,2,... Var() utts Var ( u ) 0 Assume s and s are symmetrical in s , i.e., these coefficients are constant over time and depend only on the length of lag s . The autocorrelation between the successive terms (and)uu21, (and),....uu32 (and)uunn1 gives the autocorrelation of order one, i.e., 1 .
    [Show full text]
  • Simple Statistics, Linear and Generalized Linear Models
    Simple Statistics, Linear and Generalized Linear Models Rajaonarifara Elinambinina PhD Student: Sorbonne Universt´e January 8, 2020 Rajaonarifara Elinambinina ( PhD Student: SorbonneSimple Statistics, Universt´e) Linear and Generalized Linear Models January 8, 2020 1 / 20 Outline 1 Basic statistics 2 Data Analysis 3 Linear model: linear regression 4 Generalized linear models Rajaonarifara Elinambinina ( PhD Student: SorbonneSimple Statistics, Universt´e) Linear and Generalized Linear Models January 8, 2020 2 / 20 Basic statistics A statistical variable is either quantitative or qualitative quantitative variable: numerical variable: discrete or continuous I Discrete distribution: countable I Continuous: not countable qualitative variable: not numerical variable: nominal or ordinal( cathegories) I ordinal: can be ranked I nominal: cannot be ranked Rajaonarifara Elinambinina ( PhD Student: SorbonneSimple Statistics, Universt´e) Linear and Generalized Linear Models January 8, 2020 3 / 20 Example of discrete distribution I Binomial distribution F binary data:there are only two issues (0 or 1) F repeat the experiment several times I Poisson distribution F Count of something in a limited time or space: positive F The probability of realising the event is small F The variance is equal to the mean I Binomial Negative F Number of experiment needed to get specific number of success Rajaonarifara Elinambinina ( PhD Student: SorbonneSimple Statistics, Universt´e) Linear and Generalized Linear Models January 8, 2020 4 / 20 Example of continuous distribution
    [Show full text]
  • Binary Response and Logistic Regression Analysis
    Chapter 3 Binary Response and Logistic Regression Analysis February 7, 2001 Part of the Iowa State University NSF/ILI project Beyond Traditional Statistical Methods Copyright 2000 D. Cook, P. Dixon, W. M. Duckworth, M. S. Kaiser, K. Koehler, W. Q. Meeker and W. R. Stephenson. Developed as part of NSF/ILI grant DUE9751644. Objectives This chapter explains • the motivation for the use of logistic regression for the analysis of binary response data. • simple linear regression and why it is inappropriate for binary response data. • a curvilinear response model and the logit transformation. • the use of maximum likelihood methods to perform logistic regression. • how to assess the fit of a logistic regression model. • how to determine the significance of explanatory variables. Overview Modeling the relationship between explanatory and response variables is a fundamental activity encoun- tered in statistics. Simple linear regression is often used to investigate the relationship between a single explanatory (predictor) variable and a single response variable. When there are several explanatory variables, multiple regression is used. However, often the response is not a numerical value. Instead, the response is simply a designation of one of two possible outcomes (a binary response) e.g. alive or dead, success or failure. Although responses may be accumulated to provide the number of successes and the number of failures, the binary nature of the response still remains. Data involving the relationship between explanatory variables and binary responses abound in just about every discipline from engineering to, the natural sciences, to medicine, to education, etc. How does one model the relationship between explanatory variables and a binary response variable? This chapter looks at binary response data and its analysis via logistic regression.
    [Show full text]
  • Skedastic: Heteroskedasticity Diagnostics for Linear Regression
    Package ‘skedastic’ June 14, 2021 Type Package Title Heteroskedasticity Diagnostics for Linear Regression Models Version 1.0.3 Description Implements numerous methods for detecting heteroskedasticity (sometimes called heteroscedasticity) in the classical linear regression model. These include a test based on Anscombe (1961) <https://projecteuclid.org/euclid.bsmsp/1200512155>, Ramsey's (1969) BAMSET Test <doi:10.1111/j.2517-6161.1969.tb00796.x>, the tests of Bickel (1978) <doi:10.1214/aos/1176344124>, Breusch and Pagan (1979) <doi:10.2307/1911963> with and without the modification proposed by Koenker (1981) <doi:10.1016/0304-4076(81)90062-2>, Carapeto and Holt (2003) <doi:10.1080/0266476022000018475>, Cook and Weisberg (1983) <doi:10.1093/biomet/70.1.1> (including their graphical methods), Diblasi and Bowman (1997) <doi:10.1016/S0167-7152(96)00115-0>, Dufour, Khalaf, Bernard, and Genest (2004) <doi:10.1016/j.jeconom.2003.10.024>, Evans and King (1985) <doi:10.1016/0304-4076(85)90085-5> and Evans and King (1988) <doi:10.1016/0304-4076(88)90006-1>, Glejser (1969) <doi:10.1080/01621459.1969.10500976> as formulated by Mittelhammer, Judge and Miller (2000, ISBN: 0-521-62394-4), Godfrey and Orme (1999) <doi:10.1080/07474939908800438>, Goldfeld and Quandt (1965) <doi:10.1080/01621459.1965.10480811>, Harrison and McCabe (1979) <doi:10.1080/01621459.1979.10482544>, Harvey (1976) <doi:10.2307/1913974>, Honda (1989) <doi:10.1111/j.2517-6161.1989.tb01749.x>, Horn (1981) <doi:10.1080/03610928108828074>, Li and Yao (2019) <doi:10.1016/j.ecosta.2018.01.001>
    [Show full text]
  • Statistical Analysis of Corpus Data with R a Short Introduction to Regression and Linear Models
    Statistical Analysis of Corpus Data with R A short introduction to regression and linear models Designed by Marco Baroni1 and Stefan Evert2 1Center for Mind/Brain Sciences (CIMeC) University of Trento 2Institute of Cognitive Science (IKW) University of Onsabr¨uck Baroni & Evert (Trento/Osnabr¨uck) SIGIL: Linear Models 1 / 15 Outline Outline 1 Regression Simple linear regression General linear regression 2 Linear statistical models A statistical model of linear regression Statistical inference 3 Generalised linear models Baroni & Evert (Trento/Osnabr¨uck) SIGIL: Linear Models 2 / 15 Least-squares regression minimizes prediction error n X 2 Q = yi − (β0 + β1xi ) i=1 for data points (x1; y1);:::; (xn; yn) Regression Simple linear regression Linear regression Can random variable Y be predicted from r. v. X ? + focus on linear relationship between variables Linear predictor: Y ≈ β0 + β1 · X I β0 = intercept of regression line I β1 = slope of regression line Baroni & Evert (Trento/Osnabr¨uck) SIGIL: Linear Models 3 / 15 Regression Simple linear regression Linear regression Can random variable Y be predicted from r. v. X ? + focus on linear relationship between variables Linear predictor: Y ≈ β0 + β1 · X I β0 = intercept of regression line I β1 = slope of regression line Least-squares regression minimizes prediction error n X 2 Q = yi − (β0 + β1xi ) i=1 for data points (x1; y1);:::; (xn; yn) Baroni & Evert (Trento/Osnabr¨uck) SIGIL: Linear Models 3 / 15 Mathematical derivation of regression coefficients I minimum of Q(β0; β1) satisfies @Q=@β0 =
    [Show full text]
  • Chapter 10 Heteroskedasticity
    Chapter 10 Heteroskedasticity In the multiple regression model yX , it is assumed that VI() 2 , i.e., 22 Var()i , Cov(ij ) 0, i j 1, 2,..., n . In this case, the diagonal elements of the covariance matrix of are the same indicating that the variance of each i is same and off-diagonal elements of the covariance matrix of are zero indicating that all disturbances are pairwise uncorrelated. This property of constancy of variance is termed as homoskedasticity and disturbances are called as homoskedastic disturbances. In many situations, this assumption may not be plausible, and the variances may not remain the same. The disturbances whose variances are not constant across the observations are called heteroskedastic disturbance, and this property is termed as heteroskedasticity. In this case 2 Var()ii , i 1,2,..., n and disturbances are pairwise uncorrelated. The covariance matrix of disturbances is 2 00 1 00 2 Vdiag( ) (22 , ,..., 2 )2 . 12 n 2 00 n Regression Analysis | Chapter 10 | Heteroskedasticity | Shalabh, IIT Kanpur 1 Graphically, the following pictures depict homoskedasticity and heteroskedasticity. Homoskedasticity Heteroskedasticity (Var(y) increases with x) Heteroskedasticity (Var(y) decreases with x) Examples: Suppose in a simple linear regression model, x denote the income and y denotes the expenditure on food. It is observed that as the income increases, the expenditure on food increases because of the choice and varieties in food increase, in general, up to a certain extent. So the variance of observations on y will not remain constant as income changes. The assumption of homoscedasticity implies that the consumption pattern of food will remain the same irrespective of the income of the person.
    [Show full text]