Leverage (Statistics)

Total Page:16

File Type:pdf, Size:1020Kb

Leverage (Statistics) Leverage (statistics) In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. High-leverage points are those observations, if any, made at extreme or outlying values of the independent variables such that the lack of neighboring observations means that the fitted regression model will pass close to that particular observation.[1] Modern computer packages for statistical analysis include, as part of their facilities for regression analysis, various quantitative measures for identifying influential observations: among these measures is partial leverage, a measure of how a variable contributes to the leverage of a datum. Contents Linear regression model Definition Bounds on leverage Proof Effect on residual variance Proof Studentized residuals See also References Linear regression model Definition In the linear regression model, the leverage score for thei -th data unit is defined as: the i-th diagonal element of the projection matrix , where is the design matrix. The leverage score is also known as the observation self-sensitivity or self-influence,[2] as shown by where and are the fitted and measured observation, respectively. Bounds on leverage Proof First, note that H is an idempotent matrix: Also, observe that is symmetric. So equating theii element of H to that of H 2, we have and Effect on residual variance If we are in an ordinary least squares setting with fixed X, regression errors and then where (the i th regression residual). In other words, if the model errors are homoscedastic, an observation's leverage score determines the degree of noise in the model's misprediction of that observation. Proof First, note that is idempotent and symmetric. This gives, Thus Studentized residuals The corresponding studentized residual—the residual adjusted for its observation-specific estimated residual variance—is then where is an appropriate estimate of See also Projection matrix – whose main diagonal entries are the leverages of the observations Mahalanobis distance – a measure of leverage of a datum Cook's distance – a measure of changes in regression coefficients when an observation is deleted DFFITS Outlier – observations with extremeY values References 1. Everitt, B. S. (2002). Cambridge Dictionary of Statistics. Cambridge University Press.ISBN 0-521-81099-X. 2. Cardinali, C. (June 2013)."Data Assimilation: Observation influence diagnostic of a data assimilation system" (http s://www.ecmwf.int/sites/default/files/elibrary/2013/16938-observation-influence-diagnostic-data-assimilation-system.p df) (PDF). Retrieved from "https://en.wikipedia.org/w/index.php?title=Leverage_(statistics)&oldid=887909577" This page was last edited on 15 March 2019, at 16:48 (UTC). Text is available under theCreative Commons Attribution-ShareAlike License; additional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of theWikimedia Foundation, Inc., a non-profit organization. ÿÿÿ ! 233456ÿ896@ABCD EFÿHIPIQHIQRHSÿ233456ÿ896@ABCDÿTUÿ233456ÿVÿQHÿPÿRTWWTFXYÿ`HabÿaHIQWPIaÿTcÿIdaÿQFcX`aFRaÿTcÿPÿbPIPÿeTQFIÿfdaFÿeaUcTUWQFg PÿXaPHIhHi`PUaHÿUagUaHHQTFÿPFPXYHQHpqrsÿEFÿPÿeUPRIQRPXÿTUbQFPUYÿXaPHIÿHi`PUaHÿPFPXYHQHSÿtTTuvHÿbQHIPFRaÿRPFÿwaÿ`HabÿQFÿHaxaUPX fPYHyÿITÿQFbQRPIaÿQFcX`aFIQPXÿbPIPÿeTQFIHÿIdPIÿPUaÿePUIQR`XPUXYÿfTUIdÿRdaRuQFgÿcTUÿxPXQbQIYÿTUÿITÿQFbQRPIaÿUagQTFHÿTcÿIda baHQgFÿHePRaÿfdaUaÿQIÿfT`XbÿwaÿgTTbÿITÿwaÿPwXaÿITÿTwIPQFÿWTUaÿbPIPÿeTQFIHpÿEIÿQHÿFPWabÿPcIaUÿIdaÿWaUQRPFÿHIPIQHIQRQPFÿp aFFQHÿtTTuSÿfdTÿQFIUTb`RabÿIdaÿRTFRaeIÿQFÿr pqsqs 23B@DB@6 ÿdÿefÿghijfh kilif mÿfh nih oeiiÿifp |D}9B9@93B PIPÿeTQFIHÿfQIdÿXPUgaÿUaHQb`PXHÿqT`IXQaUHrÿPFbsTUÿdQgdÿXaxaUPgaÿWPYÿbQHITUIÿIdaÿT`IRTWaÿPFbÿPRR`UPRYÿTcÿPÿUagUaHHQTFp tTTuvHÿbQHIPFRaÿWaPH`UaHÿIdaÿaccaRIÿTcÿbaXaIQFgÿPÿgQxaFÿTwHaUxPIQTFpÿtTQFIHÿfQIdÿPÿXPUgaÿtTTuvHÿbQHIPFRaÿPUaÿRTFHQbaUabÿIT WaUQIÿRXTHaUÿauPWQFPIQTFÿQFÿIdaÿPFPXYHQHp vTUÿIdaÿPXgawUPQRÿaueUaHHQTFSÿcQUHIÿbacQFa fdaUaÿ ÿQHÿIdaÿaUUTUÿIaUWSÿ ÿQHÿIdaÿRTaccQRQaFIÿWPIUQuSÿÿQHÿIdaÿF`WwaUÿTcÿRTxPUQPIaHÿTU eUabQRITUHÿcTUÿaPRdÿTwHaUxPIQTFSÿPFbÿ ÿQHÿIdaÿbaHQgFÿWPIUQuÿQFRX`bQFgÿPÿRTFHIPFIpÿwdaÿXaPHIÿHi`PUaHÿaHIQWPITUÿIdaFÿQHÿ SÿPFbÿRTFHai`aFIXYÿIdaÿcQIIabÿqeUabQRIabrÿxPX`aHÿcTUÿIdaÿWaPFÿTcÿÿPUa fdaUaÿ ÿQHÿIdaÿeUTxaRIQTFÿWPIUQuÿqTUÿdPIÿWPIUQurpÿwdaÿhIdÿbQPgTFPXÿaXaWaFIÿTcÿ SÿgQxaFÿwYÿ SqysÿQHÿuFTfFÿPHÿIdaÿXaxaUPgaÿTcÿIdaÿhIdÿTwHaUxPIQTFpÿzQWQXPUXYSÿIdaÿhIdÿaXaWaFIÿTcÿIdaÿUaHQb`PXÿxaRITUÿ ÿQHÿbaFTIabÿwYÿ p tTTuvHÿbQHIPFRaÿ ÿTcÿTwHaUxPIQTFÿ ÿQHÿbacQFabÿPHÿIdaÿH`WÿTcÿPXXÿIdaÿRdPFgaHÿQFÿIdaÿUagUaHHQTFÿWTbaX fdaFÿTwHaUxPIQTFÿÿQHÿUaWTxabÿcUTWÿQIq{s "!#$% !$&'% ()0 1 ÿÿÿ ! 23454ÿ ÿ78ÿ934ÿ@7994Aÿ548BCD84ÿEFGH4ÿCI9F7D4Aÿ234Dÿ4PQGHA7DRÿSÿFDAÿ ÿ78ÿ934ÿT4FDÿ8UHF54Aÿ455C5ÿC@ 934ÿ54R54887CDÿTCA4GVWXYÿ`UH7EFG4D9GaSÿ79ÿQFDÿI4ÿ4PB54884AÿH87DRÿ934ÿG4E45FR4WbY ÿ ¡ÿ¢ £¤ ÿ¥¦§¨©¤¥§ c3454ÿF54ÿA7@@454D9ÿCB7D7CD8ÿ54RF5A7DRÿ23F9ÿQH9dC@@ÿEFGH48ÿ9CÿH84ÿ@C5ÿ8BC997DRÿ37R3Gaÿ7D@GH4D97FGÿBC7D98Vÿe7DQ4ÿfCCgh8 A789FDQ4ÿ78ÿ7Dÿ934ÿT4957QÿC@ÿFDÿiÿA78957IH97CDÿ2793ÿÿFDAÿ ÿpF8ÿA4@7D4Aÿ@C5ÿ934ÿA487RDÿTF957Pÿ ÿFICE4qÿA4R5448ÿC@ @544ACTSÿ934ÿT4A7FDÿBC7D9ÿp7V4VSÿ qÿQFDÿI4ÿH84AÿF8ÿFÿQH9dC@@VWrYÿe7DQ4ÿ9378ÿEFGH4ÿ78ÿQGC84ÿ9Cÿsÿ@C5ÿGF5R4ÿSÿF 87TBG4ÿCB45F97CDFGÿRH7A4G7D4ÿC@ÿ ÿ3F8ÿI44Dÿ8HRR4894AVWtYÿuC94ÿ93F9ÿ934ÿfCCgh8ÿA789FDQ4ÿT4F8H54ÿAC48ÿDC9ÿFG2Fa8 QC554Q9Gaÿ7A4D97@aÿ7D@GH4D97FGÿCI845EF97CD8VWvY ª¨«¨¤¥ eB4Q7@7QFGGaÿ ÿQFDÿI4ÿ7D945B5494AÿF8ÿ934ÿA789FDQ4ÿCD4h8ÿ4897TF948ÿTCE4ÿ27937Dÿ934ÿQCD@7A4DQ4ÿ4GG7B8C7Aÿ93F9ÿ54B5484D98ÿF 54R7CDÿC@ÿBGFH87IG4ÿEFGH48ÿ@C5ÿ934ÿBF5FT49458Vÿc378ÿ78ÿ83C2DÿIaÿFDÿFG945DF97E4ÿIH9ÿ4UH7EFG4D9ÿ54B5484D9F97CDÿC@ÿfCCgh8 A789FDQ4ÿ7Dÿ945T8ÿC@ÿQ3FDR48ÿ9Cÿ934ÿ4897TF948ÿC@ÿ934ÿ54R54887CDÿBF5FT49458ÿI49244Dÿ934ÿQF848Sÿ23454ÿ934ÿBF597QHGF5 CI845EF97CDÿ78ÿ479345ÿ7DQGHA4AÿC5ÿ4PQGHA4Aÿ@5CTÿ934ÿ54R54887CDÿFDFGa878V ¬ÿ¤ §¥ wxy& &'ÿ &yÿy&' xÿ&xy ­¢¨§ $ÿ"yyÿ yyÿ"ÿ&&ÿ$ÿÿdefghÿifjkldÿmgÿnonmlnmelpÿqdrkdllmfgÿgostlmlÿu"ÿ$$ÿv!!& yÿw&ÿxy#ÿ&zyy$ÿ!$ÿ$ÿ{xÿ11|$ÿ}~ÿx&ÿÿ&yyÿyxÿÿxy'ÿ& "ÿÿ"ÿÿÿÿ%ÿ!&!ÿÿw$ÿ $ÿ ÿ)$ÿ ÿÿmÿÿyxy$$$} $ÿ ÿw$ÿ ÿ&x&ÿ))$ÿ} ÿÿyxyÿw&ÿÿ&ÿw'&}$ÿdegfdnkmel$ ~&ÿyÿ~$ÿ ÿ#ÿu|$ÿ#$1)|ÿ"!#$&'$1)(|$ ywwÿ|ÿ"!#%%%$&$&'y|$ÿwÿ1)|ÿ"!#%%%$$&'"'& 1)|$ 1$ÿ ÿw$ÿ ÿ&"ÿ)$ÿ}yxyÿw&ÿÿ&ÿw'&}$ÿfjkgosÿfÿndÿdkmeogÿnonmlnmeos llfemonmfg$ÿ~&ÿyÿ~$ÿÿ1u#ÿ)$ÿ#$1)|))ÿ"!#$&'$1)( |))$ÿywwÿ|))ÿ"!#%%%$&$&'y|))$ÿwÿuu11ÿ"!#%%%$$&'"' &uu11$ $ÿz"ÿxÿ$ÿefgfdnkmelÿ"!# $''y$ | v!'~$ÿ& v&ÿ&$ÿ!!$ÿ1$ u$ÿ} ÿ }ÿ"!#$"%& $"y! $"y$ $ÿ}ÿu#ÿ~!!yÿ&ÿy}ÿ"!#%%%$$!x&x$x'uu!1$!!' ÿ $ÿjkhjdÿgmdklmnt$ "!#$% !$&'% ()0 1 ÿÿÿ ! )$ÿIBB3ÿ"ÿ2$5ÿQ D3ÿ6C&ÿ $ÿ89$ÿ@6'&ÿA'#ÿ2ÿh!&4ÿ&DÿÿPB&ÿ yBBÿ@$ÿyÿh3ÿQ"5ÿR'3ÿQ$ÿv$ÿfVhipfÿfi`eVhYÿVÿbd`dÿqfdrsYcY$ÿt%C&4ÿH& 3ÿ2#ÿv'$ !!$ÿx)wÿ !$ÿ$ÿyvItÿ111x$ $ÿ 3ÿ6$ÿA5ÿ C&'3ÿv&ÿ89$ÿUiYchXdrYÿdfhÿfrXifgiÿcfÿUiepiYYcVfÿ8"!#C $''B$DC EF7v2222y22Q9$ÿt%ÿu& 3ÿtu#ÿ"!DÿTÿBB$ÿyvItÿG$ $ÿD3ÿ74'ÿÿ81ÿ74ÿ)9$ÿ@2ÿ&4ÿÿÿ"ÿÿÿ ÿ@ÿ8"!#%%%$D$&$ &&B %$"DBEF$x1xv27$)$$1$1)9$ÿVXfcgd`cVfYÿVpÿ `d`cY`cgdrÿqrcgd`cVfYÿdfhÿfi`eVhY$ÿxÿ819#ÿ1)w 1$ÿ#$x1xD$)$$1$1)ÿ8"!#$&'$x1x(D$)$$1$1)9$ÿyvvtÿ11)x)ÿ8"!#%% %$%&B$&'11)x)9$ ÿ 2 3ÿ2"45ÿ63ÿ7&ÿ89$ÿ@ABÿA'@ÿ8"!#C $''B$DC EFGHIPQ0RS T!'FH29$ÿUVWXY`ÿbcdefVY`cgYÿdfhÿUiepiYYcVfÿqfdrsYcY$ÿt%ÿu& #ÿv!&'&$ÿ!!$ÿwx$ÿyvItÿ1)x)$ C&'&3ÿ6"&ÿ7$5ÿBB3ÿI&ÿ819$ÿ@ÿv@ÿ8"!#C $''B$DC EF1'I%22I2QT !'FH219$ÿ `d`cY`cgdrÿqfdrsYcYÿdfhÿbd`dÿbcYrds$ÿv!&'&ÿvÿTÿIÿ7$ÿ!!$ÿ1w)$ yvItÿ))x)$ & &3ÿ BBDÿv$5ÿ"3ÿ%5ÿ B"3ÿ64ÿ$ÿ819$ÿ@Dÿ&ÿ&4ÿÿÿB%ÿDB@$ÿdfhWVVÿV gVfVi`pcgY$ÿ$ÿB&$ÿ!!$ÿxw$ÿ#$vx)1819xÿ8"!#$&'$(vx)1 (1(x9$ 2'3ÿ&D5ÿ&3ÿ64ÿ$5ÿQ3ÿ&&4ÿ819$ÿ@IH&ÿ6DDÿ&ÿA'ÿy4' ÿB'ÿPB&@ÿ8"!#%%%$&&"'$!&B&D02'!CBx)0IH&0 6DD0&0A'0y4'00B'0PB&B 1x)C1C$!9ÿ8HA9$ pedfcdd`cVfdrÿUiYidpgeÿfi`eVhY$ÿv'$ÿgÿ89#ÿ)w1$ÿ#$)))ÿ8"!#$&'$))( )9$ÿ6&ÿÿADC&ÿx$ 6&ÿ&Dÿ@"!#$% !$&'%h$!"!EBF ()0TBF11@ ijklÿnopqÿrolÿsoltÿquktquÿvwÿxyÿzo{|jÿx}~ÿotÿ}x}y hÿÿBCBÿ&ÿ"ÿ&ÿDDÿ2&Cv"&2B ÿR5ÿBÿ&DÿD4ÿ!!B4$ÿI4ÿ'ÿ" 3ÿ4ÿ'&ÿÿ"ÿ&DÿÿSÿÿH&4ÿHB4$ÿ !ÿÿÿ&'&ÿ&D& ÿÿ"ÿ D 3ÿy$3ÿÿ!&ÿ&'$ "!#$% !$&'% ()0 11.
Recommended publications
  • LSAR: Efficient Leverage Score Sampling Algorithm for the Analysis
    LSAR: Efficient Leverage Score Sampling Algorithm for the Analysis of Big Time Series Data Ali Eshragh∗ Fred Roostay Asef Nazariz Michael W. Mahoneyx December 30, 2019 Abstract We apply methods from randomized numerical linear algebra (RandNLA) to de- velop improved algorithms for the analysis of large-scale time series data. We first develop a new fast algorithm to estimate the leverage scores of an autoregressive (AR) model in big data regimes. We show that the accuracy of approximations lies within (1 + O (")) of the true leverage scores with high probability. These theoretical results are subsequently exploited to develop an efficient algorithm, called LSAR, for fitting an appropriate AR model to big time series data. Our proposed algorithm is guaranteed, with high probability, to find the maximum likelihood estimates of the parameters of the underlying true AR model and has a worst case running time that significantly im- proves those of the state-of-the-art alternatives in big data regimes. Empirical results on large-scale synthetic as well as real data highly support the theoretical results and reveal the efficacy of this new approach. 1 Introduction A time series is a collection of random variables indexed according to the order in which they are observed in time. The main objective of time series analysis is to develop a statistical model to forecast the future behavior of the system. At a high level, the main approaches for this include the ones based on considering the data in its original time domain and arXiv:1911.12321v2 [stat.ME] 26 Dec 2019 those arising from analyzing the data in the corresponding frequency domain [31, Chapter 1].
    [Show full text]
  • Remedies for Assumption Violations and Multicollinearity Outliers • If the Outlier Is Due to a Data Entry Error, Just Correct the Value
    Newsom Psy 522/622 Multiple Regression and Multivariate Quantitative Methods, Winter 2021 1 Remedies for Assumption Violations and Multicollinearity Outliers • If the outlier is due to a data entry error, just correct the value. This is a good reason why raw data should be retained for many years after it is collected as some professional associations recommend. With direct computer entry during interviews, entry errors cannot always be determined for certain. • If the outlier is due to an invalid case because the protocol was not followed or the inclusion criteria was incorrectly applied for that case, the researcher may be able to just eliminate that case. I recommend at least reporting the number of cases excluded and the reason. Consider analyzing the data and/or reporting analyses with and without the deleted case(s). • Transformation of the data might be considered (see overhead on transformations). Square root transformations, for instance, bring outlying cases closer to the others. Transformations, however, can make results difficult to interpret sometimes. • Analyze the data with and without the outlier(s). If the implications of the results do not differ, one could simply state that the analyses were conducted with and without the outlier(s) and that the results do not differ substantially. If they do differ, both results could be presented and a discussion about the potential causes of the outlier(s) could be discussed (e.g., different subgroup of cases, potential moderator effect). • An alternative estimation method could be used, such as least absolute residuals, weighted least squares, bootstrapping, or jackknifing. See discussion of these below.
    [Show full text]
  • Leveraging for Big Data Regression
    Focus Article Leveraging for big data regression Ping Ma∗ and Xiaoxiao Sun Rapid advance in science and technology in the past decade brings an extraordinary amount of data, offering researchers an unprecedented opportunity to tackle complex research challenges. The opportunity, however, has not yet been fully utilized, because effective and efficient statistical tools for analyzing super-large dataset are still lacking. One major challenge is that the advance of computing resources still lags far behind the exponential growth of database. To facilitate scientific discoveries using current computing resources, one may use an emerging family of statistical methods, called leveraging. Leveraging methods are designed under a subsampling framework, in which one samples a small proportion of the data (subsample) from the full sample, and then performs intended computations for the full sample using the small subsample as a surrogate. The key of the success of the leveraging methods is to construct nonuniform sampling probabilities so that influential data points are sampled with high probabilities. These methods stand as the very unique development of their type in big data analytics and allow pervasive access to massive amounts of information without resorting to high performance computing and cloud computing. © 2014 Wiley Periodicals, Inc. Howtocitethisarticle: WIREs Comput Stat 2014. doi: 10.1002/wics.1324 Keywords: big data; subsampling; estimation; least squares; leverage INTRODUCTION of FPGA (field-programmable gate array)-based accel- erators are ongoing in both academia and industry apid advance in science and technology in the (for example, with TimeLogic and Convey Comput- past decade brings an extraordinary amount R ers). However, none of these technologies by them- of data that were inaccessible just a decade ago, selves can address the big data problem accurately and offering researchers an unprecedented opportunity to at scale.
    [Show full text]
  • Elemental Estimates, Influence, and Algorithmic Leveraging
    Leverage in penalized least squares estimation Keith Knight University of Toronto [email protected] Abstract The German mathematician Carl Gustav Jacobi showed in 1841 that least squares estimates can be written as an expected value of elemental estimates with respect to a probability measure on subsets of the predictors; this result is trivial to extend to penalized least squares where the penalty is quadratic. This representation is used to define the leverage of a subset of observations in terms of this probability measure. This definition can be applied to define the influence of the penalty term on the estimates. Applications to the Hodrick-Prescott filter, backfitting, and ridge regression as well as extensions to non-quadratic penalties are considered. 1 Introduction We consider estimation in the linear regression model Y = xT β + ε (i =1, , n) i i i ··· where β is a vector of p unknown parameters. Given observations (x ,y ): i = 1, , n , we define the penalized least squares (PLS) { i i ··· } estimate β as the minimizer of n (y xT β)2 + βT Aβ. (1) i − i i=1 where A is a non-negative definite matrix with A = LT L for some r p matrix L. Note that L × is in general not uniquely specified since for any r r orthogonal matrix O, A =(OL)T (OL). × In the context of this paper, we would typically choose L so that its rows are interpretable in some sense. Examples of PLS estimates include ridge estimates (Hoerl and Kennard, 1970), smoothing splines (Craven and Wahba, 1978), the Hodrick-Prescott filter (Hodrick and Prescott, 1997), and (trivially when A = 0) ordinary least squares (OLS) estimates.
    [Show full text]
  • A Statistical Perspective on Algorithmic Leveraging
    Journal of Machine Learning Research 16 (2015) 861-911 Submitted 1/14; Revised 10/14; Published 4/15 A Statistical Perspective on Algorithmic Leveraging Ping Ma [email protected] Department of Statistics University of Georgia Athens, GA 30602 Michael W. Mahoney [email protected] International Computer Science Institute and Department of Statistics University of California at Berkeley Berkeley, CA 94720 Bin Yu [email protected] Department of Statistics University of California at Berkeley Berkeley, CA 94720 Editor: Alexander Rakhlin Abstract One popular method for dealing with large-scale data sets is sampling. For example, by using the empirical statistical leverage scores as an importance sampling distribution, the method of algorithmic leveraging samples and rescales rows/columns of data matrices to reduce the data size before performing computations on the subproblem. This method has been successful in improving computational efficiency of algorithms for matrix problems such as least-squares approximation, least absolute deviations approximation, and low-rank matrix approximation. Existing work has focused on algorithmic issues such as worst-case running times and numerical issues associated with providing high-quality implementations, but none of it addresses statistical aspects of this method. In this paper, we provide a simple yet effective framework to evaluate the statistical properties of algorithmic leveraging in the context of estimating parameters in a linear regression model with a fixed number of predictors. In particular, for several versions of leverage-based sampling, we derive results for the bias and variance, both conditional and unconditional on the observed data. We show that from the statistical perspective of bias and variance, neither leverage-based sampling nor uniform sampling dominates the other.
    [Show full text]
  • Chapter 10 NOTES Model Building – II: Diagnostics
    STAT 571A — Advanced Statistical Regression Analysis Chapter 10 NOTES Model Building – II: Diagnostics © 2018 University of Arizona Statistics GIDP. All rights reserved, except where previous rights exist. No part of this material may be reproduced, stored in a retrieval system, or transmitted in any form or by any means — electronic, online, mechanical, photoreproduction, recording, or scanning — without the prior written consent of the course instructor. §10.1: Added-Variable Plots • Added-variable plots visualize the potential value of adding a new Xk-variable to an existing MLR model. • Find the residuals from the existing fit of Y on the Xk variables (k = 2, ..., p–1); call these ei(Y|X2,...,Xp–1). • If the new variable is X1, regress X1 on the Xk variables (k = 2, ..., p–1); find the residuals ei(X1|X2,...,Xp–1). • Plot ei(Y|X2,...,Xp–1) against ei(X1|X2,...,Xp–1) and look for patterns. Versions of Add’d Var. Plots From Fig. 10.1: (a) nothing new in X1; (b) add’l linear term in X1; (c) add’l curvilinear term in X1 Example: Life Insur. data (CH10TA01) Y = Life insur. carried X1 = Risk aversion score X2 = ann. income Existing model has single predictor X2. Build added-variable plot for new variable X1. For this simple case, program this directly in R: > plot( resid(lm(Y~X2)) ~ resid(lm(X1~X2)) ) > abline( lm(resid(lm(Y~X2))~resid(lm(X1~X2)))) More generally, use avPlot() from car package: > library( car ) > avPlot( model=lm( Y~X1+X2 ), variable=X1 ) Plot follows → Life Insur.
    [Show full text]
  • Calculating Standard Errors for Linear Regression in Randomized Evaluations
    Calculating Standard Errors for Linear Regression in Randomized Evaluations The Office of Evaluation Sciences often analyzes the results of RCTs by estimating a parametric statistical model — typically an ordinary least squares (OLS) regression — where one of the parameters represents the effect of a treatment or intervention. In order to infer whether this treatment effect is statistically significant, we typically depend on an estimate of the standard error associated with this parameter. This document describes OES’s preferred method for calculating parametric standard errors for OLS regressions — in particular, the reasons for using HC2 standard errors — and how to calculate them in R and Stata. When to Apply This Guidance One reason for preferring HC2 standard errors is that they tend to represent variation resulting from random assignment of a set of units to different experimental conditions (for example, treatment and control). This guidance thus applies most directly to RCT’s. HC2 standard errors may or may not be the most appropriate for other types of analyses, including quasi-experimental and other observational analyses (which we often conduct to establish background conditions during the design phase of a project). Why Do We Prefer HC2 Standard Errors? The usual standard errors calculated for OLS regression are not grounded in randomization. Since we tend to analyze randomized experiments with fixed samples, we prefer standard errors that are design-based — that reflect the variation we expect due to assigning units from a fixed sample to one or more experimental conditions. The usual standard errors in OLS regression are also not consistent under heteroskedasticity. That is, when the variance in the outcome changes (with treatment and/or with covariates), larger samples do not lead the usual estimates of the standard error to collapse to a spike at the true value of the standard error.
    [Show full text]
  • Tests for Leverage and Influential Points
    LINEAR REGRESSION ANALYSIS MODULE – VI Lecture - 24 Tests for Leverage and Influential Points Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur 2 2. DFFITS and DFBETAS Cook’s distance measure is a deletion diagnostic, i.e., it measures the influence of ith observation if it is removed from the sample. There are two more statistics: (i) DFBETAS which indicates that how much the regression coefficient changes if the ith observation were deleted. Such change is measured in terms of standard deviation units. This statistic is bbj− ji() DFBETAS ji, = 2 SC()i jj th −1 th where Cjj is the j diagonal element of (') XX and b ji () regression coefficient computed without use of i observation. th th Large (in magnitude) value of DFBETAS ji , , indicates that i observation has considerable influence on the j regression coefficient. i. The values of DFBETAS ji , can be expressed in a n x k matrix that conveys similar information to the composite influence information in Cook’s distance measure. th ˆ ii. The n elements in the j row of R produce the leverage that the n observations in the sample have on β j . th DFBETAS ji , is the j element of () bb − () i divided by a standardization factor − 1' (')X X xeii bbii−=() . 1− hii th The j element of () bb ii − () can be expressed as re bbjj−=ji, i . ii() 1− h ii 3 th rij = ((R)) denotes the (i, j) elements of R −−11 (RR ')'(')= XX XXXX '(') ' −1 =(X ' X ) = C = RR '. Since ' Cjj= rr j j , so 2 = 2' S()i C jj S() i rr j j bb− DFBETAS = j ji() ji, 2 SC()i jj reji, i 1 = 1− h 2' ii S()i rr jj r t = ji, i ' − rrjj 1 hii ↓↓ Measures Measures leverage effect th (impact of i of observation large on bi residuals th 2 th where t is the i R-student residual.
    [Show full text]
  • Outliers, Leverage, and Influence
    Outliers, Leverage, and Influence James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) Outliers, Leverage, and Influence 1 / 45 Outliers, Leverage, and Influence 1 Introduction 2 Significance Tests for Outliers and Influential Cases An Outlier Test A Significance Test for Influence 3 Problems with Multiple Outliers Masking Swamping 4 What Should We Do with Outliers? Contaminated Observations Rare Cases An Incorrect Model 5 Remedial Actions James H. Steiger (Vanderbilt University) Outliers, Leverage, and Influence 2 / 45 Introduction Introduction After an initial fit of a model (including possible nonlinear transformations of X and or Y ), it may become clear that certain observations are unusual in the sense that they are extremely atypical of X and or Y , either in the univariate or bivariate sense. In this module, we develop and discuss some of the common techniques and related terminology connected with the identification and evaluation of unusual observations. James H. Steiger (Vanderbilt University) Outliers, Leverage, and Influence 3 / 45 Introduction Introduction In our examination of diagnostics based on residuals or possibly rescaled residuals, we begin by recalling that our basic model includes a model for the regression errors and, in a sense, the residuals themselves. If the fitted model does not give a set of residuals that appear to be reasonably in agreement with the model for those residuals, then we must question the model and/or its assumptions. James H. Steiger (Vanderbilt University) Outliers, Leverage, and Influence 4 / 45 Introduction Introduction A related issue is the importance of each case on estimation and other aspects of the analysis.
    [Show full text]
  • Unusual and Influential Data
    Sociology 740 John Fox Lecture Notes 6. Unusual and Influential Data Copyright © 2014 by John Fox Unusual and Influential Data 1 1. Introduction I Linear statistical models make strong assumptions about the structure of data, which often do not hold in applications. I The method of least-squares is very sensitive to the structure of the data, and can be markedly influenced by one or a few unusual observations. I We could abandon linear models and least-squares estimation in favor of nonparametric regression and robust estimation. I Alternatively, we can adapt and extend methods for examining and transforming data to diagnose problems with a linear model, and to suggest solutions. c 2014 by John Fox Sociology 740 ° Unusual and Influential Data 2 2. Goals: I To distinguish among regression outliers, high-leverage observations, and influential observations. I To show how outlyingness, leverage, and influence can be measured. I To introduce added-variable (‘partial-regression’) plots as a means of displaying leverage and influence on particular coefficients. c 2014 by John Fox Sociology 740 ° Unusual and Influential Data 3 3. Outliers, Leverage, and Influence I Unusual data are problematic in linear models fit by least squares because they can unduly influence the results of the analysis, and because their presence may be a signal that the model fails to capture important characteristics of the data. I Some central distinctions are illustrated in Figure 1 for the simple regression model \ = + [ + %. In simple regression, an outlier is an observation whose response- • variable value is conditionally unusual given the value of the explana- tory variable.
    [Show full text]
  • Outliers, Durbin-Watson and Interactions for Regression in SPSS Dependent Variable: Continuous (Scale/Interval/Ratio) Independent Variables: Continuous/ Binary
    Further regression in SPSS statstutor Community Project stcp-marshall-furtherRegression The following resources are associated: Simple and Multiple linear regression in SPSS and the SPSS dataset ‘Birthweight_reduced.sav’ Outliers, Durbin-Watson and interactions for regression in SPSS Dependent variable: Continuous (scale/interval/ratio) Independent variables: Continuous/ binary Data: The data set ‘Birthweight reduced.sav’ contains details of 42 babies and their parents at birth. The dependant variable is birthweight (pounds = lbs) and the two independent variables are the gestational age of the baby at birth (in weeks) and whether or not the mother smokes (0 = non- smoker, 1 = smoker). Investigating outliers and influential observations An assumption of regression is that there are no influential observations. These are extreme values which pull the regression line towards them therefore having a significant impact on the coefficients of the model. Outliers: Outliers are observations where the observed dependent value does not follow the general trend given the independent value (unusual y given x). In this situation, the residual for that observation is likely to be large unless it is also influential and has pulled the line towards it. Approximately 5% of ±1.96 standardised residuals will be outside but extreme outliers Outlier without leverage will be outside ±3. are the residuals obtained changes intercept only Deleted residuals y=-6.05+0.33x if the regression was repeated without the individual observation. Leverage: An observation with high leverage will pull the regression line towards it. The average leverage score is calculated as (k + 1)/ n where k is the number of independent variables in the model and n is the number of observations.
    [Show full text]
  • Studentized Residuals Jacknife Residuals 3 Zambia 2 2 1 1 0 0 −1 −1 Jacknife Residuals −2 Studentized Residuals −2
    Chapter 7 Diagnostics Regression model building is often an iterative and interactive process. The first model we try may prove to be inadequate. Regression diagnostics are used to detect problems with the model and suggest improve- ments. This is a hands-on process. 7.1 Residuals and Leverage ¡ T 1 T We start with some basic diagnostic quantities - the residuals and the leverages. Recall that yˆ X X X ¢¤£ X y Hy where H is the hat-matrix. Now ¡ ¥ ¢ εˆ y ¥ yˆ I H y ¡ ¡ ¢ ¦ ¥ ¢ I ¥ H Xβ I H ε ¡ ¢ I ¥ H ε ¡ ¡ 2 2 ¢ ¥ ¢ So var εˆ var I ¥ H ε I H σ assuming that var ε σ I. We see that although the errors may have equal variance and be uncorrelated the residuals do not. ¡ 2 ¢ hi Hii are called leverages and are useful diagnostics. We see that var εˆi σ 1 ¥ hi so that a large leverage for hi will make var εˆi small — in other words the fit will be “forced” to be close to yi. The hi depends only on X — knowledge of y is required for a full interpretation. Some facts: ¨ © ∑hi p hi § 1 n i i ¨ An average value for hi is p ¨ n and a “rule of thumb” is that leverages of more than 2p n should be looked at more closely. Large values of hi are due to extreme values in X. hi corresponds to a Mahalanobis distance ¡ ¡ T 1 ¥ ¢ ¥ ¢ defined by X which is x x¯ Σˆ £ x x¯ where Σˆ is the estimated covariance of X.
    [Show full text]