Chapter 11: Unusual and Influential Data
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
LSAR: Efficient Leverage Score Sampling Algorithm for the Analysis
LSAR: Efficient Leverage Score Sampling Algorithm for the Analysis of Big Time Series Data Ali Eshragh∗ Fred Roostay Asef Nazariz Michael W. Mahoneyx December 30, 2019 Abstract We apply methods from randomized numerical linear algebra (RandNLA) to de- velop improved algorithms for the analysis of large-scale time series data. We first develop a new fast algorithm to estimate the leverage scores of an autoregressive (AR) model in big data regimes. We show that the accuracy of approximations lies within (1 + O (")) of the true leverage scores with high probability. These theoretical results are subsequently exploited to develop an efficient algorithm, called LSAR, for fitting an appropriate AR model to big time series data. Our proposed algorithm is guaranteed, with high probability, to find the maximum likelihood estimates of the parameters of the underlying true AR model and has a worst case running time that significantly im- proves those of the state-of-the-art alternatives in big data regimes. Empirical results on large-scale synthetic as well as real data highly support the theoretical results and reveal the efficacy of this new approach. 1 Introduction A time series is a collection of random variables indexed according to the order in which they are observed in time. The main objective of time series analysis is to develop a statistical model to forecast the future behavior of the system. At a high level, the main approaches for this include the ones based on considering the data in its original time domain and arXiv:1911.12321v2 [stat.ME] 26 Dec 2019 those arising from analyzing the data in the corresponding frequency domain [31, Chapter 1]. -
Residuals, Part II
Biostatistics 650 Mon, November 5 2001 Residuals, Part II Key terms External Studentization Outliers Added Variable Plot — Partial Regression Plot Partial Residual Plot — Component Plus Residual Plot Key ideas/results ¢ 1. An external estimate of ¡ comes from refitting the model without observation . Amazingly, it has an easy formula: £ ¦!¦ ¦ £¥¤§¦©¨ "# 2. Externally Studentized Residuals $ ¦ ¦©¨ ¦!¦)( £¥¤§¦&% ' Ordinary residuals standardized with £*¤§¦ . Also known as R-Student. 3. Residual Taxonomy Names Definition Distribution ¦!¦ ¦+¨-,¥¦ ,0¦ 1 Ordinary ¡ 5 /. &243 ¦768¤§¦©¨ ¦ ¦!¦ 1 ¦!¦ PRESS ¡ &243 9' 5 $ Studentized ¦©¨ ¦ £ ¦!¦ ¤0> % = Internally Studentized : ;<; $ $ Externally Studentized ¦+¨ ¦ £?¤§¦ ¦!¦ ¤0>¤A@ % R-Student = 4. Outliers are unusually large observations, due to an unmodeled shift or an (unmodeled) increase in variance. 5. Outliers are not necessarily bad points; they simply are not consistent with your model. They may posses valuable information about the inadequacies of your model. 1 PRESS Residuals & Studentized Residuals Recall that the PRESS residual has a easy computation form ¦ ¦©¨ PRESS ¦!¦ ( ¦!¦ It’s easy to show that this has variance ¡ , and hence a standardized PRESS residual is 9 ¦ ¦ ¦!¦ ¦ PRESS ¨ ¨ ¨ ¦ : £ ¦!¦ £ ¦!¦ £ ¦ ¦ % % % ' When we standardize a PRESS residual we get the studentized residual! This is very informa- ,¥¦ ¦ tive. We understand the PRESS residual to be the residual at ¦ if we had omitted from the 3 model. However, after adjusting for it’s variance, we get the same thing as a studentized residual. Hence the standardized residual can be interpreted as a standardized PRESS residual. Internal vs External Studentization ,*¦ ¦ The PRESS residuals remove the impact of point ¦ on the fit at . But the studentized 3 ¢ ¦ ¨ ¦ £?% ¦!¦ £ residual : can be corrupted by point by way of ; a large outlier will inflate the residual mean square, and hence £ . -
Testing for Uncorrelated Residuals in Dynamic Count Models with an Application to Corporate Bankruptcy
Munich Personal RePEc Archive Testing for Uncorrelated Residuals in Dynamic Count Models with an Application to Corporate Bankruptcy Sant’Anna, Pedro H. C. Universidad Carlos III de Madrid May 2013 Online at https://mpra.ub.uni-muenchen.de/48429/ MPRA Paper No. 48429, posted 19 Jul 2013 13:15 UTC Testing for Uncorrelated Residuals in Dynamic Count Models with an Application to Corporate Bankruptcy Pedro H. C. Sant’Anna Universidad Carlos III de Madrid, Getafe, 28903, Spain [email protected] Abstract This article proposes a new diagnostic test for dynamic count models, which is well suited for risk management. Our test proposal is of the Portmanteau-type test for lack of residual autocorrelation. Unlike previous proposals, the resulting test statistic is asymptotically pivotal when innovations are uncorrelated, but not necessarily iid nor a martingale difference. Moreover, the proposed test is able to detect local alter- natives converging to the null at the parametric rate T −1/2, with T the sample size. The finite sample performance of the test statistic is examined by means of a Monte Carlo experiment. Finally, using a dataset on U.S. corporate bankruptcies, we apply our test proposal to check if common risk models are correctly specified. Keywords: Time Series of counts; Residual autocorrelation function; Model checking; Credit risk management. 1 1. Introduction Credit risk affects virtually every financial contract. Therefore the measurement, pricing and management of credit risk have received much attention from economists, bank supervisors and regulators, and financial market practitioners. A widely used measure of credit risk is the probability of corporate default (PD). -
Research Paper Commerce Statistics Inference in Nonlinear Regression Model with Heteroscedastic Errors B. Mahaboob Research Scho
Volume-5, Issue-5, May - 2016 • ISSN No 2277 - 8160 IF : 3.62 | IC Value 70.36 Commerce Research Paper Statistics Inference in Nonlinear Regression Model With Heteroscedastic Errors B. Mahaboob Research Scholar, Department of Mathematics Dr. M. Ramesh Senior Data Scientist, Tech Mahindra, Hyderabad. Dr. Sk. Khadar Assistant Professor (Senior), Department of Mathematics, VIT, Vellore, Babu Tamilnadu. Academic Consultant, Department of CSE, SPMVV Engineering P. Srivyshnavi College, Tirupati. Prof. P. B.Sireesha Research Scholar, Department of Statistics, S.V.University, Tirupati. Balasiddamuni Rtd. Professor ABSTRACT An advent of modern computer science has made it possible for the applied mathematician to study the inferential aspects of an increasing number of nonlinear regression models in recent years. The inferential aspects for nonlinear regression models and the error assumptions are usually analogous to those made for linear regression models. The tests for the hypotheses on parameters of nonlinear regression models are usually based on nonlinear least squares estimates and the normal assumptions of error variables. A more common problem with data that are best fit by a nonlinear regression model than with data that can be adequately fit by the linear regression model is the problem of heteroscedasticity of errors. The problem of heteroscedasticity can be studied with reference to nonlinear regression models in a variety of ways. In the present study, an attempt has been made by developing inferential method for heteroscedastic nonlinear regression model. KEYWORDS : Intrinsically Nonlinear model; Violation of assumptions on Errors; Problem of Heteroscedasticity. I. INTRODUCTION In estimating the linear regression models, the least squares method of estimation is often applied to estimate the unknown parameters of the linear regression model and errors are usually assumed to be independent and identically distributed normal random variables with zero mean and unknown constant variance. -
Remedies for Assumption Violations and Multicollinearity Outliers • If the Outlier Is Due to a Data Entry Error, Just Correct the Value
Newsom Psy 522/622 Multiple Regression and Multivariate Quantitative Methods, Winter 2021 1 Remedies for Assumption Violations and Multicollinearity Outliers • If the outlier is due to a data entry error, just correct the value. This is a good reason why raw data should be retained for many years after it is collected as some professional associations recommend. With direct computer entry during interviews, entry errors cannot always be determined for certain. • If the outlier is due to an invalid case because the protocol was not followed or the inclusion criteria was incorrectly applied for that case, the researcher may be able to just eliminate that case. I recommend at least reporting the number of cases excluded and the reason. Consider analyzing the data and/or reporting analyses with and without the deleted case(s). • Transformation of the data might be considered (see overhead on transformations). Square root transformations, for instance, bring outlying cases closer to the others. Transformations, however, can make results difficult to interpret sometimes. • Analyze the data with and without the outlier(s). If the implications of the results do not differ, one could simply state that the analyses were conducted with and without the outlier(s) and that the results do not differ substantially. If they do differ, both results could be presented and a discussion about the potential causes of the outlier(s) could be discussed (e.g., different subgroup of cases, potential moderator effect). • An alternative estimation method could be used, such as least absolute residuals, weighted least squares, bootstrapping, or jackknifing. See discussion of these below. -
A Review on Empirical Likelihood Methods for Regression
Noname manuscript No. (will be inserted by the editor) A Review on Empirical Likelihood Methods for Regression Song Xi Chen · Ingrid Van Keilegom Received: date / Accepted: date Abstract We provide a review on the empirical likelihood method for regression type inference problems. The regression models considered in this review include parametric, semiparametric and nonparametric models. Both missing data and censored data are accommodated. Keywords Censored data; empirical likelihood; missing data; nonparametric regression; parametric regression; semiparametric regression; Wilks’ theorem. 1 Introduction It has been twenty years since Art Owen published his seminal paper (Owen, 1988) that introduces the notion of empirical likelihood (EL). Since then, there has been a rich body of literature on the novel idea of formulating versions of nonparametric likelihood in various settings of statistical inference. There have been two major reviews on the empirical likelihood. The first review was given by Hall and La Scala (1990) in the early years of the EL method, which summarized some key properties of the method. The second one was the book by the inventor of the methodology (Owen, 2001), which provided a comprehensive overview up to that time. The body of empirical likelihood literature is increasing rapidly, and it would be a daunting task to review the entire field in one review paper like this one. We therefore decided to concentrate our review on regression due to its prominence in statistical S.X. Chen Department of Statistics, Iowa State University, Ames, Iowa 50011-1210, USA and Guanghua School of Management, Peking University, China Tel.: 1-515-2942729 Fax: 1-515-2944040 E-mail: [email protected]; [email protected] I. -
Leveraging for Big Data Regression
Focus Article Leveraging for big data regression Ping Ma∗ and Xiaoxiao Sun Rapid advance in science and technology in the past decade brings an extraordinary amount of data, offering researchers an unprecedented opportunity to tackle complex research challenges. The opportunity, however, has not yet been fully utilized, because effective and efficient statistical tools for analyzing super-large dataset are still lacking. One major challenge is that the advance of computing resources still lags far behind the exponential growth of database. To facilitate scientific discoveries using current computing resources, one may use an emerging family of statistical methods, called leveraging. Leveraging methods are designed under a subsampling framework, in which one samples a small proportion of the data (subsample) from the full sample, and then performs intended computations for the full sample using the small subsample as a surrogate. The key of the success of the leveraging methods is to construct nonuniform sampling probabilities so that influential data points are sampled with high probabilities. These methods stand as the very unique development of their type in big data analytics and allow pervasive access to massive amounts of information without resorting to high performance computing and cloud computing. © 2014 Wiley Periodicals, Inc. Howtocitethisarticle: WIREs Comput Stat 2014. doi: 10.1002/wics.1324 Keywords: big data; subsampling; estimation; least squares; leverage INTRODUCTION of FPGA (field-programmable gate array)-based accel- erators are ongoing in both academia and industry apid advance in science and technology in the (for example, with TimeLogic and Convey Comput- past decade brings an extraordinary amount R ers). However, none of these technologies by them- of data that were inaccessible just a decade ago, selves can address the big data problem accurately and offering researchers an unprecedented opportunity to at scale. -
Linear Models with R Julian J.Faraway Downloaded by [University of Toronto] at 16:20 23 May 2014 Texts in Statistical Science
Downloaded by [University of Toronto] at 16:20 23 May 2014 CHAPMAN & HALL/CRC Texts in Statistical Science Series Series Editors Chris Chatfield, University of Bath, UK Martin Tanner, Northwestern University, USA Jim Zidek, University of British Columbia, Canada Analysis of Failure and Survival Data Peter J.Smith The Analysis and Interpretation of Multivariate Data for Social Scientists David J.Bartholomew, Fiona Steele, Irini Moustaki, and Jane Galbraith The Analysis of Time Series—An Introduction, Sixth Edition Chris Chatfield Applied Bayesian Forecasting and Time Series Analysis A.Pole, M.West and J.Harrison Applied Nonparametric Statistical Methods, Third Edition P.Sprent and N.C.Smeeton Applied Statistics—Handbook of GENSTAT Analysis E.J.Snell and H.Simpson Applied Statistics—Principles and Examples D.R.Cox and E.J.Snell Bayes and Empirical Bayes Methods for Data Analysis, Second Edition Bradley P.Carlin and Thomas A.Louis Downloaded by [University of Toronto] at 16:20 23 May 2014 Bayesian Data Analysis, Second Edition Andrew Gelman, John B.Carlin, Hal S.Stern, and Donald B.Rubin Beyond ANOVA—Basics of Applied Statistics R.G.Miller, Jr. Computer-Aided Multivariate Analysis, Third Edition A.A.Afifi and V.A.Clark A Course in Categorical Data Analysis T.Leonard A Course in Large Sample Theory T.S.Ferguson Data Driven Statistical Methods P.Sprent Decision Analysis—A Bayesian Approach J.Q.Smith Elementary Applications of Probability Theory, Second Edition H.C.Tuckwell Elements of Simulation B.J.T.Morgan Epidemiology—Study Design and Data -
Elemental Estimates, Influence, and Algorithmic Leveraging
Leverage in penalized least squares estimation Keith Knight University of Toronto [email protected] Abstract The German mathematician Carl Gustav Jacobi showed in 1841 that least squares estimates can be written as an expected value of elemental estimates with respect to a probability measure on subsets of the predictors; this result is trivial to extend to penalized least squares where the penalty is quadratic. This representation is used to define the leverage of a subset of observations in terms of this probability measure. This definition can be applied to define the influence of the penalty term on the estimates. Applications to the Hodrick-Prescott filter, backfitting, and ridge regression as well as extensions to non-quadratic penalties are considered. 1 Introduction We consider estimation in the linear regression model Y = xT β + ε (i =1, , n) i i i ··· where β is a vector of p unknown parameters. Given observations (x ,y ): i = 1, , n , we define the penalized least squares (PLS) { i i ··· } estimate β as the minimizer of n (y xT β)2 + βT Aβ. (1) i − i i=1 where A is a non-negative definite matrix with A = LT L for some r p matrix L. Note that L × is in general not uniquely specified since for any r r orthogonal matrix O, A =(OL)T (OL). × In the context of this paper, we would typically choose L so that its rows are interpretable in some sense. Examples of PLS estimates include ridge estimates (Hoerl and Kennard, 1970), smoothing splines (Craven and Wahba, 1978), the Hodrick-Prescott filter (Hodrick and Prescott, 1997), and (trivially when A = 0) ordinary least squares (OLS) estimates. -
Leverage (Statistics)
Leverage (statistics) In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. High-leverage points are those observations, if any, made at extreme or outlying values of the independent variables such that the lack of neighboring observations means that the fitted regression model will pass close to that particular observation.[1] Modern computer packages for statistical analysis include, as part of their facilities for regression analysis, various quantitative measures for identifying influential observations: among these measures is partial leverage, a measure of how a variable contributes to the leverage of a datum. Contents Linear regression model Definition Bounds on leverage Proof Effect on residual variance Proof Studentized residuals See also References Linear regression model Definition In the linear regression model, the leverage score for thei -th data unit is defined as: the i-th diagonal element of the projection matrix , where is the design matrix. The leverage score is also known as the observation self-sensitivity or self-influence,[2] as shown by where and are the fitted and measured observation, respectively. Bounds on leverage Proof First, note that H is an idempotent matrix: Also, observe that is symmetric. So equating theii element of H to that of H 2, we have and Effect on residual variance If we are in an ordinary least squares setting with fixed X, regression errors and then where (the i th regression residual). In other words, if the model errors are homoscedastic, an observation's leverage score determines the degree of noise in the model's misprediction of that observation. -
Analysis of the Spread of COVID-19 in the USA with a Spatio-Temporal Multivariate Time Series Model
International Journal of Environmental Research and Public Health Article Analysis of the Spread of COVID-19 in the USA with a Spatio-Temporal Multivariate Time Series Model Rongxiang Rui 1 , Maozai Tian 2 , Man-Lai Tang 3,* , George To-Sum Ho 4 and Chun-Ho Wu 4 1 School of Statistics, Renmin University of China, Beijing 100872, China; [email protected] 2 College of Medical Engineering and Technology, Xinjiang Medical University, Ürümqi 830011, China; [email protected] 3 Department of Mathematics, Statistics and Insurance, Hang Seng University of Hong Kong, Hong Kong, China 4 Department of Supply Chain and Information Management, Hang Seng University of Hong Kong, Hong Kong, China; [email protected] (G.T.-S.H.); [email protected] (C.-H.W.) * Correspondence: [email protected] Abstract: With the rapid spread of the pandemic due to the coronavirus disease 2019 (COVID-19), the virus has already led to considerable mortality and morbidity worldwide, as well as having a severe impact on economic development. In this article, we analyze the state-level correlation between COVID-19 risk and weather/climate factors in the USA. For this purpose, we consider a spatio-temporal multivariate time series model under a hierarchical framework, which is especially suitable for envisioning the virus transmission tendency across a geographic area over time. Briefly, our model decomposes the COVID-19 risk into: (i) an autoregressive component that describes the within-state COVID-19 risk effect; (ii) a spatiotemporal component that describes the across-state COVID-19 risk effect; (iii) an exogenous component that includes other factors (e.g., weather/climate) that could envision future epidemic development risk; and (iv) an endemic component that captures the function of time and other predictors mainly for individual states. -
Power Maximization and Size Control in Heteroskedasticity and Autocorrelation Robust Tests with Exponentiated Kernels
POWER MAXIMIZATION AND SIZE CONTROL IN HETEROSKEDASTICITY AND AUTOCORRELATION ROBUST TESTS WITH EXPONENTIATED KERNELS By Yixiao Sun, Peter C. B. Phillips and Sainan Jin January 2010 COWLES FOUNDATION DISCUSSION PAPER NO. 1749 COWLES FOUNDATION FOR RESEARCH IN ECONOMICS YALE UNIVERSITY Box 208281 New Haven, Connecticut 06520-8281 Power Maximization and Size Control in Heteroskedasticity and Autocorrelation Robust Tests with Exponentiated Kernels Yixiao Sun Department of Economics University of California, San Diego Peter C. B. Phillips Yale University, University of Auckland, Singapore Management University & University of Southampton Sainan Jin School of Economics Singapore Management University, August 14, 2009 The authors gratefully acknowledge partial research support from NSF under Grant No. SES- 0752443 (Sun), SES 06-47086 (Phillips), and NSFC under grant No. 70501001 and 70601001 (Jin). ABSTRACT Using the power kernels of Phillips, Sun and Jin (2006, 2007), we examine the large sample asymptotic properties of the t-test for di¤erent choices of power parameter (). We show that the nonstandard …xed- limit distributions of the t-statistic provide more accurate approximations to the …nite sample distributions than the conventional large- limit distribution. We prove that the second-order corrected critical value based on an asymptotic expansion of the nonstandard limit distribution is also second- order correct under the large- asymptotics. As a further contribution, we propose a new practical procedure for selecting the test-optimal power parameter that addresses the central concern of hypothesis testing: the selected power parameter is test-optimal in the sense that it minimizes the type II error while controlling for the type I error. A plug-in procedure for implementing the test-optimal power parameter is suggested.