Comparing Tests of Homoscedasticity in Simple Linear Regression
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Understanding Linear and Logistic Regression Analyses
EDUCATION • ÉDUCATION METHODOLOGY Understanding linear and logistic regression analyses Andrew Worster, MD, MSc;*† Jerome Fan, MD;* Afisi Ismaila, MSc† SEE RELATED ARTICLE PAGE 105 egression analysis, also termed regression modeling, come). For example, a researcher could evaluate the poten- Ris an increasingly common statistical method used to tial for injury severity score (ISS) to predict ED length-of- describe and quantify the relation between a clinical out- stay by first producing a scatter plot of ISS graphed against come of interest and one or more other variables. In this ED length-of-stay to determine whether an apparent linear issue of CJEM, Cummings and Mayes used linear and lo- relation exists, and then by deriving the best fit straight line gistic regression to determine whether the type of trauma for the data set using linear regression carried out by statis- team leader (TTL) impacts emergency department (ED) tical software. The mathematical formula for this relation length-of-stay or survival.1 The purpose of this educa- would be: ED length-of-stay = k(ISS) + c. In this equation, tional primer is to provide an easily understood overview k (the slope of the line) indicates the factor by which of these methods of statistical analysis. We hope that this length-of-stay changes as ISS changes and c (the “con- primer will not only help readers interpret the Cummings stant”) is the value of length-of-stay when ISS equals zero and Mayes study, but also other research that uses similar and crosses the vertical axis.2 In this hypothetical scenario, methodology. -
Chapter 4: Fisher's Exact Test in Completely Randomized Experiments
1 Chapter 4: Fisher’s Exact Test in Completely Randomized Experiments Fisher (1925, 1926) was concerned with testing hypotheses regarding the effect of treat- ments. Specifically, he focused on testing sharp null hypotheses, that is, null hypotheses under which all potential outcomes are known exactly. Under such null hypotheses all un- known quantities in Table 4 in Chapter 1 are known–there are no missing data anymore. As we shall see, this implies that we can figure out the distribution of any statistic generated by the randomization. Fisher’s great insight concerns the value of the physical randomization of the treatments for inference. Fisher’s classic example is that of the tea-drinking lady: “A lady declares that by tasting a cup of tea made with milk she can discriminate whether the milk or the tea infusion was first added to the cup. ... Our experi- ment consists in mixing eight cups of tea, four in one way and four in the other, and presenting them to the subject in random order. ... Her task is to divide the cups into two sets of 4, agreeing, if possible, with the treatments received. ... The element in the experimental procedure which contains the essential safeguard is that the two modifications of the test beverage are to be prepared “in random order.” This is in fact the only point in the experimental procedure in which the laws of chance, which are to be in exclusive control of our frequency distribution, have been explicitly introduced. ... it may be said that the simple precaution of randomisation will suffice to guarantee the validity of the test of significance, by which the result of the experiment is to be judged.” The approach is clear: an experiment is designed to evaluate the lady’s claim to be able to discriminate wether the milk or tea was first poured into the cup. -
The Detection of Heteroscedasticity in Regression Models for Psychological Data
Psychological Test and Assessment Modeling, Volume 58, 2016 (4), 567-592 The detection of heteroscedasticity in regression models for psychological data Andreas G. Klein1, Carla Gerhard2, Rebecca D. Büchner2, Stefan Diestel3 & Karin Schermelleh-Engel2 Abstract One assumption of multiple regression analysis is homoscedasticity of errors. Heteroscedasticity, as often found in psychological or behavioral data, may result from misspecification due to overlooked nonlinear predictor terms or to unobserved predictors not included in the model. Although methods exist to test for heteroscedasticity, they require a parametric model for specifying the structure of heteroscedasticity. The aim of this article is to propose a simple measure of heteroscedasticity, which does not need a parametric model and is able to detect omitted nonlinear terms. This measure utilizes the dispersion of the squared regression residuals. Simulation studies show that the measure performs satisfactorily with regard to Type I error rates and power when sample size and effect size are large enough. It outperforms the Breusch-Pagan test when a nonlinear term is omitted in the analysis model. We also demonstrate the performance of the measure using a data set from industrial psychology. Keywords: Heteroscedasticity, Monte Carlo study, regression, interaction effect, quadratic effect 1Correspondence concerning this article should be addressed to: Prof. Dr. Andreas G. Klein, Department of Psychology, Goethe University Frankfurt, Theodor-W.-Adorno-Platz 6, 60629 Frankfurt; email: [email protected] 2Goethe University Frankfurt 3International School of Management Dortmund Leipniz-Research Centre for Working Environment and Human Factors 568 A. G. Klein, C. Gerhard, R. D. Büchner, S. Diestel & K. Schermelleh-Engel Introduction One of the standard assumptions underlying a linear model is that the errors are inde- pendently identically distributed (i.i.d.). -
Pearson-Fisher Chi-Square Statistic Revisited
Information 2011 , 2, 528-545; doi:10.3390/info2030528 OPEN ACCESS information ISSN 2078-2489 www.mdpi.com/journal/information Communication Pearson-Fisher Chi-Square Statistic Revisited Sorana D. Bolboac ă 1, Lorentz Jäntschi 2,*, Adriana F. Sestra ş 2,3 , Radu E. Sestra ş 2 and Doru C. Pamfil 2 1 “Iuliu Ha ţieganu” University of Medicine and Pharmacy Cluj-Napoca, 6 Louis Pasteur, Cluj-Napoca 400349, Romania; E-Mail: [email protected] 2 University of Agricultural Sciences and Veterinary Medicine Cluj-Napoca, 3-5 M ănăş tur, Cluj-Napoca 400372, Romania; E-Mails: [email protected] (A.F.S.); [email protected] (R.E.S.); [email protected] (D.C.P.) 3 Fruit Research Station, 3-5 Horticultorilor, Cluj-Napoca 400454, Romania * Author to whom correspondence should be addressed; E-Mail: [email protected]; Tel: +4-0264-401-775; Fax: +4-0264-401-768. Received: 22 July 2011; in revised form: 20 August 2011 / Accepted: 8 September 2011 / Published: 15 September 2011 Abstract: The Chi-Square test (χ2 test) is a family of tests based on a series of assumptions and is frequently used in the statistical analysis of experimental data. The aim of our paper was to present solutions to common problems when applying the Chi-square tests for testing goodness-of-fit, homogeneity and independence. The main characteristics of these three tests are presented along with various problems related to their application. The main problems identified in the application of the goodness-of-fit test were as follows: defining the frequency classes, calculating the X2 statistic, and applying the χ2 test. -
Assessing Heterogeneity in Meta-Analysis: Q Statistic Or I2 Index? Tania Huedo-Medina University of Connecticut, [email protected]
University of Connecticut OpenCommons@UConn Center for Health, Intervention, and Prevention CHIP Documents (CHIP) 6-1-2006 Assessing heterogeneity in meta-analysis: Q statistic or I2 index? Tania Huedo-Medina University of Connecticut, [email protected] Julio Sanchez-Meca University of Murcia, Spane Fulgencio Marin-Martinez University of Murcia, Spain Juan Botella Autonoma University of Madrid, Spain Follow this and additional works at: https://opencommons.uconn.edu/chip_docs Part of the Psychology Commons Recommended Citation Huedo-Medina, Tania; Sanchez-Meca, Julio; Marin-Martinez, Fulgencio; and Botella, Juan, "Assessing heterogeneity in meta-analysis: Q statistic or I2 index?" (2006). CHIP Documents. 19. https://opencommons.uconn.edu/chip_docs/19 ASSESSING HETEROGENEITY IN META -ANALYSIS: Q STATISTIC OR I2 INDEX? Tania B. Huedo -Medina, 1 Julio Sánchez -Meca, 1 Fulgencio Marín -Martínez, 1 and Juan Botella 2 Running head: Assessing heterogeneity in meta -analysis 2006 1 University of Murcia, Spa in 2 Autónoma University of Madrid, Spain Address for correspondence: Tania B. Huedo -Medina Dept. of Basic Psychology & Methodology, Faculty of Psychology, Espinardo Campus, Murcia, Spain Phone: + 34 968 364279 Fax: + 34 968 364115 E-mail: [email protected] * This work has been supported by Plan Nacional de Investigaci ón Cient ífica, Desarrollo e Innovaci ón Tecnol ógica 2004 -07 from the Ministerio de Educaci ón y Ciencia and by funds from the Fondo Europeo de Desarrollo Regional, FEDER (Proyect Number: S EJ2004 -07278/PSIC). Assessing heterogeneity in meta -analysis 2 ASSESSING HETEROGENEITY IN META -ANALYSIS: Q STATISTIC OR I2 INDEX? Abstract In meta -analysis, the usual way of assessing whether a set of single studies are homogeneous is by means of the Q test. -
Generalized Linear Models
Generalized Linear Models A generalized linear model (GLM) consists of three parts. i) The first part is a random variable giving the conditional distribution of a response Yi given the values of a set of covariates Xij. In the original work on GLM’sby Nelder and Wedderburn (1972) this random variable was a member of an exponential family, but later work has extended beyond this class of random variables. ii) The second part is a linear predictor, i = + 1Xi1 + 2Xi2 + + ··· kXik . iii) The third part is a smooth and invertible link function g(.) which transforms the expected value of the response variable, i = E(Yi) , and is equal to the linear predictor: g(i) = i = + 1Xi1 + 2Xi2 + + kXik. ··· As shown in Tables 15.1 and 15.2, both the general linear model that we have studied extensively and the logistic regression model from Chapter 14 are special cases of this model. One property of members of the exponential family of distributions is that the conditional variance of the response is a function of its mean, (), and possibly a dispersion parameter . The expressions for the variance functions for common members of the exponential family are shown in Table 15.2. Also, for each distribution there is a so-called canonical link function, which simplifies some of the GLM calculations, which is also shown in Table 15.2. Estimation and Testing for GLMs Parameter estimation in GLMs is conducted by the method of maximum likelihood. As with logistic regression models from the last chapter, the generalization of the residual sums of squares from the general linear model is the residual deviance, Dm 2(log Ls log Lm), where Lm is the maximized likelihood for the model of interest, and Ls is the maximized likelihood for a saturated model, which has one parameter per observation and fits the data as well as possible. -
Application of General Linear Models (GLM) to Assess Nodule Abundance Based on a Photographic Survey (Case Study from IOM Area, Pacific Ocean)
minerals Article Application of General Linear Models (GLM) to Assess Nodule Abundance Based on a Photographic Survey (Case Study from IOM Area, Pacific Ocean) Monika Wasilewska-Błaszczyk * and Jacek Mucha Department of Geology of Mineral Deposits and Mining Geology, Faculty of Geology, Geophysics and Environmental Protection, AGH University of Science and Technology, 30-059 Cracow, Poland; [email protected] * Correspondence: [email protected] Abstract: The success of the future exploitation of the Pacific polymetallic nodule deposits depends on an accurate estimation of their resources, especially in small batches, scheduled for extraction in the short term. The estimation based only on the results of direct seafloor sampling using box corers is burdened with a large error due to the long sampling interval and high variability of the nodule abundance. Therefore, estimations should take into account the results of bottom photograph analyses performed systematically and in large numbers along the course of a research vessel. For photographs taken at the direct sampling sites, the relationship linking the nodule abundance with the independent variables (the percentage of seafloor nodule coverage, the genetic types of nodules in the context of their fraction distribution, and the degree of sediment coverage of nodules) was determined using the general linear model (GLM). Compared to the estimates obtained with a simple Citation: Wasilewska-Błaszczyk, M.; linear model linking this parameter only with the seafloor nodule coverage, a significant decrease Mucha, J. Application of General in the standard prediction error, from 4.2 to 2.5 kg/m2, was found. The use of the GLM for the Linear Models (GLM) to Assess assessment of nodule abundance in individual sites covered by bottom photographs, outside of Nodule Abundance Based on a direct sampling sites, should contribute to a significant increase in the accuracy of the estimation of Photographic Survey (Case Study nodule resources. -
05 36534Nys130620 31
Monte Carlos study on Power Rates of Some Heteroscedasticity detection Methods in Linear Regression Model with multicollinearity problem O.O. Alabi, Kayode Ayinde, O. E. Babalola, and H.A. Bello Department of Statistics, Federal University of Technology, P.M.B. 704, Akure, Ondo State, Nigeria Corresponding Author: O. O. Alabi, [email protected] Abstract: This paper examined the power rate exhibit by some heteroscedasticity detection methods in a linear regression model with multicollinearity problem. Violation of unequal error variance assumption in any linear regression model leads to the problem of heteroscedasticity, while violation of the assumption of non linear dependency between the exogenous variables leads to multicollinearity problem. Whenever these two problems exist one would faced with estimation and hypothesis problem. in order to overcome these hurdles, one needs to determine the best method of heteroscedasticity detection in other to avoid taking a wrong decision under hypothesis testing. This then leads us to the way and manner to determine the best heteroscedasticity detection method in a linear regression model with multicollinearity problem via power rate. In practices, variance of error terms are unequal and unknown in nature, but there is need to determine the presence or absence of this problem that do exist in unknown error term as a preliminary diagnosis on the set of data we are to analyze or perform hypothesis testing on. Although, there are several forms of heteroscedasticity and several detection methods of heteroscedasticity, but for any researcher to arrive at a reasonable and correct decision, best and consistent performed methods of heteroscedasticity detection under any forms or structured of heteroscedasticity must be determined. -
Auditor: an R Package for Model-Agnostic Visual Validation and Diagnostics
auditor: an R Package for Model-Agnostic Visual Validation and Diagnostics APREPRINT Alicja Gosiewska Przemysław Biecek Faculty of Mathematics and Information Science Faculty of Mathematics and Information Science Warsaw University of Technology Warsaw University of Technology Poland Faculty of Mathematics, Informatics and Mechanics [email protected] University of Warsaw Poland [email protected] May 27, 2020 ABSTRACT Machine learning models have spread to almost every area of life. They are successfully applied in biology, medicine, finance, physics, and other fields. With modern software it is easy to train even a complex model that fits the training data and results in high accuracy on test set. The problem arises when models fail confronted with the real-world data. This paper describes methodology and tools for model-agnostic audit. Introduced tech- niques facilitate assessing and comparing the goodness of fit and performance of models. In addition, they may be used for analysis of the similarity of residuals and for identification of outliers and influential observations. The examination is carried out by diagnostic scores and visual verification. Presented methods were implemented in the auditor package for R. Due to flexible and con- sistent grammar, it is simple to validate models of any classes. K eywords machine learning, R, diagnostics, visualization, modeling 1 Introduction Predictive modeling is a process that uses mathematical and computational methods to forecast outcomes. arXiv:1809.07763v4 [stat.CO] 26 May 2020 Lots of algorithms in this area have been developed and are still being develop. Therefore, there are countless possible models to choose from and a lot of ways to train a new new complex model. -
Robust Bayesian General Linear Models ⁎ W.D
www.elsevier.com/locate/ynimg NeuroImage 36 (2007) 661–671 Robust Bayesian general linear models ⁎ W.D. Penny, J. Kilner, and F. Blankenburg Wellcome Department of Imaging Neuroscience, University College London, 12 Queen Square, London WC1N 3BG, UK Received 22 September 2006; revised 20 November 2006; accepted 25 January 2007 Available online 7 May 2007 We describe a Bayesian learning algorithm for Robust General Linear them from the data (Jung et al., 1999). This is, however, a non- Models (RGLMs). The noise is modeled as a Mixture of Gaussians automatic process and will typically require user intervention to rather than the usual single Gaussian. This allows different data points disambiguate the discovered components. In fMRI, autoregressive to be associated with different noise levels and effectively provides a (AR) modeling can be used to downweight the impact of periodic robust estimation of regression coefficients. A variational inference respiratory or cardiac noise sources (Penny et al., 2003). More framework is used to prevent overfitting and provides a model order recently, a number of approaches based on robust regression have selection criterion for noise model order. This allows the RGLM to default to the usual GLM when robustness is not required. The method been applied to imaging data (Wager et al., 2005; Diedrichsen and is compared to other robust regression methods and applied to Shadmehr, 2005). These approaches relax the assumption under- synthetic data and fMRI. lying ordinary regression that the errors be normally (Wager et al., © 2007 Elsevier Inc. All rights reserved. 2005) or identically (Diedrichsen and Shadmehr, 2005) distributed. In Wager et al. -
Generalized Linear Models Outline for Today
Generalized linear models Outline for today • What is a generalized linear model • Linear predictors and link functions • Example: estimate a proportion • Analysis of deviance • Example: fit dose-response data using logistic regression • Example: fit count data using a log-linear model • Advantages and assumptions of glm • Quasi-likelihood models when there is excessive variance Review: what is a (general) linear model A model of the following form: Y = β0 + β1X1 + β2 X 2 + ...+ error • Y is the response variable • The X ‘s are the explanatory variables • The β ‘s are the parameters of the linear equation • The errors are normally distributed with equal variance at all values of the X variables. • Uses least squares to fit model to data, estimate parameters • lm in R The predicted Y, symbolized here by µ, is modeled as µ = β0 + β1X1 + β2 X 2 + ... What is a generalized linear model A model whose predicted values are of the form g(µ) = β0 + β1X1 + β2 X 2 + ... • The model still include a linear predictor (to right of “=”) • g(µ) is the “link function” • Wide diversity of link functions accommodated • Non-normal distributions of errors OK (specified by link function) • Unequal error variances OK (specified by link function) • Uses maximum likelihood to estimate parameters • Uses log-likelihood ratio tests to test parameters • glm in R The two most common link functions Log • used for count data η η = logµ The inverse function is µ = e Logistic or logit • used for binary data µ eη η = log µ = 1− µ The inverse function is 1+ eη In all cases log refers to natural logarithm (base e). -
Heteroscedastic Errors
Heteroscedastic Errors ◮ Sometimes plots and/or tests show that the error variances 2 σi = Var(ǫi ) depend on i ◮ Several standard approaches to fixing the problem, depending on the nature of the dependence. ◮ Weighted Least Squares. ◮ Transformation of the response. ◮ Generalized Linear Models. Richard Lockhart STAT 350: Heteroscedastic Errors and GLIM Weighted Least Squares ◮ Suppose variances are known except for a constant factor. 2 2 ◮ That is, σi = σ /wi . ◮ Use weighted least squares. (See Chapter 10 in the text.) ◮ This usually arises realistically in the following situations: ◮ Yi is an average of ni measurements where you know ni . Then wi = ni . 2 ◮ Plots suggest that σi might be proportional to some power of 2 γ γ some covariate: σi = kxi . Then wi = xi− . Richard Lockhart STAT 350: Heteroscedastic Errors and GLIM Variances depending on (mean of) Y ◮ Two standard approaches are available: ◮ Older approach is transformation. ◮ Newer approach is use of generalized linear model; see STAT 402. Richard Lockhart STAT 350: Heteroscedastic Errors and GLIM Transformation ◮ Compute Yi∗ = g(Yi ) for some function g like logarithm or square root. ◮ Then regress Yi∗ on the covariates. ◮ This approach sometimes works for skewed response variables like income; ◮ after transformation we occasionally find the errors are more nearly normal, more homoscedastic and that the model is simpler. ◮ See page 130ff and check under transformations and Box-Cox in the index. Richard Lockhart STAT 350: Heteroscedastic Errors and GLIM Generalized Linear Models ◮ Transformation uses the model T E(g(Yi )) = xi β while generalized linear models use T g(E(Yi )) = xi β ◮ Generally latter approach offers more flexibility.