The Effects of Heteroscedasticity on Tests of Equivalence," Journal of Modern Applied Statistical Methods: Vol

Journal of Modern Applied Statistical Methods Volume 6 | Issue 1 Article 13 5-1-2007 The ffecE ts of Heteroscedasticity on Tests of Equivalence Jamie A. Gruman University of Guelph Robert A. Cribbie York University, [email protected] Chantal A. Arpin-Cribbie York University Follow this and additional works at: http://digitalcommons.wayne.edu/jmasm Part of the Applied Statistics Commons, Social and Behavioral Sciences Commons, and the Statistical Theory Commons Recommended Citation Gruman, Jamie A.; Cribbie, Robert A.; and Arpin-Cribbie, Chantal A. (2007) "The Effects of Heteroscedasticity on Tests of Equivalence," Journal of Modern Applied Statistical Methods: Vol. 6 : Iss. 1 , Article 13. DOI: 10.22237/jmasm/1177992720 Available at: http://digitalcommons.wayne.edu/jmasm/vol6/iss1/13 This Regular Article is brought to you for free and open access by the Open Access Journals at DigitalCommons@WayneState. It has been accepted for inclusion in Journal of Modern Applied Statistical Methods by an authorized editor of DigitalCommons@WayneState. Journal of Modern Applied Statistical Methods Copyright © 2007 JMASM, Inc. May, 2007, Vol. 6, No. 1, 133-140 1538 – 9472/07/$95.00 The Effects of Heteroscedasticity on Tests of Equivalence Jamie A. Gruman Robert A. Cribbie Chantal A. Arpin-Cribbie University of Guelph York University Tests of equivalence, which are designed to assess the similarity of group means, are becoming more popular, yet very little is known about the statistical properties of these tests. Monte Carlo methods are used to compare the test of equivalence proposed by Schuirmann with modified tests of equivalence that incorporate a heteroscedastic error term. It was found that the latter were more accurate than the Schuirmann test in detecting equivalence when sample sizes and variances were unequal. Key words: Null hypothesis testing, heteroscedasticity, tests of equivalence. Introduction questions involving two groups as tests of this null hypothesis is almost a conditioned reflex Over a half century ago, Hotelling, Bartky, among scholars, even though such an hypothesis Deming, Friedman & Hoel (1948) wrote that is frequently irrelevant to the research question “Unfortunately, too many people like to do their (Westlake, 1976). Testing the null hypothesis of statistical work as they say their prayers – no difference is inappropriate for studies in merely substitute in a formula found in a highly which the primary objective is to demonstrate respected book written a long time ago” (p. that two groups are equivalent, rather than 103). This quote, which can be found cited in different, on a particular measure. More The Task Force on Statistical Inference in specifically, when the research question deals Psychology’s report outlining recommendations with the equivalence of groups on a dependent for the effective use of statistics (Wilkinson, measure, an equivalence test is the appropriate 1999), underscores the fact that many (and necessary) statistical method to be used. researchers apply statistical methods The present article will highlight the importance thoughtlessly, without considering the methods’ of equivalence tests in behavioral research and appropriateness to the research questions under use a Monte Carlo study to compare tests of consideration. equivalence when the variances of the groups Many empirical questions in behavioral are not equal. research involve testing the null hypothesis of no Researchers frequently conduct studies difference between groups on a specific in which assessing the equivalence of two dependent variable. In fact, formulating research groups is the main purpose. For example, consider an investigation of two therapies for dealing with perfectionism. One therapy is Jamie A. Gruman is Assistant Professor of lengthy and expensive; the other short and Organizational Behavior. E-mail: inexpensive. The pertinent research question [email protected] Robert A. Cribbie is an may be to determine whether the therapies are Associate Professor of Psychology at York equivalent in terms of their effectiveness. If they University in Toronto, Ontario. He specializes in are equivalent, then the shorter, less expensive robust test statistics and multiplicity control. method can be implemented with considerable Chantal A. Arpin-Cribbie is a doctoral student in cost and time savings. Traditional statistical the Department of Psychology at York procedures such as t-tests and ANOVAs are ill- University in Toronto, Ontario. She specializes suited to answering these questions because they in clinical health psychology. focus, conceptually and statistically, on assessing group differences. For research 133 134 TESTS OF EQUIVALENCE questions pertaining to the equivalence of Vessey (1993) note, the rejection or nonrejection conditions, researchers require a statistical of the null hypothesis of traditional tests tells us technique designed specifically to test the degree very little about the potential equivalence of the to which different conditions produce similar groups in question. Effectively establishing results. Tests of equivalence serve this purpose. whether the computer adapted version of the When employing tests of equivalence DAT produced subtest scores that were the goal is not to show that treatment conditions equivalent to the paper and pencil version would are perfectly identical, but only that the have required the use of a statistical technique differences between the treatments are too small that could assess the degree to which these to be considered meaningful. Consider, for measures produced similar results. This can be example, an investigation in which an attempt is accomplished through the use of equivalence made to demonstrate that scores on a computer- testing, the purpose of which is to demonstrate based test are equivalent to those from a paper that two (or more) conditions are functionally and pencil based test (e.g., Epstein, Klinkenberg, the same (Stegner, Bostrom & Greenfield, Wiley & McKinley, 2001). In this example, the 1996). researchers may not need to show that the test This approach to statistical analysis has scores are exactly equivalent (as with the been popular for many years in biology, where traditional null hypothesis Ho: µ1 = µ2, but only researchers interested in the interchangeability that differences in test scores are inconsequential of genetically equivalent drugs have used the (i.e., |µ1 - µ2| < D, where D represents an a priori technique to determine drugs’ comparative critical difference for determining equivalence). bioavailability, or bioequivalence (Westlake, A specific example may elucidate this 1976). However, researchers outside of biology issue more clearly. Alkhadher, Clarke & have been slow to recognize the utility of this Anderson (1998) conducted an investigation procedure and continue to use inappropriate designed to assess the equivalence of the paper- statistics when conducting studies that consider and-pencil version and a computer adaptive the similarity of alternative conditions, tests, version of three subtests from the Differential treatments, or procedures. Aptitude Tests (DAT), namely numerical ability One of the more commonly discussed (NA), abstract reasoning (AR) and mechanical tests of equivalence was developed by reasoning (MR). It is noteworthy that the title of Schuirmann (1987). Schuirmann’s test of their article specifically underscores the equivalence has been introduced to the equivalence of these subtests and that in their behavioral sciences through influential articles introduction they highlight that “their by Rogers et al. (1993), Seaman & Serlin (1998) equivalence must be established empirically” and others. The first step in applying (p.206). However, as a means of demonstrating Schuirmann’s test of equivalence is to establish the equivalence of the measures, Alkhader et al. a critical mean difference for declaring two proceeded to conduct ANOVAs, which are population means equivalent (D). Any mean expressly designed to detect statistically difference smaller than D would be considered significant group differences. Based on their meaningless within the framework of the analyses they claimed to have demonstrated the experiment. The selection of an equivalency equivalence of two of the three subtests (AR and interval (D) is an important aspect of MR). However, what Alkhader et al. in fact equivalence testing that is primarily dependent demonstrated was merely that scores on the NA on a subjective level of confidence with which subtest on the computer adapted version of the to declare two (or more) populations equivalent. DAT were statistically significantly different This level of confidence can take on many from the paper and pencil method as different forms including a raw value (e.g., mean traditionally defined. test scores different by 10 points), a percentage The question of the equivalence of the difference (e.g., +/- 10%), a percentage of the different administration methods on subtest pooled standard deviation difference, etc. scores remains a mystery. As Cribbie, Gruman Researchers debating an appropriate & Arpin-Cribbie (2004) and Rogers, Howard & value of D should consider the nature of the GRUMAN, CRIBBIE, & ARPIN-CRIBBIE 135 research. For example, if the paper-and-pencil One concern with the adoption of test discussed above was ten times more Schuirmann’s test of equivalence is the potential expensive to administer than the computer-based effects of variance heterogeneity on the standard test, even a very

Load more