<<

Journal of Modern Applied Statistical Methods

Volume 2 | Issue 2 Article 30

11-1-2003 Vol. 2, No. 2 (Full Issue) JMASM Editors

Follow this and additional works at: http://digitalcommons.wayne.edu/jmasm

Recommended Citation Editors, JMASM (2003) "Vol. 2, No. 2 (Full Issue)," Journal of Modern Applied Statistical Methods: Vol. 2 : Iss. 2 , Article 30. DOI: 10.22237/jmasm/1067644800 Available at: http://digitalcommons.wayne.edu/jmasm/vol2/iss2/30

This Full Issue is brought to you for free and open access by the Open Access Journals at DigitalCommons@WayneState. It has been accepted for inclusion in Journal of Modern Applied Statistical Methods by an authorized editor of DigitalCommons@WayneState. SM

announces TM

v2.0

The fastest, most comprehensive and robust

permutation test software on the market today.

Permutation tests increasingly are the statistical method of choice for addressing business questions and research hypotheses across a broad range of industries. Their distribution-free nature maintains test validity where many parametric tests (and even other nonparametric tests), encumbered by restrictive and often inappropriate data assumptions, fail miserably. The computational demands of permutation tests, however, have severely limited other vendors’ attempts at providing useable permutation test software for anything but highly stylized situations or small datasets and few tests. TM PermuteIt addresses this unmet need by utilizing a combination of algorithms to perform non-parametric permutation tests very quickly – often more than an order of magnitude faster than widely available commercial alternatives when one sample is TM large and many tests and/or multiple comparisons are being performed (which is when runtimes matter most). PermuteIt can make the difference between making deadlines, or missing them, since data inputs often need to be revised, resent, or recleaned, and one hour of runtime quickly can become 10, 20, or 30 hours.

In addition to its speed even when one sample is large, some of the unique and powerful features of PermuteItTM include:

• the availability to the user of a wide range of test statistics for performing permutation tests on continuous, count, & binary data, including: pooled-variance t-test; separate-variance Behrens-Fisher t-test, scale test, and joint tests for scale and location coefficients using nonparametric combination methodology; Brownie et al. “modified” t-test; skew-adjusted “modified” t-test; Cochran-Armitage test; exact inference; Poisson normal-approximate test; Fisher’s exact test; Freeman- Tukey Double Arcsine test

• extremely fast exact inference (no confidence intervals – just exact p-values) for most count data and high-frequency continuous data, often several orders of magnitude faster than the most widely available commercial alternative

• the availability to the user of a wide range of multiple testing procedures, including: Bonferroni, Sidak, Stepdown Bonferroni, Stepdown Sidak, Stepdown Bonferroni and Stepdown Sidak for discrete distributions, Hochberg Stepup, FDR, Dunnett’s one-step (for MCC under ANOVA assumptions), Single-step Permutation, Stepdown Permutation, Single-step and Stepdown Permutation for discrete distributions, Permutation-style adjustment of permutation p-values

• fast, efficient, and automatic generation of all pairwise comparisons

• efficient variance-reduction under conventional Monte Carlo via self-adjusting permutation sampling when confidence intervals contain the user-specified critical value of the test

• maximum power, and the shortest confidence intervals, under conventional Monte Carlo via a new sampling optimization technique (see Opdyke, JMASM, Vol. 2, No. 1, May, 2003)

• fast permutation-style p-value adjustments for multiple comparisons (the code is designed to provide an additional speed premium for many of these resampling-based multiple testing procedures)

• simultaneous permutation testing and permutation-style p-value adjustment, although for relatively few tests at a time (this capability is not even provided as a preprogrammed option with any other software currently on the market)

For Telecommunications, Pharmaceuticals, fMRI data, Financial Services, Clinical Trials, Insurance, Bioinformatics, and just about any data rich industry where large numbers of distributional null hypotheses need to be tested on samples that are not extremely small and parametric assumptions are either uncertain or inappropriate, PermuteItTM is the optimal, and only, solution.

To learn more about how PermuteItTM can be used for your enterprise, and to obtain a demo version, please contact its author, J.D. Opdyke, President, DataMineItSM, at [email protected] or www.DataMineIt.com.

DataMineItSM is a technical consultancy providing statistical data mining, econometric analysis, and data warehousing services and expertise to the industry, consulting, and research sectors. PermuteItTM is its flagship product. $@2" Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 281-534 1538 – 9472/03/$30.00

Journal Of Modern Applied Statistical Methods

Invited Articles 281 - 292 C. Mitchell Dayton Model Comparisons Using Information Measures

293 - 305 Gregory R. Hancock Fortune Cookies, Measurement Error, And Experimental Design

Regular Articles 306 - 313 Charles F. Bond, Jr., P* Index Of Segregation: Distribution F. D. Richard Under Reassignment

314 - 328 Kimberly T. Perry A Critical Examination Of The Use Of Preliminary Tests In Two-Sample Tests Of Location

329 - 340 Christopher J. Mecklin A Comparison Of Equivalence Testing In Combination With Hypothesis Testing And Effect Sizes

341 - 349 Ayman Baklizi Confidence Intervals For P(X

350 - 358 Vincent A. R. Camara Approximate Bayesian Confidence Intervals For The Variance Of A Gaussian Distribution

359 - 370 Alfred A. Bartolucci, Random Regression Models Based On The Shimin Zheng, Sejong Bae, Elliptically Contoured Distribution Assumptions Karan P. Singh With Applications To Longitudinal Data

371 - 379 Dudley L. Poston, Jr., Using Zero-inflated Count Regression Models Sherry L. McKibben To Estimate The Fertility Of U.S. Women

380 - 388 Felix Famoye, Variable Selection For Poisson Regression Model Daniel E. Rothe

389 - 399 Chengjie Xiong, Test Of Homogeneity For Umbrella Alternatives In Yan Yan, Ming Ji Dose-Response Relationship For Poisson Variables

400 - 413 Stephanie Wehry, Type I Error Rates Of Four Methods For James Algina Analyzing Data Collected In A Groups vs Individuals Design

414 - 424 Terry Hyslop, A Nonparametric Fitted Test For The Paul J. Lupinacci Behrens-Fisher Problem

ii $@2" Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 281-534 1538 – 9472/03/$30.00

425 - 432 David A. Walker, Example Of The Impact Of Weights And Design Denise Y. Young Effects On Contingency Tables And Chi-Square Analysis

433 - 442 Guillermo Montes, Correcting Publication Bias In Meta-Analysis: Bohdan S. Lotyczewski A Truncation Approach

443 - 450 Chin-Shang Li, Hua Liang, Comparison Of Viral Trajectories In AIDS Studies Ying-Hen Hsieh, By Using Nonparametric Mixed-Effects Models Shiing-Jer Twu

451 - 466 Stephanie Wehry Alphabet Letter Recognition And Emergent Abilities Of Rising Kindergarten Children Living In Low-Income Families

467 - 474 Shlomo S. Sawilowsky Deconstructing Arguments From The Case Against Hypothesis Testing

Brief Reports 475 - 477 W. J. Hurley A Note On MLEs For Normal Distribution Parameters Based On Disjoint Partial Sums Of A Random Sample

478 - 480 W. Gregory Thatcher, On Treating A Survey Of Convenience J. Wanzer Drane Sample As A Simple Random Sample

Early Scholars 481 - 496 Katherine Fradette, Conventional And Robust Paired And J. Keselman, Lisa Lix, Independent-Samples t Tests: James Algina, Type I Error And Power Rates Rand R. Wilcox

497 - 511 A. Gemperli, Fitting Generalized Linear Mixed Models P. Vounatsou For Point-Referenced Spatial Data

512 - 519 Jason E. King Bootstrapping Confidence Intervals For Robust Measures Of Association

JMASM Algorithms and Code 520 - 524 Scott J. Richter JMASM8: Using SAS To Perform Two-Way Mark E. Payton Analysis Of Variance Under Variance Heterogeneity (SAS)

525 - 530 David A. Walker JMASM9: Converting Kendall’s Tau For Correlational Or Meta-Analytic Analyses (SPSS)

iii $@2" Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 281-534 1538 – 9472/03/$30.00

Letters To The Editor 531 Vance W. Berger Additional Reflections On Significance Testing

531 - 532 Thomas R. Knapp Predictor Importance In Multiple Regression

End Matter 533 - 534 Author Statistical Pronouncements II

JMASM is an independent print and electronic journal (http://tbf.coe.wayne.edu/jmasm) designed to provide an outlet for the scholarly works of applied nonparametric or parametric statisticians, data analysts, researchers, classical or modern psychometricians, quantitative or qualitative evaluators, and methodologists. Work appearing in Regular Articles, Brief Reports, and Early Scholars are externally peer reviewed, with input from the Editorial Board; in Statistical Software Applications and Review and JMASM Algorithms and Code are internally reviewed by the Editorial Board. Three areas are appropriate for JMASM: (1) development or study of new statistical tests or procedures, or the comparison of existing statistical tests or procedures, using computer- intensive Monte Carlo, bootstrap, jackknife, or resampling methods, (2) development or study of nonparametric, robust, permutation, exact, and approximate randomization methods, and (3) applications of computer programming, preferably in Fortran (all other programming environments are welcome), related to statistical algorithms, pseudo-random number generators, simulation techniques, and self-contained executable code to carry out new or interesting statistical methods. Elegant derivations, as well as articles with no take-home message to practitioners, have low priority. Articles based on Monte Carlo (and other computer-intensive) methods designed to evaluate new or existing techniques or practices, particularly as they relate to novel applications of modern methods to everyday data analysis problems, have high priority. Problems may arise from applied statistics and data analysis; experimental and nonexperimental research design; psychometry, testing, and measurement; and quantitative or qualitative evaluation. They should relate to the social and behavioral sciences, especially education and . Applications from other traditions, such as actuarial statistics, biometrics or biostatistics, chemometrics, econometrics, environmetrics, jurimetrics, quality control, and sociometrics are welcome. Applied methods from other disciplines (e.g., astronomy, business, engineering, genetics, logic, nursing, marketing, medicine, oceanography, pharmacy, physics, political science) are acceptable if the demonstration holds promise for the social and behavioral sciences.

Editorial Assistant Professional Staff Production Staff Internet Sponsor Patric R. Spence Bruce Fay, Jack Sawilowsky Paula C. Wood, Business Manager Christina Gase Dean Joe Musial, College of Education, Marketing Director Wayne State University

Entire Reproductions and Imaging Solutions 248.299.8900 (Phone) e-mail: Internet: www.entire-repro.com 248.299.8916 (Fax) [email protected]

iv Editorial Board of Journal of Modern Applied Statistical Methods

Subhash Chandra Bagui Joseph Hilbe Justin Tobias Department of Mathematics & Statistics Departments of Statistics/ Sociology Department of Economics University of West Florida Arizona State University University of California-Irvine

Chris Barker Peng Huang Jeffrey E. Vaks MEDTAP International Dept. of Biometry & Epidemiology Beckman Coulter Redwood City, CA Medical University of South Carolina Brea, CA

J. Jackson Barnette Sin–Ho Jung Dawn M. VanLeeuwen Community and Behavioral Health Dept. of Biostatistics & Bioinformatics Agricultural & Extension Education University of Iowa Duke University New Mexico State University

Vincent A. R. Camara Jong-Min Kim David Walker Educational Tech, Rsrch, & Assessment Department of Mathematics Statistics, Division of Science & Math Northern Illinois University University of South Florida University of Minnesota J. J. Wang Ling Chen Harry Khamis Dept. of Advanced Educational Studies Department of Statistics Statistical Consulting Center California State University, Bakersfield Florida International University Wright State University Dongfeng Wu Christopher W. Chiu Kallappa M. Koti Dept. of Mathematics & Statistics Mississippi State University Test Development & Psychometric Rsch Food and Drug Administration

Law School Admission Council, PA Rockville, MD Chengjie Xiong Division of Biostatistics Jai Won Choi Tomasz J. Kozubowski Washington University in St. Louis National Center for Health Statistics Department of Mathematics Hyattsville, MD University of Nevada Andrei Yakovlev Biostatistics and Computational Biology Rahul Dhanda Kwan R. Lee University of Rochester Forest Pharmaceuticals GlaxoSmithKline Pharmaceuticals Heping Zhang New York, NY Collegeville, PA Dept. of Epidemiology & Public Health Yale University John N. Dyer Hee-Jeong Lim Dept. of Information System & Logistics Dept. of Math & Computer Science International Georgia Southern University Northern Kentucky University Mohammed Ibrahim Ali Ageel Department of Mathematics Matthew E. Elam Devan V. Mehrotra King Khalid University, Saudi Arabia Dept. of Industrial Engineering Merck Research Laboratories University of Alabama Blue Bell, PA Mohammad Fraiwan Al-Saleh Department of Statistics Yarmouk University, Irbid-Jordan Mohammed A. El-Saidi Prem Narain Accounting, Finance, Economics & Freelance Researcher Keumhee Chough (K.C.) Carriere Statistics, Ferris State University Farmington Hills, MI Mathematical & Statistical Sciences University of Alberta, Canada Carol J. Etzel Balgobin Nandram University of Texas M. D. Department of Mathematical Sciences Debasis Kundu Anderson Cancer Center Worcester Polytechnic Institute Department of Mathematics Indian Institute of Technology, India

Felix Famoye J. Sunil Rao Christos Koukouvinos Department of Mathematics Dept. of Epidemiology & Biostatistics Department of Mathematics Central Michigan University Case Western Reserve University National Technical University, Greece

Barbara Foster Brent Jay Shelton Lisa M. Lix Academic Computing Services, UT Department of Biostatistics Dept. of Community Health Sciences Southwestern Medical Center, Dallas University of Alabama at Birmingham University of Manitoba, Canada

Takis Papaioannou Shiva Gautam Karan P. Singh Statistics and Insurance Science Department of Preventive Medicine University of North Texas Health University of Piraeus, Greece Vanderbilt University Science Center, Fort Worth Mohammad Z. Raqab Dominique Haughton Jianguo (Tony) Sun Department of Mathematics Mathematical Sciences Department Department of Statistics University of Jordan, Jordan Bentley College University of Missouri, Columbia Nasrollah Saebi School of Mathematics Scott L. Hershberger Joshua M. Tebbs Kingston University, UK Department of Psychology Department of Statistics California State University, Long Beach Kansas State University Keming Yu Statistics Dimitrios D. Thomakos University of Plymouth, UK Department of Economics Florida International University Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 281-292 1538 – 9472/03/$30.00

INVITED ARTICLES Model Comparisons Using Information Measures

C. Mitchell Dayton University of Maryland

Methodologists have criticized the use of significance tests in the behavioral sciences but have failed to provide alternative data analysis strategies that appeal to applied researchers. For purposes of comparing alternate models for data, information-theoretic measures such as Akaike AIC have advantages in comparison with significance tests. Model-selection procedures based on a min(AIC) strategy, for example, are holistic rather than dependent upon a series of sometimes contradictory binary (accept/reject) decisions.

Key words: Akaike AIC, significance tests, information measures

Introduction debated for more than 40 years. In particular, Rozeboom (1960) summarized criticisms of Quantitative researchers have been trained to significance testing that have resurfaced in evaluate effects of interest utilizing the methods various guises from time to time. Generally, of statistical inference. In a single research study these criticisms have focused on the issue of it is not unusual to see several dozen, or even binary decision-making (e.g., accept/reject null several hundred, significance tests applied to hypotheses) as opposed to considerations related assess, for example, multiple correlations, to weight of evidence (e.g., measures of strength differences among multiple correlations and of effect or effect sizes). regression coefficients. However, the The fundamental error, as seen by appropriateness of the use of significance tests in Rozeboom, “…lies in mistaking the aim of a social and behavioral research settings has been scientific investigation to be a decision, rather than a cognitive evaluation of propositions (op. cit., page 212).” Although distinctions can be C. Mitchell Dayton is a Professor of drawn between significance testing in the Measurement & Statistics at the University of Fisherian (1959) sense and hypothesis testing in Maryland. His major research interests deal with the Neyman-Pearson (1933) sense, current the topics of latent class analysis and teaching and practice in the behavioral sciences simultaneous inference. He recently published a blur these distinctions and the terms can be Sage book dealing with latent class scaling considered as essentially interchangeable in models, a topic on which he has published practice. However, it is likely that Fisher himself widely. His long standing interest in would concur with many of the criticisms as simultaneous inference has led to a focus on suggested by the following quotes (Fisher, model-comparison approaches utilizing 1959): information theory and posterior Bayes factors. …the calculation {of significance levels} is absurdly academic, for in fact no scientific worker has a fixed

281 282 C. MITCHELL DAYTON

level of significance at which from there is considerable agreement among year to year, and in all circumstance, statisticians and behavioral scientists he rejects hypotheses; he rather gives that there has been an unfortunate his mind to each particular case in the emphasis on the part of the latter on light of his evidence and his ideas. hypothesis testing to the exclusion of (page 42) other inferential techniques….In spite On the whole the ideas (a) that of this widely known fact, behavioral a test of significance must be regarded scientists continue to employ as one of a series of similar tests significance tests to the exclusion of applied to a succession of similar other more informative techniques. bodies of data, and (b) that the purpose (page 154) of the test is to discriminate or ‘decide’ It can be argued that a major reason for between two or more hypotheses, have the apparent resistance to change from greatly obscured their understanding, significance tests to other techniques is that the when taken not as contingency alternatives that have been proposed are possibilities but as elements essential unattractive to applied researchers. Consider the in their logic. (page 42) relatively simple example of multiple Advocates for change have urged comparisons among a set of, say, five sample minimizing (or, even eliminating) the role of means. A typical traditional approach would be significance tests in behavioral research and the use of Tukey tests (or one of the plethora of elevating the roles of procedures such as variations such as Games-Howell tests). In confidence intervals, measures of effect size effect, 10 significance tests would be conducted (e.g., Carver, 1993) or replicability measures and referred to the appropriate theoretical (e.g., Thompson, 1994). Although these distribution (e.g., studentized range). advocacy positions have been well articulated If a researcher were to follow Carver’s and widely disseminated among applied (1993) advice, the Tukey tests would be statisticians, there is scant evidence for change replaced by “…estimates of effect size and of in practice by applied researchers in the sampling error such as standard errors and behavioral sciences. confidence intervals 89).” However, the q For example, the Fall 1995, Winter 1995 statistic per se can be viewed as an effect size and Spring 1996 issues of the American (i.e., difference between two means divided by Educational Research Journal contained 11 data- the estimated standard error of a mean) and how based articles in the Teaching, Learning and does the researcher arrive at a unified Human Development section of the journal. The interpretation of the 10 confidence intervals? number of significance tests per article (with But Carver (1993) has additional advice: “Better some allowances for counting errors) are, in yet, by conducting multiple studies, replication rank order: 3, 29, 33, 35, 40, 48, 94, 212, 290, of results can replace statistical significance 335 and 448 for a total of 1567 tests or an testing.” This is not a particularly attractive average of 522 significance tests per issue of the option given the obstacles that may exist to journal. replication and the fact that the researcher really Although the lowest number, 3, might needs to interpret the present study in order to lead to useful interpretations within a single decide whether or not replication is a worth research study, it is highly doubtful that 29, while expenditure of time and resources. much less 448, such tests in a single study can A premise of this paper is that be interpreted in manner that provides much significance tests are appropriate for only scientific value. Indeed, the lack of popularity certain, highly constrained purposes but have for alternative procedures to significance testing enjoyed much wider use because of the failure has, itself, a long history as evidenced by of methodologists to popularize other, more Heermann and Braskamp (1970) who wrote in appropriate statistical methods. In particular, the Introduction to Part 4, Testing Statistical significance tests are useful for interpreting data Hypotheses, of their book of : that arise from controlled experimental or quasi- MODEL COMPARISONS 283 experimental designs in which the role of error rate for the set of tests or, simply, to use a specific hypotheses is well-defined. For non- conventional .05 level for each test. This experimental settings, researchers typically decision, it should be noted, can dramatically utilize significance tests for purposes of affect the interpretation of results. On the other comparing alternate models for data or for hand, a holistic, model-comparison approach interpreting effects within specific models. It is entails computing, say, an Akaike AIC statistic this application that is better served by for each regression model and then selecting a procedures specifically designed for “best” model corresponding to the minimum comparisons among models and is ill-served by value of AIC. significance tests. Another consideration in selecting an An increasingly popular technique for approach to comparing models is the logic of the comparing models involves information- decision-making strategy itself. In applying theoretic measures such as Akaike (1973, 1978) significance tests, the null hypothesis AIC or measures based on posterior Bayes corresponds to some restricted form of a model factors such as Schwarz (1978) BIC. In either (e.g., a test for quadratic regression involves a case, these measures may be viewed as null hypothesis stating that the regression penalized log-likelihoods and are computed coefficient for the quadratic term is zero and this separately for each model under consideration. corresponds to a simpler, linear regression Then, a preferred model, among those being model). The validity of the test depends upon compared, can be selected. assuming that the simpler model is true and that This permits a very wide range of deviations from the model are due to chance. applications and even avoids some technical But this is a gross over-simplification of the issues in applying statistical tests for model scientific process. In a holistic, model- comparisons (e.g., for comparing number of comparison approach the underlying goal is to components for discrete mixture models such as select the best approximating model from among latent class models). Model comparison the models under consideration. It is not procedures are holistic in the sense that a variety necessary to assume that any given model is of competing models can be assessed “true” and there is no need to posit that a true simultaneously and a best model selected by model exists among those being compared. applying a single rule. Attempting to compare In this article, the rationale for models using significance tests is, by contrast, information-theoretic model comparison piece-meal with the final selection of a model procedures is presented and two specific areas of based on results from sometimes conflicting application are discussed – pairwise outcomes. comparisons and analysis of finite mixtures. Consider, for example, the procedure that is often used when fitting polynomial Information Criteria regression models to bivariate data. Assume Akaike (1973) suggested that the there are five distinct levels of a quantitative Kullback-Leibler (1951) information measure independent variable, X, so that models provides a natural criterion for ordering alternate corresponding to linear, quadratic, cubic and models for data. He developed a sample-based quartic regression can meaningfully be fit to the estimate, AIC, for this information measure that data. Typically, the differences in fit of he incorporated into a decision-making strategy. increasingly more complex models are evaluated For any specific model, the form of AIC is by significance tests based on differences in −22LL+ p where LL is the log-likelihood for multiple correlations (of, equivalently, the model and p is the number of independent differences in explained variability). parameters that are estimated in fitting the model Thus, four distinct hypotheses are tested to data. with, say, four hierarchical F statistics each at For example, assuming normally some specified level of significance. Since four distributed residuals for a homogeneous linear independent tests are being conducted, an initial regression model for three independent decision is whether or not to control the Type I 284 C. MITCHELL DAYTON

variables, p would equal five and comprise three directly related to Kullback-Leibler information. partial regression coefficients, the mean of the In any case, both CAIC and BIC reflect sample dependent variable (or the Y-intercept) and the size and have properties of asymptotic variance of the residuals. consistency although the importance of this A summary of the technical property for the interpretation of data for any development for the AIC measure can be found specific sample setting can be disputed since, in Dayton (2003a) whereas a detailed analysis of unlike significance tests, the interpretation of the measure is presented by de Leeuw (1992). In AIC does not depend on long-range sampling general terms, Kullback-Leibler information is a notions. AIC, CAIC and BIC may each be measure of the discrepancy between the true viewed as a penalized log-likelihood (Sclove, distribution for a random variable (possibly 1987) with penalties per parameter of 2, ln(N)+1 vector-valued) and the distribution specified by and ln(N), respectively. For all reasonable some particular model. Although the true model sample sizes, CAIC and BIC apply larger is never known, Akaike managed to derive an penalties than AIC and, thus, other factors being estimate of this discrepancy by considering the equal, they tend to select simpler models than distribution of a future sample conditional on does AIC. knowing the maximum-likelihood estimator for Among the reasons for preferring the parameters in the model. use of a model selection procedure such as AIC Fundamentally, AIC involves the notion in comparison to traditional significance tests of cross-validation, but only in a theoretical are: sense. Given AIC values for two or more (a) A single, holistic decision can be alternate models, the model satisfying min(AIC) made concerning the model that is best is, in this information-theoretic sense, most supported by the data in contrast to what is representative of the true model and, on this usually a series of possibly conflicting basis, may be interpreted as the best significance test. Moreover, models can be approximating model among those being ranked from best to worst supported by the data, considered. A useful interpretation of AIC is that thus, extending the possibilities of interpretation. it estimates the loss of precision (or, increase in (b) Models with various information) that results from substituting parameterizations can be compared even when maximum likelihood estimates for the true the models do not obey hierarchic relations. parametric values in the likelihood function. (c) Homogeneous and heterogeneous Thus, among the models under consideration, it versions of models can be compared; in can be argued that the preferred model (i.e., particular, the homogeneity of variance min(AIC) model) has the smallest expected loss (homoscedasticity) assumptions required by of precision relative to the true, but unknown, many significance tests can be circumvented and model. the selection of the most appropriate model can It should be noted that AIC does not be based on the information criteria. depend directly on sample size. Bozdogan (d) Considerations related to underlying (1987) noted that, because of this, AIC lacks distributions for random variables can be certain properties of asymptotic consistency and incorporated into the decision-making process he proposed a related measure, CAIC, by rather than being treated as an assumption whose applying his own heuristic to the development of robustness must be considered (e.g., models the estimate for Kullback-Leibler information. based on normal densities and on log-normal In particular, for a sample of N cases, densities can be compared). CAIC=−2(ln()1) LL + N + p . Various arguments have been presented This measure is very similar to the BIC against the use of information criteria such as measure proposed by Schwarz (1978) that takes AIC although some of these are difficult to the form BIC=−2ln() LL + N p , although follow. For example, McDonald and Marsh (1990) seem to argue as follows: major premise Schwarz developed his measure as an estimate – the saturated model is always the true model; for a particular posterior Bayes factor not minor premise – for sufficiently large sample MODEL COMPARISONS 285 size, AIC will always select the saturated model; and, despite years of research, the best advice conclusion – AIC is defective and cannot be has always been to use large samples. used in practice. In a context such as paired- A technical point about the calculation comparisons among K means, a saturated model of AIC (or CAIC or BIC) is that the log- based on normal densities would comprise K likelihood, LL, often involves the estimation of unique means and variances. Thus, no other theoretical variances. The maximum-likelihood model could possibly fit the data better in an estimate for a variance is biased since the absolute sense (i.e., yield a larger log- denominator for the computation is the sample likelihood). However, if two of the group means size, N, regardless of the number of parameters are truly equally and very large samples are that are estimated in fitting the model to data. In involved, measures such as AIC will tend to regression analysis with p independent variables, select the correct model, not the saturated model. for example, the unbiased estimate for the As noted above, others are concerned residual variance is computed by dividing the with the fact that AIC does not directly depend residual sum of squares by N – p – 1 but in the upon sample size and, therefore, lacks properties context of computing AIC the divisor for the of asymptotic consistency (Bozdogan, 1987). maximum likelihood estimate is N. However, variations on AIC such as Schwarz’s The computation of AIC for any specific (1978) BIC and Bozdogan’s (1987) CAIC do model requires the specification of a reflect sample size considerations. In practice, it distributional form (e.g., univariate normal, is not necessarily the case that the property of multivariate normal, multinomial, Poisson, etc.). asymptotic consistency leads to a better Then, the log-likelihood, LL, for the sample is procedure in a true-model identification sense. computed based on the model and the specified For example, in the context of distributional form. In multiple regression comparing non-nested latent clas (mixture) analysis, for example, residuals may be assumed models, Lin and Dayton (1997) found that AIC to follow a univariate normal density with was superior to BIC when the “true model” was variances that are homogeneous conditional on relatively complex (i.e., was based on a the independent variables. relatively large number of parameters). However, unlike conventional Similarly, Huang and Dayton (1995) report that, significance tests, the set of alternate models for multiple comparisons among bivariate mean being considered may include different vectors, AIC tended to outperform BIC and specifications and different distributional CAIC when “the null case was excluded and, in assumptions. For example, residuals may be general, for heterogeneous cases.” However, for characterized as heterogeneous or dependent on multiple regression analysis, the results for AIC the independent variables in various ways. On and BIC reported by Gagné and Dayton (2002) the other hand, residuals may be assumed to are more complex but consistent with the follow a mixture of homogeneous univariate observation that AIC is more successful with normal densities. In any case, the min(AIC) more complex models. criterion can be used to order and select among Clearly, further research around the these models. issue of competing information measures is To illustrate these ideas in the context of needed but that does not alter the fact that this real data, consider the plot (Figure 1) for class of procedures often provides a highly mathematics achievement scores as a function of desirable alternative to traditional significance weekend television watching activity based on a testing techniques. Finally, it should be pointed 5% random sample of cases from the public use out that information measures themselves for the National Education Longitudinal Study depend upon certain asymptotic properties of (NELS). The distinct non-linear trend based on chi-square statistics and, thus, issues of 1092 cases seems to invite a quadratic regression robustness must be considered. This is a model (the television watching categories were researchable topic about which little is known at coded at their upper values except that the final present. Of course, very similar distributional category was coded 6). Conventional F tests for issues must be considered for significance tests increments to explained variability (∆R2) using a 286 C. MITCHELL DAYTON

direct notation are Flinear = 5.34, Fquad = 41.05 Paired Comparisons Information Criterion and Fcubic = 1.30. The linear and quadratic terms Dayton (1998, 2003a) proposed a method are significant at the conventional 5% level for comparisons among means using information whereas the cubic term is non-significant. Thus, criteria such as Akaike’s AIC. He advocated this the three significance tests can be interpreted as approach rather than standard pairwise- supporting the selection of a quadratic model for comparison procedures such as Tukey tests in the data. As reported in Gagné and Dayton order to avoid or minimize the following (2002), the log-likelihood for homogeneous problems with conventional procedures. multiple regression models can be computed (a) Tukey tests (and variations) have been directly from the residual sum of squares (SSe) proposed based on some arbitrary method for and sample size: controlling the family-wise type I error rate for the set of correlated pairwise contrasts. Release 11.5 of SPSS, for example, provides options for ⎡⎤⎛⎞SSe LL = −⋅.5N ⎢⎥ ln(2π ) + ln⎜⎟ + 1 . (1) 18 different post hoc pairwise comparison ⎣⎦⎝⎠N procedures that are based on several different approaches to controlling type I error. The AIC values for linear, quadratic and cubic (b) Unequal sample sizes and heterogeneity models are, respectively, 8140.02, 8101.62 and of variance pose difficulties for many 8127.30 leading to the choice of the quadratic procedures. The classic Tukey test, for example, model as the best approximating model among assumes constant sample size and homogeneous these three models (using BIC leads to the same variances, an often unrealistic set of preferred model). But, other models might be assumptions. Modifications of Tukey tests such explored for these data. For example, using the as Games-Howell tests allow for both unequal reciprocal of weekend television watching as a sample sizes and heterogeneous variances but predictor (actually, reciprocal of X+1 due to the only provide approximate control of the family- presence of 0’s), the AIC value is 8144.16 which wise type I error rates by means of an is less preferred than any of the polynomial adjustment to degrees of freedom. models. Note that from a conventional point of (c) Intransitive decisions are routinely view, a test of significance can be run for the encountered with pairwise-comparison regression coefficient in the reciprocal procedures in general and pose serious regression model (t = -1.095, p = .274) but there interpretive problems if some overall conclusion is no direct way of testing the difference in fit is desired for the set of means. For three means between, say, the linear model and the reciprocal in rank order, an intransitive decision entails model since they are not nested. rejecting the difference between the highest and

E lowest mean but retaining the null hypotheses 55 for comparisons of these means with the middle

54 mean. It has been argued that this really doesn’t pose a problem if the main concern of a study is 53 to draw conclusions about the separate pairwise 52 differences. However, if the focus is on 51 individual pairwise contrasts, what rationale is

50 there for sacrificing power and adopting a family-wise error rate rather than simply running 49 separate t tests for each pair of means? 48 The method based on information criteria 47 Mean MATHEMATI STANDARDIZEDSCOR CS described below and known as paired- DON^T WATCH TV 1-2 HOURS 3-4 HOURS OVER 5 HRS A DAY LT 1 HOUR A DAY 2-3 HOURS 4-5 HOURS comparisons information-criterion, or PCIC, has

NO. OF HOURS R WATCHES TV ON WEEKENDS been the topic of simulations by Cribbie & Keselman (2003) who suggest that PCIC has all- Figure 1. Mathematics achievement scores as a pairs power that is typically superior to standard function of weekend television watching. pairwise comparison procedures (e.g., Tukey MODEL COMPARISONS 287

HSD). The method has been extended to variance estimates as appropriate from the repeated observations as well as to data in the separate groups or by computing the (biased) form of proportions. sample variance from the appropriate combined group. For the latter preferred case, the sample (A) Independent Samples of Means variance for a {23} subset of means, for Consider a design comprising J independent, example, would be random groups of respondents with sample sizes, n , sample means Y and unbiased n2 n3 j j ()()YY−+µµˆˆ22 − J ∑∑ii223 323 2 2 ii==11 variance estimates, S , with Nn= . In σˆ23 = . (3) j ∑ j ()nn+ j=1 23

PCIC, AIC (or similar measure) is computed for th each possible, different ordered subset of means. Assume that, for the m model, the pattern of Thus, only non-overlapping subsets of means are sample means has been partitioned into K non- compared rather than all possible subsets. In overlapping subsets. Then, general, for J groups there are 2 J −1 distinct patterns of subsets based on ordered means. For N LLm = −+[] Ln(2π ) 1 example, with three groups with the means 2 (4) 242 = 1 K ranked and labeled 1, 2, 3, the ordered − n ln(σˆ 2 ) subsets are {123}, {1,23}, {12,3}, and {1,2,3,} ∑ mk mk 2 k =1 where a comma is used to separate subsets with unequal means. Focusing on ordered subsets of  2 means and using a min(AIC) [or min(BIC)] whereσ mk is the (biased) variance estimate th strategy avoids the intransitivity problem that and nmk is the sample size for the k subset. may arise when using traditional paired- Table 1 (following page) summarizes comparisons techniques without sacrificing NELS data for standardized scores for interpretability of results. five racial/ethnic as identified in the data base. Assuming homogeneity of variance, the log- Tukey tests, as well as Games-Howell tests that likelihood for the mth model can be written as: lack the homogeneity of variance assumption, yield a typical intransitive pattern of differences NN with three overlapping, non-significant ranges LL=− Ln(2π ) − Ln (σˆ2 ) mW22 comprising, in rank order of means from high to (2) low, {123}, {34} and {45}. The three smallest 1 J nj −−()Y µˆ 2 AIC values assuming homogeneity of variance 2 ∑∑ ij mj and not making this assumption are shown in 2σˆW ji==11 Table 1. Note that min(AIC) occurs for the  2 whereσ W is computed from the ANOVA pattern {12,345} assuming the restricted within-groups sum of squares but with heterogeneous variance case although several denominator N rather than N − J . Means for models show quite similar AIC values. An th the m model are estimated assuming that the interesting feature of model comparisons with model is correct. The independent parameters AIC and related information measures is that, estimated for a model comprise the variance and although a single preferred model is identified, a means, as necessary. If variances are assumed ranking of alternative models is provided. to be equal in the same pattern as means, the Additional illustrative analyses for both the case is termed the restricted heterogeneous homogeneous and heterogeneous cases are variance case (for other cases, see Dayton, presented in Dayton (1998, 2003a) as well as in 1998). Assuming the restricted heterogeneous connection with a Gauss program (Aptech variance case, an estimated variance for a subset Systems, 1997) for conducting these tests of means can be obtained either by pooling (Dayton, 2001a). 288 C. MITCHELL DAYTON

Table 1 NELS Reading Standardized Scores

Homogeneity Restricted Heterogeneity "Race" n Mean VariancePattern AIC Pattern AIC White Non- 798 52.55 98.21 {1,2,345} 8926.90 {12,345} 8926.62 Hispanic API 75 50.40 97.66 {1,2,3,45} 8927.59 {1,2,345} 8927.37 Hispanic 140 47.36 92.13 {12,34,5} 8928.30 {12,3,45} 8927.56 Black Non- 152 46.16 77.68 Hispanic American Indian 44 46.00 70.39 1209

(B) Means Based on Repeated Observations series of 0/1 Bernoulli trials with a true

Consider a cohort of individuals that is population probability, π j , of a favorable measured on the same variable at several points outcome (e.g., 1 or positive) for the jth group. in time. Assuming multivariate normality, the The log-likelihood for any specific ordered parameters of the distribution are means and outcome (e.g., 0110 for proportions based on variances for the occasions of measurement as four outcomes) in the jth group is well as covariances among occasions. As with npln( p )+ n (11− p ) ln(− p ) and the log- independent observations, attention is focused jj j j j j on the distinct ordered subsets of means and likelihood for the total sample is found by AIC, or a related information measure, can be summing across the J groups: used to select a preferred pattern. As for the case of independent groups, J ⎡ ⎤ . variances and covariances can be homogeneous, LL=+−−∑ ⎣ njj pln( p j ) n j (1 p j ) ln(1 p j )⎦ heterogeneous or restricted heterogeneous. j=1 However, the situation is more complex since (5) these conditions can be applied separately to the variances, covariances or to both. In addition, Note that npjjis the expected number various patterned covariance matrices may be of favorable outcomes and npjj()1− is the considered to be appropriate (e.g., a simplex expected number of unfavorable outcomes. The pattern with observations closer in time more sample proportion, p , is the MLE for the highly correlated that those further apart in j time). Dayton (2003a) presents more detailed corresponding population proportion. Unlike the information about this case along with situation for sample means, there is no need to illustrative data. consider homogeneous and heterogeneous cases since each Bernoulli process is based on a single

(C) Independent Samples of Proportions parameter,π j . Otherwise, model selection Consider J groups of sizes n j with follows the same reasoning as for independent sample means (Dayton, 2001a). That is, there is sample proportions, pp12, ,..., pJ , for a J −1 dichotomous dependent variable. The theoretical a total of 2 distinct patterns of subsets of model for the data is that responses represent a proportions to evaluate and proportions for a model are estimated assuming that the model is correct. Illustrative analyses for this case are presented in Dayton (2001a, 2003a).

MODEL COMPARISONS 289

PCIC for Distributions Some preliminary simulation results Standard pairwise comparison have been carried out for two-sample and a procedures, such as Tukey HSD and its many limited number of three-sample cases to assess variations, have been the subject of a good deal how well the AIC and BIC information of research directed toward assessing their measures distinguish between samples based on robustness with respect to distributional normal and log-normal distributions (Dayton, assumptions. Typically, non-normal 2003b). In one series of simulations, theoretical distributions with varying degrees of skew and log-normal densities with means, standard kurtosis are selected for comparison (e.g., deviations of (0, .1), (0, .5) and (0, 1.0) in log Keselman, Lix & Kowalchuk, 1998, report units corresponding to (1.00, .10), (1.13, .60) simulations with normal distributions, three- and (1.65, 2.16) in raw units were considered. degree-of-freedom chi-square distributions and a The first distribution is slightly skewed (index = highly non-normal distribution with skewness .30) and modestly kurtotic (index = 3.16), the and kurtosis indices equal to 6.2 and 114, second distribution is moderately skewed (index respectively). The issue, then, is the degree of = 1.75) and somewhat peaked (index = 8.89), sensitivity of the multiple comparison while the third distribution is both highly procedures to departures from normality. Also, a skewed (index = 6.18) and highly kurtotic (index number of simulations have dealt with the = 113.94). In a second series of simulations, relative power of pairwise comparison information criteria were compared assuming procedures (e.g., Ramsey, 2002). only log-normal densities but the generated data An alternative approach is to directly were either normal or log-normal. model the underlying distributions for observed Typical results for two groups are, in data and then compute appropriate likelihoods additional to the expected sample size for candidate distributions of interest. Once differences: (a) BIC selected the correct model these distributions have been selected, more often than AIC in virtually all simulated procedures comparable to PCIC can be cases with an average difference ranging from implemented. In practice, identifying the set of about 6% to 13%; (b) both information criteria candidate distributions is a non-trivial problem. were much more successful in selecting models Two classes of plausible models that have when the true distribution was log-normal as credibility in practice than can be compared are opposed to when it was normal. This latter result normal and log-normal densities. occurs because, as the median increases, log- The motivation for log-normal models normal distributions assume a nearly symmetric arises from the fact that, in contrast to an shape that approximates normality. Limited additive effect, a multiple effect for an results for three samples suggest that, as was independent variable can be modeled in log- true for two groups, BIC tends to select the linear terms. For example, the usual additive correct pattern of means more often than does model for a response in a one-way ANOVA AIC and both criteria were more successful for log-normal than for normal distributions. The design can be represented as Yij=+µ τε j + ij superiority of BIC over AIC should not be where µ is a grand mean effect, τ j is the effect generalized at this time, however, since Dayton th (1998) found for cases with several groups that of the j treatment andεij is a residual error neither criterion was uniformly superior to the term. Alternatively, assuming a multiplicative, other. rather than an additive treatment effect, yields the model: Yij=×µ τε j × ij or Number of Components in Mixture models *** An emerging area of interest in applied ln(Yij ) =++µ τε j ij where the * superscript research is the use of finite mixture models denotes a parameter on a logarithmic scale. In when distributions such as normal, Poisson and practice, many positively skewed distributions binomial fail to provide satisfactory fit to data. of observations are reasonably well An impetus for considering mixtures is the approximated by log-linear models. phenomenon of over-dispersion which is 290 C. MITCHELL DAYTON

manifested by, for example, distributions with related to boundaries of the parameter space and “heavy tails.” For situations of this sort it is is not distributed as expected (nor is its often reasonable to assume that observations distribution known). Some insight into the represent a mixture from two or more sub- problem can be seen from observing that the populations rather than arising from a single single restriction, θ2 = 0 is equivalent to the two population. In general, a mixture of J 22 distributions for some dependent variable, Y, restrictions µ12= µ and σ12= σ since, in can be represented by: either case, the resulting model is a single J normal density. In fact, the mixture is based on gY(|)βθ=×∑ jj g (| Y β j ) where θ j are five parameters whereas the single normal j=1 distribution is based on only two parameters, yet J only one restriction is required to obtain the mixing fractions such that ∑θ j = 1, g( ) is simpler from the more complex model. j=1 Given the failure of conventional some specified probability (e.g., binomial) or significance tests to provide a basis for assessing density (e.g., normal) function based on a vector the number of components in a mixture, of parameters, β j , and β is a vector containing information measures such as AIC present an all relevant parameters. attractive alternative. Information criteria For a mixture of two heterogeneous provide a single summary statistic for each normal densities, for example, gY(|µ ,σ 2 ) model being compared and avoid the asymptotic 111 distributional issues faced by difference-chi- 2 and gY222(|µ ,σ )would represent normal squares tests for mixture models. Some densities with unique means and variances that preliminary work on assessing AIC, BIC and are mixed in proportions θ andθ =−1 θ . To related measures was reported by Dayton 1 21 (2001b) who focused on the issue of selecting fit such models to data, the parameters for the the appropriate number of mixtures in binomial separate components as well as mixing fractions models (restricted latent class models) with four must be estimated. For a mixture of two normal and six binary variables. densities this would entail estimating five unique Simulations were based on samples parameters (two means, two variances and one sizes ranged from 80 to 1280, binomial mixing proportion). Some relatively simple probabilities for mixtures of two and three mixtures (e.g., normal densities) can be processes were selected to represent varying estimated using available statistical software degrees of discriminability of the components such as Mplus (Muthén and Muthén, 1998) but and mixing proportions were varied from equal specialized programs such as LEM (Vermunt, splits to cases where one component represented 1993) are required in more complex cases such only 20% of the cases. Cases with high as latent class models. discriminability involved, for two components, A persistent dilemma for applications of cases with binomial probabilities and .1, .5 and mixture models is that models with varying .1, .8 where low discriminability involved cases numbers of components cannot be compared with binomial probabilities of .1, .2. All of the using conventional significance tests even measures studied provided reasonable correct though these models are hierarchical. For identification rates for the high disciminability example, the comparison of a mixture of two cases (e.g., 80% and above across the normal densities to a single normal density conditions) but very poor correct identification could, seemingly, be based on a difference-chi- rates for the low disciminability cases (e.g., 10% square test since the single normal density is a or less across the conditions). Dayton (2001b) restricted form of the mixture (e.g., by setting concludes that this area of analysis requires θ2 = 0 ). However, as noted by Everitt and Hand “…reasonably large sample sizes and the (1981) and Titterington, Smith and Makov realization that poorly defined latent structures (1985), among others, this difference-chi-square will almost certainly go undetected.” statistic fails to satisfy theoretical requirements MODEL COMPARISONS 291

Conclusion Dayton, C. M. (2001a). SUBSET: Best subsets using information criteria, Journal of Although the recommendation has been repeated Statistical Software, Vol 6., Issue 02, April. often in the past, researchers should become Dayton, C. M. (2001b). Performance of aware of modern alternatives to the use of information criteria for number of components significance tests when comparing alternate for product-binomial processes. Paper presented models is the focus of analysis. Information at the Mixtures 2001: Recent Developments in theoretical procedures such as Akaike AIC Mixture Modeling conference at Universitat der provide a holistic approach to ordering and Bundeswehr, Hamburg, Germany, July. selecting among competing models that avoids Dayton, C. M. (2003a). Information the piece-meal and potentially inconsistent criteria for pairwise comparisons. Psychological outcomes that arise from applying multiple Methods, 8, 61-71. significance tests. This paper has summarized Dayton, C. M. (2003b). A modeling applications of these measures to multiple approach to post-hoc comparisons of means and comparisons including the possibility of varying proportions. Paper presented at Second distributional assumptions and to mixture Workshop on Research Methodology models where traditional significance tests are (RM2003), Vrije University, Amsterdam, The known to be inappropriate. Netherlands, June. de Leeuw, J. (1992). Introduction to References Akaike (1973) Information theory and an extension of the maximum likelihood principle. Akaike, H. (1973). Information theory In S. Kotz & N. L. Johnson (Eds) Breakthroughs and an extension of the maximum likelihood in Statistics Volume I Foundations and Basic principle. In B.N. Petrov and F. Csake (eds.), Theory. New York: Springer-Verlag. Second International Symposium on Information Everitt, B. S. & Hand, D. J. (1981). Theory. Budapest: Akademiai Kiado, 267-281. Finite Mixture Models. New York: Chapman & Akaike, H. (1978). A Bayesian analysis Hall, Ltd. of the minimum AIC procedure. Annals of the Fisher, R. A. (1959). Statistical Methods Institute of Statistical Mathematics, 30, Part A and Scientific Inference (2nd Edition). New 9-14. York: Hafner. Aptech Systems, Inc. (1997). GAUSS Gagné, P. & Dayton, C.M. (2002). Best for Windows NT/95: Version 3.2.32, Maple regression model using information criteria. Valley, WA. Journal of Modern Applied Statistical Methods, Bozdogan, H. (1987). Model selection 1, 479-488. and Akaike’s information criterion (AIC): The Heermann, E. & Braskamp, L. A. general theory and its analytical extensions. (editors) (1970). Reading in Statistics for the Psychometrika, 52, 345-370. Behavioral Sciences. New Jersey: Prentice-Hall. Carver, R. P. (1993). The case against Huang, C-C & Dayton, C.M. (1995). statistical significance testing, revisited. The Detecting patterns of bivariate mean vectors Journal of Experimental Education, 61, 287- using model-selection criteria. British Journal of 292. Mathematical & Statistical. Psychology, 48, Cribbie, R. A. & Keselman, H. J. 129-147. (2003). A power comparison of pairwise Keselman, H. J., Lix, L. M., & multiple comparison procedures: A model Kowalchuk, R. K. (1998). Multiple comparison testing approach versus stepwise procedures. procedures for trimmed means. Psychological British Journal of Statistical & Mathematical Methods, 3, 123-141. Psychology, 56, 157-182. Kullback, S. & Leibler, R. A. (1951). Dayton, C.M. (1998). Information On information and sufficiency. Annals of criteria for the paired-comparisons problem. Mathematical Statistics, 22, 79-86. American Statistician, 52, 144-151. 292 C. MITCHELL DAYTON

Lin, T. S. & Dayton, C. M. (1997). Schwarz, G. (1978). Estimating the Model-selection information criteria for non- dimension of a model. Annals of Statistics, 6, nested latent class models. Journal of 461-464. Educational and Behavioral Statistics, 22, 249- Sclove, S. L. (1987). Application of 264. model-selection criteria to some problems in McDonald, R. P. & Marsh, H. W. multivariate analysis. Psychometrika, 52, 333- (1990). Choosing a multivariate model: 343. noncentrality and goodness of fit. Psychological Thompson, B. (1994). The pivotal role Bulletin, 107, 247-255. of replication in psychological research: Muthén, L. K., and Muthén, B. O. Empirically evaluating the replicability of (1998), Mplus User’s Guide. Los Angeles: sample results. Journal of Personality, 62, 157- Muthén and Muthén. 176. Neyman, J. & Perarson, E. (1933). On Titterington, D. M. Smith, A. F. M. & the problem of the most efficient tests of Makov, U. E. (1985). Statistical Analysis of statistical hypotheses. Philosophical Finite Mixture Models. New York: John Wiley Transactions of the Royal Society (A), 231, 289- & Sons. 337. Vermunt, J. K. (1993). Log-linear & Ramsey, P. H. (2002). Comparison of event history analysis with missing data using closed testing procedures for pairwise testing of the EM algorithm. WORC Paper, Tilburg means. Psychological Methods, 7, 504-523. University, The Netherlands. Rozeboom, W. W. (1960). The fallacy of the null hypothesis significance test. Psychological Bulletin, 57, 416-428.

Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 293-305 1538 – 9472/03/$30.00

Fortune Cookies, Measurement Error, And Experimental Design

Gregory R. Hancock University of Maryland

This article pertains to the theoretical and practical detriments of measurement error in traditional univariate and multivariate experimental design, and points toward modern methods that facilitate greater accuracy in effect size estimates and power in hypothesis testing.

Keywords: measurement error, latent variables, multivariate analysis, experimental design

Introduction Today, a little more reserved in my decorative zeal, though no less so in my meal predilection, I Whichever leg of my post-secondary academic have but a single fortune tacked outside of my journey, and with whichever campus I have had office door. Amidst aging cartoons and family the privilege of affiliating, the vast majority of pictures is an enlarged photocopy of the one my midday meals have ended with a fortune little rectangle of wisdom I have saved over cookie. Since my college days, in fact, I estimate these last decades. It reads: that I have had lunch at some inexpensive Asian restaurant near campus well over a thousand Love truth times. My graduate school office mates and the but pardon error. many students and faculty whom I served as Lucky Numbers 7, 8, 13, 31, 32, 44 teaching assistant might even remember all the little strips of paper taped to the top of my desk, Although my quantitative training precludes me filling the entire surface with fortunes by the from seeking fortune based on the third line, not time I finished my doctorate. so with the first two. Their aphorism seems replete with insight and potential on many levels, personal and professional, with the latter Gregory R. Hancock is Professor in the level serving as the inspiration for this article. Department of Measurement, Statistics and Less obtusely, in so many applied Evaluation at the University of Maryland. His statistical analyses there seems to be a schism research appears in such journals as between the variables we have and the variables Psychometrika, Structural Equation Modeling, we wish we had. This is apparent in statements and Journal of Educational and Behavioral of theory preceding and justifying those analyses Statistics. He serves on several journal editorial and in the interpretations and purported boards, and regularly conducts workshops implications that follow. Educational policy around the U.S. Email: [email protected]. researchers, for example, might analyze measures of teacher’s job satisfaction and absenteeism and then make proclamations

293 294 GREGORY R. HANCOCK regarding the apparent degree of teacher consequence? What is the truth into which we burnout. Those studying child development seek insight? might start by eliciting new mothers’ responses As we learn and practice the many to rating scale items regarding interactions with methods huddled under the general linear model their infants, and conclude by making inferences umbrella, we typically hold as our goal a correct about those mothers’ emerging maternal inference about, and often estimation of, some warmth. Health care researchers might want an population relation among observed variables – understanding AIDS patients’ sense of a true correlation between X and Y (ρXY), a true hopelessness while in group therapy, and choose predictive relation of X3 to Y holding X1 and X2 measures of patients’ treatment compliance to constant (β3), a true standardized effect size for help facilitate that understanding. Such is the the mean difference between Populations 1 and nature of so much applied research, particularly 2 on Y (dY), and so on. But what does any within the social sciences — constructs of measured X or Y variable really represent, and interest such as burnout, maternal warmth, or what information do any relations among such hopelessness are generally latent, so our variables convey? analyses seem resigned to rely upon the fallible In the physical sciences, variables such as measured variables as surrogates. temperature, pressure, mass, and volume, when And therein lies the schism, in the considered in sufficient quantities, are in their operationalization of true constructs as error- measurement as they are in name. That is, there laden measured variables. At best the imperfect tends to be a strong correspondence between the connection might lead us to a distorted image of measurement and the entity it represents. In the the critical relations in a population; at worst we social sciences, some such variables exist as might not even have sufficient power to draw well – biological sex, treatment group inference at all. Within the context of assignment, and political party affiliation, for experimental design specifically, the primary example. Except for data recording or entry focus of this treatise, the implication is that errors, we expect each variable to represent treatment effectiveness might be severely precisely that which its name implies. Other underestimated, or perhaps even undetected. Of social science variables would also seem to have course this is not unknown. In fact, nothing such identity, being determinable largely written here will be new knowledge. But it is without interference – number of therapy important and often-overlooked knowledge, sessions attended, number of children’s books in bearing clarification and amplification. It will the home, and the like. However, a fundamental thus be my purpose to drive home the often question in many disciplines, particularly those underestimated (if not entirely disregarded) in the social sciences, is the following: What is importance of constructs and measurement error the underlying construct that each variable has in our univariate and multivariate experimental been selected to represent? analyses, and to point the applied researcher toward more modern strategies for dealing with The univariate scenario measurement error in experimental design. Consider a researcher who is truly interested in a construct contrived here as In- Love Truth Home Reading Resources. In that case, number The purpose of applied statistics seems of children’s books in the home is indeed a fairly to be to gain insight into some truth bearing proximal operationalization of the construct of practical consequence. Drawing upon a few interest. As such, estimates regarding population familiar test statistics, we attempt to use mean differences in number of children’s books observed relations among measured variables in in the home, or regarding the population samples to make educated guesses about relations this measured variable has with other unobserved relations in the populations of which such proximal operationalizations, provides each sample serves as assumed microcosm. But direct insight into some truth for the construct of what, precisely, is the population relation we In-Home Reading Resources. On the other hand, hope to understand in order to have practical if a researcher is interested in a construct FORTUNE COOKIES 295 designated as Parental Commitment to Literacy, expected to be the same for Y and η, the relative and has attempted to capture the spirit of this magnitude of the treatment effect on the Parental construct using the number of children’s books Commitment to Literacy construct would be in the home, then we expect the measured underestimated. The standardized effect size for variable to be a more distal operationalization of Y, which is the familiar the desired construct. As such, inference about dY=(µ1Y- µ2Y)/σY (1) population differences in Parental Commitment to Literacy, or of its relation to other variables (Cohen, 1988), is depicted as approximately .65; (proximally or distally operationalized), is meanwhile, the standardized effect size for η, compromised by typical general linear model dη=(µ1η- µ2η)/ση, (2) analyses. Thus, the truths we seek – constructs and their population relations – are often not is near .95. For this disparity to occur, the directly accessible. construct’s standard deviation would have to be The issue at hand, of course, is one of 68.4% of the size of standard deviation of Y, measurement error in our variables. As an meaning its variance is roughly 46.8% (.6842) indicator of In-Home Reading Resources, that of Y. That is, 46.8% of the variability in Y number of children’s books in the home has reflects η, while 53.2% is error with respect to virtually no measurement error; when reflecting the construct of interest. Put directly, Parental Commitment to Literacy, however, it d 2 = ρ d 2 , (3) has considerable error. Imagine a researcher Y YY η employing a control and treatment group to draw inference about the impact of a treatment where ρYY is the reliability of Y (.468 in the designed to enhance Parental Commitment to above example). Thus, while the number of Literacy. Figure 1 displays hypothetical children’s books in the home may accurately population distributions for the measured reflect In-Home Reading Resources, with regard variable of number of children’s books in the to Parental Commitment to Literacy it could be a home (Y), as well as for the latent construct of relative overestimate or underestimate for any Parental Commitment to Literacy (η). Notice given individual. that while the means of the two populations are

Figure 1. Univariate population difference on measured variable and underlying construct.

Measured

Construct

296 GREGORY R. HANCOCK

As mentioned previously, the implications The scenario for the univariate outcome of measurement error for inference are two-fold. may also be depicted symbolically using a path First, as seen in Figure 1, we would diagram. In Figure 2 we see the measured underestimate the magnitude of the treatment variable Y being defined by two components, the effect on Parental Commitment to Literacy. That construct of interest η and measurement error ε. is, we would make an incorrect estimate of the The connection between η and Y, labeled as λ in truth we seek. Second, the presence of the error Figure 2, symbolically reflects the (square root variance would decrease the power of a two- of the) measured variable’s reliability. The sample test to detect the presence of that stronger the relation λ, the more proximal Y’s treatment effect. An understanding of this loss of operationalization of η and thus the less error power may be communicated in terms of variance it contains; conversely, the weaker λ, additional subjects needed in each group to the more distal Y’s operationalization of η and compensate for the presence of measurement thus the more error variance it contains. On the error. Assuming a desired level of power (e.g., left we see a grouping variable representing .80) and a specific standardized effect size at the population membership and whose influence is construct level (e.g., dη=.30), the number of being assessed; this could be a single variable subjects per group for a two-sample z-test can for k=2 groups, or k-1 group code variables for easily be shown to be: the general k-group case. As depicted, population membership X has nY = (1/ρYY) nη. (4) a potential bearing γ on the construct η underlying the measured variable Y, while the For example, conducting the test using a valid remaining variance in η is accounted for by measure with reliability of ρYY =.50 would other independent but latent residual influences require twice as many subjects as a test that ζ . Thus, an observed population difference on could, hypothetically, be conducted directly at the measured variable Y is actually the the construct level. This result holds for t-tests as attenuated manifestation of a population well for all but the smallest sample sizes, where difference on the true underlying construct of appreciable changes in the critical value make interest. The weaker the connection between the the relation only approximate. Further, except η and Y (i.e., the weaker the reliability), the less for very small samples, Equations 3 and 4 hold well the population difference on the construct for k-group between-subjects analysis of of interest is propagated to, and thus reflected in, variance (ANOVA) as well using the more the observed variable. general k-group effect size measures (see Cohen,

1988).

Figure 2. Path model for univariate case

FORTUNE COOKIES 297

This simple univariate example mean difference on the maximally underscores two needs regarding truth in differentiating discriminant function W=w1Y1 + experimental design and analysis. First, we must w2Y2 = w'Y, with weights w commonly (but not seek measured operationalizations as proximal necessarily) chosen so the within-group variance to their constructs as possible. Certainly in the 2 σ W =w'ΣYw equals 1. Observed and latent social sciences perfect operationalization is variable distributions on each Y axis, as well as generally unrealistic, particularly given the i on the W axis, are depicted in Figure 3. vagaries of human behavior, , affect, Given that W is a linear combination of and attitude. Notwithstanding, researchers the observed variables, the measurement error of should expend considerable effort to select or each Y is propagated to the linear composite W. construct the most valid and reliable measures i The standardized effect size along the W axis, feasible. Second, to the extent that measurement error remains, we must employ analytic methods the effect of interest in MANOVA, is dW=(µ1W- that maximize the accuracy of inference and µ2W)/σW; it appears as approximately 3. The 2 estimation, thereby portraying population truths square of this effect size, dW , may be computed with the greatest clarity. These analytic methods as the squared Mahalanobis’ distance must, to every extent possible, penetrate the measurement noise to achieve the same fidelity D 2 = [µ − µ ]'Σ −1 [µ − µ ], (5) to truth as the theoretical questions that preceded W 1Y 2Y Ywithin 1Y 2Y and the practical proclamations we hope to where Σ is the pooled (within-group) follow. One attempt to do so lies within a Ywithin multivariate scenario. variance-covariance matrix reflecting the observed Yi measures' m-dimensional dispersion The multivariate scenario and within lurks the influence of measurement Researchers often attempt to enhance error. Specifically, Σ = Σ + Σ , their ability to make inference about population Ywithin ηwithin ε within where Σ is the pooled (within-groups) differences by gathering several pieces of ηwithin evidence to be employed within a multivariate variance-covariance matrix of the specific experimental design. In multivariate analysis of constructs ηi and Σ is a diagonal matrix of ε within variance (MANOVA) with outcome measures Y1 within-group error variances, assumed through Ym, the hope is that the signal of population differences on some combination of independent and each equal to σ 2 (1− ρ ) . Yi YiYi variables will be detected above the noise of Thus, their measurement error. This portion of the article will address MANOVA in the presence of D2 =[µ −µ ]'[Σ +Σ ]−1[µ −µ ]. (6) measurement error, and highlight its somewhat W 1Y 2Y ηwithin εwithin 1Y 2Y misguided attempt to get closer to truth. Consider the multivariate scenario with As seen in Figure 3, the population mean k=2 populations, often analyzed using difference on the W axis mirrors the univariate Hotelling’s T2. An example is depicted in Figure case, where the standardized effect size on the 3 using m=2 outcomes for simplicity, and with measured composite W underestimates the extremely large population differences for standardized effect size on the corresponding clarity. As before, assume that each Yi measure underlying construct. In this case, the construct is an operationalization of its own specific underlying W, denoted here as ηW, is a linear construct ηi, with individual standardized effect combination of the ηi constructs underlying the sizes of d and d for the univariate measured respective measured Yi variables: ηW =w1η1 + Yi ηi and latent population mean differences, w2η2 = w'η, where η is the vector of ηi respectively. The assessment of the multivariate constructs. Whereas the measured standardized population difference between centroids µY1 and µY2 is tantamount to evaluating the univariate 298 GREGORY R. HANCOCK

Y2 Pop. 2

Pop. 1 Pop. 2 W W

Pop. 2

Pop. 1

Pop. 1

Y1

Pop. 1 Pop. 2

Figure 3. Multivariate population difference on measured variables and underlying constructs.

effect size on W was depicted as near 3, the diagonal, and thus Equation 5 may be shown to latent standardized effect size for ηW, simplify to dη µ −= µ21 /)( σηηη , is approximately 5. W W m The square of this effect size, d 2 , is also the D 2 'dd == d 2 , (8) ηW YYW ∑ Yi squared Mahalanobis’ distance i=1

where dY is the vector of standardized effect D 2 −= µµ Σ−1 − µµ ][]'[ , (7) ηW 21 ηηη within 21 ηη sizes for Yi (i=1…m) as per Equation 1. The same logic would also yield a parallel result for which corresponds to Equation 6 with the error the latent effect size: m variance Σ removed. In fact, the reliability 2 2 ε within D 'dd == d , (9) ηW ηη ∑ ηi of the composite W could be determined as i=1 / DD 22 , which is just a multivariate W ηW restatement and rearrangement of Equation 3. where dη is the vector of latent standardized To get a sense of the impact of effect sizes for ηi (i=1…m) as per Equation 2. measurement error on the multivariate effect Taking each Yi variable’s measurement error into account following Equation 3, Equation 8 size, consider a simple scenario in which the ηi constructs are uncorrelated (and hence so too are yields the Yi variables). In this case the matrix Σ is Ywithin m 2 2 = dD ρ )( . (10) W ∑ η YY iii i=1

FORTUNE COOKIES 299

If all Yi variables were of the same reliability bearing on the ηi constructs underlying the ρYY, it further follows that measured Yi variables. Portions of the constructs not explained by population membership are 2 2 represented in the latent residual influences ζ , DW = (ρYY )Dη (11) i W which are likely to be correlated (shown in

Figure 4 by shared two-headed arrows). (again, given uncorrelated η constructs and i Population differences on the measured variables homogeneous reliabilities). Assuming a desired are the observable manifestations of differences level of power (e.g., .80) and a specific effect on the true underlying constructs of interest. The size at the latent multivariate level (e.g., connection between each ηi and Yi reflects the d =.30), the number of subjects per group for ηW (square root of the) reliability of each variable; a two-sample test can be shown to be inversely the weaker such a relation the less well the proportional to measured variable reliability for population differences on a construct are all but the smallest sample sizes (in this highly propagated to, and thus reflected in, the restrictive example). That is, observed variables. As a result of each variable’s imperfect operationalization of its construct,

nW = (1/ ρYY )nη . (12) error εi contributes to the variable as well. W Finally, in the case of multiple outcomes, a

discriminant function W is represented as a More generally, given any correlational pattern composite of the measured variables. The among the η constructs (and resulting i weights determining this composite are optimal attenuated correlations among the Y variables), i in the sense that they maximize the relation the resulting reliability ρWW of the composite W between W and X. Note that W, as a weighted would yield the corresponding relation sum of measured variables, is also a weighted sum of constructs and errors. That is, unless all n = (1/ ρ )n . (13) W WW ηW variables are perfect operationalizations of their constructs, the composite W will contain Thus, the more reliable the composite W, the measurement error which thus hampers its more MANOVA’s power tends toward that of a ability to reflect population differences theoretical test directly on the underlying propagated by X. construct. In the univariate case, two implications of measurement error were highlighted: underestimating the magnitude of the treatment effect on the underlying construct of interest, and decreased power to detect the treatment effect. As illustrated, these hold as well for the multivariate case. However, while we may tend to gain power by accommodating multiple measured variables simultaneously, it is here that we must remind ourselves of our purpose, of precisely what truth we seek. That is – what, exactly, is the construct of interest in MANOVA? Figure 4. Path model for multivariate case, with Figure 4, a conceptual path diagram for m constructs. the multivariate case, will help this discussion. On the left is a group code variable (e.g., So if W contains measurement error, with dummy) representing population membership respect to what construct does that measurement and whose influence is being assessed. As error exist? The answer, as utilized previously, is depicted, population membership has a potential the composite implicitly formed by MANOVA 300 GREGORY R. HANCOCK of the constructs underlying the variables. But stress is only a function of demands in the what truth does a composite of univariate workplace? Surely not. Thus, while the notion is constructs represent? To this critical question reasonable that variables combine to define a there seems to be three common answers, none composite with a meaningful underlying of which is entirely satisfactory. Each will be construct, those variables’ combination is not presented in turn, along with the concerns it informed by the theoretical soundness of the inspires. construct, but rather only by measured variable Position 1: The composite is not itself mean differences. Forming a meaningful intended to be a construct; rather, it is merely a composite and then conducting an ANOVA on vehicle for the simultaneous examination of the the resulting scores would seem more consistent m individual constructs of interest. with the beliefs underlying this variable system. Response 1: If the separate constructs are Position 3: The univariate constructs are of interest, then a MANOVA is inconsistent actually a single meaningful underlying with that interest. Rather, a collection of construct; the discriminant function represents individual ANOVAs, however seemingly that construct and allows for the assessment of inelegant, would address each construct directly. population differences thereon. An omnibus MANOVA is not generally Response 3: Contrary to the emergent appropriate as a Type I error control mechanism variable system described in Response 2, the since a single false univariate null hypothesis variable system here is latent. That is, all renders the multivariate null hypothesis false, measured variables are believed to be and thus control over other true univariate undergirded by the same construct (but perhaps ("partial") nulls becomes ungoverned. If one varying in the quality of their reflection), and it wishes to invoke an error control mechanism at is on this common construct that inference is the level of the constructs of interest, such as desired. Still, although a single construct exists, that of Bonferroni or his descendants, it may be MANOVA remains clouded in its ability to applied across ANOVAs. address this construct directly. Position 2: The univariate constructs are Consider Figure 5, where X codes facets of a single meaningful whole, as population membership and has a potential represented by the discriminant function and bearing γ on the common construct η underlying upon which knowledge of population differences the measured Yi variables. Thus, population is sought. mean differences on the measured variables are Response 2: Measured variables having a the observable manifestations of a population deterministic and defining bearing on a construct difference on the true underlying construct of have been referred to as constituting an interest. Again, the connections between the η emergent variable system (e.g., Bollen & construct and Yi variables (λi) embody the Lennox, 1991; Cohen, Cohen, Teresi, Marchi, & (square root of the) reliability of each variable; Velez, 1990). For example, one could imagine the weaker such a relations the less well the an unmeasured construct representing stress, group differences will be reflected in the contributed to and defined by such variables as observed variables. Finally, the discriminant relationship with parents, relationship with function W is again shown as an optimal spouse, and demands of the workplace. In this composite of the measured variables, where case population differences in stress might every variable in the composite contributes some indeed be of interest. part η and some part εi. So the discriminant However, the formation of the function has succeeded to some extent in being a discriminant function is not done in a manner reflection of a construct of interest; however, it reflecting any relative theoretical contributions has still failed to eradicate error. of the three measured variables. If population Further, the function has used group mean differences existed only in terms of demands of differences to guide its definition rather than the workplace, for example, then the proximity of construct operationalization. Thus, discriminant function would be composed of even if a single common construct underlies the only that variable. But does that then mean that FORTUNE COOKIES 301 measured variables, measurement error within Y = Λη + ε, (14) this multivariate approach will continue to compromise the accuracy of a treatment effect’s where Y is a subject’s mx1 vector of Yi scores, Λ assessment as well as the power to detect that is an mx1 vector of unstandardized factor effect. That is, we must continue the search for loadings generally assumed to hold for all methods that attempt to pardon error.

Figure 5. Path model for multivariate case, with one construct.

Pardon Error Having cursed the varying degrees of subjects in both populations (homogeneity of darkness inherent in traditional univariate and measurement), and ε is a subject’s mx1 vector of multivariate experimental analyses, I now wish εi measured variable residuals. More to light a candle – or more accurately, introduce interestingly, the theoretical relation of our the candle others have lit (e.g., Muthén, 1989; current focus is contained in the structural Sörbom, 1974). The foundation for this equation relating population membership to the illumination may be seen in Figure 5, already construct, presented. Our real goal is not to be able to detect an overall relation between the population η = γ X + ζ. (15) membership X and the discriminant function W, but rather between X and the construct η. That These structural equations, along with the is, we desire an estimate of the path denoted as simplifying (but not mandatory) assumption of γ, making the discriminant function W irrelevant. independence of all exogenous elements (X, ε, Fortunately, under the umbrella of structural and ζ), have implications for the partitioned equation modeling, a clearer attempt at a variance-covariance matrix Σ containing the X solution exists. and Yi variables for all populations combined. In Figure 5 the relations between the Specifically, for the Yi variables alone, Equation construct and its measured operationalizations 14 implies may be expressed in a system of m structural equations of the form Yi = λiη + εi (i=1…m). ΣY = Λφη Λ' + Θε , (16) These measurement equations may in turn be represented collectively as where φη is the total construct variance for both

populations combined, and Θε is the mxm

302 GREGORY R. HANCOCK

variance-covariance matrix for the εi residuals. allow a statistical test of the difference between Equation 15 has implications for φη, such that the two population means on the construct η. If X is coded 0/1, then a statistically significant and 2 2 positive estimate of γ implies the population φη = γ σ X +ψ , (17) coded X=1 has a higher mean on the construct η,

2 whereas a negative value would imply where σ X is the variance of X and ψ is the superiority of the population coded X=0. An variance of the construct residual ζ. That is, ψ is interpretation of the value of γ itself is not the part of the construct variance that is not generally useful because it reflects the metric explained by population membership; as such, it that η has been assigned by fixing a variable is the pooled within-groups variance for the loading to 1. However, given that the pooled construct. Finally, the portion of the covariance within-groups construct variance ψ has been matrix relating the vector Y of Yi variables to X, estimated as well, we may derive an estimate of as following from Equations 14 and 15, is the latent standardized effect size dη, where

2 Σ XY = γσ X Λ' . (18) dη = γ / ψ . (20)

As implied by the model in Figure 5, the full Thus, if a single construct underlies our partitioned matrix for the X and Y variables i measured variables, we are able to conduct a (respectively) is: statistical test on the construct mean difference

as well as estimate the standardized effect size 2 2 ⎡ σ X γσ X Λ' ⎤ associated with that differences in latent means. Σ = . (19) ⎢ 2 2 2 ⎥ The simple process described above, ⎣Λγσ X Λ[γ σ X +ψ ]Λ'+Θε ⎦ which may be conducted using any structural

equation modeling software (e.g., AMOS, EQS, Using maximum likelihood estimation LISREL, Mplus), is part of a larger class of within structural equation modeling (see, e.g., models known as multiple-indicator multiple- Bollen, 1989), and after fixing one factor cause (MIMIC) models suggested for assessing loading to a value of 1 so as to give the construct latent population differences (Muthén, 1989). η a unit of measurement (i.e., that of the The procedure is not without its own corresponding indicator variable), population assumptions and restrictions, some of which values for all parameters in Equation 19 are may be softened in a somewhat more chosen so as to maximize the likelihood of the complicated strategy known as structured means observations giving rise to the sample modeling (Sörbom, 1974). Those details are left covariance matrix S. After conducting an for the interested reader, and are summarized assessment of the data-model fit as represented didactically elsewhere (e.g., Hancock, in press). by the degree of correspondence between the More importantly is that these methods exist to observed matrix S and the expected matrix Σˆ put the construct back at center stage, in terms of (after substituting the optimum parameter values hypothesis testing and effect size estimation, and into Equation 19), satisfactory fit allows one to as such the theoretical benefits over a proceed to the question at hand. That question MANOVA approach should be clear. involves the estimation of, and statistical test of, We may also take a practical approach in the population mean difference(s) on the comparing the MIMIC and MANOVA strategies construct η. by determining the sample sizes required to For the two-group case, the path from the detect a specific latent standardized effect size in single dummy variable X to the construct η is an order to achieve a desired level of statistical estimate of the population difference on the power. In Table 1 we see the cases of m=2, 3, construct. This path, γ, will also have a and 4 measured variables, crossed with maximum likelihood standard error as a by- homogeneous sets of standardized loadings of product of the estimation process, which will λ=.4, .6, and .8. The standardized latent effect FORTUNE COOKIES 303

sizes included were dη=.2, .5, and .8. In all case of homogeneous loadings H mirrors the conditions the necessary sample size was Spearman-Brown prophecy formula as assessed for both MANOVA and the MIMIC approach in order to achieve .80 power using the H = mλ2 /[1+ (m −1)λ2 ] (21) equivalent of a two-tailed test at the .05 level. For MANOVA, sample size determination in 2 (see Hancock, 2001). For example, with m=3 each case followed strategies for Hotelling’s T variables, H=.276 for λ=.4 and H=.529 for λ=.6; outlined by Cohen (1988; Section 10.3.2.1), sample size thus decreases by a multiplicative while for the MIMIC approach the methods factor of .276/.529=.521 for both the MIMIC derived by Hancock (2001) were used. and MANOVA strategies. For MANOVA this

Table 1 Sample Size Required For Two-Group .05-Level Tests With Power=.80

MIMIC MANOVA

λ=.4 λ=.6 λ=.8 λ=.4 λ=.6 λ=.8 m=2 dη=.2 1424 742 504 1748 912 619 dη=.5 229 120 81 281 148 101 dη=.8 90 47 32 111 59 41

m=3 dη=.2 1080 626 467 1502 871 650 dη=.5 174 101 76 242 141 106 dη=.8 68 40 30 96 57 43

m=4 dη=.2 909 568 449 1383 865 684

dη=.5 146 92 73 224 141 112 dη=.8 58 36 29 89 57 45

Many points are noteworthy in Table 1. sample size decrease is due to the increased As expected, for both the MIMIC and presence of the construct in the discriminant MANOVA methods the necessary sample size function; for the MIMIC approach, which decreases as effect size increases (holding all already operates at the construct level, this else constant). Specifically, sample size sample size decrease is due to a decrease in the decreases were approximately proportional to standard error associated with the γ path. corresponding increases in the square of dη (e.g., With regard to increasing the number of from dη=.2 to dη=.5, sample size necessary variables, for the MIMIC strategy sample size decreases by a multiplicative factor of decreases correspondingly (holding all else .22/.52=.16). Sample size also decreases for both constant); this is because distributional methods as loading magnitude increases noncentrality varies directly with construct (holding all else constant). In particular, sample reliability as measured by H (Hancock, 2001), size decreases were approximately proportional which increases with the addition of any nonzero to corresponding increases in construct loading. For MANOVA, sample size decreases reliability as measured by coefficient H (also with additional variables for λ=.4 and .6, but an known as maximal reliability), where for the increase in required sample size is observed for 304 GREGORY R. HANCOCK

λ=.8. This is because at some point additional additional flexibility may exist to allow for some variables do not contribute sufficient new loading differences across populations under information about the construct to justify the particular configurations of partial measurement additional degree of freedom expenditure. This invariance (Byrne, Shavelson, & Muthén, 1989). was seen in a supplemental analysis as well Externally, the methods of assessing latent using λ=.9 (not shown in Table 1), where for means may be extended greatly. Within the m=2, 3 and 4 the necessary sample size per MIMIC framework, the creative use of group group for MANOVA increased from 540 to 590 code predictors of the latent construct of interest to 635, respectively. (e.g., dummy variables) can fairly easily Overall, as expected the sample size facilitate inferences that parallel those of more required for MANOVA was always greater than complex one-way and factorial ANOVA for MIMIC. For the m=2 case MANOVA designs. Also, covariates may be introduced sample sizes were always about 23% larger than along with the group code variables. In fact, like for the MIMIC approach. For m=3 that number all other variables covariates have underlying increased to around 39%, while for m=4 constructs; as such, given multiple indicator required sample sizes for MANOVA were variables a latent covariate construct may be approximately 52% larger than for the MIMIC incorporated into the model along with the group strategy. Thus, not only has the MIMIC code variables. The disattenuation of approach’s estimation and inference operated measurement error in the covariate provides directly at the level of the construct of interest, it greater accuracy in the assessment and testing of has done so with the same power for a the covariate’s predictive role in the design, as considerable savings in sample size (or with well as of population mean differences on the greater power for the same sample size outcome construct after exacting such latent expenditure). And interestingly, at no point did control. we need to estimate variables’ reliability; this information was implicit within the MIMIC Seeking Your Fortune process in the estimation of the λi loadings. Inspired jointly by ancient wisdom and Extensions to this latent approach exist modern analytical methods, this article has both internally and externally, where the former attempted to return our focus to the constructs refers to methods for answering the same that underlie our experimental research questions under less restrictive assumptions and endeavors. Certainly those constructs must be the latter refers to methods for addressing more grounded in observable measures, but the complex questions. With regard to internal proximity of those measures’ operationalization extensions, the primary assumption implicit in of the construct(s) should be acknowledged and MIMIC modeling is that, because the data from even accommodated. I have attempted to the groups are combined and only one model highlight the theoretical and practical costs of results, the same measurement model holds imperfect operationalization within traditional across populations. This includes loadings, experimental analyses, and pointed toward construct variance, and error variances. In effect, reasonably accessible strategies that circumvent all sources of covariation among observed our measures’ necessary imperfections. variables are assumed to be equal in all But there is no free lunch, so to speak. populations, making the assumption of identical Although the latent variable approaches to measurement models tantamount to an experimental design can pardon error and thus assumption of equal variance/covariance attempt to correct for unreliability, researchers matrices (as is actually assumed in MANOVA are not thereby absolved of expending as well). As alluded to previously, a more considerable effort in choosing or constructing flexible approach to assessing latent means quality measures. Poor reliability in measures exists in structured means modeling (Sörbom, yields less stability in the constructs and in 1974), where only the corresponding loadings estimates of their relations with other variables are commonly constrained across populations in (e.g., group code variables), as well as larger the complete covariance model. Further, standard errors for the statistical assessment of FORTUNE COOKIES 305 estimated relations. Thus, the methods described References briefly herein serve to complement sound principles of instrument selection and Bollen, K. A. (1989). Structural construction. equations with latent variables. New York: These methods also signal the potential to Wiley & Sons. reframe other aspects of the multivariate general Bollen, K. A., & Lennox, R. linear model as well. Although this article has (1991).Conventional wisdom on measurement: focused on experimental design, the canonical A structural equation perspective. Psychological correlation model suffers from some of the same Bulletin, 110, 305-314. problems as MANOVA. Specifically, while X Byrne, B. M., Shavelson, R. J., & and Y variables are generally chosen by Muthén, B. (1989). Testing for the equivalence researchers with some constructs in mind, X and of factor covariance and mean structures: The Y composites are formed whose primary issue of partial measurement invariance. allegiance is to the maximization of XY Psychological Bulletin, 105, 456-466. relations. If one used variables to define Cohen, J. (1988). Statistical power constructs in separate X and Y measurement analysis for the behavioral sciences. (2nd Ed.). models, the relations between constructs would Hillsdale, NJ: Lawrence Erlbaum Associates, be directly couched in theory, disattenuated of Publishers. measurement error, and detectable with Cohen, P., Cohen, J., Teresi, M., considerably more power than within the Marchi, M., & Velez, C.N. (1990). Problems in canonical framework. Expositions similar to the measurement of latent variables in structural those provided here for experimental design equations causal models. Applied Psychological could be crafted, and would be equally Measurement, 14, 183-196. compelling. Hancock, G. R. (2001). Effect size, In sum, although constructs and their power, and sample size determination for relations are the beloved truths that motivate structured means modeling and MIMIC most applied statistics, so many of our analytical approaches to between-groups hypothesis testing efforts are hindered in their inferential of means on a single latent construct. estimation and hypothesis testing by our Psychometrika, 66, 373-388. measures’ inability to reflect those constructs Hancock, G. R. (in press). Experimental, satisfactorily. The current article has illustrated quasi-experimental, and nonexperimental design the detriments of failing to pardon error from and analysis with latent variables. In D. Kaplan our experimental inference, and has directed the (Ed.), Handbook of Quantitative Methodology applied researcher toward more modern methods for the Social Sciences. Thousand Oaks, CA: that can assist researchers in getting closer to the SAGE Publications. truths they seek. It is my hope that they will Muthén, B. O. (1989). Latent variable pursue these and related methods as they seek modeling in heterogeneous populations. their research fortunes. In the mean time, I Psychometrika, 54, 557-585. believe I have a lunch appointment…. Sörbom, D. (1974). A general method for studying differences in factor means and factor structure between groups. British Journal of Mathematical and Statistical Psychology, 27, 229-239.

Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 306-313 1538 – 9472/03/$30.00

Regular Articles P* Index of Segregation: Distribution Under Reassignment

Charles F. Bond F. D. Richard Texas Christian University University of North Florida

Students of intergroup relations have measured segregation with a P* index. In this article, we describe the distribution of this index under a stochastic model. We derive exact, closed-form expressions for the mean, variance, and skewness of P* under random segregation. These yield equivalent expressions for a second segregation index: η2. Our analytic results reveal some of the distributional properties of these indices, inform new standardizations of the indices, and enable small-sample significance testing. Two illustrative examples are presented.

Key words: Segregation index, randomization methods, quadratic assignment, Eta-squared

Introduction u * 1 Nij Nik j Pk = ∑ (1) Bell (1954) developed a way to measure the N• j i=1 Ni• amount of contact between two groups. This widely-used measure has gone by several names. where N•j is the total number of members of It has been called the exposure index (James, group j, u is the number of units; and Nij , Nik , 1986) and the interaction index (Massey & and Ni• are the number of members of group j in Denton, 1986). We, however, follow Lieberson unit i, the number of members of group k in unit (1980) in referring to a measure of sort devised i, and the total number of people in unit i, by Bell as a P* index. It is intended to measure respectively. the probability that individuals from two P* plays a role in the study of different groups will have contact with one segregation. It has been used to document another. school, as well as residential segregation The P* index has been used in studies of (Coleman, Kelly, & Moore, 1975; Krivo & residential segregation – when data are available Kaufman, 1999). It complements alternative on the number of members of a minority group indices by tapping a distinct dimension of (j) and a majority group (k) who live in a segregation (Massey, White, & Phua, 1996; particular spatially-defined unit (on the same Stearns & Logan, 1986). Despite recurrent city block, for example, or in the same census criticism (Taeuber & Taeuber, 1965), the P* tract). It requires data on the number of minority index of segregation has found application in a and majority residents in a number of such units. variety of contexts for nearly 50 years. Then P* is the probability that a randomly Researchers who measure segregation selected member of group j lives in the same with the P* index have an obligation to interpret unit as a member of group k. The index is their results. P* is a probability. It varies defined as between 0 to 1. However, the probability of a member of one group being exposed to a member of another group could be misleading, Address correspondence to Charles F. depending (as it does) on relative group size. To Bond, Jr., Box 298920, Department of facilitate interpretation, researchers often Psychology, Texas Christian University, Fort compare an observed value of P* with the Worth, TX, 76129, or e-mail: [email protected]. j k value that would have been observed if there had been no segregation – that is, if the proportion of

306 BOND & RICHARD 307 members of group j and group k within each unit determine the distribution of the P* index equaled the overall proportion of members of under all possible assignments of individuals to * units – assuming that each assignment that those groups across all units. Then j Pk would -1 preserves the marginal totals is as likely as every equal N•k N•• , and the probability of a member other such assignment. of group j being in the same unit as a member of If all possible assignments of individuals group k would be the proportion of members of to units could be made, then the distribution of group k across all units (Lieberson, 1980). P* could be constructed empirically. Ordinarily, No doubt, students of segregation the number of assignments will be prohibitive, enhance understanding by providing comparison however, and other methods will be required. values for their measures. We wonder, however, Monte Carlo techniques could be used (cf. if it is best to compare P* to a value which Taeuber & Taeuber, 1965); but these are assumes that there is no segregation. After all, computationally intensive and yield no exact even if the members of two groups made distributional information. Here we derive the residential choices entirely at random, some exact mean, variance, and skewness of P* under degree of segregation could be expected by all possible assignments of individuals to units. chance (cf., Winship, 1977). In some contexts it Our derivation treats the distribution of would be informative to compare P* to the value * P as a quadratic assignment problem (Hubert, it would attain under a random degree of * 1987). We begin by representing P in a form segregation. Unfortunately, this comparison has that is amenable to quadratic assignment not been possible to date because the value of P* methods, so that we can draw on existing that would be produced by random segregation analytic results. has not been known. Denoting the total number of individuals In the current article, we describe the * in the analysis as N♦♦ , P is represented in two distribution of P* under random segregation. N♦♦ × N♦♦ matrices. Each row of each matrix We develop an analytic method for determining will denote a particular individual, as will the whether the amount of intergroup contact in a corresponding column of the matrix. Hence, particular setting differs from the amount that each entry in each matrix will denote a pair of would be expected by chance. Exact, closed- individuals, matrix element s,t denoting the dyad form expressions for the expected value, that consists of individual s and individual t. variance, and skewness of P* under random This representation is familiar to students of segregation are presented. These imply social networks (Wasserman & Faust, 1994). equivalent expressions for a second segregation We define the two N × N matrices: Q index: Bell’s eta-square. Our analytic results ♦♦ ♦♦ (which we call the cross-group membership reveal some of the distributional properties of matrix) and R (the unit co-occupancy matrix). these segregation indices, inform new Both matrices are symmetric. standardizations of the indices, and enable The cross-group membership matrix Q small-sample significance testing. For statistical identifies dyads in which the intergroup contact characterizations of P*, see Zoloth (1974). For of interest could, in principle, occur. If the distributional analyses of the widely-used index researcher wishes to measure the likelihood that of dissimilarity, see the papers by Winship a member of group j will have contact with a (1977) and Inman and Bradley (1991). member of group k, the entry in the sth row and

tth column of this matrix is set to ()2 N −1 Formulation of the Problem • j Our goal is to determine the distribution whenever one of the two individuals in the dyad of the statistic in equation 1) under a stochastic (s or t) belongs to group j and the other model. We begin by assuming that the total individual belongs to group k. All other entries number of individuals in each of u units is fixed of the Q matrix are set to 0. – as is the total number of members of group j The unit co-occupancy matrix R and group k. Our model is that each individual is identifies individuals who are in the same unit. randomly assigned to a unit. We seek to The entry in the sth row and tth column of the R 308 P* INDEX OF SEGREGATION matrix is set to N −1 if the two individuals (s where i• and t) are in unit i. All elements along the −1 diagonal of R are set to 0, as are any off- a=N•k [N• j N•• (N•• −1)(N•• − 2)] diagonal elements that denote two individuals u −1 who are in different units. b=(N•• − 2)∑[(Ni• −1)Ni• ] Denoting element s,t of matrix Q by qst i=1 u and the corresponding element of matrix R as rs −1 - algebra reveals that c=(N• j + N•k −2)∑[(Ni• −1)(Ni• − 2)Ni• ] i=1

N••• N• and * j Pk =∑∑qst rst (2) s==11t d=[(N − 1)(N − 1)(N − 3)− 1] • j • k • • u Thus, the P* index can be expressed as {(N − u)2−2 [(N − 1)(2N − 3)N − 1]}. • • ∑ i • i • i • the sum of the products of corresponding i = 1 elements of two matrices. Such statistics can be analyzed with quadratic assignment methods We have derived the coefficient of (Hubert, 1987). * skewness of jPk under all possible assignments Our goal is describe the distribution of of individuals to units. Appendix A presents an * P under all possible assignments of individuals analytic expression for this statistic, which we to units. In our formulation, individuals are * symbolize γ1( jPk ). implicitly assigned to units by the R matrix. We Because these analytic expressions are could change the assignment of individuals to intricate, it may be helpful to begin by noting units by reordering the rows and corresponding some quantities they omit. Neither the mean, the columns of R. * variance, nor the skewness of jPk are affected Having expressed P* as the sum of the by the number of members of group j or k in any products of corresponding elements of two particular unit. These expressions reflect only matrices, we can draw on formulas that have marginal totals – the size of the two groups, and been derived for the mean, variance, and the size of the u units. Values that would appear skewness of such statistics under all possible re- as entries in a unit x group contingency table do orderings of the rows and columns of one of not enter into the equations because these are those matrices (Hubert, 1987) These provide the moments of a distribution of the possible values desired distributional information. * of jPk over all possible entries that would preserve the marginal totals. Analytic Results Equation (3) yields insight into the Our quadratic assignment formulation * impact of random segregation on P*. In the yields the following results. The mean of jPk absence of any segregation, the probability of a under all possible assignments of individuals to member of group j being in the same unit with a units is −1 member of group k equals N•k N•• , as earlier * researchers noted. This probability is lower E( P )=[N N −1][(N − u)(N −1)−1] (3) j k •k •• •• •• under random segregation. Relative to the probability of intergroup exposure under no * all symbols having been defined above. segregation, the random expectation for jPk is * The variance of P under all possible −1 j k lower by a factor of (N − u)(N −1) , as assignments of individuals to units has a more •• •• complicated mathematical expression. In fact, equation 3) indicates. This difference might be negligible if the units under analysis were * * 2 sufficiently large; it could be appreciable if the Var( j Pk )=a(b+c+d) −[E( j Pk )] (4) units were sufficiently small.

BOND & RICHARD 309

* * Other P* Indices jPj = 1 - jPnot-j (8) The probability of a member of one group interacting with a member of a second Then it should be apparent that * * group will not, in general, equal the probability E(jPj ) = 1 – E(jPnot-j ) (9) * * of a member of the second group interacting Var(jPj ) = Var(jPnot-j ) * * with a member of the first (Lieberson, 1980). And γ1(jPj ) = - γ1(jPnot-j ) However, these probabilities have a simple relationship to one another. Eta-square Bell (1954) also proposed a revised * -1 * kPj = N•j N•k jPk (5) index of isolation

This implies that −1 ⎛ N• j ⎞⎛ N• j ⎞ I =⎜ P* − ⎟⎜1− ⎟ (10) * -1 * ⎜ j j ⎟⎜ ⎟ E(kPj ) = N•jN•k E( jPk ) (6) ⎝ N•• ⎠⎝ N•• ⎠ * 2 -2 * Var(kPj ) = N•j N•k Var(jPk ) * * and α1(jPk ) = α1(kPj ) noting that the value of this statistic would invariably lie between 0 and 1. These equations permit a comparison of As Duncan and Duncan (1955) the distributions of complementary exposure observed, Bell’s revised index of isolation is * 2 indices. In skewness, the distributions of jPk identical to η for predicting a dicotomous group * and kPj are identical. In expectation and membership variable (1= member of group j, 0= variance, these two distributions are identical if not a member of group j) from unit. η2, the groups j and k are the same size. If group j is correlation ratio, equals the percentage of * smaller than group k, then jPk will have a higher variance in group membership accounted for by * expectation and greater variability than kPj . If unit, a familiar metric for describing strength of * group j is larger than group k, then jPk will have association. * a lower expectation and less variability than kPj . Having derived the mean, variance, and * The greater the difference in the size of two skewness of the distribution of jPj under groups, the greater will be the difference in random segregation, we can use equation 10) to expectation and variability of the two exposure obtain equivalent expressions for Bell’s η2 indices involving those groups. measure Often students of segregation wish to measure the likelihood that a member of a group E(η 2 ) =(u −1)(N −1)−1 (11) will be in the same unit as other members of that •• 2 * 2 −2 group. They have done so with a isolation index Var(η ) =Var( j Pnot− j )N•• (N•• − N• j ) (Bell, 1954). 2 * γ 1(η )=−γ 1( j Pnot− j ) u 1 Nij Nij P* = (7) j j ∑ where Var( P* ) can be obtained from N• j i=1 Ni• j not− j * equation 4) above and γ 1( j Pnot− j ) can be The methods above must be adapted to * obtained from Appendix A. describe the distribution of jPj . One begins by applying the formulas above to an index for the Standardization and Significance Testing exposure of individuals who are members of Often, researchers want to compare the group j to individuals who are not members of levels of intergroup contact in different locales. group j. Having obtained results for the exposure * Locales may differ from one another in a index jPnot-j , results for the corresponding number of ways – in-group composition, for isolation index follow when one recognizes that example, and in the size of units. If some researchers want their comparisons of intergroup 310 P* INDEX OF SEGREGATION contact to reflect differences in-group significance testing may be moot. In such large composition across locales (Massey, White, & samples, every departure from expectation may Phua, 1996), others would prefer to make these be highly significant (Taeuber & Taeuber, comparisons in a standardized metric. 1965), and associations between group Many scholars have treated Bell’s η2 membership and unit occupancy may be index as a standardized measure of intergroup amenable to traditional chi-square tests. Our contact. In this role, η2 has some limitations. standardization methods would nonetheless be Neither the expectation nor the variance of η2 of value. are fixed under the assumption of random The inferential test we are proposing segregation, as the equations in 11) reveal. should be more useful for small data sets, where Given completely random segregation in two the statistical significance of intergroup contact locales, the expected value of η2 in the two is not a foregone conclusion, and chi-square locales would not in general be equal. approximations would be suspect. Such data sets Ordinarily, one of the locales would have larger may be uniquely suited to a P* analysis – the units than the other, hence a lower expected η2. members of a unit being most likely to have For standardized comparisons of contact with one another when the units are intergroup contact, we propose the following small. measure Examples * * * -.5 For illustrative purposes, we will Z = [jPk - E( jPk )] [Var(jPk )] (12) analyze intergroup contact at a mid-sized or the analogous Z-statistic for η2. Under American University. We will consider two random segregation, these statistics would have examples – an example of contact between an expected value of 0 and a variance of 1 in any minority and non-minority faculty members, and locale – regardless of group composition or unit an example of contact between minority and size. non-minority students. Researchers may wish to determine Table 1 presents data on the number of whether an observed level of intergroup contact minority and non-minority faculty members differs to a statistically significant degree from serving in eight different units of this University, the level that would be produced by random as published by the University’s Office of segregation. Although in principle an exact test Institutional Research. These units are housed in might be constructed with the multiple different buildings. Faculty tend to interact hypergeometric distribution (Agresti, 1990), we within these units of the University, not across propose a less cumbersome alternative. We units. suggest that segregation researchers refer the Z- To assess intergroup contact in this statistic of equation 12) to some reference value. setting, we begin by computing the probability Liberal reference values could be taken from the of a minority faculty member serving in the same unit of the University as a non-minority standard normal distribution, conservative * reference values from Chebyschev’s inequality. faculty member. Results show that mPnon-m = These would imply that intergroup contact .8724, a sizeable probability. Of course, one departs from the level expected under random needs to consider that the overall proportion of segregation at an alpha-level of .05 if the non-minority faculty members is .8868. It is absolute value of Z exceeds 1.96 (by the normal noteworthy that the observed probability of a criterion) or 4.47 (by the Chebyschev criterion). minority faculty member serving in the same Intermediate reference values could be obtained unit as a non-minority is slightly lower than the by incorporating the skewness of the segregation proportion of non-minorities as a whole. Does measure into a Type III Fisher’s distribution. this imply that minority faculty members tend to See Hubert (1987) for details. be isolated from non-minorities? Is this tendency For samples of the size analyzed in greater than would be expected if these faculty many studies of residential segregation, members were distributed across the eight units at random? BOND & RICHARD 311

Table 1. Faculty Members at an American University, By Educational Unit and Minority Status

Minority Non-Minority Educational Unit:

Humanities 11 48

Social Science 5 47 Natural Science 7 76 Fine Arts 6 42 Nursing 1 21

Business 7 42 Education 3 21 Divinity 2 12

To answer these questions, we used the The observed value of η2 = .0162 – a value that present analytic methods. Application of is close to what would be expected under equation 3) above shows that the observed level random segregation: E(η2) = .0189. For an * of minority exposure to non-minorities (mPnon-m analogue to the Z-statistic in equation 12), we = .8724) is slightly greater than what would be could divide the difference between observed * 2 produced by random segregation: E(mPnon-m ) = and expected values of η by .01004 (the .8700. There is little dispersion in the values of 2 * standard deviation of η ) and find that in this mPnon-m across all possible assignments of these standardized metric Z = -.27 – the same value faculty members to the 8 units; the square root that was found for the P* isolation index. These * of equation 4) yields S.D. (mPnon-m ) = .0089. values will always be the same. Applying the equations in the Appendix, we find Even if minority faculty are integrated at * that the distribution of mPnon-m is negatively this institution, students may be segregated. We * skewed: γ1(mPnon-m ) = -1.25. Plugging into checked for segregation among some equation 12), a standardized measure of minority undergraduates who were enrolled in a faculty exposure to non-minorities is Z = +.27. Psychology course. Weekly, students choose to By any significance testing criterion, this level attend any one of the six laboratory sessions of the intergroup contact could have been that are taught in conjunction with the course. produced by chance. Table 2 depicts the number of minority and non- Although the isolation of minority minority students who attended different faculty members could be expressed in terms of laboratory sessions one week during the Spring * a complementary P* index ( mPm = .1132 with Z semester of 2000. Each student’s minority status = -.27), we will express it in terms of Bell’s η2. was reported by a laboratory supervisor who was unaware of the purpose of the report.

312 P* INDEX OF SEGREGATION

Table 2. Students Enrolled in a Psychology Class, By Laboratory and Minority Status

Minority Non-Minority Laboratory Attended:

Monday 1:00 PM 1 7 Monday 3:00 PM 0 6 Wednesday 1:00 PM 0 2 Wednesday 3:00 PM 2 4 Friday 1:00 PM 2 4

Friday 3:00 PM 2 0

Conclusion

Do students avoid intergroup contact by We hope that these analytic techniques choosing to attend laboratories with peers of will be useful to students of segregation. They their own ethnicity? To address this question, require no assumption about the sampling of we computed the probability of a minority observations, or the form of any population student attending the same laboratory session as distribution. They reflect randomizations of the another minority student. Computations showed data at hand (Edington, 1995). * that the isolation index mPm = .494 – far greater than total proportion of minority students in this References sample (.233), and somewhat greater than the isolation index that random laboratory choices Agresti, A. (1990). Categorical data * analysis. New York: John Wiley. would have produced: E(mPm ) = .366. In this sample, random laboratory Bell, W. (1954). A probability model for choices produce sufficient variability in values the measurement of ecological segregation. * Social Forces, 32, 357-364. of the isolation index [S.D.(mPm ) = .073] that the observed degree of minority isolation would Coleman, J. S., Kelly, S. D., & Moore, not (by a two-tailed test) differ significantly J.A. (1975). Trends in school segregation, 1968- 2 73. Washington, DC: Urban Institute. from its expected value (Z = +1.75). Bell’s η Duncan, O. D., & Duncan, B. (1955). A index (.340) also exceeds its expected value methodological analysis of segregation indices. (.172) by an amount that yields the same American Sociological Review, 20, 210-217. value of Z (+1.75, with S.D. = .095). Of course, Edington, E. S. (1995). Randomization these small-scale examples are only illustrative. tests (Third edition). New York: Marcel Dekker. Larger data sets might yield different conclusions.

BOND & RICHARD 313

Hubert, L. J. (1987). Assignment Massey, D. S., White, M. J., & Phua, V. methods in combinatorial data analysis. New C. (1996). The dimensions of segregation York: Marcel Dekker. revisited. Sociological Methods & Research, 25, Inman, H. F., & Bradley, E. L., Jr. 172-206. (1991). Approximations to the mean and Sterns, L. B., & Logan, J. R. (1986). variance of the index of dissimilarity in 2 x C Measuring trends in desegregation: Three tables under a random allocation model. dimensions, three measures. Urban Affairs Sociological Methods & Research, 20, 242-255. Quarterly, 22, 124-150. James, F. J. (1986). A new generalized Taeuber, K. E., & Taeuber, A. F. (1965). “exposure-based” segregation index. Negroes in Cities: Residential segregation and Sociological Methods & Research, 14, 301-316. neighborhood change. Chicago: Aldine. Krivo, L. J., & Kaufman, R. L. (1999). Wasserman, S., & Faust, K. (1994). How low can it go? Declining black-white Social network analysis: Methods and segregation in a multiethnic context. applications. New York: Cambridge University Demography, 36, 93-109. Press. Lieberson, S. (1980). A piece of the pie: Winship, C. (1977). A revaluation of Blacks and white immigrants since 1880. indexes of residential segregation. Social Berkeley and Los Angeles: University of Forces, 55, 1058-1066. California Press. Zoloth, B. S. (1974). An investigation of Massey, D. S., & Denton, N. A. (1988). alternative measures of school segregation. The dimensions of residential segregation. Madison: University of Wisconsin Institute for Social Forces, 67, 281-315. Research on Poverty.

Appendix * * **332 */− The coefficient of skewness of P is defined as γ1(jPk ) = EP[jk− EP ( jk )] [ VarP ( jk )] . Below * 3 we present an expression for E[( j Pk ) ]. Skewness can be computed from

* *****33− 3/2 γ1(jPk ) = {[(Ejk P )]−− [ EP ( jk )]3 EPVarP ( jk ) ( jk )}[ VarP ( jk )]

* 3 −12− EP[(jk ) ] = NNN•••••kj[(− 1 )][ N• ABCDFG+ + + + + ], where u u −2 −1 −2 A=∑[(Ni• −1)Ni• ] , B=3(N• j +N•k −2)(N•• − 2) {u + ∑[(2 − 3Ni• )Ni• ]}, i=1 i=1 u u u −1 −1 −1 −2 C=3(N• j −1)(N•k −1)[(N•• − 2)(N•• − 3)] [u(u − N•• )∑ Ni• +2(N•• − 8u +16∑∑Ni• − 9 Ni• )], i=1 i==11i −1 D=[(N• j −1)(N• j − 2) + (N•k −1)(N•k − 2)][(N•• − 2)(N•• − 3)] u u −1 −2 (N•• − 6u +11∑∑Ni• − 6 Ni• ) , i==11i u −1 −2 F =3(N• j −1)(N•k −1)(N• j + N•k − 4)[(N•• − 2)(N•• − 3)(N•• − 4)] ∑(Ni• −1)(Ni• − 2)[Ni• (N•• − u) − 6(Ni• − 2)]Ni• i=1 −1 GN=−()()•jjkk1212 N • −()() N • − N • −⎣⎡()()() N •• − 234 N •• − N •• −() N ••−5 ⎦⎤ uuu ⎡ 3 ⎛⎞⎛⎞−−−112⎤ ⎢()()Nu••−−−−++−+−6 NuN ••⎜⎟⎜⎟ 2 •• 5 u 3∑∑∑ Niii • 8 5 N •• 22 u 32 N • 15 N • ⎥ . ⎣ ⎝⎠⎝⎠iii===111⎦ Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 314-328 1538 – 9472/03/$30.00

A Critical Examination Of The Use Of Preliminary Tests In Two-Sample Tests Of Location

Kimberly T. Perry Pfizer Inc. Kalamazoo, Michigan

This paper explores the appropriateness of testing the equality of two means using either a t test, the Welch test, or the Wilcoxon-Mann-Whitney test for two independent samples based on the results of using two classes of preliminary tests (i.e., tests for population variance equality and symmetry in underlying distributions).

Key words: t test, Welch, Wilcoxon-Mann-Whitney, Levene, preliminary test for variance, triples test, test of symmetry, test selection

Introduction 3. The Wilcoxon-Mann-Whitney test is In practice, the two-sample t test is widely used robust when the distributions are asymmetric to test the equality of two means. However, it is and the variances are equivalent. well known that the assumptions of 4. None of the above three methods are independence (which will not be discussed in robust when the distributions are asymmetric this paper), variance homogeneity and normality and the variances are unequal. must be met for the two-sample t test to perform well. Results from Zimmerman and Williams Therefore it would be useful to use the (1989), Gans (1981), Murphy (1976), and results from two classes of preliminary test to Snedecor & Cochran (1967) have demonstrated determine which of the three tests, the t test, the that the Welch test or the Wilcoxon-Mann- Welch test, or the Wilcoxon-Mann-Whitney test,

Whitney (WMW) test is more robust in certain should be used to test the hypothesis Ho: µ1 = cases of variance heterogeneity or non- µ2. One class of preliminary tests determines normality. whether the population variances differ, and the Based on the above results for testing other class ascertains if the underlying the equality of means, we conclude the distributions are symmetric or skewed. following: Tests of Variances Used as Preliminary Tests 1. The t test is robust when the The goal of the preliminary test for distributions are symmetric and the variances are variance heterogeneity is to indicate when to equivalent. avoid using mean tests that are sensitive to 2. The Welch test is robust when the variance heterogeneity. distributions are symmetric and the variances are Many methods for testing variance unequal. homogeneity have been developed and compared. Brown and Forsythe (1974), Conover, M.E. Johnson, and M.M. Johnson Kimberly T. Perry is a Senior Research Advisor, (1981), Loh (1987), and O’Brien (1979) have Pfizer Inc., Kalamazoo, Michigan. Her areas of conducted simulations to examine the robustness interest are innovated clinical study designs, of many popular methods for testing variance multiple endpoint analysis, and interim analysis. homogeneity. The L50, the Levene test using the Email: [email protected]. Michael median, was found to be robust for the non- Stoline is acknowledged for his support. normal cases and was one of the procedures

314 KIMBERLY T. PERRY 315 recommended by Conover et al. (1981) as well Test Selection Procedure as the other authors cited above. Based on the The test selection procedure, hereafter above cited literature, the Levene test using the denoted as the TS procedure, will select either a median might be a robust preliminary test t test, the Welch test, or the Wilcoxon-Mann- procedure. Whitney test based on the results of the two Furthermore, Olejnik (1987) conducted preliminary tests. One class of preliminary tests a study where the Levene test using the median determines whether the population variances was compared to the O’Brien procedure (1979) differ, and the other class ascertains if the as a preliminary test procedure preceding the underlying distributions are symmetric or means test. His results showed the Levene test skewed. The "recommended" L50 test (hereafter and the O’Brien procedure used as preliminary denoted Levene test) will be assessed as tests of variance homogeneity were only slightly preliminary test for variance homogeneity, more robust than using the t test alone. It is whereas, the Triples test will be assessed as a noted that Olejnik (1987) used significance preliminary test of symmetry/skewness. Based levels of 5% and 10% for testing variance on the results of the two preliminary tests, the homogeneity in the preliminary test procedure. TS procedure is constructed in the following It is of interest to examine the way: performance of the L50 test as a preliminary test procedure with a higher significance level. A 1. The t test is used to test the equality higher significance level would aid in of means if symmetry is accepted and variance controlling the Type II error. For this simulation homogeneity is accepted. the Levene test at a significance level of 25% 2. The Welch test is used to test the was arbitrary selected. equality of means if symmetry is accepted and variance homogeneity is rejected. Test of Symmetry Used as Preliminary Tests 3. The Wilcoxon-Mann-Whitney test is Randles, Fligner, Policello, and Wolfe used to test the equality of means if symmetry is (1980) compared three procedures for testing rejected and variance homogeneity is accepted. whether a univariate population is symmetric 4. The Welch test is used to test the about some unspecified value compared to a equality of means if symmetry is rejected and large class of asymmetric distribution variance homogeneity is rejected. alternatives. These are the Triples test, Gupta’s skewness test (Gupta, 1967) and Gupta’s It is noted that robust methods exist for testing nonparametric procedure (Gupta, 1967). Their Ho: µ1 = µ2 for cases #1-3 above, but no robust results show that the Triples test is superior to method exists for case #4 . either competitor for testing the hypothesis of symmetry while possessing good power for Methodology detecting asymmetric alternative distributions (Randles et al., 1980). This section contains the details describing the In addition, Cabilio & Masaro (1996) two-sample methodology used to test the and Perry and Stoline (2002) compared the equality of means and variance homogeneity Triples test to other tests of symmetry and the under selected distributions. Triples test continued to perform well both on Let x11, . . ., x1n1 be a random sample robustness and power. Based on the above with sample size of n1 from a distribution studies, the Triples test is selected as a possible denoted f 1(x; µ1, σ1); and x21, . . ., x2n be a preliminary test of symmetry/skewness prior to 2 random sample with sample size of n from a the testing of means equality in a test selection 2 distribution denoted f 2(x; µ2, σ2). It is assumed procedure. A significance level of 5% for testing 2 of symmetry was arbitrary chosen for this that E(xij) = µi and Var(xij) = σi for each i= 1, 2 simulation. and j = 1,…, ni.. The two samples are assumed to be independent. Let the sample mean and

316 EXAMINATION OF TWO-SAMPLE TESTS OF LOCATION

sample variance for xi1, . . . , xini be denoted as xi Testing the Equality of Variances 2 2 2 and si for i = 1, 2, respectively. The Levene test of Ho: σ 1 = σ 2 vs. H1: σ 2 ≠ σ 2 is now described, assuming the Testing the Equality of Means 1 2 sampling conditions described above hold. The t test, the Welch test, and the Wilcoxon-Mann-Whitney test procedures of H : The Levene α-level test is o = vs. H : , are now described. µ1 µ2 1 µ1 ≠ µ2 2 ∑ n ( z - z ) The t test is the given as L = i i. .. > F , 2 α , 1 , n1 + n2 - 2 ∑ ∑( zij - zi. ) /( n1+ n2 - 2 ) ( X - X ) t = 1 2 , (1a) (2.5) 2 s (1/ n1+1/ n2 ) which is the one-way analysis of variance F-test

2 2 computed on the zij values, where zij = |xij- ( n1 -1)s +( n2 -1)s where s2 = 1 2 (1b) median of group i|. ( n1+ n2 - 2 ) Testing of Symmetry 2 2 2 is the pooled estimate of σ , assuming σ1 = σ2 = The Triples test, as described in a paper σ2. by Randles, Fligner, Policello, and Wolfe (1980), is a test to determine if a distribution is The Welch test statistic is symmetric. The procedure used to obtain the test statistic is outlined in Perry and Stoline (2002) ( X1- X 2 ) and is not repeated here. tw = , (2a) ( 2 / )+( 2 / ) s1 n1 s2 n2 Selected Configurations of Distributions, Sample Sizes and Variance Ratios Used in which uses Satterthwaite’s (1946) the Simulation approximation for the degrees of freedom: Type I error rates for testing the homogeneity of means were simulated under a 2 2 2 variety of conditions using four probability ( s / n1+ s / n2 ) df = 1 2 . (2b) distributions. Each of these four distributions is ( 2 / 2 /( -1)+( 2 / 2 /( -1) s1 n1 ) n1 s2 n2 ) n2 classified into one of two groups: (1) symmetric and (2) asymmetric. The Wilcoxon-Mann-Whitney statistic The Results section examines the use of is the TS procedure using two classes of preliminary tests (i.e., testing for variance S - n ( n +1)/ 2 - n n / 2 z = 1 1 1 2 , (3) homogeneity and testing for symmetry) n1n2( n1+ n2 +1)/12 preceding the test of equality of means, Ho: µ1 = µ2 for the two symmetric distributions: (1) where S is the sum of the ranks assigned to the normal and (2) double exponential. In addition, sample observations from group 1, and z is an the Results section examines the TS procedure approximate normal deviate. for the two asymmetric distributions: (1)

The α-level tests of Ho: µ1 = µ2 vs. H1: lognormal and (2) gamma. To evaluate the performance of the µ1 ≠ µ2 are |t| > tα/2 , n1 + n2 –2, |t w | > tα/2,df , and |z| preliminary test of variance homogeneity, the > zα/2 for the t test, the Welch test, and the Wilcoxon-Mann-Whitney test, respectively, following standard deviation ratios R = σ1 / σ2 where z is the upper α-point of the standard are used: 0.25, 0.50, 1.0, 2.0, and 4.0. Clearly α the standard deviations are equal when R = 1. unit normal distribution and t is the upper α,r Sample size configurations (n :n ) used in the α-point of a t distribution with r degrees of 1 2 simulations are: (10:10), (10:20), (10:40), and freedom.

KIMBERLY T. PERRY 317

(20:20). This allows for both direct and indirect f2(x; µ2, σ2), where it is assumed that the two pairings to be examined. samples are independent. Direct pairing occurs when either R = The random realizations from the

0.25 and 0.50 holds with any of the imbalanced standardized distribution f2 (x; µ2, σ2) are samples (10:20) and (10:40). Direct pairing generated for each of the selected distributions. occurs when the group with the smaller σ is For the first sample, f1 (x; µ1, σ1), the random associated with the group with the smaller realizations are generated in the same fashion, but sample size. shape parameters and scale parameters are Indirect pairing occurs when either R = adjusted to yield the desired standard deviation

2.0 and 4.0 holds with any of the imbalanced ratio R = σ1/σ2. Details on each of the four sample sizes (10:20) and (10:40). Indirect selected distributions are outlined in Perry and pairing occurs when the group with the smaller Stoline (2002). The IMSL random number σ is associated with the group with the larger generator RNSET, which initializes the seed, is sample size. used in all of the simulations.

Generation of Random Realizations Testing the Equality of Means Using the TS This section contains an outline of how Procedure the random realizations are generated for each The TS procedure has been described in specified distribution. As before, let x11, . . ., x1n1 the Introduction section. Figure 1 is a diagram of be a random sample of size n1 from the how the TS procedure is constructed. distribution f1(x; µ1, σ1); and x21, . . ., x2n2 be a random sample of size n2 from the distribution

Figure 1. Components of the TS procedure ______

Ho: Symmetry

Accepted Rejected ↓ ↓ Ho: Variance Homogeneity Accepted Æ t test WMW test (Ho: σ1 = σ2) Rejected Æ Welch test Welch test____ Notes: WMW = Wilcoxon-Mann-Whitney.

Asymmetry is concluded if at least one of the Results samples is declared skewed. Another alternative would be that skewness is declared significant In this section, the performance of the TS only if both samples are skewed. It was procedure is evaluated. The “TS procedure” arbitrary chosen for this simulation to use the denotes the results of the test selection procedure former approach with asymmetry being using the 5% Triples test for testing symmetry concluded if at least one of the samples is and the 25% Levene test for testing variance declared skewed. homogeneity.

318 EXAMINATION OF TWO-SAMPLE TESTS OF LOCATION

simulations for each sample size. Each entry in Symmetric Distributions the following tables denotes the frequency at For each of the two symmetric which a < x ≤ b occurs. The outcome of the distributions (i.e., normal and double "test" is defined to be robust if the simulated null exponential) as defined in Perry and Stoline rejection rate is > 4.0 and ≤ 6.0. (2002), the simulations are conducted for the four selected sample size combinations (n1:n2)= Equal Variance Cases (R=1) (10:10), (10:20), (10:40), and (20:20). For each Table 1 shows, as anticipated, that the t of the four sample size combinations, the test is robust for the equal variance cases. simulated null rejection rate is generated for the However, the other procedures are also robust. specified ratio R = σ1/σ2. These are: (1) R = None of the procedures examined show 0.25, (2) R = 0.50, (3) R = 1 (equal variance), (4) simulated rejection rates ≤ 4.0% or > 6%. R = 2.0, and (5) R = 4.0. The results of the simulations for the Unequal Variance Cases two symmetric distributions are combined in Table 1 shows the t test is extremely Table 1. The proportions of rejections are conservative in 50% of the simulations for the R expressed as a percent for the t test, the Welch = 0.25 and 0.50 cases. The WMW test is liberal test, the Wilcoxon-Mann-Whitney test, and the for the R = 0.50 cases and can be extremely TS procedure. These proportions are tabulated conservative for both the R = 0.25 and the R = for each R grouping combined over all (8) 0.50 cases. The Welch test and the TS procedure combinations of sample size pairs (4) and are robust for both the R = 0.25 and R = 0.50 distributions (2) for the five categories listed cases. below: For the R = 2.0 cases the t test is extremely liberal. The WMW test tends to be 1. x ≤ 2.5 (extremely conservative) liberal and can be extremely liberal. The TS 2. 2.5 < x ≤ 4.0 (conservative) procedure is reasonably robust. The Welsh test 3. 4.0 < x ≤ 6.0 (robust) is robust. 4. 6.0 < x ≤ 10.0 (liberal) For the R = 4.0 cases, the t test and the 5. x > 10.0 (extremely liberal) WMW test are extremely liberal in 50% of the simulations. The Welsh test and the TS The value x represents the percentage of procedure are reasonably robust. rejections for testing Ho: µ1 = µ2 based on 10,000

KIMBERLY T. PERRY 319

Table 1. Summary Of Symmetric Distributions Using TS Procedure: Frequency (%) Of Simulated Null Rejection Rate (%) With Nominal 5% Level. ______R Test Extremely Conservative Robust Liberal Extremely Conservative Liberal ≤2.5 2.5< x ≤4 4< x ≤6 6< x ≤10 x > 10 ______

σ1 = σ2 1.00 t 0.0 0.0 100.0 0.0 0.0 W 0.0 0.0 100.0 0.0 0.0 WMW 0.0 0.0 100.0 0.0 0.0 TS 0.0 0.0 100.0 0.0 0.0 ______

σ1 ≠ σ2 0.50 t 50.0 0.0 50.0 0.0 0.0 W 0.0 0.0 100.0 0.0 0.0 WMW 25.0 25.0 0.0 50.0 0.0 TS 0.0 0.0 100.0 0.0 0.0

0.25 t 50.0 0.0 50.0 0.0 0.0 W 0.0 0.0 100.0 0.0 0.0 WMW 25.0 25.0 50.0 0.0 0.0 TS 0.0 0.0 100.0 0.0 0.0

2.0 t 0.0 0.0 50.0 12.5 37.5 W 0.0 0.0 100.0 0.0 0.0 WMW 0.0 0.0 50.0 37.5 12.5 TS 0.0 0.0 75.0 25.0 0.0

4.0 t 0.0 0.0 37.5 12.5 50.0 W 0.0 12.5 87.5 0.0 0.0 WMW 0.0 0.0 0.0 50.0 50.0 TS 0.0 12.5 87.5 0.0 0.0 ______Notes: Table is based on the two symmetric distributions (normal and double exponential) and four sample sizes. W = Welch, WMW = Wilcoxon-Mann-Whitney.

Based on the above simulation results, the lognormal (0, 0.40) distribution, the the Welch test and the TS procedure are coefficient of skewness ranged from 0.3 when R reasonably robust for testing the Ho: µ1 = µ2 for = 0.25 to approximately 9.6 when R = 4.0. For the symmetric cases examined. each value of R within the gamma and lognorma1 case, a skewness ratio has been Results For Asymmetric Distributions calculated. The skewness ratio is the skewness To evaluate the overall performance of of distribution #1 divided by the skewness of the procedures for varying degrees of variance distribution #2 within each gamma and heterogeneity, the results of the simulation for lognormal case. The skewness ratios are the two asymmetric distributions as defined in displayed in Table 2. Perry and Stoline (2002) are combined in Table 2 using the same format as previously defined Equal Variance Cases (R=1) for the symmetric distributions. A summary of the simulated null For the gamma (2,1) distribution the rejection rates for the two asymmetric coefficient of skewness ranged from 0.4 when R distributions for the equal variance cases are = 0.25 to approximately 5.7 when R = 4.0. For presented in Table 2. The WMW test and t test 320 EXAMINATION OF TWO-SAMPLE TESTS OF LOCATION are robust for the R = 1 cases. The Welsh test is robust for approximately 88% of the R = 1 cases. Frequency (%) Each Means Test Is Used The TS procedure tends to be liberal for In addition to the simulated null rejection approximately 38% of these cases. None of the rates, the TS procedure can report the frequency procedures are extremely liberal, extremely (%) at which each of the test procedures is used conservative, or conservative. for a given sample size and R value. Results for the imbalanced case n1 = 10 and n2 = 20, and the Unequal Variance Cases balanced case n1 = n2 = 20 are summarized for the Table 2 shows the Welch test is robust two symmetric distribution cases combined and in approximately 75% of the R = 0.50 cases. The the two asymmetric distribution cases combined. Welch test can be liberal for some R = 0.50 Tables 3 and 4 summarize the frequency cases. The t test is conservative or extremely (%) at which each of the test procedures is used conservative for approximately 50% of the R = for the two symmetric distributions cases 0.50 cases. Furthermore, the t test is liberal in combined, and the two asymmetric cases approximately 38% of the simulations for the R combined, respectively. The format for Tables 3 = 0.50 cases. The WMW test and the TS and 4 is as follows. For each R value, the procedure are liberal or extremely liberal in at frequency at which the t test, the Welch-S test, the approximately 63% and 50%, respectively, for WMW test, and the Welch-AS test was selected the R= 0.50 cases. by the TS procedure is reported. In these tables, For the R = 0.25 cases, none of the test the t test, Welch-S, WMW, and Welch-AS denote procedures are robust. The Welch test and the the following: TS procedure tend to be liberal. The t test is liberal (50%) as well as extremely conservative t test: The t test was used because the TS (50%). The WMW test is liberal or extremely procedure concluded σ1 = σ2 and symmetry was liberal in approximately 88% of the simulations accepted. for the R = 0.25 cases. Table 2 shows all procedures tend to be Welch-S: The Welch test was used liberal or extremely liberal for the R = 2.00 because the TS procedure concluded σ1 ≠ σ2 and cases. Furthermore, all procedures are extremely symmetry was accepted. liberal for 100% of the R = 4 cases. In summary for the R = 1 cases, the t test, WMW: The WMW test was used the Welsh test, and the WMW test are robust in at because the TS procedure concluded σ1 = σ2 and least 87% of the simulations. The TS procedure is symmetry was rejected. robust in approximately 63% of the simulations for the R = 1 cases. For the R = 0.50 cases, the Welch-AS: The Welch test was used

Welch test is robust for approximately 75% of the because the TS procedure concluded σ1 ≠ σ2 and simulated cases. For the R = 0.25, 2.0 and 4.0 symmetry was rejected. cases, all procedures tend to be liberal. The degree of liberal bias increases as the degree of variance heterogeneity increases.

KIMBERLY T. PERRY 321

Table 2. Summary Of Asymmetric Distributions Using TS Procedure: Frequency (%) Of Simulated Null Rejection Rate With Nominal 5% Level. ______R Skewness Test Extremely Conservative Robust Liberal Extremely Ratio Conservative Liberal Gamma, LN ≤2.5 2.5< x ≤4 4< x ≤6 6< x ≤10 x > 10 ______

σ1 = σ2 1.00 1,1 t 0.0 0.0 100.0 0.0 0.0 W 0.0 0.0 87.5 12.5 0.0 WMW 0.0 0.0 100.0 0.0 0.0 TS 0.0 0.0 62.5 37.5 0.0 ______

σ1 ≠ σ2 0.25 0.29,0.23 t 50.0 0.0 0.0 50.0 0.0 W 0.0 0.0 37.5 62.5 0.0 WMW 0.0 0.0 12.5 37.5 50.0 TS 0.0 0.0 37.5 62.5 0.0

0.50 0.50,0.46 t 25.0 25.0 12.5 37.5 0.0 W 0.0 0.0 75.0 25.0 0.0 WMW 0.0 12.5 25.0 50.0 12.5 TS 0.0 0.0 50.0 50.0 0.0

2.0 2.0, 2.39 t 0.0 0.0 0.0 50.0 50.0 W 0.0 0.0 0.0 75.0 25.0 WMW 0.0 0.0 0.0 0.0 100.0 TS 0.0 0.0 0.0 12.5 87.5

4.0 4.04, 7.4 t 0.0 0.0 0.0 0.0 100.0 W 0.0 0.0 0.0 0.0 100.0 WMW 0.0 0.0 0.0 0.0 100.0 TS 0.0 0.0 0.0 0.0 100.0 ______Notes: Table is based on the two asymmetric distributions [lognormal (0, 0.40) & G(2,1)]and four sample sizes. The skewness ratio is the skewness for distribution #1/distribution #2 for each gamma and lognormal case, respectively, at each R value. W = Welch, WMW = Wilcoxon-Mann-Whitney.

for approximately 69% of the simulations. The Symmetric Cases Welch-S test was incorrectly selected for Table 3 contains the frequency (%) at approximately 22% of the simulations when using which each of the test procedures is used in the the TS procedure. The WMW test was incorrectly two symmetric distributions combined for the selected for only 7% of the simulations when balanced and imbalanced cases, respectively. using the TS procedure. For the R = 1.00 case with unequal Equal Variances (Includes the Imbalanced and sample sizes, Table 3 shows that the TS procedure Balanced Cases) selected the t test for 70% of the simulations. The For the R = 1.00 case with equal sample TS procedure incorrectly selected the Welch-S sizes, the t test is known to be robust for the test for nearly 23% of the simulations. However, symmetric distributions. Results in Table 3 show the WMW test was incorrectly selected for less that the TS procedure correctly selected the t test 322 EXAMINATION OF TWO-SAMPLE TESTS OF LOCATION than 6% of the simulations when using the TS approximately 10% of the simulations and procedure. incorrectly concluded asymmetry in approximately 9% of the simulations. Unequal Variances (Includes the Imbalanced and For the R = 0.50 and 2.0 cases with Balanced Cases) unequal sample sizes, Table 3 shows the TS For the R = 0.50 and 2.0 cases with equal procedure correctly selected the Welch-S test for sample sizes, Table 3 shows the TS procedure about 70%-73% of the simulations. The TS correctly selected the Welch-S test for procedure incorrectly selected the t test for about approximately 81% of the simulations. The TS 20%-23% of the simulations. procedure incorrectly selected the t test in

Table 3. Frequency (%) At Which Each Means Test Is Used In The TS Procedure For The Symmetric Distributions.

n1,n2 σ1 , σ2 R t test Welch-S WMW Welch-AS

20,20 σ1 = σ2 1.00 68.91 21.96 7.10 2.04

σ1 ≠ σ2 0.25 0.09 90.78 <0.01 9.13 0.50 10.44 80.43 1.30 7.84 2.00 10.33 80.54 1.28 7.86 4.00 0.07 90.80 0.02 9.12

10,20 σ1 = σ2 1.00 70.30 22.69 5.64 1.38

σ1 ≠ σ2 0.25 0.58 92.41 0.05 7.00 0.50 20.02 72.97 2.06 4.96 2.00 22.76 70.23 1.92 5.10 4.00 0.97 92.02 0.15 6.87

For the R = 0.25, and 4.0 symmetric tions with unequal sample sizes. For the R = 0.25 cases, the Welch test is known to be robust. Table and 4.0 cases, regardless of sample size 3 shows the TS procedure correctly used the configuration, the TS procedure correctly used the Welch-S test for about 90%-92% of the Welch-S test for about 90%-92% of the simulations regardless of the sample size simulations. It is noted for the R ≠ 1 cases, the TS configurations. The Welch-AS test was procedure incorrectly concluded asymmetry for incorrectly used for about 7-9% of the simulations about 7%-9% of the simulations. for each of these same cases. In summary, for the combined symmetric Asymmetric Cases cases, the TS procedure correctly selected the t Table 4 contains the frequency (%) at test for approximately 70% of the R = 1 cases which each of the test procedures is used in the regardless of the sample size configuration. For two asymmetric distributions combined for the the R = 0.50 and 2.0 cases, the TS procedure balanced and imbalanced cases, respectively. correctly selected the Welch-S test for approximately 81% of the simulations with equal sample sizes and about 70% - 73% of the simula- KIMBERLY T. PERRY 323

Equal Variances (Includes the Imbalanced and Results in Table 4 for the unequal sample Balanced Cases) size case show that the TS procedure correctly For the R = 1 case with equal sample used the Welch-AS test for approximately 35% of sizes, the WMW test is known to be robust for the the R = 0.25 cases, whereas the WMW test and asymmetric distributions. Results in Table 4 the Welch-S test were each incorrectly selected shows the TS procedure correctly selected the for about 20% and 65%, respectively, of the R = WMW test for approximately 42% of the 0.25 cases. simulations. The TS procedure incorrectly The TS procedure incorrectly used the selected the Welch-AS test in approximately 12% WMW test for approximately 43% of the R = 4.0 of the simulations with homogeneous variances. cases and the Welch-AS test was correctly used The t test was incorrectly selected by the TS for about 52% of the R = 4.0 equal sample size. procedure in approximately 33% of the For the R= 4.0 unequal sample size cases, the TS simulations. procedure incorrectly used the WMW test for For the R = 1 cases with unequal sample approximately 32% of the simulations and the sizes, Table 4 shows the TS procedure correctly Welsh-AS test was correctly used for selected the WMW test for approximately 31% of approximately 43% of the simulations. the simulations. As also seen for the balanced In summary, for the R = 1 cases sample size cases, the TS procedure incorrectly regardless of the sample size configuration, the TS selected the Welch-AS test in approximately 8% procedure used the WMW test correctly for about of these cases. In addition, the t test was 31%-42% of the simulations. For the R = 0.50 incorrectly selected by the TS procedure in cases, the WMW test was incorrectly selected for approximately 45% of the simulations. about 6%-10% of the simulations when using the TS procedure. The TS procedure generally Unequal Variances (Imbalanced and Balanced correctly used the Welch-AS test for about 35%- Cases) 37% of the 0.25 cases. For the R= 2.0 cases, the For the equal sample size cases, Table 4 TS procedure selected the Welsh-AS test correctly shows the TS procedure incorrectly selected the for about 25%-47% of the simulations and the Welsh-S in approximately 50% of the R=0.50 WMW test incorrectly for about 28%-36% of the cases and approximately 10% of the R=2.0 cases. simulations. The TS procedure selected the Furthermore, the TS procedure incorrectly Welch-AS test correctly for about 43%-52% of selected the WMW test in approximately 6% of the simulations and the WMW test incorrectly the R = 0.50 cases and approximately 36% in the each for about 32%-43% of the simulations for the R = 2.0 cases. The Welch-AS test was correctly R= 4.0 cases. selected for approximately 35% and 47% of the R = 0.50 and 2.0 cases, respectively, when using the Summary of the TS Procedure Using an Alpha TS procedure. Level of 5% of the Triple’s Test For the R = 0.50 and 2.0 cases with For the cases where variance imbalanced sample sizes, results in Table 4 shows homogeneity and symmetry each are unknown to the same trends as was seen for the equal sample the practicing statistician, an overall test using the size cases. The TS procedure incorrectly used the TS procedure yielded improved results with WMW test for approximately 10% of the R = 0.50 respect to robustness over using the t test or the and approximately 28% in the R = 2.0 cases; and Wilcoxon-Mann-Whitney test alone, except for correctly selected the Welch-AS test for about 25- the asymmetry unequal variance cases, where no 26% of the R = 0.50 and 2.0 cases. method maintained the stated Type I error rate. Results in Table 4 shows for the balanced The Welch test is recommended as a robust test case that the TS procedure correctly selected the for testing Ho: µ1 = µ2 for the symmetric cases Welch-AS test for approximately 37% of the R = examined. The TS procedure is also reasonably 0.25 cases. The WMW test was incorrectly used robust. for about 2% of the R = 0.25 cases.

324 EXAMINATION OF TWO-SAMPLE TESTS OF LOCATION

Table 4. Frequency (%) At Which Each Means Test Is Used In The TS Procedure For The Asymmetric Distributions.

n1,n2 σ1 , σ2 R t test Welch-S WMW Welch-AS

20,20 σ1 = σ2 1.00 32.98 12.43 42.22 12.38

σ1 ≠ σ2 0.25 0.02 63.31 0.02 36.66 0.50 9.47 49.95 5.90 34.69 2.00 6.88 9.81 36.21 47.11 4.00 1.79 2.63 43.26 52.33

10,20 σ1 = σ2 1.00 45.02 16.59 30.50 7.90

σ1 ≠ σ2 0.25 0.43 64.58 0.20 34.79 0.50 17.73 46.61 9.79 25.88 2.00 21.73 24.83 28.32 25.13 4.00 9.74 15.74 31.55 42.98

The performance of the TS procedure was of an TS procedure using the Triples test for also evaluated by the frequency at which the TS testing of symmetry at a higher significance level procedure selected the most appropriate test of such as α = 0.25. means. For the symmetric equal variance cases, the TS procedure correctly selected the t test for Further Investigation of the TS Procedure Using approximately 70% of the simulated. For the an Alpha Level of 25% for the Test of Symmetry symmetric cases with unequal variances (R = As the results above showed that the TS 0.25, 0.50, 2.0, and 4.0), the frequency at which procedure was concluding symmetry too often, the Welch test was correctly selected was about the simulations were repeated using the TS 70%-92% for the TS procedure. Asymmetry was procedure with the alpha level set at 25% for the incorrectly concluded for about 7%-9% of the Triples test. To compare the TS procedure using simulated symmetric cases when using the TS the Triples test at alpha level 25% versus 5%, only procedure. the results of the frequency (%) at which each The TS procedure correctly concluded means test is used are displayed. asymmetry for about 35%-96% of the simulated Tables 5 and 6 summarize the frequency cases for the families of asymmetric distributions (%) at which each of the test procedures is used examined. For the asymmetric equal variance for the two symmetric distributions cases cases, the TS procedure correctly selected the combined, and the two asymmetric cases Wilcoxon-Mann-Whitney test for about 31%-42% combined, respectively. The format for Tables 5 of the simulations. For the asymmetric cases with and 6 is the same as described above in section unequal variances, the TS procedure correctly “Frequency (%) at Which Each Mean Test is concluded asymmetry and variance heterogeneity Used.” for about 25%-52% of the simulations. Results showed that the TS procedure Frequency (%) Each Means Test is Used For concluded symmetry too often (for 45%-62% of Symmetric Cases the asymmetric cases with equal variances). Table 5 contains the frequency (%) at Since the TS procedure examined in this which each of the test procedures is used in the simulation study concluded symmetry too often, it two symmetric distributions combined for the would be of interest to examine the performance balanced and imbalanced cases, respectively. KIMBERLY T. PERRY 325

Equal Variances (Imbalanced and Balanced Unequal Variances (Imbalanced and Balanced Cases) Cases) For the R = 1.00 case with equal sample For the R = 0.50 and 2.0 cases with equal sizes, the t test is known to be robust for the sample sizes, Table 5 shows the TS procedure symmetric distributions. Results in Table 5 show correctly selected the Welch-S test for that the TS procedure correctly selected the t test approximately 58% of the simulations. The TS for approximately 46% of the simulations. The procedure incorrectly selected the t test in Welch-S test was incorrectly selected for approximately 2% of the simulations and approximately 14% of the simulations when using incorrectly concluded asymmetry in the TS procedure. The WMW test was incorrectly approximately 40% of the simulations. selected for only 32% of the simulations when For the R = 0.50 and 2.0 cases with using the TS procedure. unequal sample sizes, Table 5 shows the TS For the R = 1.00 case with unequal procedure correctly selected the Welch-S test for sample sizes, Table 5 shows that the TS procedure about 54%-57% of the simulations. The TS selected the t test for approximately 48% of the procedure incorrectly selected the t test for about simulations. The TS procedure incorrectly 6%-9% of the simulations. selected the Welch-S test for approximately 15% For the R = 0.25, and 4.0 symmetric of the simulations. However, the WMW test was cases, the Welch test is known to be robust. Table incorrectly selected for about 30% of the 5 shows the TS procedure correctly used the simulations when using the TS procedure. Welch-S test for about 60%-63% of the simulations regardless of the sample size configurations. The Welch-AS test was incorrectly used for about 37%-40% of the simulations for each of these same cases.

Table 5. Frequency (%) At Which Each Means Test Is Used In The TS Procedure For The Symmetric Distributions.

n1,n2 σ1 , σ2 R t test Welch-S WMW Welch-AS

20,20 σ1 = σ2 1.00 46.38 13.77 32.42 7.44

σ1 ≠ σ2 0.25 0.00 60.15 0.01 39.85 0.50 2.38 57.77 1.78 38.08 2.00 2.46 57.69 1.67 38.19 4.00 0.00 60.15 0.00 39.81

10,20 σ1 = σ2 1.00 47.89 15.14 29.97 7.01

σ1 ≠ σ2 0.25 0.02 63.00 0.01 36.98 0.50 6.24 56.78 4.31 32.68 2.00 9.25 53.77 6.22 30.76 4.00 0.11 62.92 0.07 36.91

326 EXAMINATION OF TWO-SAMPLE TESTS OF LOCATION

In summary, for the combined symmetric the TS procedure correctly selected the WMW cases, the TS procedure correctly selected the t test for approximately 67% of the simulations. test for approximately 47% of the R = 1 cases The TS procedure incorrectly selected the Welch- regardless of the sample size configuration. For AS test in approximately 22% of the simulations the R = 0.50 and 2.0 cases, the TS procedure with homogeneous variances. The t test was correctly selected the Welch-S test for incorrectly selected by the TS procedure in approximately 58% of the simulations with equal approximately 8% of the simulations. sample sizes and about 54%-57% of the For the R = 1 cases with unequal sample simulations with unequal sample sizes. For the R sizes, Table 6 shows the TS procedure correctly = 0.25 and 4.0 cases, regardless of sample size selected the WMW test for approximately 60% of configuration, the TS procedure correctly used the the simulations. As also seen for the balanced Welch-S test for about 60%-63% of the sample size cases, the TS procedure incorrectly simulations. It is noted for the R ≠ 1 cases, the TS selected the Welch-AS test in approximately 19% procedure incorrectly concluded asymmetry for of these cases. In addition, the t test was about 37%-40% of the simulations. incorrectly selected by the TS procedure in approximately 15% of the simulations. Frequency (%) Each Means Test is Used For Asymmetric Cases Unequal Variances (Imbalanced and Balanced Table 6 contains the frequency (%) at Cases) which each of the test procedures is used in the For the equal sample size cases, Table 6 two asymmetric distributions combined for the shows the TS procedure incorrectly selected the balanced and imbalanced cases, respectively. Welsh-S in approximately 25% of the R=0.25 cases. Furthermore, the TS procedure incorrectly Equal Variances (Imbalanced and Balanced selected the WMW test in approximately 12% of Cases) the R = 0.50 cases and approximately 43% in the For the R = 1 case with equal sample R = 2.0 cases. The Welch-AS test was correctly sizes, the WMW test is known to be robust for the selected for approximately 67% and 55% of the R asymmetric distributions. Results in Table 6 show = 0.50 and 2.0 cases, respectively, when using the TS procedure.

Table 6. Frequency (%) For Means Test In The TS Procedure For The Asymmetric Distributions.

n1,n2 σ1 , σ2 R t test Welch-S WMW Welch-AS

20,20 σ1 = σ2 1.00 8.05 3.48 66.95 21.53

σ1 ≠ σ2 0.25 0.01 24.85 0.03 75.12 0.50 3.43 17.52 11.85 67.17 2.00 1.16 1.49 42.72 54.65 4.00 0.16 0.22 45.08 54.55

10,20 σ1 = σ2 1.00 14.78 6.34 60.04 18.85

σ1 ≠ σ2 0.25 0.22 27.49 0.40 71.90 0.50 7.15 18.80 20.83 53.23 2.00 5.36 6.27 44.27 44.11 4.00 1.88 3.05 39.18 55.90 KIMBERLY T. PERRY 327

For the R = 0.50 and 2.0 cases with procedure concluded symmetry too often (for imbalanced sample sizes, results in Table 6 shows 45%-62% of the asymmetric cases with equal the same trends as was seen for the equal sample variances). size cases. The TS procedure incorrectly used the For the TS procedure using the Triples WMW test for approximately 21% of the R = 0.50 test at an alpha level of 25%, results showed that and approximately 44% in the R = 2.0 cases; and the TS procedure concluded asymmetry for the correctly selected the Welch-AS test for about symmetric distributions in 37%-40% of the R≠ 1 44%-53% of the R = 0.50 and 2.0 cases. cases. Results in Table 6 shows for the balanced Recommendations for alternative case that the TS procedure correctly selected the approaches in the future, would be to examine the Welch-AS test for approximately 75% of the R = performance of an TS procedure which concludes 0.25 cases. The Welch-S test was incorrectly used asymmetry at an alpha level between 5% and 25% for about 25% of the R = 0.25 cases. (i.e., 15%) or concludes asymmetry only if both Results in Table 6 for the unequal sample samples were judged to be nonsymmetric at α = size case show that the TS procedure correctly 0.25. In addition, there was a trend, especially in used the Welch-AS test for approximately 72% of the asymmetric distributions, of concluding the R = 0.25 cases, whereas the Welch-S test was variance homogeneity too often for the R ≠ 1 incorrectly selected for about 27% of the R = 0.25 cases. Therefore, it would be recommended to cases. increase alpha level for testing of variance The TS procedure incorrectly used the homogeneity to a higher alpha level beyond α = WMW test for approximately 45% of the R = 4.0 0.25. cases and the Welch-AS test was correctly used for about 55% of the R = 4.0 equal sample size. References For the R= 4.0 unequal sample size cases, the TS procedure incorrectly used the WMW test for Brown, M. B. & Forsythe, A. B. (1974, approximately 39% of the simulations and the June). Robust tests for the equality of variances. Welsh-AS test was correctly used for Journal of the American Statistical Association, approximately 56% of the simulations. 69 (346), 364-367. In summary, for the R = 1 cases Brown, M. B. & Forsythe, A.B. (1974b). regardless of the sample size configuration, the TS The small behavior of some statistics which test procedure used the WMW test correctly for about the equality of several means. Technometrics, 16, 60%-67% of the simulations. For the R = 0.50 129-132. cases, the WMW test was incorrectly selected for Cabilio, P. & Masaro, J. (1996). A simple about 12%-21% of the simulations when using the test of symmetry about an unknown median. The TS procedure. The TS procedure generally Canadian Journal of Statistics, 24(3), 349-361. correctly used the Welch-AS test for about 72%- Conover, W. J., Johnson, M. E., & 75% of the 0.25 cases. For the R= 2.0 cases, the Johnson, M.M. (1981). A comparative study of TS procedure selected the Welsh-AS test correctly tests for homogeneity of variances, with for about 44%-55% of the simulations and the applications to the outer continental shelf bidding WMW test incorrectly for about 43%-44% of the data. Technometrics, 23 (4), 351-361. simulations. The TS procedure selected the Gans, D. J. (1981). Use of a preliminary Welch-AS test correctly for about 55%-56% of test in comparing two sample means. the simulations and the WMW test incorrectly Communication in Statistics B, Simulation & each for about 39%-45% of the simulations for the Computation, B10 (2), 163-174. R= 4.0 cases. Gupta, M. K. (1967). An asymptotically nonparametric test of symmetry. Annals of Conclusion Mathematical Statistics, 38, 849-866.

For the TS procedure using the Triples test with an alpha level of 5%, results showed that the TS 328 EXAMINATION OF TWO-SAMPLE TESTS OF LOCATION

IMSL. (1989, January). Math/Library Perry, K. T. & Stoline, M. R. (2002). A User’s Manual (Version 1.1). Houston, Texas: comparison of the D’Agostino SU test to the Author. Triples test for testing of symmetry versus IMSL. (1989, December). Math/Library asymmetry as a preliminary test to testing the User’s Manual (Version 1.1). Houston, Texas: equality of means. Journal of Modern Applied Author. Statistical Methods, 1 (2), 316-325. Loh, W. (1987, May 21). Some Randles, R. H., Fligner, M. A., Policello modifications of Levene’s test of variance II, G.E., & Wolfe, D.A. (1980, March). An homogeneity. Journal of Statistical Computation asymptotically distribution-free test for symmetry & Simulation, 28, 213-226. versus asymmetry. Journal of the American Murphy, B. P. (1976). Comparison of Statistical Association, 75 (369), 168-172. some two sample means tests by simulation. Snedecor, G. W. & Cochran, W. G. Communication in Statistics B, Simulation and (1967). Statistical methods. Ames, Iowa: The Computation, B5 (1), 23-32. Iowa State University Press. O’Brien, R. G. (1979). A general Zimmerman, D. W. & Williams, R. H. ANOVA method for robust tests of additive (1989). Power comparisons of the student t-test models for variances. Journal of the American and two approximations when variances and Statistical Association, 74, 877-880. sample sizes are unequal. Journal of Indian Olejnik, S. (1987). Conditional ANOVA Society Agricultural Statistics, 41 (2), 206-217. for mean differences when population variances are unknown. Journal of Experimental Education, 55, 141-148.

Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 329-340 1538 – 9472/03/$30.00

A Comparison Of Equivalence Testing In Combination With Hypothesis Testing And Effect Sizes

Christopher J. Mecklin Department of Mathematics and Statistics Murray State University

Equivalence testing, an alternative to testing for statistical significance, is little used in educational research. Equivalence testing is useful in situations where the researcher wishes to show that two means are not significantly different. A simulation study assessed the relationships between effect size, sample size, statistical significance, and statistical equivalence.

Key words: Equivalence testing, statistical significance, effect size

Introduction The references included here are by no means close to an exhaustive list. This debate is The use of statistical inference, particularly via not limited to educational research and the social null hypothesis significance testing, is an sciences; for instance, it is also being argued in extremely common but contentious practice in ecology (McBride, 1999; Anderson, Burnham, educational research. Both the pros and the cons & Thompson, 2000). Many in the statistical of hypothesis testing have been argued in the community outside of the niche of educational literature for several decades. A recent and psychological research, though, are either monograph edited by Harlow, Mulaik, and unaware of this debate or feel that it is trivial Steiger (1997) was devoted to these arguments. (Krantz, 1999). Some classic references criticizing standard The objective of this paper is not to hypothesis testing include Boring (1919), continue this heated argument, but rather to Berkson (1938, 1942), Rozeboom (1960), Meehl borrow the method of equivalence testing from (1967, 1978), and Carver (1978). More recently, biostatistics, as suggested by Bartko (1991), and some support the continued usage of using it in conjunction with standard hypothesis significance testing (Abelson, 1997; Hagan, testing in educational research. Lehmann (1959) 1997, 1998; Harris, 1997; McLean & Ernest, anticipated the need for interval testing in his 1998), while others desire a greater reliance on classic volume on the theory of hypothesis alternatives such as confidence intervals or testing. Many of the currently employed effect sizes (Cohen, 1992, 1994; Knapp, 1998, methods of equivalence testing were developed 2002; Meehl, 1997; Serlin, 2002; Thompson, in the 1970’s and 1980’s to address biostatistical 1998, 2001; Vacha-Haase, 2001), and still others and pharmaceutical problems (Westlake, 1976, advocate an outright ban on significance testing 1979; Schuirmann, 1981, 1987; Anderson & (Carver, 1993; Falk, 1998; Hunter, 1997; Nix & Hauck, 1983; Patel & Gupta, 1984). Rogers, Barnette, 1998; Schmidt & Hunter, 1997). Howard, and Vessey (1993) introduced the use of equivalence testing methods to the social sciences. Serlin (1993) essentially suggested Christopher Mecklin is an Assistant Professor of equivalence testing when he suggested the use of Mathematics & Statistics. His research interests range, rather than point, null hypotheses. include goodness-of-fit, educational statistics and statistical ecology. He enjoys working with Methodology faculty and students from various disciplines. Email:[email protected]. Standard null hypothesis significance testing dates back to the pioneering theoretical work of

329 330 COMPARISON OF EQUIVALENCE TESTING WITH HYPOTHESIS TESTING

Fisher, Neyman, and Pearson. Hypothesis mean of the control group, although τ can also testing can be found in almost every textbook of be chosen to be a constant that is the smallest statistical methods and thus will not be further absolute difference between two means that is elaborated on here. Equivalence testing, on the large enough to be practically important. other hand, is a newer technique and one that is The null hypothesis (i.e. the means are unfamiliar to most researchers in education and H012:|µ −µτ |≥ the social sciences. different) for the TOST is . Equivalence testing was developed in The alternative hypothesis (i.e. the means are H :|µ −µτ |< biostatistics to address the situation where the equivalent) is 11 2 . goal is not to show that the mean of one group is The first one-sided test seeks to reject greater than the mean of another group (i.e. the the null hypothesis that the difference between superiority of one treatment to another), but two means is less than or equal to −τ ; similarly, rather to establish that two methods are equal to the second one-sided test seeks to reject the null one another. A common application of this idea hypothesis that the difference in the means is in biostatistics is to show that a less expensive greater than or equal to τ . If the one-sided test “generic” medication is as effective as the more with the larger p-value leads to rejection, then expensive “brand-name” medication. In the two groups are considered to be equivalent. equivalence testing, the null hypothesis is that For the first one-sided test, we compute the two groups are not equivalent to one another, the test statistic and hence rejection of the null indicates that the two groups are equivalent. This differs from x − xx+τ standard significance testing where the null t = 12 2 1 snn11/+/ hypothesis states that the group means are equal p 12 and rejection of the null indicates that the two s groups are statistically different. A common where p is the pooled standard deviation of the methodological mistake in research is to two samples and compute the p-value as conclude that the null hypothesis is true (i.e. two groups have equal means) based on the failure to p = Pt()> t 11ν reject it. This action fails to recognize that the t failure to reject the null is often merely a Type II where ν is a random variable from the t- error, especially when the sample sizes are small ν = nn+−2 distribution with 12 degrees of and the power of the test is low. freedom. An explanation of the theory of The second one-sided test is similar to equivalence testing can be found in Berger and the first. The test statistic is Hsu (1996); Blair and Cole (2002) give a less technical explanation. Here, we will merely x − xx−τ review the most commonly implemented method t = 12 2 used for establishing the equivalence of two 2 snn11/+/ p 12 population means for an additive model, where the difference of means is considered. The and the p-value is multiplicative model, which looks at the ratio of means, will not be considered further in this p = Pt()< t paper. The commonly used procedure in 22ν . biostatistics for this problem is to use the “two one-sided tests” procedure, or TOST (Westlake, p = max(pp12, ) 1976, 1979; Schuirmann, 1981, 1987). With the If we let , then the null hypothesis of nonequivalence is rejected if TOST, the researcher will consider two groups p < α equivalent if he can show that they differ by less . than some constant τ , the equivalence bound, in The choice of τ is a difficult choice that both directions. The constant τ is often chosen is up to the researcher. This choice is analogous to be a percentage (such as 10% or 20%) of the to the selection of an appropriate alpha level in CHRISTOPHER J. MECKLIN 331 standard significance testing, an appropriate zero is not in the interval), then we would reject level of confidence in interval estimation, or a the null hypotheses of both a significance and an sufficiently large effect size, and should be made equivalence test. In that case, we could make the carefully. Knowledge of the situation at hand somewhat discomforting conclusion that the should be used to specify the maximum difference of means was both statistically difference between population means that would significant and equivalent. be considered clinically trivial. Researchers in It is important to note that the biostatistics typically have the choice made for equivalency confidence interval is expressed at them by government regulation. the 100(1− 2α )% level of confidence. Rogers As in standard hypothesis testing, an et al. (1993) noted that if one performs both a equivalency confidence interval can also be standard significance test and an equivalence constructed. If the entire confidence interval is test on the same data set, making either a within ()−,τ τ , then equivalence between the “reject” or “fail to reject” decision, that there are groups is indicated. If the entire confidence four possibilities. These four conditions are given in Table 1. interval is within either (0)−,τ or (0,τ ) (i.e.

Table 1. Possible Combinations of Significance and Equivalence Testing

Significance Test Equivalence Test Term Fail to reject Reject Equivalent Reject Reject Equivalent and Different Reject Fail to reject Different Fail to reject Fail to reject Equivocal

The second condition “equivalent and needs to be estimated. Cohen’s d (1988) is a different”, a simultaneous rejection of both statistic often used for this purpose. The effect inferential procedures, could happen in a size (ES) is found with situation where large samples provide “too much power”, resulting in a trivial difference in means x − x being statistically significant. The equivalence d = 12 s test (and the effect size) should detect the small pooled magnitude of these mean differences. The fourth condition indicates that there is insufficient where evidence to conclude that the groups are either equivalent or different. This would most likely (1)(1)nsns−+−22 occur when the samples are very small and/or s = 112 2 pooled nn+−2 the group variances are very large. 12 The effect size for the difference of means is the standardized difference between the is the pooled standard deviation of the two groups (Fan, 2001). We will use the parameter samples. We stress that Cohen’s d is a sample statistic and has a sampling distribution like

µ12− µ other estimates. δ = Cohen (1988) gave some suggestions for σ interpreting d. An effect size of d=0.2 is deemed to represent the effect size of the population, “small”, d=0.5 is “medium”, and d=0.8 is “large”. It is becoming, rather regrettably in our µ µ where 1 and 2 are the population means and opinion, common for researchers to rigidly apply 2 Cohen’s suggestions. Absolute reliance on σ is the common variance. Cohen’s rule of thumb is as misguided as blind δ Of course, is typically unknown and adherence to a particular level of significance 332 COMPARISON OF EQUIVALENCE TESTING WITH HYPOTHESIS TESTING

(e.g. α =.005). As Thompson (2001) said, “we efficacy, and statistical test anxiety. However, would merely be being stupid in another after calculating Cohen’s d as an effect size (ES) metric.” measure and the use of the TOST equivalence test, we see that only prior math courses and Results statistical test anxiety are “different” between males and females. Not surprisingly, the two Rogers et al. (1993) provided empirical largest effect sizes are found for these two examples of the application of equivalence variables. Table 2 shows results of both testing on data from the psychological literature. traditional significance and equivalence tests for We will do the same with an example from the the Benson data. educational research literature. This will Statistical significance was defined as a H demonstrate that there often exist situations rejection of 0 with α =.005 and equivalence where a statistically significant difference H between groups coincides with the groups being was defined as a rejection of 0 with statistically equivalent. This is the “equivalent α = 010. . The reason for the two different and different” condition that is typically significance levels is because while a traditional associated with a small to moderate effect size, significance test at level α corresponds to a as opposed to the strong effect sizes that 100(1−α )% typically occur with the “different” condition confidence interval, an α and the weak effect sizes that occur with the equivalence test at level corresponds to a “equivalent” condition. 100(1− 2α )% equivalence interval. We Benson (1989), in a study concerning selected τ = 02. (i.e. 20% of the mean of the statistical test anxiety, presented means and female group). This choice was arbitrary and by variances for a sample of 94 males and 123 no means should be taken as a choice females on seven variables. Using standard recommended for all equivalence problems. The hypothesis testing methods (i.e. t-tests), τ significant group differences were found for: results could differ with different choices for . prior math courses, math self-concept, self-

Table 2. Comparing Significance and Equivalence Testing for the Benson Data

Descriptive Statistics Males Females (N=94) (N=123) Variable M SD M SD Effect Sig. p- Equiv. p- Category Size value value GPA 3.05 0.44 3.16 0.47 -0.24 0.040 <0.001 Equiv. & Diff. Prior Math 3.45 2.14 2.20 2.01 0.60 <0.001 0.998 Different Courses Math Self- 25.77 5.96 23.20 7.05 0.39 0.002 0.012 Equiv. & Concept Diff. Self-efficacy 12.68 1.77 11.62 2.30 0.51 <0.001 <0.001 Equiv. & Diff. General Test 36.38 0.49 40.62 12.25 -0.37 0.004 0.007 Equiv. & Anxiety Diff. Achievement 32.56 5.68 32.26 7.55 0.04 0.374 <0.001 Equivalent Statistical 32.65 12.57 41.84 14.83 -0.66 <0.001 0.663 Different Test Anxiety

CHRISTOPHER J. MECKLIN 333

For a test of statistical significance, one half of a standard deviation). Three different power is the probability of rejecting the null equivalence bounds (τ = 0.1,0.2,0.4 ) were hypothesis that the population means are equal used, defining the minimum difference between when they are in fact not equal. The power of means that is practically important (i.e. non- an equivalence test is the probability of rejecting equivalent) to be 10%, 20% or 40% of µ . that the means are different by at least some 1 equivalence bound τ when the means are in fact Hence, we have a fully crossed design equivalent (i.e. differ by less than τ ). with 6 X 6 X 3 = 108 cells. Within each cell (i.e. Of interest to us is the probability of combination of sample size, effect size, and rejecting both the null hypotheses (of non- equivalence bound), 10000 simulations were significance and non-equivalence) run. The R statistical computing environment simultaneously. We designed a small simulation was used to conduct the simulations. Each study to assess the power of simultaneously simulation consisted of generating n random concluding that two means are both statistically normal variates with mean 0 +δ and variance 1 different and equivalent. and a second, independent set of n random As is always the case with Monte Carlo normal variates with mean 0 and variance 1. The studies, the choices of simulation parameters are independent samples t-test and the TOST with difficult to make and are somewhat arbitrary. equivalence bound τ was conducted for each We endeavored to simulate situations that were simulation, and the number of rejections of each likely to be encountered in actual quantitative test, along with the number of simultaneous data analysis. We also made some simplifying rejections of both procedures and the number of assumptions to keep the number of simulations failures to reject either procedure, were noted. and associated tables and figures to a reasonable Tables 3 through 8 show the number of level. rejections of the null hypotheses of the We assumed that both of our equivalence test, both tests, the significance test, populations were always normally distributed and neither test. Columns involving the 2 equivalence test are in italics; columns involving with a common variance σ =1. Six different the significance test are in boldface. Note that sample sizes per group the power of the equivalence test for each (n=10,20,50,100,200,500) were chosen; only situation can be found by dividing the sum of the equally sized groups were used in this study. Six italicized columns by 10000. Similarly, the different values for the effect size parameter power of the significance test is obtained by (δ = 0,0.1,0.2,0.3,0.4,0.5 ) were used, dividing the sum of the columns in boldface by reflecting situations from no effect (i.e. 10000. equivalent population means) to a “medium” effect size (i.e. population means that differ by

334 COMPARISON OF EQUIVALENCE TESTING WITH HYPOTHESIS TESTING

Table 3. Simulated Power of the Tests of Statistical Equivalence and Significance, Effect Size δ = 0

Number of Rejections (10000 Simulations) Equivalence Bound τ Sample Size N Equivalent Both Different Neither 0.1 10 0 0 506 9494 20 0 0 500 9500 50 0 0 476 9524 100 0 0 535 9465 200 0 0 504 9496 500 2337 0 511 7152 0.2 10 0 0 496 9504 20 0 0 507 9493 50 0 0 485 9515 100 1063 0 546 8391 200 5121 0 514 4365 500 9386 3 490 121 0.4 10 10 0 486 9504 20 370 0 469 9161 50 5279 0 481 4240 100 8757 0 457 786 200 9493 444 63 0 500 9483 517 0 0

Table 4. Simulated Power of the Tests of Statistical Equivalence and Significance, Effect Size δ = 0.1

Number of Rejections (10000 Simulations) Equivalence Bound τ Sample Size N Equivalent Both Different Neither 0.1 10 0 0 535 9494 20 0 0 606 9500 50 0 0 817 9524 100 0 0 1118 9465 200 0 0 1652 9496 500 709 0 3366 7152 0.2 10 0 0 521 9504 20 0 0 605 9493 50 1 0 786 9515 100 793 0 1090 8391 200 3452 0 1687 4365 500 6192 15 3486 121 0.4 10 11 0 565 9424 20 347 0 622 9031 50 4759 0 772 4469 100 7902 0 1044 1054 200 8361 1196 443 0 500 6521 3475 4 0

CHRISTOPHER J. MECKLIN 335

Table 5. Simulated Power of the Tests of Statistical Equivalence and Significance, Effect Size δ = 0.2

Number of Rejections (10000 Simulations) Equivalence Bound τ Sample Size N Equivalent Both Different Neither 0.1 10 0 0 727 9273 20 0 0 962 9038 50 0 0 1727 8273 100 0 0 2865 7135 200 0 0 5193 4807 500 16 0 8880 1104 0.2 10 0 0 699 9301 20 0 0 950 9050 50 0 0 1678 8322 100 408 0 2908 6684 200 951 0 5207 3842 500 915 7 8924 154 0.4 10 8 0 734 9258 20 296 0 967 8737 50 3397 0 1677 4926 100 5485 0 2890 1625 200 4886 2800 2314 0 500 1167 8534 299 0

Table 6. Simulated Power of the Tests of Statistical Equivalence and Significance, Effect Size δ = 0.3

Number of Rejections (10000 Simulations) Equivalence Bound τ Sample Size N Equivalent Both Different Neither 0.1 10 0 0 947 9053 20 0 0 1540 8460 50 0 0 3144 6856 100 0 0 5594 4406 200 0 0 8482 1518 500 0 0 9973 27 0.2 10 0 0 985 9015 20 0 0 1501 8499 50 0 0 3203 6797 100 104 0 5681 4215 200 95 0 8524 1381 500 19 1 9973 7 0.4 10 11 0 991 8998 20 225 0 1563 8212 50 2061 0 3133 4806 100 2796 0 5602 1602 200 1516 2374 6110 2167 500 23 6115 3862 0

336 COMPARISON OF EQUIVALENCE TESTING WITH HYPOTHESIS TESTING

Table 7. Simulated Power of the Tests of Statistical Equivalence and Significance, Effect Size δ = 0.4

Number of Rejections (10000 Simulations) Equivalence Bound τ Sample Size N Equivalent Both Different Neither 0.1 10 0 0 1335 8665 20 0 0 2333 7667 50 0 0 5015 4985 100 0 0 8069 1931 200 0 0 9769 231 500 0 0 10000 0 0.2 10 0 0 1344 8656 20 0 0 2341 7659 50 0 0 5077 4923 100 23 0 8110 1867 200 1 0 9784 215 500 0 0 10000 0 0.4 10 9 0 1402 8589 20 164 0 2346 7490 50 933 0 5099 3968 100 932 0 8075 993 200 232 806 8962 0 500 0 1025 8975 0

Table 8. Simulated Power of the Tests of Statistical Equivalence and Significance, Effect Size δ = 0.5

Number of Rejections (10000 Simulations) Equivalence Bound τ Sample Size N Equivalent Both Different Neither 0.1 10 0 0 1897 8103 20 0 0 3383 6617 50 0 0 6981 3019 100 0 0 9428 572 200 0 0 9985 15 500 0 0 10000 0 0.2 10 0 0 1804 8196 20 0 0 3437 6563 50 0 0 6905 3095 100 1 0 9429 570 200 0 0 9987 13 500 0 0 10000 0 0.4 10 7 0 1866 8127 20 117 0 3425 6458 50 370 0 6936 2692 100 236 0 9378 386 200 13 108 9879 0 500 0 28 9972 0

CHRISTOPHER J. MECKLIN 337

Conclusion large enough to matter), we will reject the null of the TOST and conclude equivalence with The data originally collected and analyzed with power increasing to unity with larger sample traditional significance tests by Benson (1989) sizes. showed a statistically significant difference If δ > τ , the power of the significance between the means of male and female statistics test approaches unity and the power of the students on six variables (GPA, number of prior equivalence test approaches zero as the sample math courses, math self-concept, self-efficacy, size increases. This is the situation where the general test anxiety, and statistical test anxiety) effect size parameter exceeds the specified and failed to find a significance for only one maximum for practical importance; we will variable (achievement). We computed Cohen’s d reject the t-test and conclude statistical as an effect size. Not surprisingly, the smallest significance with power increasing to unity as absolute effect size of 0.04 was found for the the sample size increases. non-significant variable, while the absolute When δ = τ , then the power of the effect sizes of the six significant variables equivalence test will approach twice the nominal ranged from 0.24 to 0.66. alpha level. This occurs because the effect size We then re-analyzed Benson’s data parameter happens to coincide with the specified using the TOST procedure for testing for equivalence bound. Rejecting the TOST (i.e. statistical equivalence. This analysis showed that concluding equivalence) is a type I error, made only two variables, number of prior math with probability 2α . The probability is twice courses and statistical test anxiety, were the nominal α since an equivalence test at level “different” (i.e. significant and not equivalent). α corresponds to a 100(1− 2α )% equivalence Not coincidentally, these were the two variables with the strongest absolute effect sizes of 0.60 interval. and 0.66. The non-significant variable When 0 < δ < τ , then the power of (achievement) was found to be statistically both the significance and equivalence tests equivalent, and the absolute effect size was approaches unity (often slowly) as n increases. virtually zero. Four of the variables (GPA, math This is the situation where the null hypothesis of self-concept, self-efficacy, and general test a significance test is false (i.e. the difference of anxiety) yielded conflicting results of means is not equal to zero), but the true “equivalent and different” since they rejected the difference is too small to be considered null hypotheses of both the statistical and practically significant, where τ is the minimum equivalence tests. It is likely that the difference difference between means that is considered in the means of these four variables, while important. statistically significant, is trivial. The absolute It appears to be somewhat common with effect sizes of these four variables ranged from real data to have situations where the tests of 0.24 to 0.51. This encompasses a range of effect statistical significance and equivalence are sizes that is often classified as “small” to simultaneously rejected for reasonable choices “medium” (Cohen, 1988), notwithstanding of significance level α and equivalence bound Lenth’s (2001) warnings against using “canned” τ . Our re-analysis of the Benson (1989) data effect sizes. yielded 4 simultaneous rejections out of 7 We noticed that whenever the effect size variables. δ is less than the equivalence bound τ , then The simulated power of simultaneous the power of the equivalence test was rejection showed that the probability of approaching unity as n increased. This simultaneous rejection was low when the convergence was slow when δ was nearly equal assumptions of the inferential tests (i.e. to τ . Essentially, if the effect size parameter is normality, equal variances, equal sample sizes less than the minimum difference that the between groups) were true except when both n researcher considers to be practically important and τ were large. It is possible that (i.e. the minimum difference between means “simultaneous rejection” will be more likely with real data than (at least our) simulated data 338 COMPARISON OF EQUIVALENCE TESTING WITH HYPOTHESIS TESTING because real data will surely violate the References normality and homoscedasticity assumptions. We speculate that simultaneous rejection will be Abelson, R. (1997). A retrospective on more common, and thus potentially more the signficance test ban of 1999 (if there were no problematic for the researcher using equivalence significance tests, they would have to be testing in conjunction with standard hypothesis invented). In L. Harlow, S. Mulaik, & J. Steiger, testing, when the data is non-normal and What if there were no significance tests? (p. heteroscedastic. 117-141). Mahwah, NJ: Lawrence Erlbaum Sawilowsky and Yoon (2002) Associates. demonstrated that large effect sizes could be Anderson, D., Burnham, K., & found in situations where the results of a Thompson, W. (2000). Null hypothesis testing: hypothesis test are ‘not significant’ (i.e. p>.05). Problems, prevalance, and an alternative. Similarly, we found the magnitude of effect Journal of Wildlife Management, 64(4), 912- sizes obtained from the statistical re-analysis of 923. typical educational research data to be troubling. Anderson, S., & Hauck, W. (1983). A Benson’s data was of a decent size (groups of 94 new procedure for testing equivalence in and 123 subjects), but an effect size as large as comparative bioavailability and other clinical 0.51 yielded both statistical significance trials. Communications in Statistics: Theory and (rejecting that the male mean was equal to the Methods, 12, 2663-2692. female mean) and equivalence (rejecting that the Bartko, J. (1991). Proving the null absolute difference of the male and female hypothesis. American , 46(10), 801- means were within a constant τ ). We make the 803. conjecture that the effect size conventions of Benson, J. (1989). Structural Cohen (i.e. 0.2 is small, 0.5 is medium, 0.8 is components of statistical test anxiety in adults: large) might not be large enough. It is even An exploratory model. Journal of Experimental possible that making any recommendation about Education, 57(3), 247-261. the desired magnitude of an effect size Berger, R., & Hsu, J. (1996). independent of the sample sizes and variability Bioequivalence trials, intersection-union tests of the populations might be futile (Lenth, 2001). and equivalence confidence sets. Statistical It would be desirable to extend the Science, 11(4), 283-319. simulation study to consider several scenarios Berkson, J. (1938). Some difficulties of ignored here. In particular, more attention needs interpretation encountered in the application of to be given to situations where one or more of the chi-square test. Journal of the American the following conditions are true: Statistical Association, 33, 526-536. Berkson, J. (1942). Tests of significance 1. The populations are non-normal considered as evidence. Journal of the American 2. The variances are not equal Statistical Association, 37, 325-335. 3. The sample sizes of the groups are not Blair, R. C., & Cole, S. R. (2002). Two- equal. sided equivalence testing of the difference between two means. Journal of Modern Applied It would also be desirable to analytically Statistical Methods, 1(1), 139-142. determine the power function for simultaneous Boring, E. (1919). Mathematical vs. rejection of the significance and equivalence scientific importance. Psychological Bulletin, tests, if possible. We will continue to strive for a 16, 335-338. greater understanding of the link between the Carver, R. (1978). The case against effect size and the results of the significance and statistical significance testing. Harvard equivalence tests. It appears that sole reliance on Educational Review, 48, 378-399. any standard methodology, be it hypothesis Carver, R. (1993). The case against testing, confidence intervals, effect sizes, or statistical significance testing, revisited. Journal equivalence testing is ill advised. of Experimental Education, 61(4), 287-292.

CHRISTOPHER J. MECKLIN 339

Cohen, J. (1988). Statistical power McBride, G. (1999). Equivalence tests analysis for the behavioral sciences (2nd). can enhance environmental science and Hillsdale, NJ: Lawrence Erlbaum Associates. management. Australian and New Zealand Cohen, J. (1992). A power primer. Journal of Statistics, 41(1), 19-29. Psychological Bulletin, 112(1), 155-159. McLean, J., & Ernest, J. (1998). The Cohen, J. (1994). The Earth is round role of statistical significane testing in ( p <.05 ). American Psychologist, 49(12), 997- educational research. Research in the Schools, 1003. 5(2), 15-22. Falk, R. (1998). In criticism of the null Meehl, P. (1967). Theory-testing in hypothesis statistical test. American psychology and physics: A methodological Psychologist, 53, 798-799. paradox. Philosophy of Science, 34, 103-115. Fan, X. (2001). Statistical significance Meehl, P. (1978). Theoretical risks and and effect size in education research: Two sides tabular asterisks: Sir Karl, Sir Ronald, and the of a coin. Journal of Educational Research, slow progress of soft psychology. Journal of 94(5), 275-282. Consulting and , 46, 806- Hagan, R. (1997). In praise of the null 834. hypothesis statistical test. American Meehl, P. (1997). The problem is Psychologist, 52(1), 15-24. epistemology, not statistics: Replace Hagan, R. (1998). A further look at significance tests by confidence intervals and wrong reasons to abandon statistical testing. quantify accuracy of risky numerical American Psychologist, 53(7), 801-803. predictions. In L. Harlow, S. Mulaik, & Harlow, L., Mulaik, S., & Steiger, J. J. Steiger (Eds.), What if there were no (1997). What if there were no significance tests? significance tests? (p. 393-426). Mahwah, NJ: Mahwah, NJ: Lawrence Erlbaum Associates. Lawrence Erlbaum Associates. Harris, R. (1997). Reforming significane Nix, T., & Barnette, J. (1998). The data testing via three-valued logic. In L. Harlow, analysis dilemma: Ban or abandon. A review of S. Mulaik, & J. Steiger (Eds.), What if there null hypothesis significance testing. Research in were no significance tests? (p.145-174). the Schools, 5(2), 3-14. Mahwah, NJ: Lawrence Erlbaum Associates. Patel, H., & Gupta, G. (1984). A Hunter, J. (1997). Needed: A ban on the problem of equivalence in clinical trials. significance test. Psychological Science, 8(1), 3- Biometrical Journal, 26, 471-474. 7. Rogers, J., Howard, K., & Vessey, J. Knapp, T. R. (1998). Comments on the (1993). Using significance tests to evaluate statistical significance testing articles. Research equivalence between two experimental groups. in the Schools, 5(2), 39-41. Psychological Bulletin, 113(3), 553-565. Knapp, T. R. (2002). Some reflections Rozeboom, W. (1960). The fallacy of on significance testing. Journal of Modern the null hypothesis significance test. Applied Statistical Methods, 1(2), 240-242. Psychological Bulletin, 57, 416-428. Krantz, D. (1999). The null hypothesis Sawilowsky, S. S., & Yoon, J. (2002). testing controversy in psychology. Journal of the The trouble with trivials (p>.05). Journal of American Statistical Association, 94, 1372- Modern Applied Statistical Methods, 1(1), 143- 1381. 144. Lehmann, E. (1959). Testing statistical Schmidt, F., & Hunter, J. (1997). Eight hypotheses (1st ed.). New York: Wiley. common but false objections to the Lenth, R. (2001). Some practical discontinuation of significance testing in the guidelines for effective sample size analysis of research data. In L. Harlow, determination. The American Statistician, 55(3), S. Mulaik, & J. Steiger (Eds.), What if there 187-193. were no significance tests? (p. 37-64). Mahwah, NJ: Lawrence Erlbaum Associates.

340 COMPARISON OF EQUIVALENCE TESTING WITH HYPOTHESIS TESTING

Schuirmann, D. (1981). On hypothesis Thompson, B. (2001). Significance, testing to determine if the mean of the normal effect sizes, stepwise methods, and other issues: distribution is contained in a known interval. Strong arguments move the field. Journal of Biometrics, 37, 617. Experimental Education, 70(1), 80-93. Schuirmann, D. (1987). A comparison Vacha-Haase, T. (2001). Statistical of the two one-sided tests procedure and the significance should not be considered one of power approach for assessing the equivalence of life’s guarantees: Effect sizes are needed. average bioavailability. Journal of Educational and Psychological Measurement, Pharmokinetics and Biopharmaceutics, 15, 657- 61, 219-224. 680. Westlake, W. (1976). Symmetric Serlin, R. (1993). Confidence intervals confidence intervals for bioequivalence trials. and the scientific method: A case for Holm on Biometrics, 32, 741-744. the range. Journal of Experimental Education, Westlake, W. (1979). Statistical aspects 61, 350-360. of comparative bioequivalence trials. Serlin, R. (2002). Constructive criticism. Biometrics, 35, 273-280. Journal of Modern Applied Statistical Methods, 1(2), 202-227. Thompson, B. (1998). Statistical significance and effect size reporting: Portrait of a possible future. Research in the Schools, 5(2), 33-38.

Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 341-349 1538 – 9472/03/$30.00

Confidence Intervals For P(X

Ayman Baklizi Department of Statistics Yarmouk University Irbid – Jordan

The problem considered is interval estimation of the stress - strength reliability R = P(X

Key words: Bootstrap, exponential distribution, interval estimation, stress-strength model

Introduction The problem of developing confidence intervals for the stress - strength probability has The problem of making inference about R = received relatively little attention; Halperin P(X

341 342 CONFIDENCE INTERVALS FOR P(X

The Model and Maximum Likelihood Intervals Based on the Asymptotic Normality of Estimation the MLE (AN Intervals) In this study, X and Y are independently Bai and Hong (1992) showed that if exponentially distributed random variables with n n = n + n → ∞ such that 1 → γ , 0 < γ < 1. scale parameters θ and λ respectively and a 1 2 n common location parameter µ , that is Then ˆ 2 where n(R − R)→ N()0,σ −θ ()x−µ 2 2 f (x,θ , µ) = θe , x ≥ µ ; 2 R (1− R) X σ = . This fact can be used to −λ()y−µ γ ()1− γ fY (y,λ, µ) = λe , y ≥ µ . construct approximate confidence intervals for R. The intervals are of the form Let X1,..., X n be a random sample for X and 1 Y ,...,Y be a random sample for Y. The 1 n2 ⎛ ˆ ˆ ⎞ ⎜ ˆ R()1− R ⎟ parameter R we want to estimate is R ± z1−α 2 , ⎜ n n n 1− n n ⎟ θ ⎝ ()1 ()1 ⎠ R = p(X < Y) = . The likelihood θ + λ function can be written as where z1−α 2 is the 1−α 2 -quantile of the standard normal distribution. ⎛ n1 n1 ⎞ n1 n2 ⎜ ⎟ L()θ ,λ, µ = θ λ exp⎜−θ ∑ ()()()xi − µ − λ∑ yi − µ ⎟I z ≥ µ ⎝ i=1 i=1 ⎠ Intervals Based on the Asymptotic Normality of the Transformed MLE (TRAN Intervals) where z = min(x ,…, x , y ,…, y ) and I(.) When the maximum likelihood 1 n1 1 n2 estimator of the parameter of interest has its indicates the usual indicator function. range in only a part of the real line, a monotone The maximum likelihood estimators of transformation of this parameter with continuous θ ,λ, and µ are given by (Ghosh & Razmpour, derivatives and range in the entire real line will generally be better approximated by an ˆ n1 ˆ n2 1984) µˆ = z , θ = , and λ = , where asymptotic normal distribution as suggested by T T 1 2 many authors including Lawless (1982) and n n 1 2 Nelson (1982). Let K()R be a monotone T1 = ()xi − z and T2 = ()yi − z . The ∑ ∑ ' i=1 i=1 function of R and let K ()R be the first maximum likelihood estimator of R is therefore derivative, then by applying the delta method n T (Serfling, 1980) we get Rˆ = 1 2 . Now we will describe the

n2T1 + n1T2 2 various intervals under study. n(K(Rˆ )− K(R))→ N(0, K ' ()R V (Rˆ )).

Confidence Intervals for R Using this, a 1−α confidence interval for R Exact confidence intervals that are may be obtained as convenient to use for R are not available and hence approximate methods that exist in a ⎛ −1 ˆ ˆ 2 ˆ ˆ ⎞ simple closed form are needed. In this section ⎜ K (K(R)− z1−α 2 gK()R V ()R ),⎟ . and the following section we shall develop ⎜ 2 ⎟ −1 ˆ ˆ ˆ ˆ various types of intervals for the stress – ⎝ K ()K()R + zK′ ()R V ()R ⎠ strength reliability (R).

An appropriate transform is the tan −1 (Jeng &

Meeker, 2003). Using this transform a 1−α confidence interval for R is given by AYMAN BAKLIZI 343

⎧ ⎛ ⎞⎫ ⎛ kα 2t2 k1−α 2t2 ⎞ ⎪ ⎜ −1 ⎟⎪ ⎜ , ⎟ tan Rˆ ± z ⎜ ⎟ ⎪ ⎜ () 1−α 2 ⎟⎪ ()1− kα 2 t1 + kα 2t2 ()1− k1−α 2 t1 + k1−α 2t2 ⎪ ⎪ ⎝ ⎠ tan⎜ ⎟ . ⎨ ⎜ ⎟⎬ ⎪ ⎜ ˆ ˆ ⎟⎪ where t and t are the observed values of ⎪ R()1 − R ⎪ 1 2 ⎜ 2 1 2 ⎟ ⎪ ˆ ⎪ T1 and T2 respectively, and kα is such that ⎩ ⎝ ()1 + R ()n()n1 n ()1 − n1 n ⎠⎭ G(kα ,πˆ,n1 ,n2 ) = α . Here πˆ is an estimator of Bai and Hong’s Intervals (BH intervals) π obtained by substituting the maximum Ghosh and Razmpour (1984) showed likelihood estimators of θ and λ in the formula that ()T1 ,T2 , Z is a complete sufficient for of π , and G is the distribution function of ()θ ,λ, µ and that the joint probability density mixed beta random variable U. function of which is independent of ()T1 ,T2 Intervals Based on the Likelihood Ratio Statistic Z is (LR Intervals) The likelihood function of (θ ,λ, µ) is g()t ,t = 1 2 given by ⎛ θ n1 λn2 ⎞⎛ n t n1 −1t n2 −2 n t n1 −2t n2 −1 ⎞ ⎜ ⎟⎜ 2 1 2 + 1 1 2 ⎟ , ⎜ ⎟⎜ ⎟ n1 n2 ⎝ n1θ + n2λ ⎠⎝ Γ()(n1 Γ n2 −1 ) Γ()()n1 −1 Γ n2 ⎠ L(θ , λ, µ) = θ λ exp

n1 n1 exp()−θt1 − λt2 ⎛ ⎞ ⎜−θ ()x − µ − λ ()()y − µ ⎟I z ≥ µ . ⎜ ∑ i ∑ i ⎟ t1 ,t2 > 0. ⎝ i=1 i=1 ⎠

Using standard transformation The likelihood ratio statistic for testing techniques, it can be shown that the probability H : R = R is defined as (Barndorff-Nielsen density function of the random variable 0 0 θT and Cox, 1994) W = 2()l()Ω − l ()ϖ , where U = 1 is given by (Bai and Hong, θT + λT l(Ω)is the log-likelihood function evaluated at 1 2 the values of the unrestricted maximum 1992) likelihood estimator of ()θ ,λ, µ . While l(ϖ )is the log-likelihood function evaluated at the g()()u,π , n1, n2 = πb u,n1 −1, n2 values of the restricted maximum likelihood + ()(1− π b u, n1, n2 −1 ), 0 ≤ u ≤ 1. estimator under the null hypothesis. Recall that the unrestricted maximum likelihood estimators n1 n2 n1θ ˆ ˆ ˆ where π = and are µ = z , θ = , and λ = , where T1 T2 n1θ + n2λ n n 1 2 T1 = ∑()xi − z and T2 = ∑()yi − z . Under Γ()r + s s−1 b()u,r, s = u r−1 ()1− u , 0 ≤ u ≤ 1. i=1 i=1 Γ()()r Γ s the null hypothesis H 0 : R = R0 we find readily 1− R is the beta probability density function with that λ = 0 θ and thus the maximum parameters r and s. Bai and Hong (1992) showed R0 that an approximate 1−α interval for R is of likelihood estimator of θ is the form ~ n1 + n2 θ = . Substituting in the formula 1− R0 T1 + T2 R0 344 CONFIDENCE INTERVALS FOR P(X

simulation. AYMAN BAKLIZI 345

The Percentile Interval (PRC Interval) where Rˆ(i) is the maximum likelihood Here we simulate the bootstrap estimator of R using the original data excluding ˆ * distribution of R by resampling repeatedly the i-th observation and from the parametric model of the original data ˆ * n and calculating Ri ,i = 1,…, B where B is the ∑ Rˆ()i number of bootstrap samples. Let Hˆ be the Rˆ(). = i=1 . cumulative distribution function of Rˆ * , then the n 1−α interval is given by The value of zˆ is given by ⎛ ˆ −1⎛α ⎞ ˆ −1⎛ α ⎞⎞ 0 ⎜ H ⎜ ⎟ , H ⎜1− ⎟⎟ . ⎝ ⎝ 2 ⎠ ⎝ 2 ⎠⎠ ⎛ # Rˆ * < Rˆ ⎞ ˆ −1⎜ {}⎟ . The Bias Corrected and Accelerated Interval z0 = Φ ⎜ ⎟ ⎝ B ⎠ ( BCa Interval) The bias corrected and accelerated interval is calculated also using the percentiles Small Sample Performance of the Intervals For the confidence intervals with ˆ * of the bootstrap distribution of R , but not nominal confidence coefficient (1− α) , we use necessarily identical with the percentile interval described in the previous subsection. The the criterion of attainment of lower and upper α percentiles depend on two numbers aˆ and error probabilities which are both equal to . 2 zˆ0 called the acceleration and the bias Attainment of lower and upper nominal error correction. The 1−α interval is given by probabilities is important because otherwise we ˆ −1 ˆ −1 (G ()α1 , G ()α 2 ) where will use an interval with unknown error probabilities and our conclusions therefore are imprecise and can be misleading. Attainment of ⎛ zˆ0 + zα 2 ⎞ α = Φ⎜ zˆ + ⎟ , nominal error probabilities (assumed equal) 1 ⎜ 0 ˆ ⎟ ⎝ 1− a()zˆ0 + zα 2 ⎠ means that if the interval fails to contain the true value of the parameter, it is equally likely to be ⎛ zˆ0 + z1−α 2 ⎞ above as to be below the true value. Users of α = Φ⎜ zˆ + ⎟ , 2 ⎜ 0 1− aˆ()zˆ + z ⎟ two sided confidence intervals expect the lower ⎝ 0 1−α 2 ⎠ and upper error probabilities to be symmetric because they are using symmetric percentiles of Φ(). is the standard normal cumulative the approximating distributions to form their confidence intervals. However, symmetry of distribution function, zα is the α quantile of the error probabilities may not occur due to the standard normal distribution. The values of skewness of the actual sampling distribution aˆ and zˆ0 are calculated as follows; Jennings (1987). Another criterion for comparing n 3 confidence intervals is their expected lengths, ∑()Rˆ(). − Rˆ (i ) obviously the shortest confidence interval aˆ = i=1 among intervals having the same confidence n 3 2 ⎧ 2 ⎫ level is the best. We have simulated the expected 6⎨∑()Rˆ(). − Rˆ (i ) ⎬ lengths of the three considered intervals. ⎩ i=1 ⎭ A simulation study is conducted to investigate the performance of the intervals. The indices of our simulations are: 346 CONFIDENCE INTERVALS FOR P(X

n ,n = 10,10 , 20,20 , 30,30 , 40,40 , 10,40 , 40,10 , 20,40 , 40,20 ()()()()()()()1 2 ( ) ( ) The following quantities are simulated for each R : The true value of R=p(X

For each combination of n1 , n2 and R , Lower error rates (L): The fraction of intervals 2000 samples were generated for X taking θ = 1, that fall entirely above the true parameter. Upper µ = 0 , and 2000 samples for Y with error rates (U): The fraction of intervals that fall entirely below the true parameter. Total error 1 λ = −1, µ = 0 . The intervals are calculated, rates (T): The fraction of intervals that did not R contain the true parameter value. we used B = 1000 for bootstrap calculations.

Table 1: Simulated error rates and expected lengths of the intervals

R AN TRAN BH LR BTST TRBTST PRC ( n1,n2 ) BCa (10, 10) 0.50 L 0.0425 0.0500 0.0245 0.0275 0.0070 0.0110 0.0310 0.0160 U 0.0455 0.0255 0.0305 0.0340 0.0110 0.0030 0.0285 0.0235 T 0.0880 0.0755 0.0550 0.0615 0.0180 0.0140 0.0595 0.0395 W 0.4160 0.4160 0.4230 0.3870 0.5240 0.4940 0.4210 0.4270 0.70 L 0.0175 0.0395 0.0245 0.0210 0.0355 0.0095 0.0115 0.0235 U 0.0475 0.0550 0.0425 0.0200 0.0650 0.0095 0.0100 0.0200 T 0.0650 0.0945 0.0670 0.0410 0.1010 0.0190 0.0215 0.0435 W 0.3610 0.3570 0.3230 0.4270 0.6000 0.4500 0.4470 0.3760 0.90 L 0.0010 0.0095 0.0075 0.0180 0.0035 0.0080 0.0550 0.0255 U 0.1080 0.0975 0.0565 0.0425 0.0195 0.0145 0.0095 0.0240 T 0.1090 0.1070 0.0640 0.0605 0.0230 0.0225 0.0645 0.0495 W 0.1550 0.1560 0.1630 0.1060 0.2030 0.2170 0.1640 0.1890 0.95 L 0.0000 0.0030 0.0120 0.0175 0.0145 0.0095 0.0655 0.0230 U 0.1370 0.1110 0.0655 0.0480 0.0270 0.0230 0.0150 0.0185 T 0.1370 0.1140 0.0775 0.0655 0.0415 0.0325 0.0805 0.0415 W 0.0813 0.0825 0.0863 0.0772 0.1080 0.1160 0.0872 0.1080 (20, 20) 0.50 L 0.0390 0.0500 0.0250 0.0290 0.0140 0.0145 0.0340 0.0290 U 0.0450 0.0295 0.0305 0.0340 0.0175 0.0215 0.0325 0.0195 T 0.0840 0.0795 0.0555 0.0630 0.0315 0.0360 0.0665 0.0485 W 0.3018 0.3010 0.3028 0.3355 0.3354 0.3250 0.3028 0.3050 0.70 L 0.0175 0.0275 0.0200 0.0225 0.0145 0.0125 0.0385 0.0170 U 0.0605 0.0420 0.0430 0.0365 0.0205 0.0160 0.0195 0.0235 T 0.0780 0.0695 0.0630 0.0590 0.0350 0.0285 0.0580 0.0405 W 0.2546 0.2560 0.2594 0.2305 0.2835 0.2830 0.2570 0.2630 0.90 L 0.0030 0.0115 0.0110 0.0155 0.0130 0.0170 0.0485 0.0195 U 0.0800 0.0605 0.0455 0.0430 0.0325 0.0160 0.0135 0.0275 T 0.0830 0.0720 0.0565 0.0585 0.0455 0.0330 0.0620 0.0470 W 0.1103 0.1110 0.1135 0.0845 0.1249 0.1300 0.1134 0.1230 0.95 L 0.0035 0.0060 0.0125 0.0190 0.0200 0.0235 0.0490 0.0270 U 0.0850 0.0855 0.0470 0.0380 0.0225 0.0240 0.0125 0.0260 T 0.0885 0.0915 0.0595 0.0570 0.0425 0.0475 0.0615 0.0530 W 0.0585 0.0586 0.0610 0.0558 0.0665 0.0682 0.0604 0.0662 (30, 30) 0.50 L 0.0305 0.0455 0.0255 0.0265 0.0175 0.0205 0.0290 0.0220 AYMAN BAKLIZI 347

R AN TRAN BH LR BTST TRBTST PRC ( n1,n2 ) BCa U 0.0310 0.0265 0.0265 0.0270 0.0190 0.0210 0.0280 0.0255 T 0.0615 0.0720 0.0520 0.0535 0.0365 0.0415 0.0570 0.0475 W 0.2488 0.2480 0.2461 0.3249 0.2663 0.2600 0.2492 0.2500 0.70 L 0.0205 0.0320 0.0225 0.0225 0.0180 0.0210 0.0400 0.0230 U 0.0565 0.0435 0.0345 0.0355 0.0240 0.0230 0.0230 0.0255 T 0.0770 0.0755 0.0570 0.0580 0.0420 0.0440 0.0630 0.0485 W 0.2097 0.2100 0.2129 0.2060 0.2249 0.2240 0.2110 0.2140 0.90 L 0.0035 0.0100 0.0090 0.0155 0.0155 0.0180 0.0365 0.0320 U 0.0600 0.0610 0.0395 0.0305 0.0205 0.0220 0.0125 0.0255 T 0.0635 0.0710 0.0485 0.0460 0.0360 0.0400 0.0490 0.0575 W 0.0903 0.0907 0.0922 0.0762 0.0977 0.0999 0.0919 0.0968 0.95 L 0.0030 0.0080 0.0145 0.0210 0.0225 0.0225 0.0425 0.0235 U 0.0700 0.0645 0.0470 0.0320 0.0270 0.0250 0.0175 0.0275 T 0.0730 0.0725 0.0615 0.0530 0.0495 0.0475 0.0600 0.0510 W 0.0479 0.0480 0.0480 0.0449 0.0520 0.0529 0.0489 0.0526 (40, 40) 0.50 L 0.0300 0.0380 0.0295 0.0260 0.0180 0.0210 0.0255 0.0240 U 0.0335 0.0185 0.0335 0.0290 0.0230 0.0155 0.0280 0.0205 T 0.0635 0.0565 0.0630 0.0550 0.0410 0.0365 0.0535 0.0445 W 0.2163 0.2160 0.2164 0.2989 0.2271 0.2240 0.2162 0.2170 0.70 L 0.0170 0.0320 0.0165 0.0255 0.0210 0.0270 0.0350 0.0220 U 0.0470 0.0280 0.0345 0.0295 0.0235 0.0170 0.0245 0.0260 T 0.0640 0.0600 0.0510 0.0550 0.0445 0.0440 0.0595 0.0480 W 0.1811 0.1830 0.1819 0.2003 0.1906 0.1920 0.1819 0.1850 0.90 L 0.0090 0.0165 0.0170 0.0210 0.0220 0.0225 0.0395 0.0225 U 0.0605 0.0470 0.0405 0.0350 0.0230 0.0235 0.0190 0.0270 T 0.0695 0.0635 0.0575 0.0560 0.0450 0.0460 0.0585 0.0495 W 0.0782 0.0791 0.0796 0.0687 0.0829 0.0848 0.0792 0.0824 0.95 L 0.0050 0.0035 0.0145 0.0215 0.0265 0.0160 0.0325 0.0265 U 0.0560 0.0575 0.0300 0.0390 0.0190 0.0245 0.0165 0.0255 T 0.0610 0.0610 0.0445 0.0605 0.0455 0.0405 0.0490 0.0520 W 0.0415 0.0414 0.0425 0.0400 0.0441 0.0443 0.0421 0.0442 (10, 40) 0.50 L 0.0270 0.0345 0.0415 0.0250 0.0175 0.0205 0.0530 0.0485 U 0.0490 0.0335 0.0165 0.0295 0.0100 0.0105 0.0080 0.0055 T 0.0760 0.0680 0.0580 0.0545 0.0275 0.0310 0.0610 0.0540 W 0.3359 0.3350 0.3345 0.3369 0.3828 0.3700 0.3351 0.3370 0.70 L 0.0105 0.0215 0.0375 0.0205 0.0115 0.0185 0.0790 0.0435 U 0.0855 0.0585 0.0230 0.0395 0.0155 0.0080 0.0065 0.0085 T 0.0960 0.0800 0.0605 0.0600 0.0270 0.0265 0.0855 0.0520 W 0.2790 0.2820 0.3033 0.2630 0.3358 0.3400 0.2770 0.2810 0.90 L 0.0020 0.0055 0.0220 0.0175 0.0145 0.0130 0.1055 0.0625 U 0.1185 0.0945 0.0520 0.0550 0.0195 0.0160 0.0025 0.0045 T 0.1205 0.1000 0.0740 0.0725 0.0340 0.0290 0.1080 0.0670 W 0.1190 0.1190 0.1440 0.0913 0.1557 0.1640 0.1181 0.1250 0.95 L 0.0010 0.0015 0.0125 0.0175 0.0190 0.0230 0.1120 0.0700 U 0.1265 0.1330 0.0475 0.0535 0.0170 0.0225 0.0015 0.0065 T 0.1275 0.1340 0.0600 0.0710 0.0360 0.0455 0.1135 0.0765 W 0.0625 0.0615 0.0720 0.0631 0.0843 0.0864 0.0614 0.0686

348 CONFIDENCE INTERVALS FOR P(X

(20, 40) 0.50 L 0.0320 0.0260 0.0320 0.0280 0.0210 0.0110 0.0370 0.0370 U 0.0395 0.0290 0.0215 0.0300 0.0170 0.0205 0.0175 0.0115 T 0.0715 0.0550 0.0535 0.0580 0.0380 0.0315 0.0545 0.0485 W 0.2632 0.2630 0.2666 0.3312 0.2838 0.2780 0.2631 0.2650 0.70 L 0.0175 0.0285 0.0260 0.0240 0.0185 0.0255 0.0470 0.0310 U 0.0620 0.0415 0.0235 0.0325 0.0180 0.0160 0.0155 0.0125 T 0.0795 0.0700 0.0495 0.0565 0.0365 0.0415 0.0625 0.0435 W 0.2214 0.2220 0.2304 0.2086 0.2434 0.2430 0.2216 0.2240 0.90 L 0.0025 0.0055 0.0195 0.0170 0.0180 0.0110 0.0625 0.0340 U 0.0830 0.0840 0.0390 0.0360 0.0230 0.0215 0.0075 0.0180 T 0.0855 0.0895 0.0585 0.0530 0.0410 0.0325 0.0700 0.0520 W 0.0950 0.0942 0.0987 0.0797 0.1077 0.1090 0.0953 0.1010 0.95 L 0.0040 0.0025 0.0135 0.0185 0.0160 0.0190 0.0715 0.0410 U 0.0940 0.0825 0.0420 0.0410 0.0240 0.0245 0.0090 0.0180 T 0.0980 0.0850 0.0555 0.0595 0.0400 0.0435 0.0805 0.0590 W 0.0498 0.0494 0.0524 0.0462 0.0571 0.0576 0.0499 0.0540 (40, 20) 0.50 L 0.0430 0.0500 0.0230 0.0325 0.0170 0.0160 0.0220 0.0210 U 0.0315 0.0170 0.0315 0.0310 0.0230 0.0200 0.0470 0.0230 T 0.0745 0.0670 0.0545 0.0635 0.0400 0.0360 0.0690 0.0440 W 0.2631 0.2630 0.2666 0.3289 0.2839 0.2770 0.2631 0.2640 0.70 L 0.0205 0.0360 0.0145 0.0245 0.0135 0.0170 0.0235 0.0195 U 0.0465 0.0340 0.0475 0.0265 0.0225 0.0235 0.0305 0.0260 T 0.0670 0.0700 0.0620 0.0510 0.0360 0.0405 0.0540 0.0455 W 0.2227 0.2240 0.2171 0.2084 0.2373 0.2370 0.2258 0.2260 0.90 L 0.0070 0.0135 0.0140 0.0230 0.0190 0.0175 0.0275 0.0180 U 0.0550 0.0500 0.0470 0.0305 0.0240 0.0235 0.0290 0.0330 T 0.0620 0.0635 0.0610 0.0535 0.0430 0.0410 0.0565 0.0510 W 0.0973 0.0982 0.0944 0.0795 0.1031 0.1060 0.1015 0.1040 0.95 L 0.0045 0.0080 0.0125 0.0265 0.0145 0.0180 0.0195 0.0235 U 0.0550 0.0510 0.0400 0.0300 0.0205 0.0220 0.0230 0.0250 T 0.0595 0.0590 0.0525 0.0565 0.0350 0.0400 0.0425 0.0485 W 0.0518 0.0521 0.0506 0.0467 0.0548 0.0560 0.0545 0.0558 (40, 10) 0.50 L 0.0525 0.0605 0.0185 0.0325 0.0090 0.0145 0.0120 0.0160 U 0.0260 0.0140 0.0400 0.0255 0.0210 0.0175 0.0625 0.0370 T 0.0785 0.0745 0.0585 0.0580 0.0300 0.0320 0.0745 0.0530 W 0.3354 0.3350 0.3345 0.3402 0.3839 0.3670 0.3349 0.3330 0.70 L 0.0370 0.0425 0.0175 0.0335 0.0120 0.0120 0.0080 0.0160 U 0.0380 0.0295 0.0425 0.0245 0.0240 0.0235 0.0610 0.0340 T 0.0750 0.0720 0.0600 0.0580 0.0360 0.0355 0.0690 0.0500 W 0.2900 0.2890 0.2774 0.2740 0.3194 0.3130 0.2993 0.2890 0.90 L 0.0070 0.0275 0.0105 0.0240 0.0100 0.0185 0.0110 0.0205 U 0.0565 0.0520 0.0505 0.0285 0.0260 0.0255 0.0460 0.0325 T 0.0635 0.0795 0.0610 0.0525 0.0360 0.0440 0.0570 0.0530 W 0.1293 0.1320 0.1277 0.0901 0.1377 0.1440 0.1430 0.1370 0.95 L 0.0055 0.0140 0.0065 0.0240 0.0170 0.0185 0.0125 0.0190 U 0.0600 0.0585 0.0495 0.0325 0.0275 0.0285 0.0435 0.0345 T 0.0655 0.0725 0.0560 0.0565 0.0445 0.0470 0.0560 0.0535 W 0.0703 0.0706 0.0698 0.0672 0.0745 0.0768 0.0798 0.0755

AYMAN BAKLIZI 349

Conclusion References Our simulations indicate that the performance of intervals based on asymptotic normality (AN Bai, D. S. (1992) Estimation of p(X Max(Y1 ,Y2 ,…,Yp−1 )) in For the Bootstrap intervals, it appears the exponential case. Communications in that the bootstrap – t intervals (BTST) and Statistics, A17, 911 - 924. (TRBTST) are symmetric but tend to be Jennings, D. (1987). How do we judge conservative for small sample sizes, while the confidence intervals adequacy? The American percentile interval (PRC) attains the nominal Statistician, 41(4), 335-337. level but tends to be asymmetric for values of R Johnson, N., Kotz, S., & Balakrishnan, far from 0.5. The bias corrected and accelerated N. (1994). Continuous univariate distributions. interval appear to be the best interval based on (Vol. 1). New York: Wiley. the bootstrap principle, they attain the nominal level and are symmetric in almost all situations considered. With regard to interval widths, our simulation results suggest that all intervals have about equal performance. No intervals appear to be uniformly shorter or longer than the others. Overall, the (BCa) interval appears to have the best performance according to the criteria of attainment of coverage probability, symmetry and expected length followed by the (LR) intervals. Although the other intervals (especially AN intervals) are anti-conservative and sometimes extremely asymmetric, which limit their usefulness, especially when lower or upper confidence bounds are desired.

Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 350-358 1538 – 9472/03/$30.00

Approximate Bayesian Confidence Intervals For The Variance Of A Gaussian Distribution

Vincent A. R. Camara University of South Florida

The aim of the present study is to obtain and compare confidence intervals for the variance of a Gaussian distribution. Considering respectively the square error and the Higgins-Tsokos loss functions, approximate Bayesian confidence intervals for the variance of a normal population are derived. Using normal data and SAS software, the obtained approximate Bayesian confidence intervals will then be compared to the ones obtained with the well known classical method. The Bayesian approach relies only on the observations. It is shown that the proposed approximate Bayesian approach relies only on the observations. The classical method, that uses the Chi-square statistic, does not always yield the best confidence intervals.

Key words: Estimation, loss functions, statistical analysis

Introduction As he pointed out, “He (Bayesian) realizes … that his selection of a prior (distribution) to There is a significant amount of research in express his present state of knowledge will Bayesian analysis and modeling, which has been necessarily be somewhat arbitrary. But he published the last twenty-five years; see greatly appreciates this opportunity to make his references. A Bayesian analysis implies the entire assumptive structure clear to the world…” exploitation of a suitable prior information and In the present study, we shall consider a the choice of a loss function in association with classical and useful underlying model. That is, Bayes’ Theorem. It rests on the notion that a we shall consider the normal underlying model parameter within a model is not merely an characterized by unknown quantity but rather behaves as a 2 random variable, which follows some 1⎛ x−µ ⎞ 1 − ⎜ ⎟ distribution. In the area of life testing, it is indeed f (x)= e 2⎝ σ ⎠ ; realistic to assume that a life parameter is 2πσ (1) stochastically dynamic. This assertion is − ∞ ≺ x ≺ ∞,− ∞ ≺ µ ≺ ∞,σ 0. supported by the fact that the complexity of electronic and structural systems is likely to cause undetected component interactions As we well know, once the underlying model resulting in an unpredictable fluctuation of the is found to be normally or approximately life parameter. Recently, Drake (1966) gave an normally distributed, the classical approach excellent account for the use of Bayesian uses the Chi-square statistic and considers the statistics in reliability problems. following confidence interval for the 2 population variance σ :

Vincent A. R. Camara earned a Ph.D. in ⎡(n −1)s 2 (n −1)s 2 ⎤ Mathematics/Statistics. His research interests are ⎢ , ⎥ . (2) χ 2 χ 2 in the theory and applications of Bayesian and ⎣⎢ n−1,α / 2 n−1,1−α / 2 ⎦⎥ empirical Bayes analyses with emphasis on the computational aspect of modeling. He is featured For the above model (1), approximate in the 2003 edition of Marquis Who’s Who in Bayesian confidence bounds for the parameter America. (E-mail: [email protected])

350 VINCENT A. R. CAMARA 351

σ 2 will be derived to challenge the classical We shall assume that θ behaves as a random approach (2). In the study, we shall denote the variable and is being characterized by the inverse of the population variance σ 2 by θ Pareto probability density function given by a+1 Λ a ⎛ b ⎞ and its corresponding estimate by θ . f1 (θ )= ⎜ ⎟ ;θ ≥ b 0,a 0. (5) Although there is no specific analytical b ⎝θ ⎠ procedure that allows us to identify the where θ =1/σ 2 . appropriate loss function to be used, the most commonly used is the square error loss The Pareto prior has been selected function. One of the reasons for selecting this because of its mathematical tractability. Using loss function is because of its analytical observations from normal distributions, we tractability in Bayesian analysis. As it will be will approximate the Pareto prior (5) in such a shown, selecting the square error loss does not way that good approximate Bayesian estimates always lead to the best approximate Bayesian of θ are obtained. confidence intervals. However, the obtained approximate Bayesian confidence intervals Preliminaries corresponding to the square error and the Let x , x , ……., x denote the Higgins-Tsokos loss functions will be 1 2 n observations of a given system that are being respectively used to challenge the classical characterized by the normal distribution. method (2). The loss functions that will be 2 used are given below, along with a statement Replacing 1/σ by θ , we obtain the of their key characteristics. following characterization of the normal underlying model defined in (1). Square error loss function 1 ( x−µ )2 The popular square error loss function 1 −θ places a small weight on estimates near the f (x)= θ 2 e 2 ; (6) true value and proportionately more weight on 2π extreme deviation from the true value of the − ∞ ≺ x ≺ ∞,θ 0 parameter. Its popularity is due to its analytical tractability in Bayesian modeling. The square error loss is defined as follows: This leads to the following posterior distribution: 2 Λ ⎛ Λ ⎞ LSE (θ ,θ )=⎜θ −θ ⎟ (3) n (x −µ )2 n −θ i ⎝ ⎠ −a−1 ∑ 2 θ 2 e 1 h(θ \ x) ,θ b.. (7) n (x −µ )2 Higgins-Tsokos loss function ∞ n −θ i −a−1 ∑ 2 The Higgins-Tsokos loss function ∫θ 2 e 1 dθ places a heavy penalty on extreme over- or b underestimation. That is, it places an exponential weight on extreme errors. The Higgins-Tsokos loss function is defined as follows: Methodology

Approximate Bayesian confidence bounds of ΛΛ σ 2 when the population mean µ is known. Λ feff21()θ−θ+ fe − () θ−θ L (,)θθ =12 − 1 , With respectively the following HT ff+ 12 approximate priors for the square error and the ff12,0. Higgins-Tsokos loss functions, good (4) 352 BAYESIAN CONFIDENCE FOR GAUSSIAN DISTRIBUTION approximate Bayesian estimates of θ are n −1 . obtained. n 2 ∑(xi − µ) i=1

Using respectively the approximate posterior distributions that correspond to (8) Approximate prior for the square error loss: and (9), along with the equalities P(θ L| x) =1−α / 2 and a1+1 _ a1⎛ b1⎞ P(θ U| x) =α / 2 , we respectively obtain g(θ )= ⎜ ⎟ ; b1⎝ θ ⎠ the following lower and upper confidence (8) bounds for θ : n n − 2 Approximate Bayesian confidence a1 = −1,b1 = n 2 2 bounds of θ corresponding to the square error ∑()xi − µ i=1 loss function when µ is known:

Approximate prior for the Higgins-Tsokos n − 2 − 2Ln(1−α / 2) L = loss: θ (SE) n 2 ∑(xi − µ) i=1 _ ao+1 ao ⎛ bo ⎞ g 1 (θ )= ⎜ ⎟ ; (9) bo ⎝ θ ⎠ n − 2 − 2Ln(α / 2) U = (10) n n −1 θ (SE) n ;a = −1,bo = − F(n, µ) 2 0 n ∑(xi − µ) 2 2 i=1 ∑()xi − µ i=1 Approximate Bayesian confidence bounds of where θ corresponding to the Higgins-Tsokos loss function when µ is known: ⎛ n (x − µ ) 2 ⎞ ⎜ i + f ⎟ 1 ∑ 2 2 n −1− 2Ln(1−α / 2) F (n, µ ) = Ln⎜ i =1 ⎟ L = − F(n, µ) ⎜ n 2 ⎟ θ (HT ) n f1 + f 2 (xi − µ ) 2 ⎜ ∑ − f1 ⎟ ∑(xi − µ) ⎝ i =1 2 ⎠ i=1

n −1− 2Ln(α / 2) Uθ (HT ) = n − F(n, µ) . (11) with 2 ∑(xi − µ) n (x − µ) 2 i=1 i f1 ≺ ∑ . i=1 2 Thus when the population mean is known, (10) and (11) respectively yield the following It’s easily shown that the approximate 100(1−α)% approximate Bayesian Bayesian estimate of the parameterθ , subject confidence bounds for the normal population to the square error loss, is the same as the variance σ 2 : Bayesian estimate of θ under the Higgins- Tsokos loss. They are equal to Confidence bounds corresponding to the square error loss:

VINCENT A. R. CAMARA 353

n n −1− 2Ln(1−α / 2) _ (x − µ) 2 L = − F(n, x) (15) ∑ i θ (HT ) n _ i=1 2 L 2 = (xi − x) σ (SE) ∑ n − 2 − 2ln(α / 2) i=1

n n −1− 2Ln(α / 2) _ 2 U = − F(n, x) . ∑(xi − µ) θ (HT ) n _ i=1 2 U 2 = , (12) (x − x) σ (SE) ∑ i n − 2 − 2ln(1−α / 2) i=1

Thus when µ is unknown (14) and (15) respectively yield the following Confidence bounds corresponding to the 100(1−α)% approximate Bayesian Higgins-Tsokos loss: confidence bounds for the normal population

variance σ 2 : 1 L 2 = Confidence bounds corresponding to σ (HT ) n −1− 2Ln(α / 2) − F(n, µ) the square error loss: n n _ (x − µ) 2 2 ∑ i ∑(xi − x) i=1 i=1 L 2 = 1 σ (SE) n − 2 − 2ln(α / 2) U 2 = . σ (HT ) n −1− 2Ln(1−α / 2) n _ − F(n, µ) 2 n ∑(xi − x) (x − µ) 2 i=1 i U 2 = , (16) ∑ σ (SE) i=1 n − 2 − 2ln(1−α / 2) (13)

Approximate Bayesian confidence bounds of Confidence bounds corresponding to the σ 2 when the population mean µ is unknown. Higgins-Tsokos loss: In the case where the population mean 1 L 2 = µ is unknown, it is estimated by the sample σ (HT ) n −1− 2Ln(α / 2) _ − F(n, x) _ n _ mean x and we obtain the following: 2 ∑(xi − x) Approximate Bayesian confidence bounds of i=1 θ corresponding to the square error loss 1 U 2 = . function when µ is unknown: σ (HT ) n −1− 2Ln(1−α / 2) _ − F(n, x) n _ 2 n − 2 − 2Ln(1−α / 2) ∑(xi − x) L = i=1 θ (SE) n _ 2 (17) ∑(xi − x ) i=1 n − 2 − 2Ln(α / 2) Results U = (14) θ (SE) n _ In order to compare the proposed approximate 2 Bayesian approach to the classical method, ∑(xi − x) i=1 samples that have been obtained from normally distributed populations (Examples 1, 2, 3, .4, Approximate Bayesian confidence bounds of 7) as well as approximately normal populations θ corresponding to the Higgins-Tsokos loss (Examples 5, 6) will be considered. SAS function when µ is unknown: software is used to obtain the normal population parameters µ and σ 354 BAYESIAN CONFIDENCE FOR GAUSSIAN DISTRIBUTION corresponding to each of the examples. The Example 2. Data obtained from Prem S. Mann, proposed approximate Bayesian estimates of Introductory Statistics, Third edition, page 504, the variance (16) (17) will be used. For the 1998. Higgins-Tsokos loss function, we will consider 13, 11, 9, 12, 8, 10, 5, 10, 9, 12, 13. f1 = 1, f 2 = 1. The lengths of the classical and approximate Bayesian confidence intervals are Normal population distribution respectively denoted by l , l and l . C SE HT obtained with SAS: N(µ = 10.182,σ = 2.4008) . Population and Example 1. (Data obtained from Prem S. 2 Mann, Introductory Statistics, Third edition, sample variances: σ = 5.76384 , page 504, 1998). s 2 = 5.763636 .

24, 28, 22, 25, 24, 22, 29, 26, 25, 28, 19, 29. Normal population distribution Table 2: Classical and approximate Bayesian obtained with SAS: confidence intervals of σ 2 corresponding to N(µ = 25.083,σ = 3.1176) Population and the second data set. sample variances:σ 2 = 9.71943 , C L. Classical Approx.Bayes. Approx.Bayes. s 2 = 9.719696. %. Bounds Bounds (SE) Bounds (HT) Table 1. Classical and approximate Bayesian 80 3.60 – 4.23 – 6.25 4.71– 6.10 confidence intervals of σ 2 corresponding to the first data set. 11.84

90 C L. Classical Approx.Bayes. Approx.Bayes. 3.14 – 3.84 – 6.33 4.23– 6.25

%. Bounds Bounds (SE) Bounds (HT) 14.62 80 6.18 – 7.32 – 10.47 8.08– 10.23 95 2.81 – 3.51 – 6.36 3.84 – 6.33 19.16 17.75 90 5.43 – 6.68 – 10.58 7.32 –10.47 99 2.28 – 2.94 – 6.39 3.16 – 6.38 23.36 26.73 95 4.87 – 6.15 – 10.63 6.68 –10.58

Confidence ( ) ( ) 28.01 level lC ÷ lC ÷

( l ) ( l ) 99 3.99– 5.19 – 10.68 5.56 –10.67 SE HT 80% 4.0777 5.9530 90% 41.07 4.6157 5.6804 95% 5.2426 6.0051 99% 7.0734 7.5801 Confidence ( l ) ÷ ( l ) ( l ) ÷ level C SE C

( lHT ) Example 3. Data obtained from Prem S. Mann, 80% 4.1193 6.0455 Introductory Statistics, Third edition, page 504, 90% 4.6021 5.6927 1998. 95% 5.1589 5.9373 99% 6.7538 7.2636 16, 14, 11, 19, 14, 17, 13, 16, 17, 18, 19, 12.

VINCENT A. R. CAMARA 355

Normal population distribution obtained with Table 4. Classical and approximate Bayesian SAS: N(µ = 15.5,σ = 2.6799) . Population confidence intervals of σ 2 corresponding to and sample variances: σ 2 = 7.18186 , the fourth data set. 2 s = 7.181818. C L. Classical Approx.Bayes. Approx.Bayes.

Table 3. Classical and approximate Bayesian %. Bounds Bounds (SE) Bounds (HT) 2 confidence intervals of σ corresponding to 80 10.11 12.02 – 13.20 – the third data set. – 16.74 16.39 C L. Classical Approx.Bayes. Approx.Bayes.

%. Bounds Bounds (SE) Bounds (HT) 29.77

80 4.57 – 5.40 – 7.73 5.97 – 7.56 90 8.92 – 11.04 – 12.02 –

14.16 35.91 16.90 16.74

90 4.01 – 4.94 – 7.81 5.40 – 7.73 95 8.04 – 10.21 – 11.04 –

17.26 42.61 16.98 16.90

95 3.60 – 4.54 – 7.86 4.94 – 7.81 99 6.63 – 8.69 – 9.28 –

20.70 61.05 17.04 17.03

99 2.95 – 3.83 – 7.89 4.11 – 7.88 Confidence level ( l ) ÷ ( l ) ( l ) ÷ 30.34 C SE C ( lHT ) 80% 4.1688 6.1471 Confidence 90% 4.6063 5.7243 ( lC ) ÷ ( lSE ) ( lC ) ÷ level 95% 5.1059 5.9013 ( lHT ) 99% 6.5129 7.0273 80% 4.1194 6.0456 90% 4.6022 5.6926 Example 5. Data obtained from James T. 95% 5.1592 5.9375 McClave/Terry Sincich A first course in 99% 6.7539 7.2636 Statistics, page 301, Sixth edition, 1997

52, 33, 42, 44, 41, 50, 44, 51, 45, 38, Example 4. Data obtained from Prem S. 37,40,44, 50, 43. Mann, Introductory Statistics, Third edition, page 504, 1998. Normal population distribution

obtained with SAS: 27, 31, 25, 33, 21, 35, 30, 26, 25,31.33.30, 28. . Population and N(µ = 43.6,σ = 5.4746) Normal population distribution sample variances: σ 2 = 29.97124 , obtained with SAS: s 2 = 29.971428 . N(µ = 28.846,σ = 3.9549) . Population and sample variances: σ 2 = 15.64123, 2 s = 15.641025.

356 BAYESIAN CONFIDENCE FOR GAUSSIAN DISTRIBUTION

Table 5. Classical and approximate Bayesian Table 6. Classical and approximate Bayesian confidence intervals of σ 2 corresponding to confidence intervals of σ 2 corresponding to the fifth data set. the sixth data set.

C L. Classical Approx.Bayes. Approx.Bayes. C L. Classical Approx.Bayes Approx.Bayes.

%. Bounds Bounds (SE) Bounds (HT) %. Bounds . Bounds (HT)

80 19.92 – 23.83 – 31.76 25.87 – Bounds (SE)

53.86 31.20 80 19.71 – 23.63 – 25.53 – 30.44

90 17.71 22.09 – 32.02 23.83 – 51.45 30.94 90 17.59– 21.99 – 23.63 – 30.94 – 31.76 60.56 31.18 63.85 95 15.99– 20.57 – 21.99 – 31.18 95 16.06 – 20.59 – 32.15 22.09 – 70.22 31.29 74.54 32.02 99 13.40– 17.87 – 18.94 – 31.36 99 13.39– 17.78 – 32.25 18.89 – 95.57 31.38 102.96 32.22

Confidence ( ) ( ) ( ) Confidence level lC ÷ lSE lC ÷ ( ) ( ) ( ) level lC ÷ lSE lC ÷ ( l ) HT ( l ) HT 80% 4.3422 6.4743 80% 4.2814 6.3629 90% 4.6781 5.8754 90% 4.6465 5.8198 95% 5.0551 5.9036 95% 5.0583 5.8889 99% 6.0822 6.6163 99% 6.1902 6.7170 Example 7. The following observations have Example 6. Data obtained from James T. been obtained from the collection of SAS data McClave/Terry Sincich A first course in sets. Statistics, page 301, Sixth edition, 1997. 50, 65, 100, 45, 111, 32, 45, 28, 60, 66, 114, 52, 43, 47, 56, 62, 53, 61, 50, 56, 52, 134, 150, 120, 77, 108, 112, 113,80,77, 69, 53, 60, 50, 48, 60, 55. 91, 116, 122, 37, 51, 53, 131, 49, 69, 66, 46, 131, 103, 84, 78. Normal population distribution obtained with SAS: Normal population distribution N(µ = 53.625,σ = 5.4145) . Population and obtained with SAS: sample variances: σ 2 = 29.31681, N(µ = 82.861,σ = 33.226) . Population and s 2 = 29.316666 . sample variances: σ 2 = 1103.96716, s 2 = 1103.951587 .

VINCENT A. R. CAMARA 357

Table 7: Classical and approximate Bayesian best coverage accuracy. In fact, each of the confidence intervals of σ 2 corresponding to obtained approximate Bayesian confidence the seventh data set. intervals contains the population variance and is strictly included in the corresponding C L. Classical Approx.Bayes. Approx.Bayes. confidence interval obtained with the classical method. %. Bounds Bounds (SE) Bounds (HT) Contrary to the classical method that 80 839.4– 1000.8– 1038.1 – uses the Chi-square statistic, the proposed approach relies only on the observations. 1556.4 1129.4 1121.6 With the proposed approach, approximate Bayesian confidence intervals for 90 776.4 – 966.1 – 1000.8 – a normal population variance are easily 1717.1 1133.0 1129.4 computed for any level of significance. The approximate Bayesian approach 95 726.8 – 933.7 – 966.1 – under to the popular square error loss function 1874.5 1134.7 1133.0 does not always yield the best approximate Bayesian results. In fact, the Higgins-Tsokos 99 641.6 – 866.3 – 894.1 – loss function performs better in the above examples. 2240.2 1136.0 1135.7 Bayesian analysis contributes to reinforcing well-known statistical theories such

Confidence as the estimation theory. ( ) ( ) level lC ÷ lC ÷ References ( lSE ) ( lHT ) 80% 5.5772 8.5808 Bhattacharya, S. K. (1967). Bayesian 90% 5.6388 7.3176 approach to life testing and reliability 95% 5.7119 6.8792 estimation. Journal American Statistical 99% 5.9277 6.6181 Association, 62, 48-62. Bernard, H. (1976). A survey of All seven Tables show that the statistical methods in system reliability using proposed approximate Bayesian confidence Bernoulli sampling of components. 2 intervals contain the population variance σ . Proceedings of the conference on the theory Also, the lengths of the obtained classical and applications of Reliability with emphasis confidence intervals are more than four times on Bayesian and Nonparametric Methods. NY: greater than the ones corresponding to the Academic. proposed approach. Britney, R. R., & Winkler, R. L. (1968). Bayesian III point estimation under Conclusion various loss functions. American Statistical Association, 356-364. In the present study, approximate Bayesian Camara, V. A. R., & Tsokos, C. P. confidence intervals for the variance of a (1996). Effect of Loss Functions on Bayesian normal population under two different loss Reliability Analysis. Proceedings of functions have been derived. The loss International Conference on Nonlinear functions that are employed are the square Problems in Aviation and Aerospace, 75-90. error and the Higgins-Tsokos loss functions. Camara, V. A. R., & Tsokos, C. P. Based on the above numerical results we can (1998). Bayesian reliability modeling with conclude the following: applications. UMI Publishing Company. The classical method used to construct confidence intervals for the variance of a normal population does not always yield the 358 BAYESIAN CONFIDENCE FOR GAUSSIAN DISTRIBUTION

Camara, V. A. R., & Tsokos, C. P. Higgins, J. J., & Tsokos, C. P. (1980). (1999). Bayesian estimate of a parameter and A study of the effect of the loss function on choice of the loss function. Nonlinear Studies bayes estimates of failure intensity, MTBF, Journal, 6, 55-64. and reliability. Applied Mathematics and Camara, V. A. R., & Tsokos, C. P. Computation, 6, 145-166. (1999). The effect of loss functions on McClave, J. T., & Sincich, T. A. empirical bayes reliability analysis. Journal of (1997). First course in statistics (6th ed.). Engineering Problems, Boston University, Upper Saddle River, NJ: Prentice Hall. http://bujep.bu.edu. Mann, P. S. (1998), Introductory Camara, V. A. R. (2002). Approximate Statistics, (3rd ed). NY: Wiley. Bayesian confidence intervals for the variance Schafer, et al. (1970). Bayesian of a Gaussian distribution. 2002 Proceedings reliability demonstration, Phase I Data for the a of the American Statistical Association, prior distribution. Rome Air Development Statistical Computing Section New York, NY: Center, Griffis AFBNY RADC-TR-69-389. American Statistical Association. Schafer, et al. (1971). Bayesian Canfield, R. V. (1970). A Bayesian reliability, Phase II- Development of a priori approach to reliability estimation using a loss distribution. Rome Air Development Center, function. IEEE Trans. Reliability R-19 (1), 13- Griffis AFR, NY RADC-YR-71-209. 16. Schafer, et al. (1973). Bayesian Vincent A. R. Camara & Tsokos, C. P. reliability demonstration Phase III - (1999, November). Sensitivity behavior of Development of test plans, Rome Air Bayesian reliability analysis for different loss development Center, Griffs AFB, NY RADC- functions. International Journal of Applied TR-73-39. Mathematics. Shafer, R. E., & Feduccia, A. J. Drake, A. W. (1966). Bayesian (1972). Prior distribution fitted to observed statistics for the reliability eng. proc. Annual reliability data. IEEE Trans. Reliability R-21, Symposium on Reliability, 315-320. (3), 148-154 Higgins, J. J., & Tsokos, C. P. (1976). Tsokos, C. P. & Shimi, I. (Eds). Comparison of Bayes estimates of failure (1976). Proceedings of the Conference on the intensity for fitted priors of life data. theory and Applications of Reliability with Proceedings of the Conference on the Theory Emphasis on Bayesian and Nonparametric and Applications of Reliability with Emphasis Methods, Methods, I, II. NY: Academic. on Bayesian and Nonparametric Methods. NY: Winkler, R. L. (1972). Introduction to Academic. Bayesian inference and decision-making. NY: Higgins, J. J. & Tsokos, C. P. (1976). Holt, Rinehart and Winston. On the behavior of some quantities used in Bayesian reliability demonstration tests. IEEE Trans. Reliability R-25, (4), 261-264.

Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 359-370 1538 – 9472/03/$30.00

Random Regression Models Based On The Elliptically Contoured Distribution Assumptions With Applications To Longitudinal Data

Alfred A. Bartolucci, Shimin Zheng Sejong Bae, Karan P. Singh Department of Biostatistics Department of Biostatistics University of Alabama at Birmingham and Health Science Center at Fort Worth Nanjing Audit University, P. R. China University of North Texas

We generalize Lyles et al.’s (2000) random regression models for longitudinal data, accounting for both undetectable values and informative drop-outs in the distribution assumptions. Our models are constructed on the generalized multivariate theory which is based on the Elliptically Contoured Distribution (ECD). The estimation of the fixed parameters in the random regression models are invariant under the normal or the ECD assumptions. For the Human Immunodeficiency Virus Epidemiology Research Study data, ECD models fit the data better than classical normal models according to the Akaike (1974) Information Criterion. We also note that both univariate distributions of the random intercept and random slope and their joint distribution are non-normal short-tailed ECDs, and that the error term is distributed as a non-normal long-tailed ECD if we don’t use the low undetectable limit or half of it to replace the undetectable values. Instead, we use the ECD cumulative distribution function to calculate the contribution to the likelihood due to the undetectable values.

Key words: Generalized multivariate analysis, power exponential distributions, Gamma distributions, maximum likelihood functions, censoring, informative drop-outs, empirical Bayes

Introduction On the other hand, illness or death caused by an early drop-out is known as an In clinical studies of human immunodeficiency informative drop-out. If either a left-censored or virus (HIV) infection the number of copies of an informative drop-out is present, as Lyles et al. HIV ribonucleic acid (RNA) per milliliter of (2000) pointed out, random effects linear models plasma is often used to measure the progression (Laird & Ware, 1982) and generalized of the disease. When the number of copies per estimating equations (GEE) (Liang & Zeger, milliliter is below or equal to 500, the 1986) produce biased estimates of key observation is considered as undetectable, parameters, such as the population average HIV missing, or left-censored, since the copy RNA slope and intercept. Louis (1982) used numbers below 500 are not quantifiable. asymptotic approximation methods to deal with the problem of left-censored and informative drop-out data. Both Hughes (1999) and Correspondence regarding this article should be Schluchter (1992) implemented Maximum emailed to Alfred A. Bartolucci: Likelihood (ML) estimation via Expectation and [email protected]. The authors acknowledge Maximization (EM) algorithm to handle the assistance from Robert H. Lyles with SAS and problem of left-censored and informative drop- S-Plus programming; and the HERS Study out data. Lyles et al. (2000) combined the Group for providing the Human approaches of Hughes (1999) and Schluchter Immunodeficiency Virus (HIV) Epidemiology (1992) into a single likelihood integrating Research Study data. subject-specific random slopes and intercepts which took both informative drop-out and undetectable data into account. Then, they maximized the likelihood function with respect to fixed effects and other variables. Our

359 360 RANDOM REGRESSION MODELS approach follows Lyles et al. (2000) and we n nΓ( ) extend their normal distribution assumptions to f ( y; µ,Σ, β ) = 2 ⎛ n ⎞ the ECD assumptions since when the number of n ⎜1+ ⎟ ⎛ n ⎞ ⎜ 2β ⎟ (1) undetectable observations exceeds a certain π 2 Σ Γ⎜1 + ⎟2⎝ ⎠ ⎜ 2 ⎟ number, or when the random intercept and ⎝ β ⎠ random slope have a bell shaped and long-tailed ⎛ 1 −1 β ⎞ exp⎜− []()()y − µ 'Σ y − µ ⎟ , or short-tailed distribution the ECD distribution ⎝ 2 ⎠ improves the fit of the data over the normal distribution. where -∞< µ< ∞, Σ > 0, 0 < β < ∞. If y is We used the data from the study of distributed as an MPE distribution with Lyles et al. in this paper. From April 1993 to parameters µ, Σ and β, we write y ~ MPE (µ, Σ, June 1998 there were 528 HIV-infected women β) and we write y~ UPE (µ, σ, β) if n=1. The (16-55 years old) in the HIV Epidemiology parameter β is called the shape parameter. Research Study (HERS) and 1,864 RNA We use the following linear random- measurements were collected. Overall, there effects regression model (LRRM): were 25 (4.7%) drop-out events which resulted in 77 informative drop-out observations, y = α + a + ()β + b t + e . (2) according to Lyles et al.’s (2000) definition. ij i i ij ij We used δ as an indicator which was set to 1 if an observation was an informative drop- We take the response yij to be the base 10 out and to 0 otherwise. For these 25 individuals logarithm of HIV RNA measured at the jth time the time on study was set as the minimum of the point tij ( j = 1,2,…, ni) for the ith woman (i = time from the base-line to death or the time from 1,…,528, 1 ≤ ni ≤ 5 for our data set). We assume that the error terms ei j are distributed as the base-line to 3 months beyond the last visit. 2 UPE (µ, σ , ν1), the random intercept deviations For other non-informative drop-out women the 2 censored time was set equal to the time from the ai are distributed as UPE (µ, σ1 , ν2) and the random slope deviations bi are distributed as base-line to the last visit date. Overall, 745 2 (40%) out of 1,864 HIV RNA observations were UPE (µ, σ2 , ν2) with cov(ai,bi)=cσ12 where c is undetectable or left-censored (below 500 copies the correction coefficient and v2 is a shape per milliliter). parameter. The joint distribution of ai and bi is MPE2 (0, Σ2, ν2), where Σ2 = (σij). Based on the Power Exponential Distributions and Models trivariate normal distribution model (Schluchter, 1992) we assume the 3-dimensional random The power exponential distributions can 0 be used to model both light and heavy tailed, vector (ai,bi,Ti )  distributed as trivariate power symmetric and unimodal continuous data sets. exponential, i.e. Gomez et al. (1998) generalized the Univariate Power Exponential (UPE) distribution, which ⎛ ai ⎞ was established by Subbotin (1923), to the ⎜ ⎟ ⎜ bi ⎟ ~ MPE3 ()µ,Σ3ν 2 , Multivariate Power Exponential (MPE) ⎜ 0 ⎟ distribution. Both Johnson (1979) and Gomez et ⎝Ti ⎠ al. (1998) discussed the relationship between the UPE distribution and a Gamma distribution. where Gomez et al. (1998) studied the properties of MPE intensively, including the stochastic ⎛ 0 ⎞ ⎛σ 2 σ σ ⎞ representation, the moments, the characteristic ⎜ ⎟ ⎜ 1 12 at ⎟ 0 , 2 , rk( ) 3. function and the marginal and conditional µ = ⎜ ⎟ Σ3 = ⎜σ 12 σ 2 σ bt ⎟ Σ3 = ⎜ ⎟ ⎜ 2 ⎟ distributions and asymmetry and kurtosis ⎝ µt ⎠ ⎝σ at σ bt σ t ⎠ coefficients. Obviously, the family of MPE distribution is a subset of the class of ECDs. 0 The joint pdf of (ai,bi,Ti )  is given as Gomez et al. (1998) defined the MPE distribution as follows: BARTOLUCCI, ZHENG, BAE, & SINGH 361

3 use the probability distribution function (pdf) to 3Γ( ) 0 2 calculate the contribution to the likelihood due f (ai ,bi ,Ti ) = 3 ⎛ 3 ⎞ to the observed values for subject i. On the other ⎛ 3 ⎞ ⎜1+ ⎟ 2 ⎜ ⎟ ⎝ 2ν 2 ⎠ π Σ3 Γ⎜1 + ⎟2 hand we use the cumulative distribution function ⎝ 2ν 2 ⎠ (cdf) to calculate the contribution to the ⎛ ′ ν 2 ⎞ likelihood due to the undetectable values. ⎜ 1 ⎡ 0 −1 0 ⎤ ⎟ exp − ()()ai ,bi ,Ti − µt Σ3 ai ,bi ,Ti − µt , Therefore, the complete-data likelihood function ⎜ 2 ⎢ ⎥ ⎟ ⎝ ⎣ ⎦ ⎠ is given by

0 where Ti is the natural logarithm of the L(θ ,Y ) =

n n “survival” time for subject i. k ⎡ ∞ ∞ ii1 ⎤ ⎢ f ()Yij | ai ,bi FY ()()d | ai ,bi f ai | bi f (bi )dai dbi ⎥, ∏ ∫∫−∞ −∞ ∏∏ i=11⎢ j=+j=n j1 1 ⎥ Maximum Likelihood Functions ⎣ ⎦

In this section we utilize general where f(Y |a ,b ) and f(a |b )f(b ) are given in (3) integrated likelihood expressions given by Lyles ij i i i i i and et al. (2000), in order to facilitate estimation and inference for the ECD case. ν b ⎛ 1 ⎞ (a) The Maximum Likelihood (ML) 1 ⎜ 1 ⎛ y ⎞ ⎟ FY (d | ai ,bi ) = exp − ⎜ ⎟ dy ∫−∞ ⎛ 1 ⎞ ⎜ ⎟ ⎜1+ ν1 ⎟ 2 ⎝σ ⎠ function without accounting for undetects and ⎛ 1 ⎞ ⎝ 2 ⎠ ⎝ ⎠ σ Γ⎜1 + ν 1 ⎟2 informative drop-outs: By the conditional ⎝ 2 ⎠ 2ν probability formulae the ML function without 1 ⎡ ⎪⎧ 1 ⎛ b ⎞ 1 ⎪⎫ ⎤ accounting for undetects and informative drop- ⎢P⎨u ≤ ⎜ ⎟ ⎬ + 1⎥, if b ≥ 0 2 ⎢ ⎪ 2 ⎝σ ⎠ ⎪ ⎥ = ⎣ ⎩ ⎭ ⎦ outs is given by 2ν 1 1 ⎪⎧ 1 ⎛ b ⎞ 1 ⎪⎫ − P⎨u ≤ ⎜− ⎟ ⎬, if b < 0 L()θ,Y ,T = 2 2 ⎩⎪ 2 ⎝ σ ⎠ ⎭⎪

n (4) k ⎡ ∞ i ⎤ (3)

f ()Yij | ai ,bi f ()()ai | bi f bi daidbi , ∏ ⎢∫−∞ ∏ ⎥ i=11⎣ j= ⎦ where b = yi - [α + ai + (β + bi)ti], yi is the ⎛ 1 ⎞ where θ = (α, β, σ 2, σ 2 , σ , σ2)  , Y is a vector censored value for subject i and u ~ Γ⎜1, ⎟ . 1 2 12 ⎜ 2ν ⎟ consisting of Yij, T is a vector consisting of tij ⎝ 1 ⎠ and (c) The ML function accounting for informative drop-outs only: 1 We use T 0 to denote the natural logarithm of the f (Y | a ,b ) = i ij i i ⎛ 1 ⎞ 1 ⎜1+ ν1 ⎟ “survival” time for subject i and ci to denote the ⎛ ⎞ ⎝ 2 ⎠ σ Γ⎜1 + ν 1 ⎟2 ⎝ 2 ⎠ natural logarithm of the time from the base-line 0 ⎛ Y − α + a + β + b t v1 ⎞ to the study end. Let Ti = min (Ti , ci). ⎜ 1 ⎡ ij []i ()i ij ⎤ ⎟ exp − Σ⎢ ⎥ , i) If subject i did not drop out early we ⎜ 2 ⎣ σ ⎦ ⎟ ⎝ ⎠ have δi = 0 and use 1 - FT(ci|ai,bi) to compute the 1 f (a | b ) f (b ) = f (a ,b ) = contribution to the likelihood due to the right i i i i i ⎛ 1 ⎞ ⎜ ⎟ ⎛ 1 ⎞ ⎜1+ ⎟ ⎜ ⎟ ⎝ v2 ⎠ censored values, where F is the cdf of T given ai π Σ 2 Γ⎜1+ ⎟2 ⎝ v2 ⎠ and bi. That is

ν 2 ⎛ 1 ⎡ −1 ′⎤ ⎞ exp⎜− ()()ai ,bi Σ 2 ai ,bi ⎟. ⎝ 2 ⎣⎢ ⎦⎥ ⎠

(b) The ML function accounting for undetectable values only: We use d to denote the operable limit of detection. We assume that the first ni1 measurements are detectable values and there are ni - ni1 undetectable values for subject i. We 362 RANDOM REGRESSION MODELS

3 ⎛ 3 ⎞ ⎛ 1 ⎞ Specifically, the empirical Bayes estimates of ⎜ ⎟ Γ⎜ ⎟ Σ2 Γ⎜1 + ⎟ 2 ⎝ 2 ⎠ ν the random intercept ai and slope bi for subject i F (c | a ,b ) = ⎝ 2 ⎠ T i i i ⎛ 1 ⎞ are given, respectively, by ⎛ 3 ⎞ ⎜ ⎟ ⎜ ⎟ ⎝ 2ν 2 ⎠ π Σ3 Γ⎜1 + ⎟2 (5) ⎝ 2ν 2 ⎠ ∞ ∞ * −1 ν aˆi = E(ai | Yi ,Ti ) = f (Yi ,Ti ,θ ) g a (ai ,bi )dai dbi , 2 ∫∫−∞ −∞ ⎧ 1 ⎡ −1 ′⎤ ⎫ ⎪− a ,b , z − µ a ,b , z − µ ⎪ ⎢()()i i t ∑3 i i t ⎥ ⎪ 2 ⎣ ⎦ ⎪ ci ⎪ ⎪ where exp⎨ ν 2 ⎬dz, ∫−∞ ⎡ 2 −1 ⎤ ⎪ 1 ⎛σ σ ⎞ ⎛ai ⎞ ⎪ + ⎢()a ,b ⎜ 1 12 ⎟ ⎜ ⎟⎥ ⎪ i i ⎜ 2 ⎟ ⎜ b ⎟ ⎪ ⎪ 2 ⎢ ⎝σ 12 σ 2 ⎠ ⎝ i ⎠⎥ ⎪ ga (ai ,bi ) = ⎩ ⎣ ⎦ ⎭ * ai f (Yi | ai ,bi ) f (Ti | ai ,bi ) f (ai | bi ) f (bi ), where Σ2 , Σ3 and v2 were defined in LRRM. ˆ (ii) If subject i dropped out early we bi = E(bi |Yi ,Ti ) = 0 0 ∞ ∞ have δi =1 and Ti = Ti and use the pdf f(Ti | ai * −1 f (Yi ,Ti ,θ ) gb (ai ,bi )dai dbi , ,bi) to compute the contribution to the likelihood ∫∫−∞ −∞ due to the informative drop-out values. Therefore, the likelihood function accounting for where informative drop-outs and the right censored data is given by g (a ,b ) = b i i * L()θ ,Y ,T = bi f (Yi | ai ,bi ) f (Ti | ai ,bi ) f (ai | bi ) f (bi ).

∞ ∞ δ n ⎡ 0 i ⎤ f ()Yi | ai ,bi f ()Ti | ai ,bi The above empirical Bayes estimates were given ⎢∫∫−∞ −∞ ⎥, ∏ by Lyles et al., (2000). Note that f*(Y |a ,b ) is i=1 ⎢ 1−δi ⎥ i i i []1 − FT ()ci | ai ,bi f (ai | bi ) f (bi )daidbi ⎣ ⎦ different from f(Yi|ai,bi), the one with asterisk (6) indicates that the data vector Yi may include one or more undetectable values. 2 2 2 2 where θ = (α, β, σ1 , σ2 , σ12 , σ , µt, σat, σbt, σt ) . Thus, the complete ML function is given by Computation The software package we have used to L()θ,Y ,T = obtain the ML estimates of variance components n n ⎡ ∞ ∞ j1 i ⎤ and fixed effects corresponding to models f ()Yij | ai ,bi FY ()d | ai ,bi ⎢∫∫−∞ −∞ ∏ ∏ ⎥ discussed in this chapter is SAS PROC IML. ⎢ j=1 j=ni1+1 ⎥ (7) 528 The ML function is constructed within PROC ⎢ 0 δi 1−δ ⎥ f T | a ,b 1 − F c | a ,b i . ∏ ⎢ ()i i i []T ()i i i ⎥ IML first. The initial parameter estimates are i=1 ⎢ f (ai | bi ) f (bi )dai dbi ⎥ obtained from Lyles et al. (2000). The ML ⎢ ⎥ function is maximized through the NLPQN ⎣⎢ ⎦⎥ routine in IML with respect to the parameters stated in this paper. The double integration was Computing Empirical Bayes Estimates of computed by quadrature for each subject. The Random Intercepts & Random Slopes Hessian matrix (the dispersion matrix of the In this section we discuss the calculation estimated parameters) was found through the of empirical Bayes estimates of random NLPFDD routine in IML. There are no built-in intercepts and random slopes in the presence of generic non-normal ECD functions in SAS. We drop-outs and undetectable values based on the used the theorems of relationship between a ECD assumptions. Specifically, we calculate the UPE and a Gamma distribution developed in estimate of the random intercept ai and random another paper to compute the probability of UPE slope bi by substituting the ML estimators of θ distribution below or above a certain point. based on the ML function (7) developed in the However, this method can not be used to deal last section into the analytic expressions for the with MPE distribution or the conditional and posterior means given the observed data (Yi, Ti). marginal MPE distributions since there is no BARTOLUCCI, ZHENG, BAE, & SINGH 363 existing useful relationship between an MPE and Model 5 (M5): ECDs were assumed in this a Gamma distribution and the conditional or the model. Undetectable, informative drop-out and marginal distributions of an MPE are not right censored values were considered at the necessarily MPEs, which can be much more same time in this model. Also, we assume v1= complicated ECD distributions. We used v2. approximation methods to integrate such Model 6 (M6): This model is the same integrands. The Simpson’s rule has been adopted as M5 except that we don’t assume v1 = v2. which requires much less computing time and Next, we summarize what we have can reach highly accurate results. S-Plus and found from the HERS data analysis. SAS PROC IML were used to obtain the (1). ECDs fit the data much better than empirical Bayes estimates of the random the classical normal distributions. intercept and random slope for each subject and Among models M1, M3 and M4 we account for the critical values of UPE distribution and the undetectable values only. Model M1 is based on Simpson’s rule has also been used for non- the normal distribution assumptions while model normal situations. M3 and M4 are based on ECD assumptions. The value of AIC changes from 3932.216 to Results 3928.101 when the model, M3, is used whereas the value reduces to 3908.833 using the model, We used the Akaike (1974) Information M4. Overall, model M4 is the best according to Criterion (AIC) which was used by Lindsey the AIC standard if we consider undetectable (1999) and among others to compare the values only in our analysis. classical multivariate normal model and the Among models M2, M5 and M6 we multivariate power exponential model. In treat undetects, informative drop-outs and right version 8 of SAS/STAT software AIC is defined censored observations simultaneously. Model as ’smaller-is-better’. Specifically, AIC=2l + 2d, M2 is based on the normal distribution where l denotes the maximum value of the log assumptions, but model M5 and M6 are based likelihood, d denotes the dimension of the on the ECD assumptions. Model M5 reduces model, i.e., the number of parameters estimated AIC from 4083.556 of M2 to 4079.746 (see in the ML function. Six models were considered: Table 2). Overall, model M6 (4064.791) is the Model 1 (M1): In this model we best by AIC standard if we consider all possible assumed the normal distributions. There were situations. 2 2 2 six parameters (α, β, σ1 , σ2 , σ12, σ ) estimated in (2). The dispersion matrix of an MPE the ML function accounting for undetectable random vector is proportional to ∑ as defined in values which were constructed as in equation (2) section 2. Hence, multiplying the ML estimate of Lyles et al. (2000, p.488). Σˆ by a coefficient we transformed Σˆ to the Model 2 (M2): As in model M1, the estimated dispersion matrix whose elements are normal distributions were assumed. There were listed in Table 1. As expected, variance and 2 2 2 ten parameters (α, β, σ1 , σ2 , σ12, σ , µt, σat, σbt, covariance estimates are very close under the six 2 σt ) estimated in the ML function accounting for different models. This proportional relationship both undetects and informative drop-outs which provides us a short cut to gain the ML estimates. were constructed as in equation (5) of Lyles et That is, we can get the ML estimate of the al. (2000, p.489). dispersion matrix under the normal distribution Model 3 (M3): ECDs were assumed in assumption first and then utilize this estimated this model. This model accounts for dispersion matrix to estimate the shape undetectable values only. Furthermore we parameters. This method is very useful and assumed that two shape parameters were equal, effective, especially when we have a large i.e., v1 = v2. There were seven parameters (α, β, number of parameters to estimate or when we 2 2 2 σ1 , σ2 , σ12, σ , ν1) estimated using the ML deal with a very large data set where computing function. CPU time and memory space are prohibiting. Model 4 (M4): This model is the same The estimates of the fixed intercept and the fixed as M3 except that we don’t assume v1= v2. slope for all subjects are almost exactly the same 364 RANDOM REGRESSION MODELS under the six different models. This is because have used simulation methods to assess different that αˆ and βˆ only involve the data set which is distributions, like normal or non-normal characteristics as per the shape parameter. If we given and the dispersion matrices of random could construct a statistic related to the shape effects and error terms which are invariant under parameter and get an explicit, exact or the normal distribution assumptions and the asymptotic distribution of the statistic we could ECD assumptions as we discussed. do a formal accurate hypothesis testing about the (3). The estimates of the shape shape parameter of the distribution. This is parameters in Table 2 strongly suggest that we another challenging task for future research. All should make the power exponential distribution source code provided in this paper is in SAS assumptions instead of classical normal (Appendix). distribution assumptions since our simulations revealed that less than 0.94 or greater than 1.15 References shape parameters indicate the distribution departs significantly from the normal Akaike, H. (1973). Information theory and distribution at α = 0.05 level. The shape an extension of the maximum likelihood parameter estimates νˆ =0.6574 (S.E.=0.116) principle. Second International Symposium on under model M3 and νˆ =0.6997 (S.E.=0.099) Inference Theory. under model M5 indicate that 40% undetects Fang, K. T., & Zhang, Y. T. (1993). contribute to a long tailed non-normal Generalized multivariate analysis. Science distribution. In model M4 and M6 we don’t Press. assume v1= v2. The estimate of the second shape Gomez, E., Gomez-Villegas, M. A., & Marin, J. M. (1998). A multivariate parameter is νˆ2 =1.8089 (S.E.= 0.490) in model generalization of the power exponential family M4 and νˆ =1.3706 (S.E.=0.215) in model M6. 2 of distributions. Communications in Statistics, The shape parameter estimate νˆ2 in both models A27, 589-600. M4 and model M6 are much larger than 1 which Harville, D. A. (1977). Maximum likelihood shows that both univariate distributions of the approached to variance component estimation random intercept and the random slope and their and to related problems. Journal of the joint distribution are non-normal. They are thin- American Statistical Association, 722 (358), 320 tailed ECDs, concentrated around 0 means. - 338. Harville, D. (1976). Extension of the Gauss- Possible Extensions Markov theorem to include the estimation of First, power exponential distributions random effects. The Annals of Statistics, 4(2), are just a member of larger ECD family. To 384-395. extend the power exponential distribution Hughes, J. P. (1999). Mixed effects models assumptions for the models we have discussed is with censored data with application to HIV RNA a challenging task and of great interest in both levels. Biometrics, 55, 625-629. theory and practice. Second, we used Laird, N. and Ware, J. (1982). Random- approximation methods to compute probability Effects models for longitudinal data. Biometrics distribution function values at a certain given 38, 963-974. point and the probability on some interval or Lindsey, J. K. (1999). Multivariate within a certain given high dimension rectangle elliptically contoured distributions for repeated for the non-normal power exponential measurements. Biometrics, 55, 1277 - 1280. distributions. The CPU time and memory space Lyles, R. H., Lyles, C. M., & Taylor, D. J. required for this kind of task are prohibitive. (2000). Random regression models for human This highly intensive computing problem will be immunodeficiency virus ribonucleic acid data eased if we could find an exact or asymptotic subject to left censoring and informative drop- relationship between distributions (such as non- outs. Applied Statistics, 49 (4), 485 - 497. normal MPEs and marginal or conditional distributions of a non-normal MPE). Third, we BARTOLUCCI, ZHENG, BAE, & SINGH 365

Lyles, R. H., Lyles, C. M. & Taylor, D. J. Smith, D. K., Warren, D. L., Vlahov, D., (2000). SAS programs simulation data sets. Schuman, P, Stein, M. D., Greenberg, B.L., & http://www.blackwellpublishing.com/rss/Volum Holmberg, S. D. (1997). Design and baseline es/Cv49p4.htm. participant characteristics of the human Schluchter, M. D. (1992). Methods for the immunodeficiency virus epidemiology research analysis of informatively censored longitudinal (HER) study: A prospective cohort study of data. Statistics in Medicine, Vol. 11, 1861-1870. human immunodeficiency virus infection in US women. American Journal of Epidemiology, 146, 459-469.

Table 1. Results from HERS data: ML Estimates.

2 2 2 2 α β σ1 σ2 σ12 σ µt σat σbt σt M1 2.89 0.058 0.721 0.037 0.061 0.383 - - - - (0.033) (0.016) (0.088) (0.008) (0.022) (0.023) M2 2.88 0.062 0.718 0.039 0.060 0.382 2.32 0.165 0.035 0.298 (0.050) (0.016) (0.088) (0.008) (0.022) (0.023) (0.158) (0.062) (0.022) (0.096) M3 2.91 0.058 0.747 0.040 0.050 0.387 - - - - (0.057) (0.017) (0.149) (0.009) (0.015) (0.076) M4 2.89 0.050 0.695 0.044 0.054 0.410 - - - - (0.002) (0.001) (0.417) (0.028) (0.052) (0.023) M5 2.90 0.062 0.833 0.047 0.059 0.383 2.258 0.173 0.042 0.269 (0.053) (0.017) (0.137) (0.008) (0.015) (0.068) (0.144) (0.036) (0.010) (0.060) M6 2.90 0.062 0.833 0.047 0.059 0.383 2.258 0.173 0.042 0.269 (0.053) (0.017) (0.137) (0.008) (0.015) (0.068) (0.144) (0.036) (0.010) (0.060) Note. Numbers in parentheses are Standard Errors of the corresponding estimates.

Table 2. Results from HERS data: Shape parameter estimates and AIC.

ν1 ν2 d -2 log-likelihood AIC M1 - - 6 3920.216 3932.216 M2 - - 10 4063.556 4083.556 M3 0.6574 (0.116) - 7 3914.101 3928.101 M4 0.4694 (0.060) 1.8090 (0.490) 8 3892.833 3908.833 M5 0.6997 (0.099) - 11 4057.746 4079.746 M6 0.5173 (0.055) 1.3706 (0.215) 12 4040.791 4064.791 Note. Numbers in parentheses are Standard Errors of the corresponding estimates.

366 RANDOM REGRESSION MODELS

Appendix

SAS Program for Taking Left-censored into Account ********************************************************************** Acknowledgments: The following program was created originally by Dr. Robert H. Lyles. We have changed his distribution assumptions normal to ECD and added five nonlinear constraints. We really appreciate Dr. Lyles's providing this program. Description: Calculation of the ML estimates of the fixed effects and the variance matrix. We assume the underlying distributions are ECDs. Also, we take the undetectable observations into account under the model described by the likelihood equation (4) in this paper. ********************************************************************** data test; infile '/herscens1.dat'; input obsn id time nondet response fail survtyrs logsurvt; *Compute ML estimates via PROC MIXED on complete data (which would not be available in practice). That is, using the actual values for the response and all 1864 measurements; proc mixed data=test method=ml; class id; model response=time / s ddfm=bw; random intercept time /type=un subject=id; title2 "ml estimates for full data set (unavailable in practice)"; run; data test2; set test; if nondet=1 then do; observed=0; end; else if nondet=0 then do; observed=1; end; label response="Base 10 log HIVRNA value" time ="time of measurement" id ="subject id" observed="indicator for whether value was observed" fail="indicator for whether subject dropped out" survtyrs="Years to dropout" logsurvt="Natural log of dropout time";

* Compute ML estimates ignoring left censoring and drop-outs using PROC MIXED with random intercept and slope. These naive estimates will be used as starting values for the six parameters of the mixed effects model; proc mixed data=test2 method=ml; class id; BARTOLUCCI, ZHENG, BAE, & SINGH 367

model response=time / s ddfm=bw; random intercept time /type=un subject=id; title2 "ml estimates ignoring left censoring and dropouts"; run;

***Create dataset to be read into IML for maximizing likelihood in Eqn. 2, accounting for left censoring: *; data test; set test; if nondet=1 then do; observed=0; end;

else if nondet=0 then do; observed=1; end; run; proc iml worksize=999216000 symsize=999999900;

******************************************************************* * define IML function which will be used to maximize the likelihood *******************************************************************; start likeli1(parms);

* lower and upper boundaries and stepsize for numerical integration; nsteps=31; a_l =-5; a_u = 5; step_a=(a_u-a_l)/(nsteps-1); b_l = -1.5; b_u = 1.5; step_b=(b_u-b_l)/(nsteps-1); pi=2*arsin(1);

* variables corresponding to input parameters from vector 'parms'; sigsq1 =parms[1]; * random intercept effect variance; sig12 =parms[2]; * covariance between random intercept and slope; sigsq2 =parms[3]; * random slope effect variance; sigsq =parms[4]; * within subject variance; alpha =parms[5]; * fixed effect intercept; beta =parms[6]; * fixed effect slope; v =parms[7];

* determine number of subjects in dataset; use test; read all var {id} into subjects; close test;

* compute number of subjects and create vector for each subjects contribution to the likelihood; subjects=ncol(unique(subjects)); 368 RANDOM REGRESSION MODELS terms=j(subjects,1,.);

* get vector of indicators for observed vs. censored responses for subject i; do i=1 to subjects; use test; read all var {observed} into d_i where (id=i); close test;

* number of observations, number of observed values, and number of censored values, respectively, for subject i; n_i=nrow(d_i); o_i=sum(d_i); c_i=n_i-o_i;

* create vectors of censored values and the associated time of measurement; if c_i>0 then do; use test; read all var {response} into cens_i where (id=i & observed=0); read all var {time} into c_time_i where (id=i & observed=0); close test; end;

* create vectors of observed values and the associated time of measurement; if o_i>0 then do; use test; read all var {response} into y_i where (id=i & observed=1); read all var {time} into time_i where (id=i & observed=1); close test; end;

* set initial value for likelihood contribution by subject i to zero; func_i=0;

* define quadrature points for numerical integration; do a_i=a_l to a_u by step_a; do b_i=b_l to b_u by step_b;

* contribution to likelihood due to observed values for subject i; if o_i=0 then func_i1=1; else do; t_i1=(y_i-alpha-beta*time_i); t_i2=(a_i+b_i*time_i); func_i1=(1/(sqrt(sigsq)*gamma(1+0.5/v)*(2##(1+0.5/v)))**o_i)* exp(-0.5*sum(((t_i1-t_i2)##2/sigsq)##v)); end;

BARTOLUCCI, ZHENG, BAE, & SINGH 369

* contribution to likelihood due to censored values for subject i; func_i2=1; if c_i>0 then do j=1 to c_i; b=cens_i[j,1]-alpha-a_i-beta*c_time_i[j,1]- b_i*c_time_i[j,1]; if b >= 0 then temp_i2=0.5*(1+probgam(0.5*(b/sqrt(sigsq))**(2*v),(1/(2*v)))); else temp_i2=0.5*(1-probgam(0.5*(- b/sqrt(sigsq))**(2*v),(1/(2*v)))); func_i2=func_i2*temp_i2; end;

* compute correlation coefficient between intercept and slope; r=sig12/sqrt(sigsq1*sigsq2);

* compute joint distribution of intercept and slope; w=(sigsq1||sig12)//(sig12||sigsq2); u=det(w); y=inv(w); x_i=(a_i||b_i);

func_i3=(2/(pi*sqrt(u)*gamma(1+1/v)*(2##(1+1/v))))* exp(-0.5*(x_i*y*x_i`)##v);

* compute contribution of subject 'i' to objective function;

func_i=func_i+(func_i1*func_i2*func_i3*step_a*step_b); end; end;

* add subject i's contribution to vector of likelihood terms;

terms[i,1]=func_i; end;

* compute -2 log likelihood;

loglik2=-2*sum(log(terms)); return(loglik2); finish likeli1;

********************************************************************** The following is the main body of the program (which calls the minimization function, computes the Hessian, etc.) ********************************************************************** ; * initial estimates from preliminary analysis; parms={.24 -.012 .015 .201 3.21 .039 1.0}; 370 RANDOM REGRESSION MODELS

* options vector for minimization function; * matrix of lower (row 1) and upper (row 2) bound contraints on parameters (sigsq1 > 0, sig12 <> 0, sigsq2 > 0, sigsq > 0, alpha <> 0, beta <> 0); /* con={1E-5 . 1E-5 1E-5 . ., ...... }; */

* The following are five non-linear restrictions; start c_h(parms); c=j(5,1,0.); c[1]=parms[1]; c[2]=parms[3]; c[3]=parms[4]; c[4]=parms[1]-(parms[2]##2/parms[3]); c[5]=parms[7]; return(c); finish c_h;

* call function minimizer in IML; optn=j(1,11,.); optn[1]=0; optn[2]=3; optn[10]=5; optn[11]=0; call nlpqn(rc, xres, "likeli1", parms, optn) nlc="c_h";

* create vector of mle's computed using function minimizer; parms=xres`;

* compute numerical value of Hessian (and covariance matrix) using mle's calculated above; call NLPFDD(crit, grad, hess, "likeli1", parms); cov_mat=2*inv(hess); se_vec =sqrt(vecdiag(cov_mat)); print cov_mat se_vec;

*****************************************************; The following program is used to transform MLE of ECD Sigma matrix int the variance matrix; proc iml; sig1={ 0.221828 -0.014873 0.011839}; sig = 0.141499; a = 2.906699; b = 0.057614; beta=0.657391; c1=2**(1/beta)*gamma(2/beta)/(2*gamma(1/beta)); c2=2**(1/beta)*gamma(1.5/beta)/(gamma(0.5/beta)); sig11=c1*sig1; sig0=c2*sig; sigma=sig11||sig0||a||b; print sigma; /* with ECD */ sigmaOld={0.720710 -0.060955 0.037333 0.382976 2.886360 0.058335}; print sigmaOld; /* without ECD */ Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 371-379 1538 – 9472/03/$30.00

Using Zero-inflated Count Regression Models To Estimate The Fertility Of U. S. Women

Dudley L. Poston, Jr. Sherry L. McKibben Department of Sociology Texas A&M University

In the modeling of count variables there is sometimes a preponderance of zero counts. This article concerns the estimation of Poisson regression models (PRM) and negative binomial regression models (NBRM) to predict the average number of children ever born (CEB) to women in the U.S. The PRM and NBRM will often under-predict zeros because they do not consider zero counts of women who are not trying to have children. The fertility of U.S. white and Mexican-origin women show that zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB) models perform better in many respects than the Poisson and negative binomial models. Zero-inflated Poisson and negative binomial regression models are statistically appropriate for the modeling of fertility in low fertility populations, especially when there is a preponderance of women in the society with no children.

Key words: Poisson regression, negative binomial regression, demography, fertility, zero counts

Introduction is a count variable, i.e., a nonnegative integer, is hence heavily skewed with a long right tail. When analyzing variation in the number of The statistical modeling of these kinds children that women have born to them, of CEB data is best based on approaches other demographers frequently use Poisson and than the ordinary least squares (OLS) linear negative binomial regression models rather than regression model because using it to predict a ordinary least squares models. Poisson and count outcome, such as CEB, will often “result negative binomial regression models are in inefficient, inconsistent, and biased estimates” statistically more appropriate for predicting a (Long, 1997, p. 217) of the regression woman’s children ever born (CEB), particularly parameters. Poisson regression models (PRM) in societies where mean fertility is low (Poston, and negative binomial regression models 2002). Most women in such populations have (NBRM) have been shown to be statistically children at the lower parities, including zero more appropriate (Poston, 2002). parity, and few have children at the higher However, sometimes there are so many parities. The CEB variable, which by definition zeros in the count dependent variable that both the PRM and the NBRM under-predict the number of observed zeros; the resulting Dudley L. Poston, Jr. is Professor of Sociology, regression models, therefore, often do not fit the and the George T. and Gladys H. Abell data. Zero-inflated count regression models were Professor of Liberal Arts. He is co-editing (with introduced by Lambert (1992) and Greene Michael Micklin) the Handbook of Population (1994) for those situations when the PRM and (Klewer Plenum, 2004). E-mail him at: the NBRM failed to account for the excess zeros [email protected]. Sherry L. McKibben and resulted in poor fit. This paper examines the is a Lecturer in the Department of Sociology. E- use and application of zero-inflated count mail: sherrymc. [email protected]. regression models to predict the number of children ever born to U.S. women.

371 372 ESTIMATING THE FERTILITY OF U. S. WOMEN

Methodology are predicted by the PRM or NBRM, resulting in an overall poor fit of the model to the data. Zero- The most basic approach for predicting a count inflated models respond to this problem of variable, such as CEB, is the Poisson regression excess zeros “by changing the mean structure to model (PRM). In the PRM, the dependent allow zeros to be generated by two distinct variable, namely, the number of events, i.e., in processes” (Long & Freese, 2001, p. 250). the case of this paper, the number of children Consider a few examples of excess ever born (CEB), is a nonnegative integer and zeros. Suppose one wishes to survey visitors to a has a Poisson distribution with a conditional national park to predict the number of fish they mean that depends on the characteristics (the caught. Suppose that some of the visitors did not independent variables) of the women (Long, fish, but data were not available on who fished 1997; Long & Freese, 2001). The PRM and who did not fish. The data gathered hence incorporates observed heterogeneity according have a preponderance of zeros, some of which to the following structural equation: apply to persons who fished and caught no fish, and others to persons who did not fish (Stata,

µ i = exp (a + X 1i b1 + X 2i b2 + ... + X ki bk ) 2001; Cameron & Trivedi, 1998). Or consider the problem of predicting the number of publications written by scientists. where µI is the expected number of children ever born for the ith woman; X , X ... X are her Some scientists will never publish either because 1i 2i ki they have chosen not to do so, or, perhaps, characteristics; and a, b1, b2 ... bk are the Poisson regression coefficients. because they are not permitted to do so. But The PRM is appropriate when the mean assume that there are no data telling which and the variance of the count distribution are scientists have a zero probability of ever similar, and is less applicable when the variance publishing. As with the example of the number of the distribution exceeds the mean, that is, of fish caught, there will be a preponderance of when there is over-dispersion in the count data. zeros among scientists with regards to the If there is significant over-dispersion in the number of articles published. Some of the zeros distribution of the count, the estimates from the will apply to scientists who tried to publish but PRM will be consistent, but inefficient. “The were not successful and others to scientists who standard errors in the Poisson regression model did not try to publish (Long & Freese, 2001; will be biased downward. Resulting in Long, 1990). spuriously large z-values and spuriously small p- Finally, consider the example to be values” (Long & Freese, 2001; Cameron & addressed in this paper, namely, the number of Trivedi, 1986), which could lead the investigator children born to women. Some women will to make incorrect statistical inferences about the choose not to have children and are referred to significance of the independent variables. as voluntarily childless women. Other women This is addressed by adding to the PRM will try to have children but will not be “a parameter that allows the conditional variance successful in their attempts and are referred to as of (the count outcome) to exceed the conditional involuntarily childless women (Poston, 1976; mean” (Long, 1997, 230). This extension of the Poston & Kramer, 1983). But, assume that it is Poisson regression model is the negative not directly known to which group each woman binomial regression model (NBRM). The belongs. Thus among women of the childbearing NBRM adds to the Poisson regression model the ages of 15-49, there will be many zeros on the error term ε according to the following structural CEB dependent variable; some of the zeros will equation: apply to women who tried to produce children but were not successful, and others to women who voluntarily opted against having children. µ = exp (a + X b + X b + ... + X b + ε i) i 1i 1 2i 2 ki k Long and Freese (2001) stated that in zero-inflated models it is assumed that “there are However, sometimes there are many two latent (i.e., unobserved) groups. An more zeros in the count dependent variable than individual in the Always-0 Group (Group A) has POSTON, Jr. & MCKIBBEN 373 an outcome of 0 with a probability of 1, while an probability of 0’s, which is “a combination of individual in the Not Always-0 Group (Group the probabilities of 0’s from each group, ~A) might have a zero count, but there is a weighted by the probability of an individual nonzero probability that she has a positive (woman) being in the group” (Long, 1997, p. count” (p. 251). 242-243). The probabilities of counts other than In all cases, the investigator does not 0 are adjusted in a similar way. know into which of the two groups the respondents fall. If it was known into which Results group each subject was placed, one could subtract the persons belonging to the Always-0 Data are available for 1995 for U.S. (non- Group from the total sample, and estimate Hispanic) white and Mexican-origin women, Poisson or negative binomial regression models. gathered in Cycle 5 of the National Survey of But typically one does not have this kind of Family Growth (National Center for Health information, thus requiring the introduction of Statistics, 1995). The data are based on personal zero-inflated regression. interviews conducted in the homes of a national The estimation of zero-inflated sample of 10,847 females between the ages of regression models involves three steps: 1) 14 and 44 in the civilian, non-institutionalized predicting membership in the two latent groups, population in the United States. Table 1 reports Group A and Group ~A; 2) estimating the the descriptive data on children born (CEB) for number of counts for persons in Group ~A; and U.S. white and Mexican-origin women in 1995. 3) computing “the observed probabilities as a White women have a mean CEB of 1.2 mixture of the probabilities for the two groups” with a variance of 1.6. Mean CEB for Mexican- (Long & Freese, 2001, p. 251). origin women is 1.9 with a variance of 2.8. For To analyze the fertility of U.S. women, both white and Mexican-origin women, the one would follow these steps (for detail, see variance of CEB is greater than the mean of Long & Freese, 2001. p. 251-252; Cameron & CEB. There are several ways for determining if Trivedi, 1998, p. 125-127, 211-215). there is over-dispersion in the CEB data (see In Step 1, use a logistic regression Poston, 2002). It turns out that there is not a model to predict the woman’s membership in significant amount of over-dispersion in the Group A (never have children) or Group ~A CEB data for whites, justifying the use of a (may or may not have children). The Poisson regression model. There is a significant independent variables used in the logistic amount of over-dispersion in the CEB data for equation may be “referred to as inflation Mexican-origin women, so that a negative variables since they serve to inflate the number binomial regression model will be appropriate. of 0s” (Long & Freese, 2001, p. 251). Poisson Regression versus Zero-inflated Poisson In Step 2, for women in Group ~A (may Regression or may not have children), depending on A Poisson regression model is thus whether or not there is over-dispersion in the estimated for the white women that predicts their CEB dependent variable, use either a Poisson CEB with socioeconomic and location regression model or a negative binomial characteristics that have been shown in the regression model to predict the probabilities of demographic literature to be associated with counts 0 to y (where y is the maximum number fertility. The independent variables pertain to of children born to a woman). The independent education, rural residence, poverty status, age, variables used in Step 2 may or may not be the regional location, and religion. Some are same as those used in Step 1. In the examples measured as dummy variables and others as shown below, the same independent variables interval. are used in both steps. Using the same variables in both steps is not required. Different variables could be used in each step. In Step 3, the results from the preceding steps are used to determine the overall 374 ESTIMATING THE FERTILITY OF U. S. WOMEN

Table 1. Data for Children Ever Born: U.S. White and Mexican-Origin Women, Ages 15-49. ______

Group Mean Standard Dev. Variance No. of Cases ______

White 1.2471 1.2839 1.6486 6,456 Mexican 1.8864 1.6592 2.7531 924 ______Source of Data: National Center for Health Statistics (1995).

They are the following: X1 is the According to the Poisson coefficients woman’s education measured in years of school shown in the first panel of Table 2, four of the completed; X2 is a dummy variable indicating ten independent variables are significantly whether the woman lives in a rural area; X3 is a related with the CEB of white women. The dummy variable indicating whether the woman higher the woman’s education, the fewer her is classified as being in poverty (poverty status CEB; the older her age, the higher her CEB. If is based on whether the woman’s family income she is a rural resident or in poverty, she will is below the national poverty threshold, adjusted have more children than urban residents or for family size). women not living in poverty. The geographic Continuing, X4 is the woman’s age location and religion variables are not measured in years; X5 to X7 are three dummy statistically significant. variables representing the woman’s region of Using the above Poisson regression residence, namely, X5 residence in the Midwest, results, the predicted probabilities of each white X6 residence in the South, and X7 residence in woman may be calculated for each count of the West; residence in the Northeast is the CEB from 0 to 10. The mean of the predicted reference category; and X8 to X10 are three probabilities at each count may then be dummy variables reflecting the woman’s determined, using this formula (Long & Freese, religion, as follows: X8 indicates if the woman’s 2001): religion is Protestant, X9 if she is Catholic, and ∧ 1 N y X10 if she has no religion, or religion is not Pr (y = m) = ∑ Pr( i = m | xi ) i=1 specified; Jewish religion is the reference N category. The first panel of Table 2 reports the where y = m = the count of children ever born, results of the Poisson regression equation and x are the above ten independent variables. predicting CEB for U.S. white women in 1995. i

POSTON, Jr. & MCKIBBEN 375

Table 2. Poisson Regression Model, and Zero-inflated Poisson Regression Model, U.S. White (non-Hispanic) Women, 1995.

______

Poisson Model Zero-inflated Poisson Model Logit Poisson Independent Variable b z b z b z

Panel 1 Panel 2 Panel 3

X1 Education -.070 -15.45 .905 12.83 -.058 11.84 X2 Rural Residence .111 3.87 -.336 -1.55 .097 3.35 X3 Poverty Status .377 10.48 -.482 -2.06 .336 8.96 X4 Age .076 48.41 -.781 -14.79 .034 16.01 X5 Midwest .045 1.37 -.714 -2.63 .007 .22 X6 South -.023 -.07 -.289 -1.06 -.048 -1.39 X7 West .019 .53 -.187 -0.63 .011 .31 X8 Protestant .074 .80 -2.483 -3.02 -.030 -.31 X9 Catholic .052 .56 -1.871 -2.28 -.035 -.37 X10 No Religion -.110 -1.14 -2.020 -2.15 -.223 -2.23 Constant -1.504 -11.98 8.780 8.12 .096 .14

Likelihood Ratio χ2 2857.59, P = 0.000 458.66, P = 0.000 ______Vuong Test of Zip vs. Poisson: 20.16 P = 0.000

Figure 1 is a plot of the mean Poisson appropriate to estimate a zero-inflated Poisson predicted probabilities at each count of CEB (the regression model. The 2nd and 3rd panels of green x symbols), and they may be compared Table 2 present the results of such a model. with the observed empirical distribution of CEB Recall from the previous section that the first (the blue circles). Just over 40 percent two steps in estimating a zero-inflated model (proportion of .4046) of U.S. white women have involve 1) using a logistic regression model to no children ever born, but the Poisson regression predict the woman’s group membership in results predict a mean probability at zero count Group A (never have children) or Group ~A of .361, which is an under-prediction of the (may or may not have children), and 2) for observed CEB. The Poisson regression results women in Group ~A (may or may not have over-predict the observed CEB data at count children), using a Poisson regression model to one, under-predict at count two, and are more predict her number of children ever born. Thus consistent with the observed CEB data at the there are two panels of zero-inflated Poisson third and higher counts. results reported in Table 2. Panel 2, titled But, a central issue for this paper is the “Logit” are the logit coefficients obtained in under-prediction by the Poisson regression Step 1, and Panel 3, titled “Poisson” are the model of the observed zero counts of CEB for Poisson coefficients obtained in Step 2. white women. In such a situation, it would be

376 ESTIMATING THE FERTILITY OF U. S. WOMEN

Observed CEB Distribution Zero-inflated Poisson (ZIP) Poisson Regression Model (PRM)

.4

.3

.2

Proportion or Probability .1

0 0 1 2 3 4 5 6 7 8 9 Number of Children ever Born Figure 1. Distributions of CEB, PRM, & ZIP, U.S. White Women

The coefficients in the “Logit” panel the Poisson coefficients (Panel 1). Note first that (Panel 2) of Table 2 are the logit coefficients the Poisson coefficients (Panel 1) for most of the predicting a woman’s membership in Group A independent variables are slightly larger than (never having children). The higher her those for the zero-inflated Poisson coefficients education the greater the likelihood of her not (Panel 3). However, the z-scores for many of the having children. If she is in poverty, she is likely Poisson coefficients are quite a bit larger than to not have children. The older her age, the less the z-scores for the corresponding zero-inflated likely she will not have children. If she lives in Poisson coefficients. Thus although the two sets the Midwest, she will be less likely than women of Poisson coefficients are not too different in living in the Northeast to not have children. And magnitude, the standard errors for the zero- if she is a Catholic, or a Protestant, or has no inflated coefficients will tend to be larger than religion, she will be less likely than Jewish they are for the Poisson coefficients. women to have no children. The rural, South, Regarding issues of interpretation and and West variables are not significant. statistical inference, the results of the two For the purpose of this paper, the more Poisson models allow the investigator to relevant coefficients are shown in the “Poisson” conclude that the effects on a woman’s CEB of panel (Panel 3) of Table 2; these are the zero- her education, rural residence, poverty status and inflated Poisson coefficients predicting the age are all statistically significant. However, the woman’s CEB. The higher her education, the zero-inflated Poisson results, but not the basic less number of children shill will have. If she is Poisson results, also allow the investigator to a rural resident, or in poverty, she will have conclude that the “no religion” variable has a more children. The older her age, the more the statistically significant negative effect on CEB. children. If she has no religion, she will have Women who report no religion have fewer fewer children than Jewish women. The other children than women in the reference (Jewish variables are not significant. religion) category. This inference would not A relevant comparison is between the have been made using the results of the Poisson zero-inflated Poisson coefficients (Panel 3) and model (Panel 1). POSTON, Jr. & MCKIBBEN 377

Is the zero-inflated Poisson regression The first panel of Table 3 reports the model (ZIP) statistically preferred over the basic results of a negative binomial regression Poisson model regression model (PRM)? There (NBRM). The same independent variables are is a formal test statistic, the Vuong test (Vuong, used in this regression, as were used in the 1989) that determines statistically whether the regressions shown in Table 2, except that age is zero-inflated model is a significant improvement excluded and “no religion” is used as the over the Poisson model (for details, see Long, reference religion category. Excluding age 1997, p. 248; Long & Freese, 2001, p. 261-262). resulted in a better fit of the negative binomial The Vuong statistic is asymptotically normal; if model with the data. The “no religion” variable its value is > 1.96, the ZIP model is preferred is removed from the equation and used as the over the PRM. If Vuong < 1.96, the PRM is reference category because the Jewish variable preferred. The Vuong test statistic is shown at was removed altogether from the equation; only the base of Table 2, Vuong = 20.16. This is clear 2 of the 924 Mexican-origin women were evidence that the zero-inflated Poisson Jewish, so there was insufficient variation in this regression results are preferred over the Poisson variable. Thus, regression results are shown in regression results. Table 3 for eight independent variables. Another way to judge whether ZIP is The negative binomial regression preferred over PRM is to ascertain if the results coefficients in the first panel of Table 3 indicate from a ZIP regression improve the prediction of that only two of the eight independent variables the mean probability at count zero. Recall from are significantly related with the CEB of the discussion of Figure 1 (above) that the Mexican-origin women. The higher the Poisson regression results under-predicted the woman’s education, the lower her fertility; and mean probability at count zero. Figure 1 also if she is living in poverty, she will have more contains mean predicted probabilities at each children than women not in poverty. The other count that are based on the results of the zero- independent variables are not statistically inflated Poisson model. significant. The ZIP predictions are shown in Figure There is a large number of zeros for the 1 as maroon diamonds. The ZIP model predicts CEB of Mexican-origin women. Almost 27 a probability at count zero of .3988, which is percent of them have zero children. Although very close to the observed proportion of CEB at this is not quite as high as the level of zero parity count zero of .4046. The expected probabilities among white women (40.4 percent of the white from the ZIP model at counts 1, 2 and 3 are also women have no children), a zero-inflated closer to the corresponding observed CEB negative binomial regression model (ZINB) was counts than are those predicted by the PRM. The estimated to see if model fit would be improved ZIP results seem to do a much better job over that of the NBRM. Its results may be predicting the observed CEB at counts 0, 1, 2, compared with those of the NBRM shown in the and 3 than do the results from the PRM. first panel of table 3. Recall that zero-inflated models produce Negative Binomial Regression versus Zero- two sets of coefficients (see the discussion inflated Negative Binomial Regression above). Thus, the coefficients in the “Logit” In the above example predicting CEB panel (Panel 2) of Table 3 are the logit for U.S. white women, it was first determined coefficients predicting a Mexican-origin that there was not a significant amount of over- woman’s membership in Group A (never having dispersion in CEB, thus justifying modeling children). The higher her education the greater CEB with the PRM. But recall that for the U.S. her likelihood of not having children. And if she Mexican-origin women, the variance of CEB is Catholic, she is less likely than women with was significantly greater than the mean of CEB. no religion to have no children. The other In such a case, the PRM is not appropriate. independent variables are not significant. Instead a negative binomial regression model (NBRM) is preferred.

378 ESTIMATING THE FERTILITY OF U. S. WOMEN

Table 3. Negative Binomial Regression Model, and Zero-inflated Negative Binomial Regression Model, U.S. Mexican-origin Women, 1995

______

Negative Zero-inflated Binomial Model Negative Binomial Model Logit Negative Binomial Independent Variable b z b z b z

Panel 1 Panel 2 Panel 3

X1 Education -.076 -9.13 .120 3.04 -.058 7.35 X2 Rural Residence .156 1.36 .150 .33 .177 1.57 X3 Poverty Status .231 3.84 -.089 -.29 .213 3.51 X4 Midwest .351 .85 9.288 .04 .698 1.76 X5 South .442 1.09 8.922 .04 .697 1.82 X6 West .478 1.18 8.951 .04 .736 1.93 X7 Protestant .202 1.72 -.491 -1.18 .084 .69 X8 Catholic .180 1.71 -.903 -2.41 -.004 -.03 Constant .694 1.63 -11.176 -.05 .589 1.46

Likelihood Ratio χ2 126.36, P = 0.000 98.05, P = 0.000

Vuong Test of Zero-inflated Negative Binomial versus Negative Binomial 70.67, P = 0.000

A comparison may be made between the to be made. zero-inflated negative binomial coefficients One may compare the ZINB regression (Panel 3) and the negative binomial coefficients results with the NBRM results to determine if (Panel 1). Note first that the NBRM coefficients one is statistically preferred over the other. The (Panel 1) for four of the independent variables Vuong test statistic provided at the base of Table are slightly larger than those for the ZINB 3 has a value of 70.67. Clearly the zero-inflated coefficients (Panel 3). And the z-scores for four negative binomial regression results are of the NBRM coefficients are larger than those preferred over the basic negative binomial for the corresponding ZINB coefficients. The regression results. results of the two models in Panels 1 and 3 allow the investigator to conclude that the effects on a woman’s CEB of her education and poverty Conclusion status are statistically significant. However, the zero-inflated negative binomial regression This article considered a situation that frequently results, but not the negative binomial results, occurs when modeling count variables, namely, also allow the investigator to conclude that that there is a preponderance of zero counts. The Mexican-origin women living in the West have application addressed in this paper involved the more children than those living in the Northeast. estimation of Poisson regression models (PRM) The NBRM results did not allow this inference and negative binomial regression models POSTON, Jr. & MCKIBBEN 379

(NBRM) to predict the average number of Cameron, A. C., & Trivedi, P. K. children ever born (CEB) to women in the U.S. (1998). Regression analysis of count data. This is a count variable, and in a low fertility Cambridge, U.K.: Cambridge University Press. society such as the U.S., it is skewed with a long Greene, W. H. (1994). Accounting for right tail. excess zeros and sample selection in Poisson and It was noted in this article that many negative binomial regression models. Stern U.S. women have no children, resulting in a very School of Business, Department of Economics, large percentage of zero counts. But two groups Working Paper Number 94-10. of women have no children; one group will have Lambert, D. (1992). Zero-inflated zero CEB because they have chosen to never Poisson regression with an application to defects have children; another group will have no in manufacturing. Technometrics, 34, 1-14. children even though they are trying to do so. Long, J. S. (1990). The origins of sex PRM and NBRM are best suited to predict CEB differences in science. Social Forces, 68, 1297- counts among women who are having, or trying 1315. to have, children. Thus these models end up Long, J. S. (1997). Regression models under-predicting zero counts because strictly for categorical and limited dependent variables. speaking they are not able to consider the zero Thousand Oaks, California: Sage Publications. counts of women who are not trying to have Long, J. S., & Freese, J. (2001). children. Zero-inflated Poisson (ZIP) and zero- Regression models for categorical dependent inflated negative binomial (ZINB) models have variables using Stata. College Station, Texas: been proposed to handle such situations. Stata Press. Analyses conducted in this paper of the National Center for Health Statistics fertility of U.S. white and Mexican-origin (NCHS). (1995). National Survey of Family women in 1995 demonstrated that the zero- Growth, Cycle V. Hyattsville, Maryland: inflated models performed better in many Department of Health and Human Services. respects than the straightforward Poisson and Poston, D. L., Jr. (1976). negative binomial models. Not only were the Characteristics of voluntarily and involuntarily coefficients in the ZIP and ZINB models childless wives. Social Biology, 23, 198-209 different from those in the PRM and NBRM, it Poston, D. L., Jr. (2002). The statistical was also shown that errors of statistical modeling of the fertility of Chinese women. inference, in terms of failing to include Journal of Applied Statistical Methods, 1(2), significant effects, would have been made had 387-396. the investigator only relied on the results of the Poston, D. L., Jr., & Kramer, K. B. PRM and NBRM. (1983). Voluntary and involuntary childlessness It would appear that the use of zero- in the United States, 1955-1973. Social inflated Poisson and negative binomial Biology, 30, 290-306. regression models are statistically appropriate StataCorp. (2001). Stata statistical for the modeling of fertility in low fertility software: Release 7.0. Vol. IV, College Station, populations. This is especially the case when Texas: Stata Corporation, 481. there is a preponderance of women in the society Vuong, Q. H. (1989). Likelihood ratio with no children. tests for model selection and non-nested hypotheses. Econometrica, 57: 307-333 References Cameron, A. C., & Trivedi, P. K. (1986). Econometric models based on count data: comparisons and applications of some estimators and tests. Journal of Applied Econometrics, 1, 29-53.

Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 380-388 1538 – 9472/03/$30.00

Variable Selection for Poisson Regression Model

Felix Famoye Daniel E. Rothe Department of Mathematics Alpena Community College Central Michigan University

Poisson regression is useful in modeling count data. In a study with many independent variables, it is desirable to reduce the number of variables while maintaining a model that is useful for prediction. This article presents a variable selection technique for Poisson regression models. The data used is log-linear, but the methods could be adapted to other relationships. The model parameters are estimated by the method of maximum likelihood. The use of measures of goodness-of-fit to select appropriate variables is discussed. A forward selection algorithm is presented and illustrated on a numerical data set. This algorithm performs as well if not better than the method of transformation proposed by Nordberg (1982).

Key words: Transformation, goodness-of-fit, forward selection, R-square

Introduction behavior based on a particular group of observed Regression models using count data have a wide characteristics and experiences. D’Unger et al. range of applications in engineering, medicine, (1998) examined categories of criminal careers and social sciences. Other forms of regression using Poisson latent class regression models. such as logistic regression are well established in They assert that Poisson regression models are various social science and medical fields. For appropriate for modeling delinquent behavior example, in epidemiology, researchers study the and criminal careers. relationship between the chance of occurrence of Gourieroux et al. (1984) and Cameron a disease and various suspected risk factors. and Trivedi (1986) described the use of Poisson However, when the outcomes are counts, regression in economics applications such as the Signorini (1991) and others point out that daily number of oil tankers’ arrivals in a port, Poisson regression gives adequate results. the number of accidents at work by factory, the The social sciences often perform number of purchases per period, the number of studies that involve count data. Sociology, spells of unemployment, the number of strikes in psychology, demography, and economics all a month, or the number of patents applied for perform studies using the type of data that can and received by firms. Gourieroux et al. (1984) make use of the Poisson regression model. concluded that the use of Poisson regression Sociology applications involve situations where model is justified in a situation where the researchers wish to predict an individual’s dependent variable consists of counts of the occurrence of an event during a fixed time period. Felix Famoye, Department of Mathematics, Christiansen and Morris (1997) listed Central Michigan University, Mt. Pleasant, MI, applications of Poisson regression in a variety of 48859. E-mail: [email protected]. Daniel fields. Poisson regression has been used in E. Rothe, Alpena Community College, 666 literary analysis of Shakespeare’s works and the Johnson Street, Alpena, MI, 49707. Email: Federalist Papers, Efron and Thisted (1976). [email protected]. The first author Home run data has been analyzed using these acknowledges the support received from Central types of methods, Albert (1992). Poisson Michigan University FRCE Committee under regression and count data in general are very Grant #48136. important in a wide range of fields and thus deserve special attention. Often these models

380 FAMOYE & ROTHE 381

involve many independent variables. Hence xij (j = 0, 1, …, k and x i0 = 1) are independent there is a need to consider variable selection for variables, and β j (j = 0, 1, 2, …, k) are the Poisson regression model. regression parameters. The mean and variance of Variable selection techniques are well y are equal and this is given by known for linear regression. See for example i

Efroymson (1960). Beale (1970) summarizes the E(|)V(|)yx= yx= µ , various familiar methods: forward, backward, iij iiji (2) stepwise, and several other methods. Krall et al. injk = 1, 2, ..., , and = 0, 1, ..., . (1975) discussed a forward selection technique for exponential survival data. They used the Throughout this article, a log linear relationship likelihood ratio as the criterion for adding ⎛⎞k significant variables. Greenberg et al. (1974) µβijij= exp⎜⎟∑ x will be considered. discussed a backward selection and use a log ⎝⎠j=0 likelihood ratio step-down procedure for However, the results can be modified to elimination of variables. For other nonlinear accommodate other types of relationships. regressions and Poisson regression in particular, Frome et al. (1973) described the use of the little is available in the literature. maximum likelihood (ML) method to estimate Nordberg (1982) considered a certain the unknown parameters for the Poisson data transformation in order to change the regression model. variable selection problem for a general linear Several measures of goodness-of-fit for model including the Poisson regression model the Poisson regression model have been into a variable selection problem in an ordinary proposed in the literature. The Akaike unweighted linear regression model. Thus, information criterion (AIC) is a commonly used ordinary linear regression variable selection measure (Akaike, 1973). It is defined as software can be used. In this article, we provide the Poisson AIC= −++log L ( k 1) (3) regression model and describe some goodness- of-fit statistics. These statistics will be used as where k + 1 is the number of estimated selection criteria for the variable selection parameters and L is the likelihood function. The method. A variable selection algorithm is smaller the value of the AIC statistic, the better described. We present the results of a simulation the fit of the model. The log likelihood could be study to compare the variable selection used as a measure of goodness-of-fit. However, algorithm with the method suggested by the AIC criterion also includes k as an Nordberg (1982). The algorithm is illustrated adjustment for the number of independent with a numerical example and it is compared variables, so that a model with many variables with the method suggested by Nordberg. Finally, included is not necessarily better using this we give some concluding remarks. statistic.

Merkle and Zimmermann (1992) Poisson Regression Model and Goodness-of-fit 2 Measures suggested some measures similar to the R The Poisson regression model assumes statistic for linear regression. They define the response variable yi, which is a count, has a lly()µˆ − () Poisson distribution given by R2 = i (4) D ly()i − ly () y −µ µ iie Py( ;µ )= , y = 0, 1, 2, ... (1) ii i where yi !

⎛⎞k and µµiiij==()expx ⎜⎟∑ β jijx , where ⎝⎠j=0 382 VARIABLE SELECTION FOR POISSON REGRESSION MODEL

n determine which variable to add in the selection ˆˆˆ ly()µlogµµlog!,iiii=−−∑ ( y ) procedure. These variable selection criteria i=1 measures are adjusted to include the number of n parameters. In this way, an additional variable ly()=−−∑() yi log y y log y ! being added to the model may not necessarily i=1 result in an improvement to the measure.

and Selection Algorithm

The transformation suggested by n Nordberg (1982) for log-linear Poisson ly()iiii=−−∑() y log y y log!. y i=1 regression model takes the form

2 ux= µˆ , where j = 0, 1, 2, . . . k, and i = 1, The quantity RD measures the ij ij i goodness-of-fit by relating the explained 2, . . . n (6) increase in the log-likelihood to the maximum increase possible. The interpretation is that ⎛⎞y − µˆ k 2 zu=+ii βˆ (7) higher RD indicates a better fit from the model. ijij⎜⎟∑ ⎜⎟µˆ j=0 The numerator of R2 is the deviance statistic. ⎝⎠i D Cameron and Windmeijer (1996) analyzed R- ˆ squared measures for count data. They establish where µi ’s are the estimates of the predicted five criteria for judging various R2 measures. values from the full Poisson regression model. The variable selection procedure is as follows. Among all R2 measures considered, only the 2 Compute the ML estimate of β in the full RD defined by Merkle and Zimmermann (1992) Poisson regression model. Transform the data satisfies all the five criteria. using (6) and (7). Perform variable selection on the linear model with zi as the dependent Selection Criteria Statistics variable and uij as the independent variables. Variable selection procedures need Identify the subset of the uij variables that is criteria for adding significant variables. We selected and choose the corresponding xij propose two selection criteria statistics (SCS). variables. This gives the Poisson regression sub- The first SCS is the Akaike information criterion model. Compute the maximum likelihood (AIC) defined earlier. The smaller the value of estimate for the Poisson regression on the the AIC statistic, the better the fit of the model. chosen xij variables. This gives the final result of The second SCS is a modification of the variable selection through transformation. 2 RD suggested by Cameron and Windmeijer Nordberg (1982) indicated that the (1996) by taking the number of parameters into success of this technique depends on the accuracy of the approximation of the log- account. We define R2 as adj likelihood function given by

n ˆˆ ⎡⎤yyylog()µµˆˆ / −−() log()LLQQββββ≈−− log()() () ()/2, (8) ∑ ⎣⎦ii i ()n −1 R2 =⋅i=1 adj n nk− −1 ()where Q(β) is given by ∑ yyyiilog() / i=1 2 (5) nk ⎛⎞ Qzu()ββ=−∑∑⎜⎟ijij. where n is the sample size and k is the number of ij==10⎝⎠ independent variables. Either of the selection criteria statistics in (3) and (5) can be used to FAMOYE & ROTHE 383

The error in (8) is given by • Return “No variables are significant” 3 nk and Stop. 11⎛⎞ˆ E =−⎜⎟ββuij . (9) 5. Do while (k ≥ 2 ) 6 ∑∑ˆ ()jj ij==10µi ⎝⎠ • ν ← ν + 1 • k ← k – 1 Nordberg (1982) concludes that the • Fit k Poisson regression models each approximation is adequate even when 30% of with the intercept and ν independent the µˆi are less than or equal to 4. However it is variables. [The model includes all not clear what would happen to a case with say previously selected xi’s and one new 70% of the µˆ are less than or equal to 4. We i xj, j = 1, 2, 3, . . k] note here that Nordberg did not run simulations • Select the model with the optimal on such cases. SCS. Let xnew be the independent variable added and βnew be its Forward Selection Algorithm parameter. The forward selection program begins • If the asymptotically normal Wald by finding all possible regression models with type “t”-value for βnew is not one variable. The one with the best selection significant at level α, criteria statistic is chosen as the best one o ν ← ν – 1 variable model. Once the best one variable o go to 6, else model has been chosen, all models with the first o add xnew to the Poisson variable and one additional variable are regression model calculated and the one with the best selection o Continue criteria statistic is chosen. In this way, a two variable model is chosen. The process continues 6. The forward selection selects ν independent to add variables until the asymptotically normal variables. Deliver the parameter estimates, t- Wald type “t”-value for an added variable is not values, and goodness-of-fit statistics for the significant. The process then stops and returns selected model. the previous acceptable model. The selection criteria statistics (SCS) Simulation Study and a test of significance of each variable are In order to compare the proposed used to determine which variable to enter. The method with the method proposed by Nordberg following is the algorithm: (1982), we conduct a simulation study. The Poisson regression model in (1) is generated and [Initialize: k = number of independent variables, both methods were used for variable selection. α = significance level] We generated a set of x–data consisting 1. ν ← 1 of n (n = 100, 250, 500, and 1000) observations on eight explanatory variables x , i = 1, 2, …, n 2. Fit k Poisson regression models with the ij and j = 0, 1, 2, …, 7, where xi0 = 1 (a constant intercept and ν independent variable term). The variables xi1, xi2, …, xi7 were 3. Select the model with the optimal SCS. generated as uncorrelated standard normal Let xi be the independent variable chosen variates. All simulations were done using and βi be its parameter. computer programs written in Fortran codes and 4. If the asymptotically normal Wald type the Institute of Mathematical Statistics Library “t”-value associated with βi is significant at (IMSL) is used.

level α, The parameter vector β = (β0, β1, β2, …, • Retain Poisson regression model β7) used in the simulation study is chosen in with independent variable xi and go to 5. else such a way that β5 = β6 = β7 = 0, while β0, β1, β2,

β3, and β4 are non-zero. For all simulations, we 384 VARIABLE SELECTION FOR POISSON REGRESSION MODEL

x not to enter into the selected model. chose β1 = β2 = β3 = β4 = 0.2 and six different 7 Whenever any or all of these three variables values of β0. We consider the six values β0 = – enter a selected model, it is considered an error. 1.0, –0.5, 1.5, 1.7, 2.0, and 3.0. These values The error rate from the 1000 simulations was were chosen so that certain percentages of fitted recorded in Table 1 for both selection methods. In each simulation, the percentage of fitted values µˆi will be less than or equal to 4.0. When values µˆi less than or equal to 4 is recorded. β0 = –1.0 or –0.5, all fitted values µˆi from the These percentage values are averaged over the Poisson regression model are less than or equal 1000 simulations and the results are presented in to 4.0. For β0 = 1.5, about 40% of the fitted Table 1. From Table 1, we notice some values µˆ are less than or equal to 4.0. When β0 i differences between the error rates from the ˆ = 1.7, about 20% of the fitted values µi are less forward selection method and the transformation than or equal to 4.0, and for β0 = 3.0, almost all method proposed by Nordberg. In general, the error rates from the forward selection method fitted values ˆ exceed 4.0. µi are smaller than the error rates from the

Using the β-vector and xi0, xi1, xi2, …, xi7 Nordberg method. The error rates are much larger when the sample size is small, say n = 100 as explanatory variables, the observations yi, i = 1, 2, .., n, were generated from the Poisson or n = 250. As the sample size increases to n = regression model in (1). Thus, the y–variates are 500 or n = 1000, the two methods are closer in Poisson distributed with mean performance. However, the forward selection method seems to have a slight advantage over the Nordberg method. When the percentage of ⎛⎞7 the fitted values µˆ less than or equal to 4.0 is µβijij= exp⎜⎟∑ x . i ⎝⎠j=0 high, the error rates from the Nordberg method seem to be high, especially when the sample size The Nordberg method is used to n is small. perform variable selection on each set of data From the simulation study, the generated. The forward selection algorithm difference between the two selection methods is developed in this article is also used for variable not only due to whether the percentage of fitted selection. The result from using AIC selection values µˆi less than or equal to 4.0 is high, it also criterion is presented in this article. The result 2 depends on the sample size n. For small sample from using Radj selection criterion is the same size, the Nordberg method tends to select as that of using AIC, and hence the result is not variables x5, x6, and/or x7 more often than the given. forward selection algorithm presented earlier. As Each simulation was repeated 1000 the sample size increases to 1000, the Nordberg times by generating new y–variates keeping the method tends to perform as well as the forward selection algorithm. x–data and the β1, β2, …, β7 constant. Since the parameters β5 = β6 = β7 = 0, we expect x5, x6, and

FAMOYE & ROTHE 385

Table 1. Error Rates For Nordberg And Forward Selection Algorithms.

Nordberg Forward Percentage of N Method selection β0 µˆi ≤ 4.0 –1.0 0.188 0.166 100.0 –0.5 0.189 0.170 100.0 100 1.5 0.164 0.140 42.3 1.7 0.159 0.129 24.1 2.0 0.145 0.129 6.9 3.0 0.147 0.130 0.0 –1.0 0.191 0.168 100.0 –0.5 0.171 0.155 100.0 250 1.5 0.155 0.136 43.0 1.7 0.155 0.142 24.0 2.0 0.155 0.143 7.8 3.0 0.152 0.142 0.4 –1.0 0.159 0.151 100.0 –0.5 0.147 0.139 100.0 500 1.5 0.133 0.136 41.6 1.7 0.139 0.139 23.4 2.0 0.149 0.147 6.9 3.0 0.143 0.138 0.2 –1.0 0.144 0.144 100.0 –0.5 0.154 0.146 100.0 1000 1.5 0.162 0.160 38.9 1.7 0.153 0.147 20.5 2.0 0.159 0.156 5.7 3.0 0.148 0.145 0.1

Numerical Example 2936 married women who were not head of We applied the forward selection households and with nonnegative total family algorithm and the transformation method income. The dependent variable was the number suggested by Nordberg (1982) to several data of children. Of the families, 1029 (35.05%) had sets. The forward selection algorithm was no children under age 17. The response variable 2 implemented using AIC and R adj as selection had a mean of 1.29 and a variance of 1.50. The predicted values under full Poisson regression criteria statistics. When the percentage of the µˆi less than or equal to 4 satisfied the cases model were small with 54.26% less than or considered by Nordberg (1982), both methods equal to 1. Thus the data set was much more yielded the same sub-model. However, when the extreme than any of the cases considered by Nordberg (1982). data has a much larger percentage of µˆ less i The Poisson regression model was fitted than or equal to 4, we tend to obtain different to the data using 12 covariates. The results are results. We now present the results of a data set. presented in Table 2. The forward selection Wang and Famoye (1997) modeled algorithm was run on the data and the variables fertility data using Poisson and generalized chosen are x9, x1, x4, x5, x2, and x10. The variables Poisson regression models. The data was from chosen are exactly the same variables that are the Michigan Panel Study of Income Dynamics significant in the full model. The transformation (PSID), a large national longitudinal data set. method proposed by Nordberg (1982) was The particular portion of the data used in this applied to the data. The variables selected were paper was from 1989 and consisted of data from 386 VARIABLE SELECTION FOR POISSON REGRESSION MODEL

Table 2. Poisson Regression Model.

Full Model Forward Selection Sub-model

Parameter Estimate ± s.e. t-value Estimate ± s.e. t-value Step added

Intercept 2.0686±0.1511 13.69* 2.1226±.0744 28.53* --

x1 –0.2657±0.0356 –7.46* –0.2674±.0351 –7.61* 2

x2 –0.0193±0.0041 –4.71* 0.0196±.0041 –4.82* 5

x3 –0.1226±0.0651 –1.88 ------

x4 –0.2811±0.0379 –7.42* –0.2629±.0368 –7.15* 3

x5 0.3057±0.0575 5.32* 0.3002±.0567 5.29* 4

x6 –0.0050±0.0087 –0.57 ------

x7 0.0035±0.0071 0.49 ------

x8 –0.0143±0.0187 –0.76 ------

x9 –0.0211±0.0038 –5.55* –0.0217±.0038 –5.76* 1

x10 –0.0147±0.0066 –2.23* –0.0132±.0059 –2.25* 6

x11 0.0118±0.0078 1.51 ------

X12 –0.0545±0.0340 –1.60 ------*Significant at 5% level.

x6, x7, x8, x11, x9, and x10. These are not the same Goodness-of-fit statistics for the models variables chosen by the forward selection are provided in Table 4. The goodness-of-fit procedure. Only two of the variables are chosen statistics for the forward selection sub-model are by both methods. The results from the close to those for the full model even though the transformation method are shown in Table 3. number of independent variables is now six. The parameters corresponding to x6, x7, x8, and This is not the case for the transformation sub- x11 are not significant in the full Poisson model. All these results are in support of the regression model (see Table 2), causing simulation study reported earlier. concerns about the accuracy of the transformation method.

FAMOYE & ROTHE 387

Table 3. Nordberg’s Transformation Method. predicted values are large, it may run into problems when predicted values are small. Real Transformation sub-model world data may not necessarily have large predicted values. It would be ideal to have an Step Parameter Estimate ± s.e. t-value algorithm that is not dependent on the size of the added predicted values. The forward selection method Intercept 1.8062±.1484 12.17* -- presented performed well regardless of the size of the predicted values.

x6 –0.0140±.0086 –1.63 1 The forward selection algorithm may take much more computer time than the x7 0.0075±.0070 –1.07 2 transformation method proposed by Nordberg (1982). In these days of better computer x8 –0.0063±.0186 –0.34 3 technology, more computer time should not be a reason for using a method that may not always x –0.0376±.0017 –22.12* 5 9 produce an adequate result. From our simulation study, the forward selection algorithm performs x10 –0.0027±.0008 –3.04* 6 as well if not better than the transformation x11 0.0124±.0077 1.61 4 method. In this article, a forward selection * Significant at 5% level. algorithm was developed. Similar methods could

Table 4. Goodness-of-fit For The Poisson be developed using backward or stepwise Model. selection for the class of generalized linear models. In addition, other selection criteria Statistic Full Forward Nordberg statistics could be used. Count data occur very Model Selection frequently in real world applications. The size of the predicted values cannot be controlled within Deviance 3277.84 3286.14 3430.81 a particular study. Thus a selection method that d.f. 2923.0 2929.0 2929.0 can deal with any size of predicted values is desirable. Log- –2410.09 –2414.24 –2486.58 likelihood AIC 2423.09 2421.24 2493.58 References

2 R adj 0.2014 0.1994 0.1639 Akaike, H. (1973). Information theory and an extension of the likelihood principle, in

Proceedings of the Second International Conclusion Symposium of Information Theory, Eds. Petrov,

B.N., & Csaki, F. Budapest: Akademiai Kiado. The size of the predicted values affected the Albert, J. (1992), A Bayesian analysis of usefulness of the transformation method. In the a Poisson random-effects model for Home Run data set, the predicted values are relatively small hitters, The American Statistician, 46, 246-253. (54.3% less than or equal to 1). Since the Beale, E. M. L. (1970). Note on approximation error E in (9) for the procedures for variable selection in multiple transformation method involves division by the regression, Technometrics, 12, 909-914. square root of the predicted value, one should be Cameron, A. C., & Trivedi, P. (1986). concerned when many predicted values are Econometric models based on count data: small. Dividing by small values may cause this comparisons and applications of some estimators error term to become large and make the and tests, Journal of Applied Econometrics, 1, approximation inaccurate. Although many other 29-53. data sets analyzed indicate that the transformation method can be useful when the 388 VARIABLE SELECTION FOR POISSON REGRESSION MODEL

Cameron, A. C., & Windmeijer, F. A. G. Greenberg, R. A., Bayard, S., & Byar, (1996). R-squared measures for count data D. (1974). Selecting concomitant variables using regression models with applications to health- a likelihood ratio step-down procedure and a care utilization, Journal of Business and method of testing goodness-of-fit in an Economic Statistics, 14, 209-220. exponential survival model, Biometrics, 30, 601- Christiansen, C. L., & Morris, C. N. 608. (1997). Hierarchical Poisson regression Institute for Social Research, The modeling, Journal of the American Statistical University of Michigan (1992). A Panel Study of Association, 92, 618-632. Income Dynamics. Ann Arbor, MI. D’Unger, D. V., Land, K. C., McCall, Krall, J. M., Uthoff, V. A., & Harley, J. P. L., & Nagin, D. S. (1998). How many latent B. (1975). A step-up procedure for selecting classes of delinquent/criminal careers? Results variables associated with survival, Biometrics, from mixed Poisson regression analyses, 31, 49-57 American Journal of Sociology, 103, 1593-1630. Lawless, J. F. (1987). Negative binomial Efron, B., & Thisted, R. (1976). and mixed Poisson regression, The Canadian Estimating the number of unseen species: how Journal of Statistics, 15, 209-225. many words did Shakespeare know? Biometrika, Merkle, L., & Zimmermann, K. F. 63, 435-447. (1992). The demographics of labor turnover: a Efroymson, M. A. (1960). Multiple comparison of ordinal probit and censored count regression analysis, Mathematical Methods for data models, Recherches Economiques de Digital Computers. Eds. A. Ralston & H.S.Wilf, Louvain, 58, 283-307. New York: John Wiley & Sons, Inc. Nordberg, L. (1982). On variable Frome, E. L., Kutner, M. H., & selection in generalized linear and related Beauchamp, J. J. (1973). Regression analysis of regression models, Communications in Statistics Poisson-distributed data, Journal of the Theory and Method, 11, 2427-2449. American Statistical Association, 68, 935-940. Signorini, D. F. (1991). Sample size for Gourieroux, C., Monfort, A., & Poisson regression, Biometrika, 78, 446-450. Trognon, A. (1984). Pseudo maximum Wang, W., & Famoye, F. (1997). likelihood methods: applications to Poisson Modeling household fertility decisions with models, Econometrica, 52, 701-720. generalized Poisson regression, Journal of Population Economics, 10, 273-283.

Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 389-399 1538 – 9472/03/$30.00

Test Of Homogeneity For Umbrella Alternatives In Dose-Response Relationship For Poisson Variables

Chengjie Xiong Yan Yan Ming Ji Division of Biostatistics Division of Biostatistics Graduate School of Public Health Washington University in St. Louis Department of Surgery San Diego State University Washington University in St. Louis

This article concerns the testing and estimation of a dose-response effect in medical studies. We study the statistical test of homogeneity against umbrella alternatives in a sequence of Poisson distributions associated with an ordered dose variable. We propose a test similar to Cochran-Armitage’s trend test and study the asymptotic null distribution and the power of the test. We also propose an estimator to the vertex point when the umbrella pattern is confirmed and study the performance of the estimator. A real data set pertaining to the number of visible revertant colonies associated with different doses of test agents in an in vitro mutagenicity assay is used to demonstrate the test and estimation process.

Key words: C(α) statistic, maximum likelihood estimate, monotone trend test, Poisson distribution, vertex point

Introduction Similarly, in many in vitro mutagenicity assays, Medical studies often evaluate treatment effects experimental organisms may succumb to toxic at several doses of a test drug. One usually effects at high doses of the test agents, thereby assumes a priori, based either on past experience reducing the number of organisms at risk of with the test drug or on theoretical mutation and causing a downturn in the dose- considerations, that if there is an effect on a response curve (Collings et. al., 1981; Margolin parameter of interest, the response is likely et al., 1981). These types of non-monotonic monotonic with dose, i.e., the effect of the drug dose-response behavior may not be caused by a is expected to increase or decrease random effect, but may occur due to an monotonically with increasing dose levels. underlying biological mechanism. Mechanistic Comparing several doses with a placebo in a arguments for non-monotonic dose-response clinical dose study is then typically performed shapes can be found in many medical areas, such by one-sided many-to-one comparisons or trend as toxicology (Calabrese & Baldwin, 1998), tests assuming an order restriction. Monotonicity carcinogenesis (Portier & Ye, 1998), and of dose-response relationship, however, is far epidemiology (Thorogood et al., 1993). from universal. One of the simplest non-monotonic Instances may be found where a reversal dose-response is the so-called umbrella pattern or downturn at higher doses is likely to occur. in which the response increases (decreases) until For example, many therapies for humans certain dose level (usually unknown) and then become counterproductive at high doses. decreases (increases). Ames, McCann and Yamasaki (1975) reported experimental data exhibiting this pattern from three replicate Ames Address correspondence to Chengjie Xiong, tests in which plates containing Salmonella Division of Biostatistics, Campus Box 8067, bacteria of strain TA98 were exposed to various Washington University in St. Louis, St. Louis, doses of Acid Red 114. The number of visible MO, 63110. Telephone: 314.362.3635. Email: revertant colonies on each plate was observed. [email protected]. Figure 1 is a scatter plot of the number of visible revertant colonies against dose level, which

389 390 TEST OF HOMOGENEITY FOR UMBRELLA ALTERNATIVES clearly indicates an umbrella pattern peaked different doses of test agents in an in vitro between the third dose and the fourth dose. This mutagenicity assay is used to demonstrate the same phenomenon is also observed and test and estimation process. We also present discussed by Simpson and Margolin (1986). results of a simulation study about the proposed When the dose-response curve contains test and estimation. an umbrella pattern, the usual statistical trend tests become inadequate because of their power Methodology loss and inherent, and possibly erroneous decisions (Collings et al., 1981; Bretz & We consider an experiment in which Hothorn, 2001). The statistical test of independent random samples are taken from k homogeneity in response against an umbrella distinct dose levels. Suppose that the k dose alternative has been studied by many authors. levels are meaningfully ordered. Let Most of these discussions deal with a continuous d ,d ,...,d be the scores associated with these response variable and assume the normality for 1 2 k the associated distributions. The typical dose levels and d1 ≤ d 2 ≤ ... ≤ d k . We assume approaches under the assumption of normality that at dose level i , the response follows a are based on the framework of one-way analysis Poisson distribution with mean µ ,i = 1,2,...,k . of variance and the simultaneous confidence i intervals for umbrella contrasts of mean Let ni be the sample size associated with dose parameters (Bretz & Hothorn, 2001; Rom et al., k 1994; Shi, 1988; Marcus & Genizi, 1994; Hayter level i and n = ∑ ni . Let xi be the total & Liu, 1999). Nonparametric approaches have i=1 also been considered by several authors (Lim & response in the i -th dose level. For each i and p 2 Wolfe, 1997; Mack & Wolfe, 1981 & 1982, p , 1 ≤ i, p ≤ k , let di = (di − d p ) and Simpson & Margolin, 1994). k When data are based on counts such as p p d = ∑ ni di / n . Suppose that the relationship those reported by Ames, McCann and Yamasaki i=1 (1975), however, a more reasonable between the mean response and the score takes distributional assumption might be the Poisson the form of distribution. The statistical test of homogeneity against umbrella alternatives in a sequence of 2 µi = H[α + β ()di − d p ], Poisson distributions associated with an ordered dose variable has not been addressed in the biostatistics literature to the best of our where H is a monotonic function that is twice knowledge. This article studies this problem differentiable, d p is the dose level associated using an approach based on so-called C(α) with the vertex dose of the umbrella pattern. statistics as proposed and studied by Neyman Notice that when p = 1 or k , this formulation (1959) and Bailey (1956). The C(α) statistics reduces to the monotone trend. We consider the are also discussed in more details by Moran problem of testing H 0: β = 0 against the (1970) and by Cox and Hinkley (1974) under the alternative hypothesis H : β ≠ 0 . The more general category of score statistics. a We propose a test similar to the likelihood function as a function of α, β , and Cochran-Armitage trend test and study the p is: asymptotic null distribution and the power of our k test. We also propose an estimator of the vertex L(α,β, p) ∝Πi=1 point when the umbrella pattern is confirmed 2 2 xi and study the performance of the estimator. A exp{}− ni H[]α + β()di − dp {}H[]α + β()di − dp . real data set reported by Ames, McCann and Yamasaki (1975) pertaining to the number of visible revertant colonies associated with XIONG, YAN, & JI 391 p Is Known p Is Unknown When p is given, the test is the same as When p is unknown and H 0: β = 0 is the trend test based on the redefined dose score tested against the alternative hypothesis d p ,i = 1,2,...,k . The test is based on the C(α) i H a : β ≠ 0 , we propose to reject H 0: β = 0 statistic (Moran, 1970) and is obtained by when X 2 = max X 2 is large. Let evaluating the derivative of the loglikelihood 1≤ p≤k p n with respect to β at the maximum likelihood λ = lim i and assume that 0 < λ < 1 for i n→∞ n i estimate of α under H 0 : k p p 1 ≤ i ≤ k . For 1 ≤ p ≤ k , let d = ∑λi di ∂ log L H '(αˆ) ⎛ k k ⎞ C(α) = | = ⎜ x d p − xˆ n d p ⎟ , i=1 αˆ,β =0 ∑∑i i i i k ∂β H (αˆ) ⎝ i==11i ⎠ and µ = λ µ . Let ∆ be the k by k matrix ∑ i i i=1 where k whose (i, p) entry is given by

∑ xi xˆ = i=1 , d p − d p ∆p = i . n i k p p 2 µ λi ()di − d −1 ∑ and αˆ = H ()xˆ . The test statistic for testing i=1

H 0: β = 0 against the alternative hypothesis Let A = (aij ) be the k by k matrix such that H a : β ≠ 0 is obtained after dividing C(α) by its asymptotic standard deviation under aij = 0 if i ≠ j and aij = µi λi if i = j . The following theorem gives the limiting distribution H 0 computed from the information matrix of of the proposed test when the null hypothesis is (α, β ) (Tarone, 1982): true.

2 ⎡ k k ⎤ Theorem 1: If H is true, then for any x > 0 , x d p − xˆ n d p 0 ⎢∑∑i i i i ⎥ X 2 = ⎣ i==11i ⎦ . (1) p k lim P X 2 ≥ x 2 = p p 2 n→∞ ( ) xˆ ni ()di − d ∑ x x x 1 i=1 1 − ... ∫∫−−x x ∫ −x k ()2π | ∆' A∆ | Notice that this test statistic does not depend on (2) the choice of the function H . Under the null ⎡ X ' ()∆' A∆ −1 X ⎤ exp − dx dx ...dx , hypothesis, X 2 has an asymptotic Chi-square ⎢ ⎥ 1 2 k p ⎣ 2 ⎦ distribution with one degree of freedom. Notice also that this test statistic is identical in formula where X = (x , x ,..., x )' and | | is the matrix to the test statistic for testing monotone trend 1 2 k with the redefined score in binomial proportions determinant. proposed by Armirage (1955). In addition, The proof of Theorem 1 can be found in Tarone (1982) showed that, like the binomial Appendix. Notice that the asymptotic null trend test (Tarone & Gart, 1980), this Poisson distribution does not depend on the unknown trend test is asymptotically locally optimal common mean µ1 = µ 2 = ... = µ k as the against any choice of smooth monotone function common mean µ is cancelled out in the H that satisfies integrand. Therefore, µ =1 can always be 2 µi = H[α + β ()di − d p ], i = 1,2,...,k. assumed for the computation. The evaluation of 392 TEST OF HOMOGENEITY FOR UMBRELLA ALTERNATIVES the integration can be done by the iterative hypothesis. We propose to estimate l by lˆ such algorithm proposed by Genz (1992). This that X 2 = max X 2 , where X 2 is given algorithm begins with a Cholesky lˆ 1≤ p≤k p p decomposition of the covariance matrix and then by (1). Notice that as n → ∞ , for any uses a simple Monte-Carlo algorithm. Another 1 ≤ p ≤ k , possible way of evaluating the distribution of the test statistic under the null hypothesis is through 2 ⎡ k k k ⎤ a large simulation of the test statistic. We point λ µ d p − λ µ λ d p out that the asymptotic null distribution does X 2 ⎢∑∑i i i ∑ i i i i ⎥ lim p = ⎣ i==11i=1 i ⎦ depend on the unknown proportion n→∞ k k n p p 2 λi µi λi ()di − d ni ∑ ∑ λ ,i = 1,2,...,k. can be used for λ in the i=1 i=1 i n i n k computation based on the consistency of i to λ ()µ − µ 2 n ∑ i i i=1 2 , = k R p λi . In addition, according to Šidák and Zbynĕk U ,V ∑λi µi (1967), under H 0 , i=1

2 2 k where 2 is the correlation coefficient Pr()X ≤ x ≥ []2Φ()x −1 , RU p ,V between random variables U p and V defined where Φ is the distribution function of the on the sample space 1,2,..., k with the standard normal distribution. Therefore, under {} multinomial probability distribution H 0 , λ ,λ ,...,λ , and U p (i) = d p , V (i) = µ . { 1 2 k } i i k Since lim Pr X 2 ≥ x 2 ≤ 1 − 2Φ x −1 , n→∞ ()[]() − U p (1) ≤ −U p (2) ≤ ... ≤ −U p ( p − 1) < −U p ( p) which then provides a conservative test of H 0 = 0 > −U p ( p + 1) ≥ ... ≥ −U p (k) against H a . and either Estimation of the Vertex Point If the alternative hypothesis is true, the µ1 ≤ µ 2 ≤ ... ≤ µl−1 < µl > µl+1 ≥ ... ≥ µ k problem of interest is then the estimation of the true vertex point. To avoid the problem of or parameter identification, we assume that the umbrella pattern satisfies µ1 ≥ µ 2 ≥ ... ≥ µl−1 > µl < µl+1 ≤ ... ≤ µ k µ ≤ µ ≤ ... ≤ µ < µ > µ ≥ ... ≥ µ 1 2 l−1 l l+1 k holds, the proposed estimator to the true vertex

point l asymptotically maximizes the square of or p the correlation coefficient between U and V over p = 1,2,...,k. µ1 ≥ µ2 ≥ ... ≥ µl−1 > µl < µl+1 ≤ ... ≤ µk , i.e., we only consider the case where a single A Real Example vertex point l exists. Notice that this In in vitro mutagenicity assays, formulation does not rule out the possibility that experimental organisms may succumb to toxic the vertex point is on the boundary of the dose effects at high doses of test agents, thereby interval if a monotone trend is the alternative reducing the number of organisms at risk of XIONG, YAN, & JI 393 mutation and causing a downturn in the dose- correctly estimates the true vertex point for a set response curve (Collings et al., 1981; Margolin of selected parameters. et al., 1981). In our simulation, we assume that a total Ames, McCann and Yamasaki (1975) of five independent Poisson distributions reported experimental data exhibiting this associated with five different dose levels pattern from three replicate Ames tests in which di = i,i = 0,1,2,3,4. We also assume that the plates containing Salmonella bacteria of strain sample size of all 5 groups is the same, i.e., TA98 were exposed to various doses of Acid n = n = n = n = n . Theorem 1 is used to Red 114. The number of visible revertant 1 2 3 4 5 colonies on each plate was observed. We assume determine the x 2 which achieves the upper 5% a Poisson distribution for the number of visible percentile of the test statistic under the null revertant colonies and test whether an umbrella hypothesis. pattern in the mean exists. The empirical power of the proposed Figure 1 is a scatter plot of the number test is computed as the proportion of rejections of visible revertant colonies against dose level, of the null hypothesis over repeated independent which clearly indicates an umbrella pattern tests with a selected set of umbrella patterns. peaked between the third dose and the fourth The performance of the proposed estimator to dose. The test statistic is the vertex point is assessed by computing the empirical probability that the proposed estimator X 2 = max( 75.71,75.76,75.90,76.20,69.78,55.96) correctly estimates the true vertex point. Table 1 presents the empirical power of = 76.20. the test and the empirical probability of correct

estimation of the vertex point for three different The conservative test gives a p -value choices of the true umbrella pattern and various less than 0.00001, indicating a strong evidence sample sizes. Each entry in Table 1 is computed that an umbrella pattern exists. Since from 10000 independent hypotheses tests and 2 max1≤ p≤6 X p is obtained when p =4, i.e., when estimations. All the tests assume a significance level of 5%. dose d = 10000 (µg/ml) of Acid Red 114 is 4 The first column in Table 1 is the true used, the estimated peak dose is d 4 =10000 mean vector (µ1 , µ 2 , µ3 , µ 4 , µ5 ) . Notice that (µg/ml). these umbrella patterns are chosen so that each possible interior vertex point (i.e., l =2,3,4) within the boundary of the dose interval is considered. Because the monotone trend is included in the alternative hypothesis when the vertex point falls on the boundary of the dose interval, it is of interest to see how our proposed test performs in these alternatives. This is relevant given the fact that, when an umbrella pattern is likely in the dose-response relationship, the traditional statistical monotone trend tests become inadequate because of their Simulation Studies power loss and inherent, and possibly erroneous To understand the performance of the decisions (Collings et al., 1981; Bretz and proposed test and the estimator for the vertex Hothorn, 2001). We simulated the statistical point when the alternative hypothesis is true, we power of the proposed test for detecting the have carried out a simulation study to evaluate monotone trend and compared that to the the statistical power of the proposed test and the traditional trend test as discussed by Cochran probability that the vertex point estimator (1954) and Tarone (1982).

394 TEST OF HOMOGENEITY FOR UMBRELLA ALTERNATIVES

Table 1: Empirical Power and Probability with an Interior Vertex Point.

Umbrella Pattern Sample Size Per Power (%) Correct Vertex Dose Estimation (%)

(2,2.5,3,2.5,1.5) 10 51.98 68.81

20 84.66 77.90

30 96.56 82.85

40 99.22 87.18

50 99.79 89.71

80 100 93.59

(1.5,2,2.5,3,2.5) 10 53.31 47.84

20 85.97 57.63

30 96.54 62.92

40 99.34 66.64

50 99.86 69.15

80 100 74.32

(2.5,3,2.5,2,1.5) 10 53.23 46.85

20 85.47 58.08

30 96.60 63.88

40 99.24 66.64

50 99.79 68.66

80 100 74.09

XIONG, YAN, & JI 395

Table 2 provides the empirical power Another different type of alternative hypothesis and the comparison along with the empirical is when a flat segment appears in the Poisson probability of the correct estimation of the mean vector (µ1 , µ 2 , µ3 , µ 4 , µ5 ) . vertex point. The second column in Table 2 is Table 3 presents the empirical power the empirical power based on our proposed test. and the empirical probability of the correct The third column in Table 2 is the empirical estimation of the vertex points for several power based on the test by Cochran (1954) and different choices of such patterns. Since the Tarone (1982). Because the vertex point for a vertex point in some of these situations is not monotone trend could be either l =1 or l =5, the unique, the empirical probability of the correct

Table 2: Empirical Power and Probability with a Boundary Vertex Point.

Umbrella Pattern Sample Size Power 1 (%) Power 2 (%) Correct Vertex Per Dose Estimation (%) (1.5,1.8,2.0,2.3,2.5) 10 33.69 41.85 52.86

20 62.45 70.45 68.05

30 80.55 86.65 78.25

40 90.20 94.18 84.46

50 95.58 97.45 88.14

80 99.71 99.89 94.91

(3.5,3.4,3.0,2.8,2.6) 10 22.81 28.27 44.66

20 40.99 49.29 57.09

30 56.80 65.14 66.32

40 70.90 78.63 73.20

50 80.05 86.83 78.59

80 94.85 97.14 88.04

¹Proposed test, ²Cochran & Tarone’s test.

empirical probability of the correct estimation to estimation reported in Table 3 refers to the the true vertex points reported in Table 2 refers proportion that one of the possible vertex points to the proportion over repeated estimates that is correctly identified over 10000 independent either l =1 or l =5 is correctly estimated. Each estimates. Data simulations are done using the entry in Table 2 is also computed from 10000 random number generating function RANPOI independent hypotheses tests and estimations. 396 TEST OF HOMOGENEITY FOR UMBRELLA ALTERNATIVES

Table 3: Empirical Power and Probability with a Flat Segment in the Pattern

Umbrella Pattern Sample Size Per Power (%) Correct Vertex Dose Estimation (%) (2.5,3.0,3.0,2.5,2.0) 10 26.18 77.21

20 50.48 87.36

30 69.76 92.19

40 82.76 95.12

50 90.61 96.64

80 98.95 98.91

(2.5,3.0,3.0,3.0,2.5) 10 11.95 83.36

20 20.81 90.08

30 30.39 93.72

40 40.36 95.98

50 49.78 97.37

80 71.38 99.36

(2.5,2.5,3.0,3.0,2.5) 10 9.71 63.19

20 15.07 70.92

30 21.47 78.14

40 28.10 81.78

50 34.82 85.59

80 52.70 92.83

XIONG, YAN, & JI 397 from Statistical Analysis System (SAS Institute, statistical power of the proposed test deteriorates Inc. 1999). when a substantial flat segment exists in the mean vector of the Poisson distributions, Conclusion although the proposed vertex estimator still shows a high probability of correctly identifying When an umbrella pattern is likely in the dose- one of these multiple vertex points. response relationship, the usual statistical trend Like the similarity on the test statistic tests become inadequate because of their power for testing a monotone trend between a sequence loss and inherent, and possibly erroneous of binomial distributions and a sequence of decisions (Collings et al., 1981; Bretz & Poisson distributions (Armitage 1955; Cochran Hothorn, 2001). We proposed in this paper a test 1954), the proposed test and estimation of homogeneity against umbrella alternatives in techniques can be readily extended to the a sequence of Poisson distributions associated situation for detecting an umbrella pattern in a with an ordered dose variable and studied the sequence of binomial distributions. limiting null distribution and the statistical power. References We also proposed an estimator of the vertex point when the umbrella pattern is Ames, B. N., McCann, J., & Yamasaki, confirmed and studied the performance of the E. (1975). Methods for detecting carcinogens estimator. Although the simulation study verifies and mutagens with the Salmonella/Microsome that the increase of the sample size always Mutagenicity test. Mutation Research, 31, 347- increases the statistical power of the test and the 364. probability of the correct estimation to the vertex Armitage, P. (1955). Tests for linear point, Table 1 seems to indicate that for the trends in proportions and frequencies. selected set of parameters, the proposed Biometrics, 11, 375-386. estimator to the true vertex point performs better Bailey, N. T. J. (1956). Significance when the vertex point (l =3) is in the middle of tests for a variable chance of infection in chain- the dose interval than when it is away from the binomial theory, Biometrika, 43, 332-336. middle of the dose interval ( l =2,4). The Bretz, F., & Hothorn, L. A. (2001). statistical power of the proposed test, however, Testing dose-response relationships with a priori seems to be very comparable wherever the unknown, possibly nonmonotone shape. Journal interior vertex is. of Biopharmaceutical Statistics, 11, 3, 193-207. Our proposed test not only detects an Calabrese, E. J., & Baldwin, L. A. umbrella pattern effectively based on the (1998). A general classification of U-shape simulation results in Table 1, but also possesses dose-response relationships in Toxicology and reasonable statistical power to detect a their mechanistic foundations. Human and monotone trend which is a subset of the Experimental Toxicology, 17, 353-364. alternative hypothesis considered in this paper. Cochran, W. G. (1954). Some methods In fact, the simulation in Table 2 shows that, for strengthening the common χ² tests. although our proposed test does not have as Biometrics, 10, 417-451. much the statistical power for detecting the Collings, B. J. Margolin, B. H., & monotone trend as the trend test of Cochran Oehlert, G. W. (1981). Analysis for binomial (1954), the difference in power between these data, with application to the fluctuation tests for two tests is relatively small. This is especially mutagenicity. Biometrics, 37, 775-794. promising given the fact that the trend test of Cox, D. R., & Hinkley, D. V. (1974). Cochran (1954) is asymptotically locally optimal Theoretical Statistics, London: Chapman and against any choice of smooth monotone function Hall. H (Tarone, 1982). Genz, A. (1992). Numerical On the other hand, the simulation results computation of multivariate normal reported in Table 3 seem to indicate that the probabilities. Journal of Computational and Graphical Statistics 1, 141-149. 398 TEST OF HOMOGENEITY FOR UMBRELLA ALTERNATIVES

Hayter, A. J., & Liu, W. (1999). A new Rom, D. M., Costello, R. J., & Connell, test against an umbrella alternative and the L. T. (1994). On closed test procedures for dose- associated simultaneous confidence intervals. response analysis. Statistics in Medicine, 13, Computational Statistics & Data Analysis, 30, 1583-1596. 393-401. SAS Institute, Inc. (1999). SAS Lim, D. H., & Wolfe, D. A. (1997). Language (Version 8), Cary, NC. Nonparametric tests for comparing umbrella Shi, N. Z. (1988). A test of homogeneity pattern treatment effects with a control in a for umbrella alternatives and tables of the level randomized block design. Biometrics, 53, 410- probabilities. Communications in Statistics-- 418. Theory and Methods, 17, 657-670. Mack, G. A., & Wolfe, D. A. (1981). K- Šidák, Z. (1967). Rectangular sample rank tests for umbrella alternatives. confidence regions for the means of multivariate Journal of the American Statistical Association normal distributions. Journal of American 76, 175-181; Correction 1982, 53, 410-418. Statistical Association, 62, 626-633. Marcus, R., & Genizi, A. (1994). Simpson, D. G., & Margolin, B. H. Simultaneous confidence intervals for umbrella (1986). Recursive nonparametric testing for contrasts of normal means. Computational dose-response relationship subject to downturns Statistics & Data Analysis, 17, 393-407. at high doses. Biometrika, 73, 3, 589-596. Margolin, B. H., Kaplan, N., & Zeiger, Tarone, R. E. (1982). The use of E. (1981). Statistical analysis of the Ames historical control information in testing for a Salmonella / Microsome test. Proceedings of trend in Poisson means. Biometrics, 38, 457- the National Academy of Sciences of the United 462. States of America, 78, 3779-3783. Tarone R. E., & Gart, J. J. (1980). On Moran, P. A. P. (1970). On the robustness of combined tests for trends in asymptotically optimal tests of composite proportions. Journal of American Statistical hypotheses. Biometrika, 57, 47-55. Association, 75, 110-116. Neyman, J. (1959). Optimal asymptotic Thorogood, M., Mann, J., & tests of composite hypotheses, in Probability McPherson, K. (1993). Alcohol intake and the and Statistics. U. Grenander (Ed.). New York: U-shaped curve: Do non-drinkers have a higher John Wiley & Sons, 213-234. prevalence of cardiovascular-related disease? Portier, C. J., & Ye, F. (1998). U-shaped Journal of Public Health Medicine, 15, 61-68. dose-response curves for carcinogens. Human and Experimental Toxicology, 17, 705-707.

XIONG, YAN, & JI 399

Appendix ⎛Y ⎞ ⎛ Xˆ ⎞ ⎜ 1 ⎟ ⎜ 1 ⎟ ˆ We give the proof of Theorem 1, which gives ⎜Y2 ⎟ ⎜ X 2 ⎟ = ∆' 2 ⎜ ⎟ the null distribution of X . Let ⎜ ... ⎟ ... ⎜ ⎟ ⎜ ⎟ ⎜Y ⎟ ⎜ ˆ ⎟ ⎝ k ⎠ ⎝ X k ⎠ ⎡ k x d p k ⎤ i i p ⎢ − xˆ λi di ⎥ ∑∑ k p p i==11n i x ∆ ⎣ ⎦ i i where ∆ = (∆ i )is the k by k matrix whose Yp = = k ∑ 2 i=1 n p p p (i, p) entry is ∆ i . Since µ∑λi ()di − d i=1 ⎡⎛ Xˆ ⎞ ⎛ λ µ ⎞⎤ where ⎜ 1 ⎟ ⎜ 1 1 ⎟ ⎢ ˆ ⎥ ⎢⎜ X 2 ⎟ ⎜λ2 µ 2 ⎟⎥ n ⎜ ⎟ − → N()0, A k ⎢ ... ⎜ ... ⎟⎥ d p = λ d p ⎢⎜ ⎟ ⎜ ⎟⎥ ∑ i i ⎜ ˆ ⎟ ⎜λ µ ⎟ i=1 ⎣⎢⎝ X k ⎠ ⎝ k k ⎠⎦⎥

k where A = (a ) is the k by k matrix such that µ = ∑λi µi ij i=1 aij = 0 if i ≠ j and aij = µi λi if i = j , and k the limit is in distribution. Therefore, p p di − ∑λi di ∆p = i=1 . ⎡ Y λ µ ⎤ i k ⎛ 1 ⎞ ⎛ 1 1 ⎞ 2 ⎜ ⎟ ⎜ ⎟ µ λ d p − d p ⎢ ⎥ ∑ i ()i ⎢⎜Y2 ⎟ ⎜λ2 µ 2 ⎟⎥ i=1 n − ∆' → N()0,∆' A∆ . ⎢⎜ ... ⎟ ⎜ ... ⎟⎥ ⎜ ⎟ ⎜ ⎟ ⎢⎜ ⎟ ⎜ ⎟⎥ x ⎣⎢⎝Yk ⎠ ⎝λk µ k ⎠⎦⎥ Let Xˆ = i . Note that i n Theorem 1 follows from the fact that

n(Y1 ,Y2 ,...,Yk )' and ()X 1 , X 2 ,..., X k ' are

stochastically equivalent under H 0 .

Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 400-413 1538 – 9472/03/$30.00

Type I Error Rates Of Four Methods For Analyzing Data Collected In A Groups vs Individuals Design

Stephanie Wehry James Algina University of North Florida University of Florida

Using previous work on the Behrens-Fisher problem, two approximate degrees of freedom tests, that can be used when one treatment is individually administered and one is administered to groups, were developed. Type I error rates are presented for these tests, an additional approximate degrees of freedom test developed by Myers, Dicecco, and Lorch (1981), and a mixed model test. The results indicate that the test that best controls the Type I error rate depends on the number of groups in the group-administered treatment. The mixed model test should be avoided.

Key words: groups-versus-individuals design, approximate degrees of freedom tests, mixed models

Introduction therapy. Group processes do not affect the participants in the wait-list control group When a groups-versus-individuals design is used because they do not receive a treatment, much to compare two treatments, one treatment is less meet in groups. According to Clarke (1998) administered to J groups of n participants (for a the most common design in total of NG such participants) and one treatment research involves the use of a randomly assigned is individually administered to N participants control condition, which can feature a variety of I no-treatment control schemes. or the individual participants may be in a no- The groups-versus-individuals design is treatment control group. For example, also used when the purpose is to compare the psychotherapy researchers investigating the effectiveness of an active treatment delivered to efficacy of group therapy often use a wait-list groups to an active treatment delivered control group (Burlingame, Kircher, and Taylor, individually. For example Bates, Thompson, 1994). The therapy is provided to participants in and Flanagan (1999) compared the effectiveness groups because the researchers believe group of a mood induction procedure administered to processes will enhance the effectiveness of the groups to the effectiveness of the same

procedure administered to individuals. Boling and Robinson (1999) investigated the effects of Stephanie Wehry is Assistant Director for study environment on a measure of knowledge Research and Evaluation for The Florida following a distance-learning lecture. The three Institute for Education. Her research interests are levels of study environment included a printed in evaluating early childhood education study guide accessed by individuals, an programs, applied statistics, and . interactive multi-media study guide accessed by Address correspondence to Florida Institute for individuals, and a printed study guide accessed Education, University of North Florida by cooperative study groups. University Center, 12000 Alumni Drive, A possible model for the data collected Jacksonville, FL, 32224. Email: swehry@ in a groups-versus-individuals design consists of unf.edu. James Algina is Professor of two submodels. For participants in the at the University of individually administered treatment the Florida. His research interests are in applied submodel is statistics and psychometrics. Email: [email protected].

400 WEHRY & ALGINA 401

Y =+µ ε (1) α as fixed, if both error terms are normally iT::I I iTI j:TG distributed, and if the error terms have equal variances, the treatments can be compared by where iT: I (iN= 1,… , I ) denotes the ith participant within the individually-administered using an independent samples ANOVA and treatment. For participants in the group- testing the hypothesis administered treatment H0 : µI = µG (3)

YijT::=+µ Gαε jT : + ijT :: (2) GGG but generalization of the results to additional groups is not warranted. where ijT:: G (in= 1,… , ) denotes the ith Myers et al. (1981) developed two participant within the jth group ()jJ=1,… , in statistical tests of the hypothesis given in the group-administered treatment. An important equation (3). These tests take the lack of independence into account and allow question is whether to treat the α as fixed or j:TG generalization of the results to the population of random. When the researcher views the groups groups represented by the groups in the group- in the group-administered treatment as administered treatment. (In the following, representative of a larger number of groups, groups will always refer to the groups in the α should be treated as random. In the j:TG group-administered treatment). One of these remainder of the paper we assume that the procedures used a quasi-F statistic and degrees groups in the group-administered treatment of freedom approximated by the Satterthwaite comprise a random factor with the groups in the (1941) method. Formulated as an approximate study representing an infinitely large number of degrees of freedom (APDF) t statistic, the Myers groups. et al. test statistic is Burlingame, Kircher, and Taylor (1994) YY− reported that the independent samples t test, t = IG (3) ANOVA, and ANCOVA were the most APDF MSST/ MSGT/ commonly used methods for analyzing data in I + G NN group psychotherapy research. It is well known IG that these procedures require the scores for individuals to be independently distributed both N I where YYN= is the mean of the I ∑ iT: I I between and within treatments, an assumption i=1 that is likely to be violated for the participants in criterion scores and the group-administered treatment when α is j:TG N random. It is also well known that these I 2 YY− procedures are not robust to violations of the ∑()iT: I I MS = i=1 (5) independence assumption (see, for example, ST/ I N −1 Scheffe, 1958). When the groups-versus- I individuals design is used, lack of independence is indicated by a non-zero intraclass correlation is the variance for participants who received the coefficient for the participants who receive the individually administered treatment; Jn group-administered treatments. Myers, Dicecco, YYN= is the mean of the GijTG∑∑ :: G and Lorch (1981), using simulated data, showed ji==11 that the Type I error rates for the independent criterion scores for participants who received the samples t test is above the nominal alpha level group-administered treatment (ijT:: ) and when the intraclass correlation is positive. G Burlingame, Kircher, and Honts (1994) reported similar results. In passing we note that if the researcher believes it is appropriate to treat the 402 ANALYZING DATA IN A GROUPS vs INDIVIDUALS DESIGN

J 2 Nn−+1 τ 22σ nY− Y ( IG)( ) ∑ ()jT: G G (8) j=1 2 MS = (6) ()J −1 σ I GT/ G J −1 becomes closer 1.0. When there are two groups is the between-group mean square for these in the group-administered treatment, J −1 is as participants. It can be shown that the squared small as it possibly can be. In addition, denominator of tAPDF estimates the sampling calculating the ratio in equation (4) for variance of the numerator assuming a correct conditions in which σ 22=+τσ 2 shows that the model for the data is given by equations (1) and I G ratio can be much larger than 1. Therefore, the (2) and α is random. Assuming that j:TG discussion in Satterthwaite would lead one to 2 2 expect that the APDF t test in Myers et al. εiT: ~0,N ( σ I ) , α jT: ~0,N ( τ ) , and I G (1981) would not work well when there are just ε ~0,N σ 2 , the estimated approximate ijT::G ( G) two groups. degrees of freedom are Scariano and Davenport (1986) studied Type I error rates for the APDF t test that Welch 2 (1938) proposed as a solution to the Behrens- ⎛⎞MS MS ST/ I + GT/ G Fisher problem: ⎜⎟NN ˆ ⎝⎠IG f2 = 2 2 (7) ⎛⎞MS ⎛⎞MS YY− ST/ I GT/ G t = ab. (9) ⎜⎟⎜⎟N N 22 ⎝⎠⎝⎠I G SS + ab+ NJI −−11 NabN

It should be noted that in using the Satterthwaite In t, Y and Y are means for two individually method, the distribution of the square of the a b 2 2 administered treatments, Sa and Sb are the denominator of tAPDF is approximated as a multiple of a chi-square distribution with sample variances, and the square of the denominator estimates the sampling variance of degrees of freedom estimated by fˆ . 2 the numerator. The distribution of the Welch t Based on simulated data, Myers et al. can be approximated by a t distribution with (1981) reported estimated Type I error rates for degrees of freedom approximated the by the their APDF test, including results for J = 4 and Satterthwaite (1941) method. Thus, the Myers et J = 8 groups in the group-administered al. (1981) APDF test and the Welch APDF treatment. For both numbers of groups, solution to the Behrens-Fisher problem are both estimated Type I error rates were very similar to based on the same theoretical approach to the nominal level. While these results indicate approximating the sampling distribution of the that the APDF has adequate control of the Type test statistic. I error rate when J ≥ 4 , it leaves open the Scariano and Davenport (1986) question of how well the test works with a developed an analytic procedure for calculating smaller number of groups and the discussion in the Type I error rate of the Welch APDF test and Satterthwaite (1941) and results in Scariano and showed its Type I error rate can be seriously Davenport (1986) suggest the test may not inflated when (a) there is a negative relationship control the Type I error rate for J ≤ 3 . between the sampling variances of the means The discussion in Satterthwaite (1941) and the degrees of freedom for the estimated implies that the approximation of the square of sampling variances and (b) the smaller of the the denominator of tAPDF by a multiple of a chi- two degrees of freedom is small. In the Myers et al. (1981) APDF test, the sampling variances of square distribution improves as J −1 or N I − 1 22 2 increases and as the means are (nNτσ+ GG) and σ I N I and the degrees of freedom for estimates of these WEHRY & ALGINA 403

variance are J −1 and N I − 1. When N I = NG An alternative to the preceding and σ 22=+τσ 2, for example, the relationship approaches is based on a mixed model with a I G proper inference space (McLean, Sanders, & will be negative and, when J ≤ 3 , the degrees of Stroup, 1991) and Satterthwaite degrees of freedom will be small. Consequently, the APDF freedom. When the restricted maximum test may not work well in these conditions. One likelihood estimate (RMLE) of τ 2 is larger than purpose of the study is to study Type I error zero and there are an equal number of rates when J is small. participants in the groups, the mixed model test Satterthwaite (1941) showed how to is equivalent to the Myers et al. (1981) two- approximate the distribution of a sum of two moment test. However, if the RMLE is zero, chi-square distributed random variables by MS and MS are pooled and replace another chi-square distribution. He determined GT/ G SGT//G the degrees of freedom for the approximating MS in equation (1). This statistic, which is GT/ G distribution by equating the mean and variance equivalent to the Welch t test, is smaller than of the sum with the mean and variance of the t and may be more conservative than the approximating chi-square distribution. Thus, the APDF Satterthwaite approach is a two-moment two-moment test. However, it tends to have approach to determining the degrees of freedom. larger degrees of freedom, which may make it Scariano and Davenport (1986) developed a more liberal than the two-moment test. four-moment approach and showed analytically When there are an equal number of 2 that it provides a more conservative test than participants in the groups, the RMLE of τ is does the two-moment approach. In the four- zero when the method of moments estimate of moment approach the estimated approximate τ 2 is ≤ 0 (McCulloch & Searle, 2001). The degrees of freedom are probability that the method of moments estimate of τ 2 is ≤ 0 is 3 ⎧⎫u2 1 ⎨⎬+ prob MS≤ MS { GT///GGG ST T } (12) ⎩⎭JN−−11I ˆ ⎧ ⎫ f4 = 2 (10) ⎪ 1 − ρ ICC ⎪ 3 =⎡−−⎤≤prob⎨ F⎣⎦ J1, J() n 1 ⎬ ⎛⎞u 1 ⎪⎩⎭ρ ICC ()n −+11⎪ ⎜⎟+ ⎜⎟23 ⎝⎠()JN−−11()I where 22 2. Figure 1 displays ρICC=+ττσ( G ) where, in the groups-versus individuals design, the probability as a function of J, ρICC , and n. The probability can be quite substantial and, in MSN GT/ g G some conditions, we would expect the mixed u = . (11) MSN model test to perform differently than the two- ST/ I I moment, four-moment, and averaged degrees of A second purpose of the present study was to freedom tests. Thus, a fourth purpose of the calculate the actual Type I error rate for the four- study is to compare these tests to the mixed moment approach. model test. In Scariano and Davenport (1986), the The research was carried out in two two-moment approach was sometimes liberal studies. In the first study, actual Type I error when the four-moment approach was rates were calculated for the two-moment conservative. As a result, they suggested using approach, the four-moment approach, and the an average of the estimated degrees of freedom averaged degrees of freedom approach. In the produced by the two approaches. Thus, a third second study, simulated data were used to purpose was to analytically evaluate the actual estimate the actual Type I error rate for the Type I error rate for this averaged degrees of mixed model approach as well as for the two- freedom approach. moment approach, the four-moment approach, and the averaged degrees of freedom approach. Taken together, the purposes of the studies were 404 ANALYZING DATA IN A GROUPS vs INDIVIDUALS DESIGN

0.6

ICC=.00, J=2 J=4 0.4 J=6 ICC=.15, J=2 J=4 J=6 Probability ICC=.30, J=2 0.2 J=4 J=6

0.0

4 8 12 16 20 Sam ple Size

Figure 1. Probability of a Negative Estimate for τ 2

to compare Type I error rates for the two- when we use the term Type I error rate without moment, four-moment, averaged degrees of the actual or nominal modifier, we refer to the freedom, and mixed model approaches when the actual Type I error rate. number of groups in the group administered treatment is small and to study the influence of Calculating Type I Error Rates the number of groups, number of participants in Scariano and Davenport (1976) a group, and intraclass correlation on the Type I developed a method to calculate Type I error error rates for these methods. rates for the Welch t test. We applied their method, which we describe below, to the three Methodology APDF tests considered in this paper. It should be noted that although the method we applied Study 1 was developed in the context of the Behrens- Actual Type I error rates were Fisher problem, that is, comparing means of calculated for each condition in a 5 (Number of independently distributed scores for two groups Groups) × 4 (Intraclass Correlation) × 15 when the variance are not equal for the groups, (Number of Participants in a Group) completely we did not apply the method to the Behrens- crossed factorial design. The levels of the factors Fisher problem. Rather we applied the method to wereJ = 2 to 6 for the number of groups; n = 3 comparison of means for two groups, when and 4, and 6 to 30 in steps of 2 for the number of scores are not independently distributed within the sub-groups in the group-administered participants in a group; and ρICC = .00, .20, .40, and .80 for the intraclass correlation. In all treatment. Thus, our work is not subject to Sawilowsky’s (2002) criticisms of research on conditions, τσ22+= σ 21 and, because the ( GI) the Behrens-Fisher problem. design was balanced across treatments, The Type I error rate for the APDF t test is N I = Jn( ) . For all calculations the nominal alpha level was .05. In the following,

WEHRY & ALGINA 405

Pr ⎡⎤tF2 >= .0001 and 1. The interval was divided into 1000 ⎣⎦quasi α ,1, fˆ segments of equal width. For J = 3 a ∞ (13) removable singularity occurs at s = 0 . For Pr⎡⎤tFugudu2 > ( ) ∫ ⎣⎦quasi α ,1, fˆ J ≥ 3 the limits of integration were 0 and 1 and 0 this interval was also divided into 1000

segments. As a check on the calculations, Type I ˆ where f is the two-moment, four-moment, or error rates were estimated by using simulated averaged degrees of freedom and α is the data with 100,000 replications. The results from nominal Type I error rate. Cochran (1951) has the simulation were consistent with the results 2 determined by numerical integration. shown that tquasi is the ratio of Q to C where

QF~ , mJ=−1 , mN=−1 , 1, mm12+ 1 2 I Results

(1++um)( 12 m) Study 1 C = , (14) Figures 2 to 6 contain plots of the Type I ⎛⎞mu1 ()1++Um⎜⎟2 error rates against size of groups. The five plots ⎝⎠U are for two, three, four, five, and six groups,

respectively. Plots within a figure are organized and by the intraclass correlation coefficient.

Inspection of Figure 2 indicates that when there 22 (nNτσ+ GG) are two groups, the four-moment degrees of U = . (15) 2 freedom should be used, except perhaps when σ IIN ρICC = 0 . Then the averaged degrees of freedom To facilitate numerical integration the variable u might be used. When there are three groups (see can be transformed to Figure 3), the averaged degrees of freedom might be used at the risk of a slightly liberal test

when ρICC is at .20 or greater. The two-moment ⎛⎞mu1 ⎜⎟ degrees of freedom results in a test that is too ⎝⎠m2 s = (16) liberal and the four-moment degrees of freedom ⎛⎞mu ⎜⎟1+ 1 results in a test that is too conservative. When ⎝⎠m2 there are four groups (see Figure 4), the two- moment degrees of freedom provides a test that and the Type I error rate is found by numerically has a slight liberal tendency that increases as integrating ρICC get larger and as the size of the groups get larger. Use of the averaged degrees of freedom 1 provides a test that is slightly conservative when Pr ⎡⎤QCF>× sfsds (17) α ,1, fˆ () ∫ ⎣⎦ ρICC is small, but controls the Type I error rate 0 well as it increases. Plots for five or more where groups (see Figures 5 and 6) are similar to those for four groups. However, the use of either the ⎛⎞mm12+ m 2 2 m −22 Γ⎜⎟U ()m1 −22 ()2 two-moment degrees of freedom or and average 2 ss()1− fs()= ⎝⎠ . degrees of freedom provide reasonable control ()mm12+ 2 ⎛⎞⎛⎞mm12⎡⎤Uss1−+ of the Type I error rate. Use of the former can ΓΓ⎜⎟⎜⎟⎣⎦() ⎝⎠⎝⎠22 result in a slightly liberal test, whereas use of the (18) latter can result in a slightly conservative test.

Numerical integration was performed using the Methodology trapezoid rule. For J = 2 a singularity occurs at Study 2 s = 0 . Therefore, the limits of integration were As noted in the introduction, simulated data were used to compare the three APDF tests 406 ANALYZING DATA IN A GROUPS vs INDIVIDUALS DESIGN

and the mixed model test. The design had four ε s were pseudorandom standard normal iT: I factors: the four tests, the number of groups, size deviates generated using RANNOR. Scores for of the groups, and level of the intraclass simulated participants in the group-administered correlation. There were five levels of the number treatment level were generated using equation of groups, J = 2, 3, 4, 5, and 6; five levels of (2), where µ was arbitrarily set at 100, α group size, n = 4, 8, 12, 16, and 20 subjects G j:TG nested in the groups; and seven levels of was a pseudorandom normal deviate with mean zero and variance τ 2 and ε was a intraclass correlation, ρICC = .00 to .30 in steps ijT:;G of .05. pseudorandom normal deviate with mean zero The simulation was carried out using the 2 and variance σ G . Each of the conditions was random number generation functions of SAS, replicated 5,000 times and the Type I errors of Release 8.2. Scores for simulated participants in the four tests were counted over the replications the individually administered treatment level of each condition. The nominal type I error rate were generated using the equation (1), was .05 in all conditions. where µI was arbitrarily set at 100 and the

Figure 2. Plots of Type I Error Rates by Size of Group for Two Groups

WEHRY & ALGINA 407

Figure 3. Plots of Type I Error Rates by Size of Group for Three Groups.

Figure 4. Plots of Type I Error Rates by Size of Group for Four Groups.

408 ANALYZING DATA IN A GROUPS vs INDIVIDUALS DESIGN

Figure 5. Plots of Type I Error Rates by Size of Group for Five Groups.

Figure 6. Plots of Type I Error Rates by Size of Group for Six Groups.

WEHRY & ALGINA 409

The mixed model specified in equations Results (1) and (2) was implemented by using the following is a SAS program. The individually Study 2 administered treatment is coded 1 on the TRT The analytic results showed that, when code. there were two groups, the APDF test statistic with the four-moment degrees of freedom PROC MIXED; provided the best control of the Type I error rate. CLASS TRT GROUP; Figure 7 compares Type I error rate for the four- MODEL SCORE=TRT/SOLUTION moment test and the mixed model test for

DDFM=SATTERTHWAITE; ρICC = 0.00 and 0.30. Results for the APDF test RANDOM GROUP/GROUP=TRT; statistic and the two-moment degrees of freedom REPEATED/GROUP=TRT; are also included because the mixed model test PARMS (0) (1) (1) (1)/EQCONS=1 is equivalent to the two-moment test when the ESTIMATE 'COMP' TRT 1 -1; estimate of τ 2 is non-zero. The four-moment degree of freedom test still provides the best The APDF tests are easily carried out in proc iml control of the Type I error rate. The mixed as the only required statistics are the means for model test is more conservative than the two- the two groups, the variance for the treatment moment test and is substantially more administered to individuals, and the mean conservative in conditions in which the squares within and between subgroups for the probability of a zero estimate for τ 2 is large. group-administered treatments.

f2, ICC=.00 0.13 f2, ICC=.30 f2 f4, ICC=.00 f4, ICC=.30 fmfmfmfmfm fm, ICC=.00 0.11 fm, ICC=.30

0.09

f2 0.07 Type I ErrorRate f4

0.05 f4

fm 0.03

4 8 12 16 20 Size of Groups

Figure 7. Type I Error Rates for Two Groups

410 ANALYZING DATA IN A GROUPS vs INDIVIDUALS DESIGN

When there were three groups, the error rates are depicted in Figure 9 for the two- analytic results showed that the APDF test moment, four-moment test and the mixed model statistic with the averaged degrees of freedom tests for ρICC = .00 and .30. provided the best control of the Type I error rate. The results indicate that the mixed- Type I error rates for the two-moment test model test is conservative and less adequate than tended to be too large. Figure 8 compares Type the other tests when ρ is zero. Inspection of I error rates for the mixed model test and the ICC the results for other values of ρ indicate that APDF tests with two-moment and averaged ICC when ρ = .10 the performance of the degrees of freedom when ρICC = .00 and .30. ICC The results indicate that the averaged degrees of averaged degrees of freedom and the mixed freedom test still provides the best control of the model tests is very similar and as ρICC increases Type I error rate. the Type I error rates for the mixed model test According to the analytic results, there become slightly larger than those for the were four or more groups, both the two-moment averaged degrees of freedom test. A similar and averaged degrees of freedom tests provided pattern of results emerged for five or six groups. good control over the Type I error rate, with the In particular, when ρICC was near zero the former test being slightly more liberal. Type I mixed model test was too conservative.

f2, ICC=.00 f2, ICC=.30 0.09 fa, ICC=.00 fa, ICC=.30 fm, ICC=.00 fm, ICC=.30 f2 fm

0.07 fa

f2

fa 0.05

fm

0.03

4 8 12 16 20 Size of Groups

Figure 8. Type I Error Rates for Three Groups

WEHRY & ALGINA 411

0.08 f2, ICC=.00 f2, ICC=.30 fa, ICC=.00 0.07 fa, ICC=.30 fm, ICC=.00 fm, ICC=.30 f2 0.06 fm f2 fa 0.05 fa

Rate TypeError I 0.04 fm

0.03

4 8 12 16 20 Size of Groups

Figure 9. Type I Error Rates for Four Groups

Conclusion control of the Type I error rate when there are Myers et al. (1981) presented a two-moment two or three groups. Using results presented in approximate degrees of freedom test for use Scariano and Davenport, we developed two when one treatment is delivered to individual alternatives to the Myers et al. (1981) test, a participants and one is delivered to groups of four-moment approximate degrees of freedom participants. The test was based on results in test and an averaged degrees of freedom test. Satterthwaite (1941). Simulation results Using the analytic procedure developed indicated that the test provided good control of by Scariano and Davenport, Type I error rates the Type I error rate for both four groups and were calculated for all three test in a wide range eight groups of participants. of conditions in which the design was balanced Satterthwaite (1941) and Scariano and across the individually administered treatment Davenport (1986) studied a two-moment and the group-administered treatment and across approximate degrees of freedom test for a design the groups in the group-administered treatment. in which both treatments are delivered We also estimated Type I error rates for the individually. Discussion in Satterthwaite and mixed model test and the three APDF tests. The results in Scariano and Davenport suggest that results indicated that the four-moment test the Myers et al. test may not perform well when should be used when the group-administered the number of groups is smaller than four. treatments are delivered to two groups and the Using an analytic procedure developed by averaged degrees of freedom test should be used Scariano and Davenport, we showed that the when the group-administered treatments are Myers et al. test could provide relatively poor delivered to three groups. When there are between four and six groups, we recommend 412 ANALYZING DATA IN A GROUPS vs INDIVIDUALS DESIGN

using the averaged degrees of freedom test. population parameters for sample statistics in the However, because (a) this test is slightly Myers et al. (1981) t statistic, we have conservative, with a Type I error rate between 0.045 and 0.050, and (b) the two-moment test is µµ− IG . (20) slightly liberal but tends to keep the Type I error 222 σ στ rate below 0.06, some may prefer the two- IG++ moment test. Even when there are four or more N IGNJ groups, we do not recommend the mixed model test because of its conservative tendency when Therefore even as the two sample sizes increase the intraclass correlation coefficient is small. power will not go to 1.0 if τ 2 ≠ 0 . Finally, the These recommendations are summarized in the fact that the Type I error rate for the four- Table 1. moment test declines as n increases suggests power will decline as n increases because the Table 1. test becomes more conservative. The predicted Recommended Tests by the Number of Groups low power and decline in power as n increases in the Group-Administered Treatment were borne out by simulation studies. For 222 example whenσστ=+=1, ρICC = .2 , Number of Recommended Test IG andµ − µ = .8 , estimated power was .23, .21 Groups GI 2 Four-Moment Test and .19 as n increased from 6 to 18 in steps of 6. 3 Averaged Degrees of Comparison of these results to the power of an Freedom Test independent samples t test with the same overall 4-6 Averaged Degrees of sample size indicates how much lower power is Freedom Or Two Moment when a groups-versus-individuals design is used. 222 Test Note that becauseσστIG= +=1, µGI− µ = .8 corresponds to Cohen’s large effect size. Also as When there are two groups in the group- n increases from 6 to 18 the sample size in a administered treatment, the four-moment test treatment increases from 12 to 36 in steps of 12. provides better control of the Type I error rate For an independent sample t test with an effect than do the other tests. Nevertheless researchers size equal to .8, power is .47, .77, and .92 as n should be cautious about using a groups-versus- increase from 12 to 36 by 12. individuals design with two groups because such When there are three groups and the designs will provide relatively low power. The averaged degrees of freedom approach is used, true degrees of freedom for the four-moment test power does not decline as n increases, but power is can still be quit low and does not increase quickly as n increase. As n increased from 4 to 3 12, so that the overall sample size remained the ⎧⎫U 2 1 ⎨⎬+ same as in the conditions on which power results ⎩⎭JN−−11I were reported for J = 2 , estimated power was f4 = 2 (19) ⎛⎞U 3 1 .29, .36, and .40 when J was 3. ⎜⎟+ As suggested by equation (20), power ⎜⎟JN−−1123 ⎝⎠()()I continues to increase as J increases. For example with J = 6 , as n increased from 2 to 6 in steps where U is defined in equation (15). of 2 estimated power was .41, .58, and .68 using

Calculations show that f4 approaches 1.0 from the averaged degrees of freedom test. Thus when above as U increases. Thus in many situations the groups-versus-individuals test is used, it is the degrees of freedom for the four-moment test important to have as many groups as possible will be very small and this will have a negative and may be more important to have more groups impact on power. In addition, substituting than to have more participants per group. WEHRY & ALGINA 413

At least four lines of additional research Burlingame, G. M., Kircher, J. C., & are attractive. First, the performance of the tests Honts, C. R. (1994). Analysis of variance under non-normality should be investigated and versus bootstrap procedures for analyzing if performance is poor developing the test dependent observations in small group research. statistic and degrees of freedom using robust Small Group Research, 25, 486-501. estimates of the means and mean squares is of Clarke, G. N. (1998). Improving the interest. Second, performance of the four tests transition from basic efficacy research to when the design is unbalanced across the effectiveness studies: Methodological issues and individually administered treatment and the procedures. In A. E. Kazdin (Ed.), group-administered treatment, but balanced Methodological issues and strategies in clinical across groups in the group-administered research, (2nd ed.) (pp. 541-559). New York: treatment might be investigated. Third, Wiley. calculating the averaged degrees of freedom by Cochran, W. G. (1951). Testing a linear differentially weighting the two-moment and relationship among variances. Biometrics, 7, 17- four-moment degrees of freedom might be 32. investigated when there are four or more groups. McCulloch, C. E., & Searle, S. R. Weighting the two-moment degrees of freedom (2001). Generalized, linear, and mixed models. more heavily will reduce the slight conservative New York: Wiley. tendency of the averaged degrees of freedom McLean, R. A., Sanders, W. L., & test. In general, more extensive studies of power Stroup, W. W. (1991). A unified approach to than we have conducted would be worthwhile. mixed linear models. American Statisticia, 45, Fourth, the three APDF tests should be 54-64. generalized for use when the design is not Myers, J., Dicecco, J., & Lorch, Jr., J. balanced across groups in the group- (1981). Group dynamics and individual administered treatment and Type I error rates for performances: Pseudogroup and Quasi-F these tests and the mixed model test should be analyses. Journal of Personality and Social investigated. Psychology, 40, 86-98. Satterthwaite, F. W. (1941). Synthesis References of variance. Psychometrik,, 6, 309-316. Scariano, S. M., & Davenport, J. M. Bates, G. W., Thompson, J. C., & (1986). A four-moment approach and other Flanagan, C. (1999). The effectiveness of practical solutions to the Behrens-Fisher individual versus group induction of depressed problem. Communications in Statistics: Theory mood. The Journal of Psycholog, 33, 245-252. and Methods, 15, 1467-1504. Boling, N. C., & Robinson, D. H. Sawilowsky. S. (2002). The probable (1999). Individual study, interactive difference between means when σ12≠ σ : The multimedia, or cooperative learning: Which Behrens-Fisher problem. Journal of Modern activity best supplements lecture-based distance Applied Statistical Methods, 1, 461-472. education? Journal of Educational Psychology, Scheffe, H. (1959). The analysis of 91, 169-174. variance. New York: Wiley. Bradley, J.V. (1978). Robustness? Welch, B. L. (1938). On the British Journal of Mathematical and Statistical comparison of several mean values: An Psychology, 31, 144-152. alternative approach. Biometrika, 38, 330-336. Burlingame, G. M., Kircher, J. C., & Taylor, S. (1994). Methodological Endnote considerations in group psychotherapy research: Past, present, and future practices. In A. Tables containing Type I error rates for all Fuhriman & G. Burlingame (Eds.), Handbook of conditions in the studies are available at group psychotherapy and counseling: An http://plaza.ufl.edu/algina/index.programs.html empirical and clinical synthesis. (pp. 41-80). New York: Wiley. Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 414-424 1538 – 9472/03/$30.00

A Nonparametric Fitted Test For The Behrens-Fisher Problem

Terry Hyslop Paul J. Lupinacci Department of Medicine Department of Mathematical Sciences Thomas Jefferson University Villanova University

A nonparametric test for the Behrens-Fisher problem that is an extension of a test proposed by Fligner and Policello was developed. Empirical level and power estimates of this test are compared to those of alternative nonparametric and parametric tests through simulations. The results of our test were better than or comparable to all tests considered.

Key words: Behrens-Fisher problem, empirical level and power, Wilcoxon-Mann-Whitney, nonparametric, simulation study

Introduction Various authors have also considered the Behrens-Fisher problem when the normality The comparison of the means of two assumption is not appropriate. The usual independent populations has traditionally been nonparametric approaches assume that the data approached using Student’s t-test. The use of are continuous and the distributions are of the this test assumes that the observations come same shape. For these tests, such as the from a normal distribution and that the variances Wilcoxon-Mann-Whitney test (Wilcoxon 1945; of the two populations are equal. When the Mann & Whitney 1947), the level of the test will homogeneity of variances is not a reasonable not be preserved when the populations have assumption the problem has been called the different shapes or variances (Fligner & Behrens-Fisher problem. Policello 1981; Brunner & Neumann 1982, Lee and Gurland (1975) developed a 1986; Brunner & Munzel 2000). Fligner and new method for handling the Behrens-Fisher Policello (1981) and Brunner and Neumann problem and compared their test to many others (1982, 1986) considered the problem under the that have been proposed for this problem. Their assumption that the independent samples are test performed very well regarding size and from continuous distributions without the power. However, their method utilized a large assumption of equal variances or equal shapes of table of critical values to determine the correct the distributions. Brunner and Munzel (2000) region of rejection. Lee and Fineberg (1991) derived an asymptotically distribution free test sought to simplify the method proposed by Lee without the assumption that the data are and Gurland. They fit a nonlinear function to the generated from a continuous distribution critical values derived by Lee and Gurland so function. that the critical values could be estimated. Fligner and Policello developed their alternative nonparametric method for comparing two population medians without the equal Terry Hyslop is Assistant Professor of Medicine variance and equal shape assumptions. To in the Biostatistics Section, Division of Clinical implement their test, one must consult a large Pharmacology, Department of Medicine at table of critical values to determine the correct Thomas Jefferson University. Email: region of rejection. Their table is parameterized [email protected]. Paul J. Lupinacci by the test’s level of significance and the sample is Assistant Professor in the Department of sizes of the two samples. We expand on the Mathematical Sciences at Villanova University. approach of Fligner and Policello by proposing a E-mail: [email protected]. fitted test which eliminates the need for large tables or complicated derivations of critical values. We fit a nonlinear function to the critical

414 HYSLOP & LUPINACCI 415 values in their table so that the critical values approximate α level of significance versus the can be estimated. Motivation for the nonlinear one-sided alternative θ x >θ y is function came from the nonlinear function used by Lee and Fineberg. A complete description of ˆ the problem and details of the proposed test Reject H0 if Uu≥ α ; otherwise do not reject. follow in the Methodology Section. In that section, our method is demonstrated using a They provided a table of critical values for numerical example. Simulation studies are used uα for various values of m, n, and α. Values in the Results Section to compare the fitted test outside the range of their table are to be derived to some of the other parametric and or estimated by z for large sample sizes, where nonparametric tests which have been proposed α z is the 1-α percentile of the standard normal for the Behrens-Fisher problem. α distribution. Methodology Implementation of this test would be greatly simplified if the large table of critical Let X ,,… X and YY,,… be independent values was not required. In addition, sample size 1 m 1 n combinations that are not provided in their table random samples from continuous distributions would require either additional effort for with population medians θ x and θ y , derivation, or an assumption of uzα = α . We respectively. We are interested in testing the propose fitting the following function to the following hypotheses: critical values in the Fligner and Policello table so that the critical values can be estimated: H0 :θ x =θ y

versus Horor:θ ><≠θθθθθ⎡⎤. bb12bb35 b 4 ax y⎣⎦ x y x y ubα =+0 + + +22 + , ff12()ff f f 12 12

Let Pi represent the number of sample where fm12= −=−1, fn 1, and bb0,… , 5 are the observations, Yj , less than Xi , for im=1,… , . parameters of the function. We also propose that Similarly, let Q j represent the number of the parameters bb05,,… be estimated by sample observations, Xi , less than Yj , for ordinary least squares. 54 values obtained from jn=1,… , . Compute the average placement for Fligner and Policello’s table of critical values each of the samples, were used in the estimation process. Table 1 presents the parameter estimates obtained for the various α values of 0.10, 0.05, 0.025, and 0.01. 1 m 1 n PP= ∑ i and QQ= ∑ j . m i=1 n j=1 Table 1. Parameter estimates for the F-P fitted test polynomial. m n 2 2 Let VPP=− and VQQ=−, 1 ∑()i 2 ∑()j α i=1 j=1 b0 b1 b2 b3 b4 b5 0.10 1.34 -1.39 0.16 -0.03 5.20 1.17 and calculate the test statistic 0.05 1.74 -0.69 -0.87 12.53 -4.09 2.44 0.025 2.15 -0.60 -2.54 22.05 -3.50 7.36 mn 0.01 3.16 -11.43 -6.75 50.15 51.87 19.20 ∑∑PQij− ˆ ij==11 Motivation for this functional form U = 1 . 2 comes from a parametric fitted test for the 2()VV12++ PQ Behrens-Fisher problem proposed by Lee and Fineberg (1991) as an alternative to Lee and Fligner and Policello presented a test of H0 Gurland’s (1975) test that also required based on this statistic, where the procedure at an 416 NONPARAMETRIC TEST FOR THE BEHRENS-FISHER PROBLEM extensive tables of critical values. Their ˆ referred to as U f , and the critical value will be proposed function is similar to the one proposed ()f here. Other functional forms were also referred to as uα . For example, Fligner and considered, but none were found which provided Policello’s critical value for a one-sided, level a better fit to the critical values. Figure 1 0.05 test when both samples are of size 5 is displays the critical and fitted values for a level 2.063. Using our parameter estimates when 0.05 test when m = 4, 5, …, 12 and n = 3, 4, and α = 0.05 and the sample sizes, we obtain an 5. The fit is good for even these small values of estimated critical value of 2.035. n and becomes more precise as n gets larger. The test based on the fitted critical values will be

Figure 1. Plot of Fligner and Policello’s Critical Values and the Fitted Critical Values for m=4(1)12 and n=3,4,5.

3.5

3.0

2.5

2.0 Observed and Fitted Critical Values

1.5 3 4 5 6 7 8 9 10 11 12 13 m

F-P n=3 Fitted F-P n=3 F-P n=4 Fitted F-P n=4 F-P n=5 Fitted F-P n=5

Numerical Example Hox:θ =θ yversus Hax:θ >θ y.

The following example uses a data set The data in both groups were simulated that originated from the simulation studies that from uniform distributions with a mean of 100. are presented in the Results Section of the However, the variance of the second distribution manuscript. With this data set, we will test the was ten times that of the first distribution. The null hypothesis that the two population medians first data set consists of twelve observations are the same versus the alternative hypothesis while the second data set consists of only five that the median of the first population is greater observations. Thus, we are simulating a scenario than that of the second population, that is, where the data set with fewer observations HYSLOP & LUPINACCI 417 comes from an underlying distribution function significance. We will use the notation as defined with a larger variance. We will utilize this data in the Methodology Section. The data, as well as set to demonstrate the test procedure and to the placement values, Pi and Q j , for each illustrate the need for a simulation study which observation in the first and second samples, compares the various methods that are used for respectively, are given in Table 2. analyzing this type of data in terms of their power and ability to hold the level of

Table 2. Data for the Numerical Example

Group 1 Group 2

Observation Value Pi Observation Value Q j 1 101.673 4 1 103.409 12 2 101.550 4 2 98.546 1 3 100.410 4 3 97.429 0 4 100.203 4 4 96.536 0 5 99.906 4 5 95.940 0 6 99.875 4 7 99.861 4 8 99.695 4 9 99.535 4 10 98.985 4 11 98.575 4 12 98.461 3 12 5 ∑ Pi =47 ∑Qj =13 i=1 i=1

5 2 VQQ=−=111.200 , For this example, the sum of the placements in 2 ∑()j j=1 the first data set is 47 and the sum of the placements in the second data set is 13. this and the test statistic is calculated as leads to the average placements of:

12 5 1 12 PQ− PP==3.917 ∑∑ij ∑ i ij==11 12 i=1 ˆ U ==1 1.537 . 2 2()VV12++ PQ and

For m = 12 and n = 5, the critical value 1 5 QQ==2.600 for the fitted test is u()f = 1.868, and the critical 5 ∑ j 0.05 j=1 value for the Fligner and Policello test is u = 0.05 1.923. Therefore, we fail to reject the null for each group. The values of V and V are 1 2 hypothesis using both tests. However, the calculation of the Fligner and Policello critical 12 2 value would have been much more complicated VPP1 =−=∑()i 0.917 i=1 if our sample sizes were not given in their table of critical values. Therefore, we suggest using and the critical value based on the fitted test. Let us also consider how alternative 418 NONPARAMETRIC TEST FOR THE BEHRENS-FISHER PROBLEM

tests for this type of data would have fared with X1,,… XFxm ~(), this data set. Since the data are not coming from a normal distribution, most statisticians would then we let use the nonparametric alternative to Student’s t test, the Wilcoxon-Mann-Whitney test, without y hesitation. For this example, we compared our YYGyF,,… ~ = 1 n () ()σ results to those of the Wilcoxon-Mann-Whitney test and the nonparametric test proposed by Brunner and Munzel (2000). The Brunner and for values of Munzel test led to the same conclusion as our 2 fitted test, that is, their test failed to reject the σ = {0.01,0.25,1, 4,10} . null hypothesis. The Brunner and Munzel test statistic was B = 1.437 and the corresponding p- All simulations were run in SAS version value was 0.112, which is not significant at the 8. The SAS function NORMAL was used to 0.05 level of significance. However, there was a generate random standard normal deviates which different result if one used the Wilcoxon-Mann- were then transformed to simulate the desired Whitney test. Its test statistic was W = 28 and normal distribution. The contaminated normal the corresponding p-value was 0.041, which is deviates were generated by multiplying a significant at the 0.05 level of significance. This random normal deviate by 9 with probability p = conflicting result caught our attention and 0.10. The double exponential deviates were spurred interest in a simulation study which generated using the method of Martinez and compares the various methods in terms of size Iglewicz (1984) that transforms a random and power. standard normal deviate into a double exponential deviate using the transformation Results 0.109Z 2 exp{ } For our simulation study, we considered three DE= Z 2 , nonparametric procedures and one parametric procedure. The three nonparametric procedures where Z is a random standard normal deviate. that were considered were the Wilcoxon-Mann- Random uniform and gamma deviates were Whitney test, denoted W , the Brunner and generated using the SAS functions UNIFORM Munzel test, denoted B , and our fitted test, and RANGAM, respectively. denoted Uˆ . The parametric test that we For a statistical test to be meaningful, it f must display adequate power while still included in our simulation study was the usual t maintaining its nominal level. We ran test using Satterthwaite’s approximation for the simulations to obtain estimates of the level and degrees of freedom. We used ts to denote this power for each of the tests under consideration. test. We decided not to include the Fligner and To estimate the tests’ level, we ran 15,500 Policello test in the discussion because its simulation iterations. The number of simulations empirical level and power estimates were almost provides that a 95% confidence interval for the identical to those of our fitted test. This was to estimated level will be approximately ± 0.36% be expected since we fitted a function to their for α = 0.05 . At each iteration m + n deviates critical values and the fit was very good. We of the desired type were generated from simulated data using the normal, contaminated distributions where θ =θ .The four tests were normal, double exponential, uniform, and x y gamma distributions for estimating both the performed at each interation testing Hox :θ =θ y

empirical level and power for the four tests. vs. H ax:θ >θ y. The proportion of the iterations Since we are interested in determining the effect where the null hypothesis was rejected was of different variances on the level and power recorded for each of the four tests. This estimates, we considered distributions which proportion is the empirical level estimate. The differed in scale by assuming that if HYSLOP & LUPINACCI 419 empirical levels were mulitplied by 1,000 and the four tests. The standard error was calculated these values are reported in Table 3. Table 3 lists assuming a true nominal level of 0.05. We the empirical levels for each of the five indicate an empirical level more than two distributions, for five sample size combinations, standard deviations above 0.05 by entering the at each of the five variance ratios, τ , for each of number into the table in boldface type.

Table 3. Empirical Levels Times 1,000 for α = 0.05 for Each of the 4 Tests.

m = 5, n = 5 m = 12, n = 5 m = 11, n = 10 Distribution W f B t W f B t W f B t τ Uˆ s Uˆ s Uˆ s 0.1 49 49 47 51 22 35 49 48 59 54 49 50 0.25 49 49 49 48 24 38 49 49 55 54 52 51 Normal 1 51 51 55 50 40 52 56 49 51 53 52 50 4 52 52 53 51 67 61 56 52 55 52 49 47 10 52 52 48 51 89 59 52 53 61 55 48 45 0.1 49 32 44 25 28 13 59 73 59 29 49 41 0.25 48 29 47 21 29 13 57 75 54 25 49 38 Contaminated 1 49 28 53 21 43 21 58 73 49 26 51 41 Normal 4 45 27 46 22 60 27 49 72 56 29 52 40 10 49 30 44 21 76 28 44 72 61 29 51 38 0.1 50 50 47 57 19 32 47 46 61 52 47 50 0.25 52 52 53 55 23 37 50 51 55 53 50 50 Uniform 1 47 47 49 44 42 54 56 55 48 50 49 49 4 48 48 51 52 80 66 59 59 62 56 49 49 10 49 49 48 57 103 58 53 59 70 58 50 50 0.1 51 51 46 47 22 36 50 49 58 54 50 49 0.25 49 49 44 47 25 39 50 46 49 50 48 48 Double 1 48 48 51 45 42 52 53 48 50 52 51 48 Exponential 4 50 50 51 47 66 61 56 47 56 53 49 47 10 46 46 42 42 87 59 52 47 66 58 52 50 0.1 32 32 30 32 10 19 27 38 33 30 28 37 0.25 35 35 37 33 17 27 34 44 34 33 31 38 Gamma 1 48 48 51 46 42 53 57 65 51 53 51 50 4 87 87 88 85 123 110 100 109 129 122 115 84 10 144 144 132 147 238 165 147 163 253 229 209 124

Notes: Wilcoxon-Mann-Whitney (W), Fitted Test (Uˆ f ), Brunner-Munzel (B), and Satterthwaite’s t-test

(ts). Variance of X = 1, Variance of Y = τ. The right side of this table continues on the page below.

420 NONPARAMETRIC TEST FOR THE BEHRENS-FISHER PROBLEM

Table 3, continued.

m = 25, n = 20 m = 40, n = 40 Distribution W f B t W f B t τ Uˆ s Uˆ s 0.1 51 47 48 48 60 44 48 49 0.25 45 44 46 49 54 43 48 46 Normal 1 47 46 48 50 52 46 52 53 4 59 47 47 47 58 47 51 52 10 71 50 49 48 66 49 52 52 0.1 53 17 48 64 62 19 51 54 0.25 50 16 49 67 55 17 50 54 Contaminated 1 48 18 49 67 50 15 50 48 Normal 4 55 19 48 66 52 16 49 52 10 65 22 49 66 63 20 54 53 0.1 58 48 49 49 67 44 48 47 0.25 50 46 48 49 61 47 50 51 Uniform 1 50 49 51 51 51 46 52 50 4 63 48 47 48 61 47 50 51 10 78 52 50 52 64 43 46 46 0.1 52 46 48 48 64 47 52 51 0.25 48 46 49 49 51 43 47 48 Double 1 52 50 53 53 47 43 48 47 Exponential 4 59 48 48 47 55 46 50 54 10 70 50 48 46 61 45 49 49 0.1 22 19 20 40 20 13 15 41 0.25 25 24 25 44 19 15 17 40 Gamma 1 47 46 48 50 52 48 53 54 4 186 158 158 76 245 215 227 72 10 408 339 332 108 590 521 535 91

Notes. Continued from previous page. Wilcoxon-Mann-Whitney (W), Fitted Test (Uˆ f ), Brunner-

Munzel (B), and Satterthwaite’s t-test (ts). Variance of X = 1, Variance of Y = τ.

There are a number of interesting generated from the contaminated normal conclusions that can be made from observing the distribution and the sample sizes were similar, values in Table 3. First, the t test using the test was very conservative. In other cases, Satterthwaite’s approximation for the degrees of such as when the data were uniformly freedom maintained its level when the data were distributed and when the sample sizes differed, generated from a normal distribution regardless the test became anti-conservative. The of the sample size combination or the ratio of the Wilcoxon-Mann-Whitney text generally does variances. This was expected since the primary not maintain its level, even under the optimal purpose of this test is to handle these situations. condition of normality. It was conservative in However, when the condition of normality was situations where the larger sample size was removed, the test became less predictable. In taken from the population with the larger some cases, such as when the data were variance, and it was anti-conservative if the reverse was true. The fitted test generally maintained its level. In most of the situations HYSLOP & LUPINACCI 421 when it did not, the test was conservative. The conservative in most scenarios, it was not Brunner-Munzel test generally maintained its surprising that the power of this test was greater level under all scenarios tested. All of the tests than the power of the other tests. However, since had trouble maintaining the 0.05 level when the this power is meaningless in the presence of an data were simulated using the gamma inflated nominal level, the Wilcoxon test will be distribution. removed from the rest of the discussion. To estimate the tests’ power, we ran Figure 2 shows the power of the 1,540 simulation iterations. This number of remaining tests under normality when the simulations assures that a 95% confidence variances are not equal and the sample sizes are interval for power will be approximately the same. Under these conditions, most ±0.025when power is around 80%. For each statisticians would use the t test with iteration, m + n deviates of the desired type were Satterthwaite’s approximation for the degrees of freedom. However, the fitted test and the generated under the condition θ xy−=θδ , where Brunner-Munzel test demonstrate comparable δ = {1,2,3,4}. Again, the proportion of the power. iterations where the null hypothesis was rejected Figure 3 illustrates the power of the tests was recorded for each of the four tests. This under normality with the added complication proportion is the test’s estimated power. Since that the smaller sample size corresponds to the the Wilcoxon-Mann-Whitney test was anti- group with the larger variance. Once again, all three tests demonstrate similar power levels.

Figure 2. Plot of the Power for the Various Tests under Normality, Equal Sample Sizes, Ratio of the Variances τ = 0.1 , and α = 0.05 .

100

80

60

Power 40

20

0 01234 δ Fitted F-P Brunner-Munzel Satterthwaite

422 NONPARAMETRIC TEST FOR THE BEHRENS-FISHER PROBLEM

Figure 3. Plot of the Power for the Various Tests under Normality, Different Sample Sizes, Ratio of the Variances τ =10 , and α = 0.05 .

100

80

60

Power 40

20

0 01234 δ Fitted F-P Brunner-Munzel Satterthwaite

When symmetry is removed from the when samples of different sizes are generated distribution, such as in the case of the from contaminated normal distributions with contaminated normal distribution, the fitted test different variances. In both of these figures, the and the Brunner-Munzel test demonstrate fitted test and the Brunner-Munzel test superiority over Satterthwaite’s t test. This is demonstrate similar power. However, the t test illustrated in Figures 4 and 5. Figure 4 illustrates using Satterthwaite’s approximation for the the power of the three tests when samples of the degrees of freedom has considerably less power same size are generated from contaminated than the other tests. This pattern is consistent normal distributions with the same variance. over all of the results run using the contaminated Figure 5 illustrates the power of the three tests normal distribution.

HYSLOP & LUPINACCI 423

Figure 4. Plot of the Power for the Various Tests under the Contaminated Normal Distribution, Equal Sample Sizes, and Ratio of the Variances τ = 1, and α = 0.05 .

100

80

60

Power 40

20

0 01234 δ Fitted F-P Brunner-Munzel Satterthwaite

Figure 5. Plot of the Power for the Various Tests under the Contaminated Normal Distribution, Different Sample Sizes, and Ratio of the Variances τ = 0.1 , and α = 0.05 .

100

80

60

Power 40

20

0 01234 δ Fitted F-P Brunner-Munzel Satterthwaite

All three tests exhibited comparable variance. However, all three tests exhibited power under the double exponential and uniform decreased power when the sample with fewer distributions. All three tests had increased power observations was obtained from the distribution when the sample with fewer observations was with the larger variance. obtained from the distribution with the smaller

424 NONPARAMETRIC TEST FOR THE BEHRENS-FISHER PROBLEM

Conclusion References

In this paper, we developed a method to test the Brunner, E., & Munzel, U. (2000). The difference between two population medians. nonparametric Behrens-Fisher problem: Our fitted test was created by fitting a function Asymptotic theory and small-sample to the large table of critical values presented by approximation. Biometrical Journal, 42, 17-25. Fligner and Policello. Through a simulation Brunner, E., & Neumann, N. (1982). study, we have determined that our test, and the Rank tests for correlated random variables. Brunner-Munzel test, generally maintains the Biometrical Journal, 24, 373-389. expected level of the test for a variety of Brunner, E., & Nuemann, N. (1986). underlying density functions. Two-sample rank tests in general models. The usual alternative to Student’s t test, Biometrical Journal, 28, 395-402. the Wilcoxon-Mann-Whitney test, has been Fligner, M. A., & Policello, G. E. shown to be anti-conservative in the simulation (1981). Robust rank procedures for the Behrens- study under unequal variances by exhibiting Fisher problem. Journal of the American empirical level estimates that are generally Statistical Association, 76, 162-168. greater than the nominal level. Therefore, this Lee, A. F. S., & Fineberg, N .S. (1991). test also exhibited artificially high power in the A fitted test for the Behrens-Fisher problem. simulation results. Whereas, the fitted test and Communications in Statistics – Theory and the Brunner-Munzel test have shown Methods, 20, 653-666. comparable power to the t test using Lee, A. F. S., & Gurland, J. (1975). Size Satterthwaite’s approximation for the degrees of and power of tests for equality of means of two freedom under the ideal condition of normality. normal populations with unequal variances. When symmetry is removed from the Journal of the American Statistical Association, distribution function, such as in the 70, 933-941. contaminated normal distribution, the fitted test Mann, H. B., & Whitney, D. R. (1947). and the Brunner-Munzel test have shown On a test of whether one of two random improved power over the t test using variables is stochastically larger than the other. Satterthwaite’s approximation for the degrees of Annals of Mathematical Statistics, 18, 50-60. freedom. Martinez, J., & Iglewicz, B. (1984). All three tests exhibited comparable Some properties of the Tukey g and h family of power in the simulation studies when the data distributions. Communications in Statistics – were simulated from the double exponential or Theory and Methods, 13, 353-369. the uniform distributions. Statisticians should Wilcoxon, F. (1945). Individual consider using an alternative to the Wilcoxon- comparisons by ranking methods. Biometrics, 1, Mann-Whitney test when unequal variances are 80-83. possible.

Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 425-432 1538 – 9472/03/$30.00

Example Of The Impact Of Weights And Design Effects On Contingency Tables And Chi-Square Analysis

David A. Walker Denise Y. Young Educational Research and Assessment Institutional Research Northern Illinois University University of Dallas

Many national data sets used in educational research are not based on simple random sampling schemes, but instead are constructed using complex sampling designs characterized by multi-stage cluster sampling and over-sampling of some groups. Incorrect results are obtained from statistical analysis if adjustments are not made for the sampling design. This study demonstrates how the use of weights and design effects impact the results of contingency tables and chi-square analysis of data from complex sampling designs.

Key words: Design effect, chi-square, weighting

Introduction Methodology

Many large-scale data sets used in educational In large-scale data collection, survey research research are constructed using complex designs applies varied sample design techniques. For characterized by multi-stage cluster sampling example, in a single-stage simple random and over-sampling of some groups. Common sample with replacement (SRS), each subject in statistical software packages such as SAS and the study has an equal probability of being SPSS yield incorrect results from such designs selected. Thus, each subject chosen in the unless weights and design effects are used in the sample represents an equivalent total of subjects analysis (Broene & Rust, 2000; Thomas & in the population. More befitting, however, is Heck, 2001). The objective of this study is to that data collection via survey analysis often demonstrate how the use of weights and design involves the implementation of complex survey effects impact the results of contingency tables design (CSD) sampling, such as disproportional and chi-square analysis of data from complex stratified sampling or cluster sampling, where sampling designs. subjects in the sample are selected based on different probabilities. Each subject chosen in the sample represents a different number of This article is based upon work supported by the subjects in the population (McMillan & Association for Institutional Research (AIR), Schumacher, 1997). National Center for Education Statistics Complex designs often engender a (NCES), and the National Science Foundation particular subgroup, due to oversampling or (NSF) through fellowship grants awarded to the selection with a higher probability, and authors to participate in the 2001 AIR Summer consequently the sample does not reflect Data Policy Institute on Databases of NCES and accurate proportional representation in the NSF. Correspondence for this article should be population of interest. Thus, this may afford sent to: David Walker, Northern Illinois more weight to a certain subgroup in the sample University, ETRA Department, 208 Gabel, than would be existent in the population. As DeKalb, IL 60115, (815)-753-7886. E-mail him Thomas and Heck (2001) cautioned, “When at: [email protected]. using data from complex samples, the equal weighting of observations, which is appropriate with data collected through simple random samples, will bias the model’s parameter

425 426 THE IMPACT OF WEIGHTS AND DESIGN EFFECTS

estimates if there are certain subpopulations that Raw Expansion Weight (Wj) = n (1) have been oversampled” (p. 521). ∑ wj = N The National Center for Education j=1 Statistics (NCES) conducts various national surveys that apply complex designs to projects Weighted Mean (⎯x) = n (2) such as the Beginning Postsecondary Students ∑ wj xj / ∑ wj study (BPS), the National Educational j=1 Longitudinal Study of 1988 (NELS: 88), or the National Study of Postsecondary Faculty Mean Weight (⎯w) = n (3) (NSOPF). Some statistical software programs, ∑ w / n for instance SPSS or SAS, presuppose that data j j=1 were accumulated through SRS. These statistical programs tend not to use as a default setting Relative Weight = w /⎯w (4) sample weights with data amassed through j complex designs, but instead use raw expansion weights as a measure of acceptable sample size Notes: n = sample size, j=1 = subject response, (Cohen, 1997; Muthen & Satorra, 1995). wj = raw weight, xj = variable value, N = However, the complex sampling designs utilized population size in the collection of NCES survey data allocates larger comparative importance to some sampled Furthermore, the lack of sample elements than to others. To illustrate, a complex weighting with complex designs causes design identified by the NCES may have a inaccurate estimates of population parameters. sample selection where 1 subject out of 40 is The existence of variance estimates, which chosen, which indicates that the selection underestimate the true variance of the probability is 1/40. The sample weight of 40, population, induce problems of imprecise which is inversely proportional to the selection confidence intervals, larger than expected probability, indicates that in this particular case degrees of freedom, and an enhancement of 1 sample subject equals 40 subjects in the Type I errors (Carlson, Johnson, & Cohen, 1993; population. Lee, Forthofer, & Lorimor, 1989). Because of the use of complex designs, Design effect (DEFF) indicates how sample weighting for disparate subject sampling design influences the computation of representation is employed to bring the sample the statistics under study and accommodates for variance in congruity with the population the miscalculation of sampling error. As noted variance, which supports proper statistical previously, since statistical software programs inferences. The NCES incorporates as part of its often produce results based on the assumption data sets raw expansion weights to be applied that SRS was implemented, DEFF is used to with the data of study to ensure that the issues of adjust for these inaccurate variances. DEFF, as sample selection by unequal probability defined by Kish (1965), is the ratio of the sampling and biased estimates have been variance of a statistic from a CSD to the addressed. Relative weights can be computed variance of a statistic from a SRS. from these raw expansion weights. 2 DEFF = _SE CSD_ (5) Because the NCES accrues an 2 abundance of its data for analysis via CSD, the SE SRS following formulae present how weights function. The raw expansion weight is the The size of DEFF is affined to weight that many statistical software programs conditions such as the variables of interest or the use as a default setting and should be avoided attributes of the clusters used in the design (i.e., when working with the majority of NCES data. the extent of in-cluster homogeneity). A DEFF Instead, the relative weight should be used when greater than 1.0 connotes that the sampling conducting statistical analyses with NCES design decreases precision of estimate compared complex designs. to SRS, and a DEFF less than 1.0 confirms that WALKER & YOUNG 427 the sampling design increases precision of analysis, which resulted in 6,948 students. estimate compared to SRS (Kalton, 1983; Although there were 14,915 students in the 1994 Muthen & Satorra, 1995). As Thomas and Heck follow-up of NELS: 88, only 12,509 had high (2001) stated, “If standard errors are school transcript data (F3TRSCWT > 0) from underestimated by not taking the complex which F2RHMA_C was obtained. Of these, sample design into account, there exists a greater 6,948 participated in post-secondary education likelihood of finding erroneously ‘significant’ by the time of the third follow-up in 1994. parameters in the model that the a priori Missing values were not a problem with established alpha value indicates” (p. 529). RMATH. Of the 14,915 students in the 1994 follow-up of NELS: 88, 6,943 had a legitimate Procedures missing value because they had not participated Three variables were selected from the in postsecondary education (i.e., not of interest public-use database of the National Education for this paper), 16 had missing values, and 7,956 Longitudinal Study of 1988 to demonstrate the had a value (yes or no) for postsecondary impact of weights and design effects on remedial math. contingency tables and chi-square analysis. A There were some missing values for two-stage cluster sample design was used in high school transcript data, but the transcript NELS: 88, whereby approximately 1,000 eighth- weight (F3TRSCWT) provided in NELS: 88 grade schools were sampled from a universe of takes into account missing transcript data. The approximately 40,000 public and private eighth- Carnegie units of high school math grade schools (first stage) and 24 eighth-grade (F2RHMA_C) came from high school transcript students were randomly selected from each of data. There were 14,915 students in the 1994 the participating schools (second stage). follow-up of NELS: 88; however, only 12,509 An additional 2 to 3 Asian and Hispanic had high school transcript data. That is why students were selected from each school, which NCES provides a separate weight (F3TRSCWT) resulted in a total sample of approximately that is to be used specifically with variables 25,000 eighth-grade students in 1988. Follow-up from high school transcript data. studies were conducted on subsamples of this This weight has already been adjusted cohort in 1990, 1992, 1994, and 2000. by NCES for missing high school transcript Additional details on the sampling methodology observations. Of the 7,956 students with a value for NELS: 88 are contained in a technical report for RMATH, 1,008 did not have high school from the U.S. Department of Education (1996). transcript data. These 1,008 students were not The three variables used in this example included in the analysis presented here (7,956- are F2RHMA_C (total Carnegie Units in 1,008 = 6948 students for analysis in this paper). mathematics taken in high school), RMATH After selecting the 7,956 students with a value (flag for whether one or more courses in for RMATH, only those observations with remedial math were taken since leaving high F3TRSCWT>0 were selected. No further school), and F3TRSCWT (1994 weight to be adjustment was necessary for missing values used with 1992 transcript data). Five categories since F3TRSCWT had already been adjusted by for the number of Carnegie Units of math taken NCES for missing values. in high school were created (up through 1.99, Effect sizes are reported for each chi- 2.00 through 2.99, 3.00 through 3.99, 4.00 square statistic addressed in the research. For the through 4.99, 5.00 or more). The other variable chi-square statistic, a regularly used effect size is of interest was whether a student had taken a based on the coefficient of contingency (C), postsecondary remedial math course by the time which is not a true correlation but a “scaled” chi- of the 1994 follow-up study. Four chi-square squared (Sprinthall, 2000). As a caveat with the contingency tables were developed for these two use of C, it has been noted that its highest value variables using SPSS. Differences in the four cannot attain 1.00, as is common with other tables are due to use of weights and DEFF. effect sizes, which makes concordance with akin Only those observations where RMATH effect sizes arduous. > 0 and F3TRSCWT > 0 were selected for this 428 THE IMPACT OF WEIGHTS AND DESIGN EFFECTS

In fact, C has a maximum approaching Results 1.0 only for large tables. In tables smaller than 5 x 5, C may underestimate the level of A total of 6,948 observations met the selection association (Cohen, 1988; Ferguson, 1966). As criteria (i.e., availability of high school an alternative to C, Sakoda’s Adjusted C (C*) transcripts and participation in post-secondary may be used, which varies from 0 to 1 regardless education by the time of the third follow-up in of table size. For chi-square related effect sizes, 1994). The first contingency table (Table 1), Cohen (1988) recommended that .10, .30, .50 without any weights or design effects, has a total represent small, medium, and large effects. count of 6,948 and a chi-square value of 130.92. This table is useful for determining minimum C = SQRT [χ2 / (χ2 + n)] (6) cell sizes, but the percentages in each of the cells and the overall chi-square (130.92) are incorrect C* = C / SQRT [(k-1)/k] (7) because the sample observations were not weighted to represent the population. k = number of rows or columns, whichever is smaller.

Table 1. Contingency Table Without Weights: Carnegie Units of High School Math by Postsecondary Education (PSE) Remedial Math. χ2(4) = 130.92, C = .136, 95% CI (.112, .160), C* = .192, 95% CI (.168, .216).

PSE Remedial Math Units HS Math Yes No Row Total

0 – 1.99 Count 84 231 315 % of Grand Total 1.2% 3.3% 4.5%

2 – 2.99 Count 215 661 876 % of Grand Total 3.1% 9.5% 12.6%

3 – 3.99 Count 495 1,875 2,370 % of Grand Total 7.1% 27.0% 34.1%

4 – 4.99 Count 382 2,504 2,886 % of Grand Total 5.5% 36.0% 41.5%

>= 5 Count 43 458 501 % of Grand Total 0.6% 6.6% 7.2%

Column Total Count 1,219 5,729 6,948 % of Grand Total 17.5% 82.5% 100.0%

WALKER & YOUNG 429

Asian and Hispanic students were over- high school transcript variables such as sampled in NELS: 88, so the sample contained F2RHMA_C. The raw expansion weight is the higher proportions of these ethnic groups than number of cases in the population that the did the reference population. Sampling weights observation represents. Unlike simple random must be applied to the observations to adjust for sampling, the weights are not the same for each the over-sampling. In contrast, a chi-square table subject. The weights for these 6,948 without weights or design effects is appropriate observations range from 7 to 12,940 with a mean for a simple random sample because each of 228.50. The total count of 1,587,646 in this observation represents the same number of cases table represents the number of students from the in the population. 1988 eighth-grade cohort that met the selection The variable F3TRSCWT, a raw criteria. This table contains correct population expansion weight, is used as the weight in Table counts and percentages in the cells; however, the 2. This is one of several raw expansion weights overall chi-square (27,500.88) is too high provided by NCES, and it is the weight that is to because the cell sizes are overstated. The cell be used when analyzing variables from the 1994 sizes represent counts of the population rather follow-up (e.g., RMATH) in conjunction with than the sample.

Table 2. Contingency Table With Raw Expansion Weight F3TRSCWT: Carnegie Units of High School Math by Postsecondary Education (PSE) Remedial Math. χ2(4) = 27,500.88, C = .130, 95% CI (.128, .132), C* = .184, 95% CI (.182, .186).

PSE Remedial Math Units HS Math Yes No Row Total

0 – 1.99 Count 24,353 63,532 87,885 % of Grand Total 1.5% 4.0% 5.5%

2 – 2.99 Count 53,767 167,485 221,252 % of Grand Total 3.4% 10.5% 13.9%

3 – 3.99 Count 118,230 427,763 545,993 % of Grand Total 7.4% 26.9% 34.4%

4 – 4.99 Count 81,325 537,884 619,209 % of Grand Total 5.1% 33.9% 39.0%

>= 5 Count 14,951 98,356 113,307 % of Grand Total 0.9% 6.2% 7.1%

Column Total Count 292,626 1,295,020 1,587,646 % of Grand Total 18.4% 81.6% 100.0%

The relative weight of F3TRSCWT is weight of F3TRSCWT is computed by dividing used in Table 3 to bring the cell counts in Table F3TRSCWT by 228.50, which is the mean of 2 back into congruence with the sample counts. F3TRSCWT for the 6,948 observations. The For each of the 6,948 observations, the relative total count in Table 3 is 6,947, which differs

430 THE IMPACT OF WEIGHTS AND DESIGN EFFECTS

from Table 1 only because of rounding (note: correct cell percentages, but the cell sizes and although displayed in whole numbers by SPSS, chi-square (120.62) are overstated due to the Table 3 actually contains fractional numbers of two-stage clustered sample design of NELS: 88. observations in each cell). Table 3 contains

Table 3. Contingency Table With Relative Weight = F3TRSCWT / 228.5. Carnegie Units of High School Math by Postsecondary Education (PSE) Remedial Math. χ2(4) = 120.62, C= .131, 95% CI (.107, .155), C* = .185, 95% CI (.161, .209).

PSE Remedial Math Units HS Math Yes No Row Total

0 – 1.99 Count 107 278 385 % of Grand Total 1.5% 4.0% 5.5%

2 – 2.99 Count 235 733 968 % of Grand Total 3.4% 10.6% 13.9%

3 – 3.99 Count 517 1,872 2,389 % of Grand Total 7.4% 26.9% 34.4%

4 – 4.99 Count 356 2,354 2,710 % of Grand Total 5.1% 33.9% 39.0%

>= 5 Count 65 430 495 % of Grand Total 0.9% 6.2% 7.1%

Column Total Count 1,280 5,667 6,947 % of Grand Total 18.4% 81.6% 100.0%

Table 4 was obtained by dividing the the variance of the NELS: 88 estimates was relative weight for F3TRSCWT by the NELS: increased by 194% due to variations in the 88 average DEFF (2.94), extrapolated via Taylor weights. The square root of DEFF, the DEFT, series methods, which resulted in effective cell yields the degree by which the standard error has sizes with correctly weighted cell counts and been increased by the CSD. The DEFT (1.71) proportions and the appropriate overall chi- implied that the standard error was 1.71 times as square (40.81) for this clustered design. The large as it would have been had the present counts in Table 4 are the effective sample size results been realized through a SRS design, or after accounting for the clustered sample design the standard error was increased by 71%. An (i.e., a sample of 6,948 from this clustered intra-class correlation coefficient (ICC) of .20 or design is equivalent to a sample size of 2,363 less is desirable for indicating the level of randomly selected students). Essentially, a mean association between the responses of the DEFF of 2.94 tells us that if a SRS design had members in the cluster. Since an ICC was not been conducted, only 33% as many subjects used in the computation of the Taylor series- when compared against a CSD, would have been derived average DEFF for NELS: 88, an necessary to observe the statistic of study. estimated, average ICC was calculated from the

DEFFs that range between 1.0 and 3.0 following formula for determining tend to be indicative of a well-designed study. DEFF: 1 + δ (n – 1), (8) The current study’s DEFF of 2.94 indicated that WALKER & YOUNG 431 where δ is the ICC and n is the typical size of a corresponding characteristics than if compared cluster (Flores-Cervantes, Brick, & DiGaetano, to another member selected randomly from the 1999). The low ICC (.0844) indicated that the population. members in the same cluster were only about 8%, on average, more probable of having

Table 4. Contingency Table With Weight = (F3TRSCWT / 228.5) / 2.94: Carnegie Units of High School Math by Postsecondary Education (PSE) Remedial Math. χ2(4) = 40.81, C = .130, 95% CI (.090, .170), C* = .184, 95% CI (.144, .224).

PSE Remedial Math Units HS Math Yes No Row Total

0 – 1.99 Count 36 95 131 % of Grand Total 1.5% 4.0% 5.5%

2 – 2.99 Count 80 249 329 % of Grand Total 3.4% 10.5% 13.9%

3 – 3.99 Count 176 637 813 % of Grand Total 7.4% 27.0% 34.4%

4 – 4.99 Count 121 801 922 % of Grand Total 5.1% 33.9% 39.0%

>= 5 Count 22 146 168 % of Grand Total 0.9% 6.2% 7.1%

Column Total Count 435 1,928 2,363 % of Grand Total 18.4% 81.6% 100.0%

NELS: 88 used a clustered sample Furthermore, the chi-square (40.81) is an design, in which schools were randomly appropriate approximation of the true chi-square selected, and then students within those schools for this clustered design. These are the values were randomly selected. Students selected from that should be used in a chi-square analysis of such a sampling design would be expected to be Carnegie Units of high school math by whether more homogeneous than students selected from or not a student took a postsecondary education a simple random design across all schools. The remedial math course. Notice that the cell counts chi-square values from SPSS cross-tabulations and the total count in Table 4 are equal to those and SAS Proc Freq tables presume simple in Table 3 divided by 2.94. The counts in Table random samples. One method for estimating the 4 are the effective sample size after accounting proper chi-square for the two variables under for the clustered sample design. investigation from NELS: 88 is to divide the As was found with the chi-square relative weight for F3TRSCWT by the average statistics, weighting, or lack thereof, also DEFF (2.94), and use the result as the weight in influenced effect size values. For example, the SPSS cross-tabulations or SAS Proc Freq. The coefficient of contingency and Sakoda’s results in Table 4 were obtained by such a Adjusted C in Table 1, where the default of no computation, which yields effective cell sizes weighting occurred, had higher values than any and correctly weighted proportions. of the reported C or C* estimations where a 432 THE IMPACT OF WEIGHTS AND DESIGN EFFECTS form of weighting transpired. It should be noted Carlson, B. L., Johnson, A. L., & that the C values ranged from .130 to .136, or in Cohen, S. B. (1993). An evaluation of the use of the case of adjusted C from .184 to .192, which personal computers for variance estimation with means that regardless of weighting scheme, or complex survey data. Journal of Official none at all, the practical implication of the chi- Statistics, 9, 795-814. square statistics of study was that they had a Cohen, J. (1988). Statistical power small effect. Thus, although the chi-square analysis for the behavioral sciences (2nd ed.). statistics were all statistically significant, they Hillsdale, NJ: Lawrence Erlbaum Associates. had a small effect, which indicates that the Cohen, S. B. (1997). An evaluation of results derived from the chi-square statistics alternative PC-based software packages would not be deemed very important practically developed for the analysis of complex survey and also in terms of accounting for much of the data. The American Statistician, 51, 285-292. total variance of the outcome. Ferguson, G. A. (1966). Statistical analysis in psychology and education (2nd ed.). Conclusion New York: McGraw-Hill. Flores-Cervantes, I., Brick, J. M., & Some sampling designs over-sample certain DiGaetano, R. (1999). Report no. 4: 1997 NSAF groups (i.e., their proportion in the sample is variance estimation. http://newfederalism.urban. greater than their proportion in the population) org/nsaf/methodology1997.html in order to obtain sufficiently large numbers of Kalton, G. (1983). Introduction to observations in these categories so that statistical survey sampling. In J. L. Sullivan & R. G. analyses can be conducted separately on these Niemi (Series Eds.), Quantitative applications in groups. When analyzing the entire sample, the social sciences. Beverly Hills, CA: Sage relative weights should be used to bring the Publications. sample proportions back in congruence with the Kish, L. (1965). Survey sampling. New population proportions. When clustered sampled York: John Wiley & Sons. designs are used, then relative weights should be Lee, E. S., Forthofer, R. N., & Lorimor, divided by the DEFF to adjust for the fact that a R. J. (1989). Analyzing complex survey data. In sample from a clustered design is more M. S. Lewis-Beck (Series Ed.), Quantitative homogeneous than if a simple random sampling applications in the social sciences. Newbury scheme had been employed. The chi-square Park, CA: Sage Publications. values from SPSS cross-tabulations and SAS McMillan, J. H., & Schumacher, S. Proc Freq tables presume simple random (1997). Research in education: A conceptual samples. Design effects must be used with such introduction (4th ed.). New York: Longman. software in order to obtain an appropriate Muthen, B. O., & Satorra, A. (1995). approximation for the true chi-square, and its Complex sample data in structural equation accurate effect size, of a clustered design. modeling. In P. Marsden (Ed.), Sociological methodology (pp. 267-316). Washington, DC: References American Sociological Association. Sprinthall, R. C. (2000). Basic statistical Broene, P., & Rust, K. (2000). analysis (6th ed.). Boston, MA: Allyn and Strengths and limitations of using SUDAAN, Bacon. Stata, and WesVarPC for computing variances Thomas, S. L., & Heck, R. H. (2001). from NCES data sets. (U.S. Department of Analysis of large-scale secondary data in higher Education, National Center for Education education research: Potential perils associated Statistics Working Paper No. 2000-03). with complex sampling designs. Research in Washington, DC: U.S. Department of Education. Higher Education, 42, 517-540. U.S. Department of Education. (1996). Methodology report, National Education longitudinal study: 1988-1994 (NCES 96-174). Washington, DC: Author. Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 433-442 1538 – 9472/03/$30.00

Correcting Publication Bias In Meta-Analysis: A Truncation Approach

Guillermo Montes Bohdan S. Lotyczewski Children’s Institute, Inc.

Meta-analyses are increasingly used to support national policy decision making. The practical implications of publications bias in meta-analysis are discussed. Standard approaches to correct for publication bias require knowledge of the selection mechanism that leads to publication. In this study, an alternative approach is proposed based on Cohen’s corrections for a truncated normal. The approach makes less assumptions, is easy to implement, and performs well in simulations with small samples. The approach is illustrated with two published meta-analyses.

Key words: Meta-analysis, methods, truncation, publication bias

Introduction Undoubtedly, the reason is that the methodology available to the address the Publication bias presents possibly the problem (Vevea & Hedges, 1995; Hedges & greatest methodological threat to the Vevea, 1996; Cleary & Casella, 1997) is validity of a meta-analysis. It can be complex, not easily accessible to the average caused by the biased and selective meta-analyst practitioner and has been unable to reporting of the results of a given study, make a strong practical case for supporting its or, more seriously, by the selective use. The problem is difficult because publication decision to publish the results of the study bias, by its own nature, is a phenomena we know in the first place. Undetected publication little about and because it does not suffice to bias is especially serious owing to the fact show that, theoretically, a corrected estimate that the meta-analysis may not only lead exists. One must show that the correction to a spurious conclusion, but the performs better than the original biased statistics aggregation of data may give the in small samples. impression, with standard statistical In spite of these practical problems, the methodology, that the conclusions are struggle against the effects of publication bias very precise. (Cooper & Hedges, 1994, p. should not be abandoned. The presence of 407). publication bias can lead to an erroneous consensus regarding the efficacy of a class of With these words, Cooper and Hedges interventions or the importance of a particular (1994) concluded their discussion on the factor in a psychological process of interest. detection and correction of publication bias in Moreover, because one cannot assume that the meta-analysis. For all its theoretical and same level of publication bias exists across practical importance, it is not often that one sees meta-analyses, even in related content areas, a meta-analysis corrected for publication bias. there is little solid ground on which to base comparisons across meta-analyses. Not only is the scientific community in Address correspondence to: Guillermo Montes, danger of conceding to the evidence what the Children’s Institute, Inc., 274 N. Goodman St., evidence does not warrant; but often social Suite D103, Rochester, NY, 14607. E-mail: scientists are called to testify to critical gmontes@ childrensinstitute.net. Telephone: allocations of funds and to the implementation 585.295.1000 ext 227. of far-reaching social policies. Meta-analytic evidence plays an increasing role in those policy

433 434 CORRECTING PUBLICATION BIAS IN META-ANALYSIS discussions as legislators and other policy assumptions; at least until the selection makers demand simple summaries of complex processes are better understood. information. Therefore, publication bias can also The truncation approach is more lead to harm in the public policy arena. practical than the standard approach for three To be widely used, a method for reasons. First, detecting publication bias correcting publication bias in meta-analysis must becomes an exercise in elementary statistics. Is meet the following criteria: 1) It must recover the observed distribution of effects normal or it the true population parameters in large samples, is missing one of the tails? Both a standard 2) it must be an improvement over the biased histogram of the observed distribution and the sample statistics in small samples, and 3) it must computation of the distance between the median be relatively easy to calculate and easy to use for and the mean in standard deviation units can be the average meta-analytic practitioner. used to answer this question. Second, although we provided a Modeling Publication Bias: Two Approaches rationale for our approach, the truncation model Traditional approaches to correct meta- does not require us to specify a selection analysis require some model for observed effect mechanism or to know how publication bias sizes that incorporates the selection process. occurs. All we need to know is that there were Two aspects to such a model are given, the no published studies below a particular effect selection model and the effect size model. size, and that the observed distribution of effects (Hedges & Vevea, 1996). Typically, the effect is skewed to the right. Truncation relies size model has been constructed using the exclusively on the assumption of normality of random effects model and assuming a normal the effect size model. compound distribution. The selection process is Third, it simplifies the correction for modeled as a complex weight function of the publication bias considerably because it uses a probability of obtaining significant results based long-time developed method already in use in on sample size. This approach is based on the other disciplines as the standard way to deal notion that publication bias is directly related to with the statistics of truncated phenomena. Since the presence of significant results. 1959 engineers, economists, cosmologists and This approach, commonplace in the physicists have used Cohen’s (1959) estimates literature, has a number of problems. First, it is for the population mean and standard deviation unclear whether significance is the only criteria of a truncated normal to investigate truncated that impacts publication bias, effect sizes may be phenomena. equally important, particularly when the sign is Cosmologists observe only the brightest unexpected. Second, the selection process is an stars, engineers observe only products that meet unknown and complex social phenomenon. tolerance checks, economists observe only Modeling publication bias as a function of a portions of the income distribution and particle process we know little about seems unwise. physicists observe only the energy signature of An alternative approach is to use a higher energy particles. Similarly, highly simple truncation model, based not on statistical effective programs are likely to be observed in significance but on effect size. After all if the published literature while less effective publication bias is having an impact on the interventions with non-significant or negatively overall results of a meta-analysis is because the significant results are likely to become bias is systematically truncating one of the tails, unavailable results. Meta-analysis can benefit typically the left tail, of the distribution of from the research and development of program effects. truncation-related statistics in other fields. These Because the standard approach assumes include truncation regression, correction for normality, modeling publication bias with a doubly truncated normals and many others truncated normal model may be a practical (Greene, 1990). alternative to modeling selection processes without imposing additional unverified

MONTES & LOTYCZEWSKI 435

Correcting for Publication Bias in Meta- Then solve for ξ and calculate θ(ξ). There are Analysis two ways of making the process less painful. Assume that the distribution of effects is One can look up the value of θ(ξ) in Cohen’s normal. One can model the distribution of book (1991, Table 2.1.) Alternatively, one can effects in a variety of ways but the simplest use a numerical solver, now standard in many method is to posit a compound normal where applications, to numerically solve for ξ . each study would be a realization of a normal Once θ(ξ) is known, calculate µC and distribution with mean ∆. Where ∆ represents 2 σ C using equations 1 and 2. Note that the the true effect sizes of each actual intervention. estimated degree of truncation is simply Φ(ξ). Yet, each true intervention effect size ∆ is itself a random variate of a normal distribution with µR = x −θ (ξ )(x − T ) (1) mean µ. µ represents the true effect size of a 2 2 2 class of interventions. σ R = s + θ (ξ )(x − T ) (2) T − µ ξ = R (3) The resulting distribution of effects is a σ compound normal distribution: R Q(ξ ) θ (ξ ) = N(∆,σ ) ∧ N(µ,σ ′) . Q(ξ ) − ξ ∆ φ(ξ ) Q(ξ ) = It can be shown (Johnson, Koptz, & 1 − Φ(ξ ) Balakrishnan, 1994) that such distribution is also s 2 1 − Q(ξ )(Q(ξ ) − ξ ) 2 = 2 normal with N(µ, σ 2 + σ ′2 ) . ()x − T (Q(ξ ) − ξ ) Consider now the presence of publication bias. Because of the reasons Cohen’s formulas to calculate the 95% described above, effect sizes below some level T confidence interval around the mean and are unlikely to be published. The resulting standard deviation are: observable distribution of effect sizes will be a truncated normal. φ11 = 1 − Q(Q − ξ ) Truncation of the left tail of a normal φ = Q(1 − ξ (Q − ξ )) distribution produces the following effects: 1) 12 the sample mean will overestimate the true φ22 = 2 + ξφ12 mean, and 2) the sample standard deviation will φ22 underestimate the true standard deviation. µ11 = φ φ − φ 2 In other words, publication bias will 11 22 12 result in the systematic overestimation of φ µ = 11 average effect sizes and the lowering of the 22 2 φ11φ22 − φ12 associated standard deviation resulting in the σ 2 illusion of precision that Cooper and Hedges V (µ ) = R µ (1994) described as one the greatest threats to R N 11 the validity of meta-analysis. 2 σ R V (σ R ) = µ22 Correction for truncation N Cohen (1959) first developed estimation 95%CIµR = procedures to recover the mean and standard (µ − 2 V (µ ), µ + 2 V (µ )) deviation from a truncated observed normal R R R R distribution. Equations 1-5 describe the process. First, calculate the left-hand side of equation 5, where Q is evaluated at ξ. using the minimum observed value in the truncated distribution as a proxy variable for T. 436 CORRECTING PUBLICATION BIAS IN META-ANALYSIS

A Large-Sample Example The population parameters were picked To illustrate the process, consider a class to represent meta-analytic results of importance of interventions whose true effect size is 0.4 both for scientific and policy purposes. We with a 0.8 standard deviation. Because of the chose a large effect (0.8) with a relatively small large standard deviation, if 2000 studies were standard deviation (0.4) and a total sample size performed on this class of interventions one of 100 published and unpublished studies (of would expect that 20% of the studies would which only a few will be published under high have negative results, some of them with truncation). considerable effect sizes (e.g., -0.8 and below). Maxwell and Cole (1995) stated that Assume that there is no theoretical “simulation studies are experiments and must be explanation for a negative effect size, so studies described and interpreted in this light”. showing negative effect sizes are unlikely to be Therefore, we will use the language of published. In the random sample we generated, experiments to describe our simulations. Table 1 that would leave 1393 publishable studies with shows the result of an experiment designed to some of them reporting non-significant results. answer seven questions and analyze how the A meta-analysis performed on the 1393 answers vary as the truncation level increases: studies would yield a biased sample mean of 0.82 with a biased standard deviation of 0.55. By • Question 1: What is the average sample all accounts, this class of interventions would be bias for µ? deemed to have large effects. Cohen’s corrected • Question 2: What is the average sample estimates are 0.475 [95% CI (0.3723,0.5776)] bias for σ? for the mean and 0.77 [95% CI (0.709-0.829)] • Question 3: What would be the average for the standard deviation. As can be seen, these number of studies published? estimates are quite close to the true values of 0.4 • Question 4: What is the average error in and 0.8. correction for µ using Cohen’s As mentioned before, once the original estimates? mean and standard deviations have been • Question 5: What is the average error in recovered one can calculate the degree of truncation by simply calculating the value of the correction for σ using Cohen’s cumulative normal with the recovered mean and estimates? standard deviation at the truncation point, Φ(ξ). • Question 6: On average, by how much In this case, the degree of truncation was do we benefit by performing the 26.84%. correction? • Question 7: In what percentage of Behavior of the Estimator in Small Samples samples would the meta-analyst For Cohen’s estimates to be useful in the practitioner benefit from using Cohen’s correction of meta-analysis publication bias they estimates? need perform adequately in small samples. The standard criteria of using 95% confidence To answer these questions we simulated intervals does not seem appropriate in this small 10,000 samples of a normal distribution of effect sample context. Some severely truncated sizes mean = 0.8 and sd = 0.4. We then samples will have sample sizes of below 15 truncated it to create an observed distribution. observations and, therefore we expect that the We used the four different points of truncation 95% confidence interval of the corrected mean ranging from almost no truncation (2 standard will contain the biased sample estimate. Other deviations below the mean) to severe truncation approaches to correct publication bias have the (one standard deviation above the mean). Then same problem (Vevea & Hedges, 1995). we used Cohen’s (1959) formulas to estimate the Therefore, we studied the direct improvement of corrected mean and standard deviation. using Cohen’s formulas in terms of distance to To answer questions 6 and 7 we defined the true parameters. an improvement measure as the ratio of two distances. The numerator is the distance between MONTES & LOTYCZEWSKI 437 the sample moment from the biased distribution the biased sample moments; finally, and the corresponding true value. The improvement factors higher than one indicate denominator is the distance between the how much closer the correction for truncation corrected estimate and the true value. We used gets us to the true mean (e.g. a value of 2 would the absolute value measure of distance (although indicate that Cohen’s correction gets us two in the next section we also ran simulation with times closer to the real mean than the biased the Euclidean distance without substantial estimates do). differences). Since the improvement factors are always positive, their distribution is not likely to Dist(x, µ) x − µ be symmetric; therefore, we report the median IMPROVEµ = = improvement, as the preferred measure of Dist(µ R , µ) µ R − µ central tendency. This median will be the answer provided to question 6. for the mean; and Because it is possible to have large average improvements while the majority of the Dist(s,σ ) s −σ samples would not be improved by using IMPROVEσ = = Cohen’s corrections, question 7 asks the Dist(σ R ,σ ) σ R −σ proportion of the 10,000 that benefit from the correction. Benefit is defined as having an for the standard deviation. improvement factor strictly higher than one. It An improvement factor below one is a measure of the risk that an average meta- indicates that the correction gets us farther way analyst practitioner incurs by correcting the from the true value, an improvement factor of estimates of her study. one indicates that the correction does as badly as

Table 1. Results of Experiment 1.

Almost Mild Serious Severe none Truncation point (T) µ-2 σ=0 µ-σ=0.4 µ=0.8 µ+σ=1.2 Truncation level Φ(T) 0.023 0.1586 0.5 0.841 Average Observed Sample Size 97.73 84.12 50.03 15.88 Average Sample Bias (for µ) 0.022 0.114 0.319 0.610 Average Sample Bias(for σ) -0.024 -0.084 -0.161 -0.227 Average Error in Correction (for µ) -0.017 -0.040 -0.150 0.116 Average Error in Correction (for σ) 0.0123 0.016 0.033 -0.056 Median Improvement factor (for µ) 1.1724 2.147 1.889 4.233 Median Improvement factor (for σ) 1.375 2.378 2.202 2.597 % of Samples that benefited (µ) 52.83% 75.84% 72.70% 100% % of Samples that benefited (σ) 56.68% 80.90% 80.80% 96.4% 95% CI range 0.48 0.939 0.921 11.29 Does CI contain sample mean? 100% 100% 100% 100% Does CI contain true mean? 96.43% 96.93% 92.08% 85.68% Simulations based on 10,000 random samples from the normal (0.8, 0.4) for each truncation level.

Answer to Question 1: The Answer to Question 2: The overestimation of µ increases with truncation underestimation bias of the standard deviation level ranging from 0.02 to 0.609. increases as the truncation gets progressively 438 CORRECTING PUBLICATION BIAS IN META-ANALYSIS worse, but it does so at a slower rate than the nonlinear risk function was carefully sample mean. It ranges from -0.02 to -0.22. investigated in the next section. Answer to Question 3: The observed When almost no truncation is present sample size varies form 97 (almost the 100 (truncation level of 0.02) slightly half of the possible publications) to about 15 in the case of samples did not benefit from Cohen’s correction. severe truncation. It is, of course, a linear At that small level of truncation, however, both function of the truncation level. the error of the correction and the bias are Answer to Question 4: The average unlikely to have substantial scientific or policy errors made in correcting for µ ranged form 0.02 implications. As truncation increases, both the to 0.15 in absolute value, roughly increasing in a chances of benefiting from using Cohen’s non-linear manner with the truncation level. At correction and the improvement in terms of all levels of truncation, the average correction distance to the true parameters are sizeable. error was smaller than the corresponding Therefore, if truncation is detected, the use of average bias. Cohen’s estimates seems warranted even for Answer to Question 5: The average error small sample sizes. made in correcting for σ ranged from 0.01 to We now turn our attention to 0.05, roughly increasing in a non-linear manner investigating in detail how the proportion of with the truncation level. Answer to question 5. samples that benefit from correction increase as At all levels of truncation, the average correction a nonlinear function of truncation level. error was smaller than the corresponding average bias. Proportion of Samples that Benefit from Answer to Question 6: The median Cohen’s Correction as a Function of Truncation improvement from using the correction ranges The experiment of the previous section form 1.17 to 4.23. In other words, Cohen’s yielded that the proportion of samples that estimation method got us anywhere from 1.17 to benefit from Cohen’s corrections were nonlinear four times closer to the true mean. The functions of the truncation level Φ(ξ). To improvement function is a nonlinear function of investigate these nonlinear functions further we the truncation level, increasing with the generated 1000 random samples (µ=0.8, σ=0.4, truncation level at early stages of truncation, N=100) for each of 121 levels of truncation decreasing until past the 0.5 truncation level to ranging from T=µ-2σ=0 to T==µ+σ=1.2 at 0.01 quickly ascend again. intervals. We then plotted the percentage of The median improvement for the samples that benefit from Cohen’s correction for standard deviation ranged from 1.27 to 2.59. µ as a function of the truncation level, and Again, the function is nonlinear with truncation proceeded similarly for σ. We repeated the level, although less dramatically non-linear than process for different values of µ and σ but the the improvement for the mean was. nonlinear pattern remained essentially Answer to Question 7: Regardless of the unchanged. level of truncation, the correction for both µ and We employed the absolute value σ was beneficial in more than half of the cases. distance function in our improvement measure With mild truncation the proportion of samples as before; but also generated a complete that benefited from the correction were over independent set of random samples and 75%, there was a small decrease in the calculated the improvement factors using proportion of samples that benefit as truncation standard Euclidean distance function. Figure 1 nears the 0.5 point and then a dramatic increase and 2 show the results. so that for serious truncation virtually all samples benefited form Cohen’s correction. This

MONTES & LOTYCZEWSKI 439

Figure 1. Samples Improved By Correction: The Mean Samples Improved by Correction

By Type of Distance Used 100

90

80

70

% Samples Improved 60 Absolute value dist.

50 Euclidean dist. 0.0 .2 .4 .6 .8 1.0

Based on 121 sets of 1000 samples with sample size 100

Distances calculated on two independent runs.

440 CORRECTING PUBLICATION BIAS IN META-ANALYSIS

Figure 2: Samples Improved By Correction: The Standard Deviation.

Samples Improved by Correction

By Type of Distance Used 100

90

80

70

60 Absolute value dist.

50 Euclidean dist. 0.0 .2 .4 .6 .8 1.0

Based on 121 sets of 1000 samples with sample size 100

Distances calculated on two independent runs.

Note the following patterns: 1) for all the mild levels (Φ(ξ) ≈ 0.2) of truncation truncation levels, the proportion of samples that commonly believed to be present in meta- benefit from Cohen’s correction for both µ and analysis. σ was over 50%, 2) for mild truncation levels, the proportion of samples that benefit from Illustrative Examples Cohen’s correction increases quite rapidly until To demonstrate the applicability of the about Φ(ξ)=0.25, 3) in the case of µ, the method we have chosen two meta-analysis. The proportion of samples decreases until Φ(ξ)=0.65 meta-analysis were previously published by truncation level to then rise dramatically to Psychological Bulletin and contained the 100%, and 4) in the case of σ, the proportion of necessary data to make the corrections. We are samples stabilizes at about 80% until past the not presenting the corrections as substantive Φ(ξ)=0.5 truncation level to then rise revisions, but simply as illustrations of the dramatically to almost 100%. method. The two meta-analyses show different Therefore, Cohen’s estimates perform levels of truncation. adequately in small samples, with over 60% chance of obtaining a better point estimate Example 1: Mild Truncation through Cohen’s estimation method. The The first example is taken from table 3 correction seems to be particularly beneficial for of Yirmiya, et al. (1998) meta-analysis MONTES & LOTYCZEWSKI 441 comparing theory of mind abilities of results are interpretable in the context of new individuals with autism, individuals with mental theories or methodological issues. It is also retardation and normally developing individuals. suggestive that at least part of the controversy The data used here refers only to the comparison regarding diverse meta-analytic findings from of individuals with autism versus normally several types of studies may be due to the degree developing individuals. The authors report of publication bias. different average statistics because they calculated a weighted average. We had no Conclusion information to replicate the weights (sample size of the studies). Publication bias is an important threat to the There were 22 effect sizes, with sample validity of meta-analysis. It can lead to error mean 1.1173, standard deviation 0.9667, median regarding the efficacy of classes of interventions 1.030 and minimum value -0.40. The authors or the importance of particular factors in report different numbers because they used a psychological processes. These errors can have a weighted function to calculate average effect detrimental effect on both scientific knowledge sizes. and on public policy. Therefore, it is important The histogram of the observed to find some correction, even if imperfect, to the distribution and the fact that the median was problem. larger than the sample mean revealed mild First, modeling publication bias by truncation on the left size. Cohen’s corrections estimating a selection function of what remains a are as follows: Corrected µ = 0.689, corrected σ fundamentally unknown process seems to us =1.258. Degree of truncation 0.1933. Therefore, unwise. Selection rules are likely to vary in this case the correction would cast some depending on the nature of the study, the doubt on the average large effect differential availability of theoretical and methodological between normally developing individuals and explanations for the unexpected result, the other those with autism. results in the study, a very complex web of reputation and financial incentives, and the Example 2: No Truncation larger context of scientific or popular debate on The second example is taken from the content of the study. Therefore, to correct Appendix A of Rind, Tromovitch, and publication bias using a selection approach one Bauserman’s (1998) controversial meta-analysis either needs to know the complexity of how the on the assumed consequences of child sexual publication bias originated or oversimplify the abuse using college samples. This is an example problem substantially by using a simple of real-world research in which it was easier to mechanical rule. In either case, one is likely to explain the lack of significant positive findings impose additional assumptions on the data. by using a number of methodological and We make a case for using truncation theoretical arguments. Because of this, one instead of selection as a method to correct for would expect less truncation to have occurred. publication bias on practical grounds: truncation Using the 56 studies, the average effect does not require any additional assumptions size is 0.0953, with a standard deviation of beyond the normality of the effect size 0.0947 and a minimum observation of -0.25. distribution; in particular it does not require us The histogram revealed little or no truncation, as to know how the publication selection took did the fact that the median was almost identical place. Truncation is easy to detect in practice by to the sample mean. The corrected mean was looking at simple statistics like the difference 0.09531. The point estimate is essentially between the median and the mean or plotting a identical to the uncorrected mean. The corrected histogram. It can be corrected by well-developed standard deviation is 0.0948. The estimated estimators currently in use by other disciplines degree of truncation was only 0.0001. with the attendant benefits of on-going research This example illustrates how some and development in the area. meta-analysis may suffer very little from In addition, our simulations demonstrate publication bias because negative and positive that in the small samples typical of meta-analytic 442 CORRECTING PUBLICATION BIAS IN META-ANALYSIS studies Cohen’s correction performs adequately. Cohen, A. C. (1959). Simplified In cases of mild truncation (defined as around Estimators for the Normal Distribution When 20%), the proposed correction will, on average, Samples are Singly Censored or Truncated. get point estimates that are two times closer to Technometrics, 1, 3, 217-237. the true parameters, and the correction will Cooper, H., & Hedges, L. V. (Eds) benefit over 70% of the samples. Therefore, the (1994). The handbook of research synthesis. odds favor making the correction. The size of New York: Russell Sage Foundation. the correction is likely to have a substantial Greene, W. H . (1990). Econometric impact on the interpretation of the results. Analysis. New York: MacMillan Publishing Certainly, this approach is not perfect. Company. The truncation approach is presented simply as Hedges, L. V., & Vevea, J. (1996). an approximation to the real underlying structure Estimating Effect Size Under Publication Bias: of publication bias. Yet, complicating the Small Sample Properties and Robustness of a statistics in favor of a more accurate portrayal of Random Effects Selection Model. Journal of the underlying structure, given our wide Educational and Behavioral Statistics, 21, 4, ignorance of the phenomena and the increasing 299-332. complexity of the statistics, seems to us not be a Johnson, N. L., Kotz, S., & practical approach to a problem that has Balakrishnan, N. (1994). Continuos Univariate important policy ramifications. Given the Distributions. V. 1. 2nd edition. NY: John Wiley seriousness of the potential damage publication & sons, Inc. bias may be doing both to social science and to Maxwell, S. E., & Cole, D. A. (1995). public policy finding some correction procedure Tips for Writing (and Reading) Methodological that requires minimal assumption and is easy to Articles. Psychological Bulletin, 118, 2,193-198. use seems to us as a more responsible course of McEachern, W. A. (1994). Economics: action than ignoring the problem until a A Contemporary Introduction. Cincinnati, OH: complete solution has been found. South-Western Publishing Company. Rind, B., Tromovitch, P., Bauserman, R. References (1998). A meta-analytic examination of assumed properties of child sexual abuse using college Cleary, R. J., & Casella, G. (1997). An samples. Psychological Bulletin, 124, 22-53. Application of Gibbs Sampling to Estimation in Vevea, J. L., & Hedges, V. (1995). A Meta-Analysis: Accounting for Publication Bias. General Linear Model for Estimating Effect Size Journal of Educational and Behavioral in the Presence of Publication Bias. Statistics, 22, 2, 141-154. Psychometrika, 60, 3, 419-435. Cohen, A. C. (1991). Truncated and Yirmiya, N., Osnat, E., Shaked, M., & Censored Samples: Theory and Applications. Solomonica-Levi, D. (1998). Meta-Analyses New York, NY: Marcel Dekker Inc. Comparing Theory of Mind Abilities of Individuals With Autism, Individuals With Mental Retardation, and Normally Developing Individuals. Psychological Bulletin, 124, 3, 283- 307.

Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 443-450 1538 – 9472/03/$30.00

Comparison Of Viral Trajectories In Aids Studies By Using Nonparametric Mixed-Effects Models

Chin-Shang Li Hua Liang Department of Biostatistics Department of Biostatistics St. Jude Children’s Research Hospital St. Jude Children’s Research Hospital

Ying-Hen Hsieh Shiing-Jer Twu Department of Applied Mathematics National Health Research Institutes National Chung-Hsing University Taipei, Taiwan Taichung, Taiwan

The efficacy of antiretroviral therapies for human immunodeficiency virus (HIV) infection can be assessed by studying the trajectory of the changing viral load with treatment time, but estimation of viral trajectory parameters by using the implicit function form of linear and nonlinear parametric models can be problematic. Using longitudinal viral load data from a clinical study of HIV-infected patients in Taiwan, we described the viral trajectories by applying a nonparametric mixed-effects model. We were then able to compare the efficacies of highly active antiretroviral therapy (HAART) and conventional therapy by using Young and Bowman’s (1995) test.

Key words: AIDS clinical trial, HIV dynamics, longitudinal data, kernel regression, nonparametric mixed-effects model, viral load trajectory

Introduction roles in clinical research evaluating antiviral therapies for the acquired immunodeficiency Surrogate viral markers, such as the amount of syndrome (AIDS). Before HIV RNA assays HIV RNA in the plasma (the amount of HIV were developed in mid-1990s, CD4+ cell counts RNA in the patient’s plasma represents the served as the primary surrogate marker in AIDS patient’s viral load), currently play important clinical trials. Later, the amount of HIV RNA in the patient’s plasma (viral load, measured as the copy number of the viral RNA) was shown to Chin-Shang Li and Hua Liang are Assistant better predict the clinical outcome (Mellors et Members, St. Jude Children’s Research al., 1995; Mellors et al., 1996; Saag et al., 1996), Hospital, Memphsis, TN, 38105. E-mail: and thus replaced CD4+ cell counts as the [email protected]; [email protected]. primary surrogate marker used in most AIDS Ying-Hen Hsieh is Professor, Department of clinical trials. Applied Mathematics, National Chung-Hsing It is, therefore, important to characterize University, Taichung, Taiwan. Shiing-Jer Twu is the trajectory that describes the change in viral Visiting Professor, NHRI Forum, National load that occurs during antiviral treatment, Health Research Institutes, Taipei, Taiwan. The because it is this trajectory that is commonly research of C.S. Li and H. Liang was partially used to evaluate the efficacy of the treatment. supported by Cancer Center Support Grant For example, if the viral load reduces, we may CA21765 from the National Institutes of Health infer that the treatment has successfully and by the American Lebanese Syrian suppressed the replication of the virus. The Associated Charities (ALSAC). Y.H. Hsieh was differences between the viral loads resulting supported by grant DOH91-DC-1059 from the from different antiviral treatments may be used CDC of Taiwan. The authors thank Nanyu to compare the antiviral activities of the Wang for preparing the data analyzed. treatments. Appropriate analysis of the viral load

443 444 COMPARISON OF VIRAL TRAJECTORIES is therefore very important in HIV/AIDS drug technique developed by Wu and Zhang (2002). development. In general, it is believed that the The test statistic proposed by Young and replication of the virus is suppressed at the Bowman (1995) will then be used to answer beginning of an antiviral treatment, but recovery question (ii). of the virus (called rebound) can occur in later The remainder of this paper is organized as stages of treatment, because of drug resistance follows. In Section 2, we give details of the or treatment failure. Some parametric models proposed model, with the method of estimation, have been developed to describe the progression and use the test statistic of Young and Bowman of AIDS phenomenologically; among the best (1995) to determine whether there is a difference known of these models are the exponential between the effects of two treatments. In Section models (Ho et al., 1995; Wei et al., 1995). More 3, we illustrate the use of the proposed recently, biomathematicians and biologists have methodology with longitudinal viral load data proposed a variety of complicated models that from 30 HIV-infected patients treated with include the use of differential equations. The use HAART alone and another 30 patients treated of these models has led to a deeper with monotherapy or dual therapy. Some understanding of the pathogenesis of AIDS (e.g., discussion is given in Section 4. Perelson & Nelson, 1999; Wu and Ding, 1999). In recent years, the necessity for Methodology appropriate models has gained more importance with the widespread use of highly active Nonparametric Models and Estimation Methods antiretroviral therapy (HAART) to treat We fit the viral load trajectory data of HIV/AIDS (Ghani et al., 2003). Numerous HIV-infected patients receiving a treatment by studies have shown that HAART is effective in using a nonparametric mixed-effects (NPME) extending the time taken from the diagnosis of model: HIV-infection to AIDS or death in HIV-infected patients (e.g., Detels et al., 1998; Tassie et al., 2002) as well as reducing the likelihood of yi (t) = log10{Vi (t)} = η(t) + vi (t) + ε i (t) , perinatal HIV transmission (Cooper et al., 2002). i = 1,2,...,n (2.1) However, in many clinical practices, combination antiviral therapy has failed to completely and durably suppress HIV where Vi (t) is the number of copies of HIV-1 replication (e.g., Deeks et al., 1999). RNA per mL of plasma at treatment time t for To determine the efficacy of treatments the ith patient and y (t) is the corresponding in suppressing HIV replication in patients, the i present study focuses on the following value in log10 scale; η(t) is the population mean questions: (i) Given longitudinal viral load data, function, also called the fixed-effects or how can one identify a common feature of the population curve; vi (t) are individual curve antiviral activities of each treatment? (ii) How variations from the population curve η(t) and can we compare the antiviral efficacies of two these variations are called random-effects different treatments? If we can answer question (ii), we may be able to demonstrate that the curves; and ε i (t) are measurement errors. We better treatment should be evaluated in a large- assume that vi (t) and ε i (t) are independent in scale clinical study. However, it may be which v (t) can be considered as realizations of difficult to answer these questions by using i existing parametric or semi-parametric methods. a mean 0 process with a covariance function γ(s,

To sufficiently consider all of the information t) = E( vi (s) vi (t) ), and εi(t) can be considered available from the observations, and to avoid the as realizations of an uncorrelated mean 0 process misspecification of parametric modeling, we 2 will use a nonparametric mixed-effects model to with variance σ (t) . The population curve analyze the longitudinal viral load data, and we η(t) reflects the overall trend or progress of the will incorporate the local linear approximation LI, LIANG, HSIEH, & TWU 445

treatment process in an HIV-infected population T T X = (1,(t − t)) , β = (η (t) , η′ (t) ) , and, hence, can provide an important index of gij gij g g g T the population’s response to a drug or treatment and bgi = ( vgi (t) , v′gi (t) ) . in a clinical or biomedical study, so in this paper we are mainly interested in estimatingη(t) . In Consequently, the NPME model (2.2) can be addition, an individual curve approximated by the following model: si (t) = η(t) + vi (t) can represent an individual’s response to a treatment in a study, T y gij = X gij ( β g + bgi ) +ε gij , j = 1,2,...,ngi ; so a good estimate of s (t) would help the i i = 1,2,...,n ; g = 1,2 (2.3) investigator to make better decisions about an g individual’s treatment management and would enable us to classify subjects on the basis of which is called a LME model. Note that, for individual response curves. Similar models have simplicity of notation, been proposed by Shi et al. (1996) and Zeger and Diggle (1994) to describe CD4+ cell counts. y gij = y gi (t gij ) , ε gij = ε gi (t gij ) , ε gi = T Let t gij , j = 1,2,...,ngi , be the design (ε gi1 ,...,ε gin ) ∼ N(0, Σ gi ), and bgi ∼ N(0, Dg) time points for the ith individual in treatment gi T T group g. Then, NPME model (2.1) becomes for Σ gi = E(ε giε gi ) and Dg = E( bgi bgi ).

To estimate η g (t) and vgi (t) , which are y gi (t gij ) = η g (t gij ) + vgi (t gij ) + ε gi (t gij ) , the first element of β and b , respectively, j = 1,2,...,n ; i = 1,2,...,n ; g = 1,2 g gi gi g under the standard normality assumptions (2.2) for b , we can minimize the following objective gi Here, ng is the number of subjects in treatment function: group g, and ngi is the number of measurements ng made from subject i in treatment group g. We T 1/ 2 −1 1/ 2 {( y gi − X gi (β g + bgi )) K giλ Σ gi K giλ now wish to estimate η g (t) and ∑ i=1 T −1 vgi (t) simultaneously, via a local approximation ( y gi − X gi (β g + bgi )) + bgi Dg bgi + of the NPME model (2.2), by using the local linear mixed-effects model approach of Wu and log | Σ gi |} Zhang (2002), which combines linear mixed- effects (LME) models (Laird & Ware, 1982) and where local polynomial techniques (Fan & Gijbels, 1996). For this purpose, we assume the existence T T y = (y ,..., y ) ; X = (X ,..., X ) gi gi1 gingi gi gi1 gingi of the second derivatives of η g (t) and vgi (t) at ; K =diag{ K (t − t),..., K (t − t) } t, which are then approximated locally by a giλ λ gij λ gingi polynomial of order 2 as follows: . is the kernel weight of the residual term for Kλ( ) T . . η (t ) ≈ η (t) +η′ (t)(t − t) ≡ X β = K( /λ)/λ, in which K( ) is a kernel function; λ g gij g g gih gij g is a bandwidth selected by a leave-one-subject- out cross-validation approach (Wu & Zhang, and T −1 2002); and the term bgi Dg bgi is a penalty T term to account for the random effects b , vgi (t gij ) ≈ vgi (t gij ) + v′gi (t)(t gij − t) ≡ X gij bgi gi where taking between-subject variation into account. 446 COMPARISON OF VIRAL TRAJECTORIES

curves is statistically significant. To compare the Thus, for given Σ gi and Dg , the resulting effects of two treatments, we apply the estimators can be obtained as follows: following test statistic (Young & Bowman,

1995): ˆ β g =

ng ng TS T −1 T 2 ( X gi Ω gi X gi ) ( X gi Ω gi y gi ) 2 ∑ ∑ {ηˆg (t gj ) −ηˆc (t gj )} i=1 i=1 = (2.5) ∑∑ 2 gT=∈1 j g σˆ ˆ T 1/ 2 −1 1/ 2 bgi = (X gi K giλ Σ gi K giλ X gi + −1 −1 T 1/ 2 −1 1/ 2 where Tg = {all distinct times t gi in treatment D ) X K Σ K ( y − X βˆ ) g gi giλ gi giλ gi gi g g} and (2.4) 2 2 n 2 2 σˆ = g (n −1)σˆ /(n − n ) is where ∑∑g==11iggi gi ∑ =1 g an estimator of the variance of the measurement 1/ 2 1/ 2 T 1/ 2 −1 2 ng 2 Ω gi = K giλ ( K giλ X gi Dg X gi K giλ + Σ gi ) error with n = n ; σˆ are obtained ∑g==11∑i gi gi 1/ 2 by using the first-order difference approach K giλ . As a result, the estimators of η g (t) proposed by Rice (1984), as follows:

and ngi −1 2 1 2 σˆ = (y − y ) , ˆ gi 2(n −1) ∑ gi[ j+1] gi[k ] vgi (t) are ηˆ g (t) = (1, 0) β g and vˆgi (t) = (1, gi j=1

ˆ i = 1,2,...,ng ; g = 1,2 0) bgi .

The unknown variance-covariance parameters in If the two population curves are equal; that is, D and Σ can be estimated by using maximum under the null hypothesis H0: η1(t) = η2(t), the g gi distribution of the test statistic TS in (2.5) is then or restricted maximum likelihood, implemented approximated by aχ2(b) + c, where χ2(b) is a by using the EM algorithm or the Newton- chi-squared distribution with b degrees of Raphson method (Davidian & Giltinan, 1995; freedom. Moreover, a, b, and c are constants Vonesh & Chinchilli, 1996). such that the mean, variance, and skewness of Of particular interest are the comparative aχ2(b) + c are equal to the corresponding effects of the two treatments. Therefore, we need quantities of the test statistic TS, which can be to compare the equality of the two population 2 calculated directly. The distribution of aχ (b) + curves η1 (t) andη2 (t) . To do this, we fit the c is then used to calculate the p-value. The model ηc (t) + vcgi (t) to all data, where ηc (t) is standard error of the difference between the the fixed-effects (population) curve for the data estimates for the two population curves can be computed as and v (t) are random-effects curves that cgi 2 2 deviate fromηc (t) . As is done when estimating sediff(t) = se{ηˆ1 (t) −ηˆ2 (t) } = se1 (t) + se 2 (t) and , we can use the local linear η g (t) vgi (t) approximation approach of Wu and Zhang where se1(t) = se{ηˆ1 (t) } and se2(t) = se{ηˆ2 (t) } (2002) to obtain the estimators, are the standard errors of the estimates of the ηˆc (t) and vˆcgi (t) , of ηc (t) and vcgi (t) . population curves, respectively. A reference Our main concern is how to justify that band whose width is centered at the average of the difference between the two population the two estimated curves ±2 × sediff(t) can be LI, LIANG, HSIEH, & TWU 447 used to see how much difference there is data set we are using includes the longitudinal between the two treatment groups (Young and viral load data obtained from 30 HIV-infected Bowman, 1995). Note that, theoretically we patients who received monotherapy or dual should consider correlation when using the therapy and 30 HIV-infected patients who approach of Young and Bowman (1995), but we received HAART in several hospitals in Taipei, do not just because of mathematical simplicity. Taiwan, between 1997 and 2002. These data are Ignoring the correlation may lose some subsets of data from a much larger cohort data of efficiency, however, as you will see, for the real- 1,195 HIV-infected patients in Taipei. Among life data analysis given in the next section there the 1,195 HIV-infected patients, most of them is significant difference between the treatment received diverse treatments, so, to ensure the effects of the two groups even using independent validity of the comparison, we chose to use data structure. Considering correlation may increase from the patients treated with HAART who had power but seems unnecessary. never been given any other treatment regimen and non-HAART patients who had never been Results treated with HAART. Treatment durations varied, because patients began receiving The Analysis of Longitudinal Viral Load Data treatment at different times during the study In this section, we illustrate the practical use of period. Figure 1 presents scatter plots of viral the proposed methodology with longitudinal load (in log10 scale) against treatment durations viral load data from HIV-infected patients. The for the HIV-1-positive patients.

Figure 1: Scatter plot of viral load (log10 of copy number of HIV RNA in plasma) versus duration of treatment with HAART (left) or non-HAART (right).

448 COMPARISON OF VIRAL TRAJECTORIES

After excluding missing data, we have viral load increases, remains constant for a short 208 complete viral load observations in the time, and increases again at the end of the HAART group, of which 108 have a value less treatment. than 400; and we have 164 complete viral load A Chi-squared test for the success rates observations in the non-HAART group, of of the two treatments gives a p-value of 0.07. It which 69 have a value less than 400. If we use is hard to say that there is a significant the criterion that a treatment is considered difference between the effects of the two successful in its antiviral effect when the viral treatments, although the success rate in the load is below 400, the success rates in the HAART group is greater than that in the non- HAART and non-HAART groups are 51.9% and HAART group. Therefore, to look more closely 42.1%, respectively. at the difference between the effects of the two For data analysis, we used the quartic treatments, we use the principle described in 2 2 kernel, K(u) = (15/16)(1 – u ) I(|u| ≤1). The Section 2. The p-value obtained by using this estimates of the two population curves are method is less than 10-4, which indicates that the depicted in Figure 2. From Figure 2, we can see two population curves for each treatment are that the estimates of the two population curves substantially different. To confirm this have different patterns although both decrease at conclusion, we obtained a range of reference the beginning of treatment. The estimated curve values and plotted them with our viral load for the HAART group shows that the viral load trajectory estimates in Figure 2. The two is maintained at a constant level until the end of estimated population curves deviate from the the treatment, whereas that for the non-HAART reference band, and the efficacy of the HAART group shows that the viral load decreases sharply is seen to be almost significantly superior to that during the first 480 days, reaching its lowest of the conventional therapy that does not include point on day 480. However, after 480 days, the HAART.

Figure 2. Estimate of two population curves.

LI, LIANG, HSIEH, & TWU 449

Discussion References

To determine the efficacy of antiviral treatments Cooper, E. R., Charurat, M., Mofenson, by using longitudinal viral load data, we applied L., Hanson, I. C., Pitt, J., Diaz, C., Hayani, K., nonparametric mixed-effects models to estimate Handelsman, E., Smeriglio, V., Hoff, R., the patterns of the viral trajectories in the two Blattner, W., & Women and Infants’ sampled populations. This approach avoids Transmission Study Group (2002). Combination misspecification and, thus, the occurrence of an antiretroviral strategies for the treatment of artificial bias. By combining the between- pregnant HIV-1-infected women and prevention subject and within-subject information, the of perinatal HIV-1 transmission. Journal of models we have proposed can parsimoniously Acquired Immune Deficiency Syndromes, 29, capture the features of viral response to an 484-94. antiviral therapy, such that the estimated curve is Davidian, M. & Giltinan, D. M. (1995). able to show common features of the antiviral Nonlinear model for Repeated Measurement activity. Data. London: Chapman and Hall. In implementing the estimation of Deeks, S. G. & Martin, J. N. (2001). population curves, we used local linear Reassessing the goal of antiretroviral therapy in regression and the bandwidth selection the heavily pre-treated HIV-infected patient. method proposed by Wu and Zhang (2002) to AIDS, 15, 117-119. select the bandwidth. Besides the local linear Deeks, S. G., Hecht, F. M., Swanson, M., methods applied in this article, the method of Elbeik, T., Loftus, R., Cohen, P. T., & Grant, regression splines may also be implemented R.M. (1999). HIV RNA and CD4 cell count for parameter estimation. The approach of response to protease inhibitor therapy in an regression splines transforms the models to urban AIDS clinic: response to both initial and standard linear mixed-effects models and is easy salvage therapy. AIDS, 13, F35-43. to implement by using existing software such Detels, R., Munoz, A., McFarlane, G., as SAS and SPLUS. Kingsley, L. A., Margo lick, J. B., Giorgi, J., The result of our illustrative example Schrager, L. K., & Phair, J. P. (1998). indicates that HAART has effects that are Effectiveness of potent antiretroviral therapy on significantly different from those of treatment time to AIDS and death in men with known HIV that did not include HAART. At the beginning infection duration. Multicenter AIDS Cohort of treatment, non-HAART has strong antiviral Study Investigators. The Journal of the activity, which is lacking with HAART. American Medical Association, 280, 1497-503. However, during the course of the treatment, Fan, J. & Gijbels, I. (1996). Local the superiority of non-HAART lessens, and this polynomial modeling and its applications. therapy ultimately fails, whereas HAART London: Chapman and Hall maintains a constant effect throughout treatment. Ghani, A. C., Donnelly, C. A., & This maintenance of the viral load at a Anderson, R.M. (2003). Patterns of antiretroviral constant level confirms previous findings and is use in the United States of America: analysis of preferable to the fluctuation of load resulting three observational databases. HIV Medicine, 4, from non-HAART. This result confirms that 24-32. HAART is worth continuing, despite its Ho, D. D., Neumann, A. U., Perelson, A. inability to suppress viral replication completely S., Chen, W., Leonard, J. M., & Markowitz, M. (Deeks & Martin 2001). (1995). Rapid turnover of plasma virions and Finally, the reference band covers a CD4 lymphocytes in HIV-1 infection. Nature, wider range of viral loads at the end of 373, 123-126. treatment, despite the increasing difference Laird, N. M. & Ware, J. H. (1982). between the two estimated curves. This is not Random effects models for longitudinal data. surprising because of the smaller sample size Biometrics, 38, 963-974. resulting from a shorter treatment duration for some patients at that time. 450 COMPARISON OF VIRAL TRAJECTORIES

Mellors, J. W., Kingsley, L. A., Rinaldo, Vonesh, E. F. & Chinchilli, V. M. C. R., Todd, J. A., Hoo, B. S., Kotta, R. P., & (1996). Linear and Nonlinear Models for the Gupto, P. (1995). Quantitation of HIV-1 RNA in Analysis of Repeated Measurements. New York: plasma predicts outcome after seroconversion. Marcel Dekker. Annals of Internal Medicine, 122, 573-579. Wei, X., Ghosh, S. K., Taylor, M. E., Mellors, J. W., Rinaldo, C. R., Gupta, P., Johnson, V. A., Emini, E. A., Deutsch, P., White, R. M., Todd, J. A., & Kingsley, L. A. Lifson, J. D., Bonhoeffer, S., Nowak, M. A., (1996). Prognosis in HIV-1 infection predicted Hahn, B. H., Saag, M. S., & Shaw, G. M. by the quantity of virus in plasma. Science, 272, (1995). Viral dynamics in human 1167-1170. immunodeficiency virus type-1 infection. Perelson, A. S. & Nelson, P. W. (1999). Nature, 373, 117-122. Mathematical analysis of HIV-1 dynamics in Wu, H. & Ding, A. A. (1999). Population vivo. SIAM Review, 41, 3-44. HIV-1 dynamics in vivo: applicable models and Rice, J. (1984). Bandwidth choice for inferential tools for virological data from AIDS nonparametric regression. The Annals of clinical trials. Biometrics, 55, 410-418. Statistics 12, 1215-1230. Wu, H. & Zhang, J. (2002). Local Saag, M. S., Holodniy, M., Kuritzkes, polynomial mixed-effects models for D. R., O'Brien, W. A, Coombs, R., Poscher, longitudinal data. Journal of the American M. E., Jacobsen, D. M., Shaw, G. M., Richman, Statistical Association, 97, 883-897. D. D., & Volberding, P. A. (1996). HIV viral Young, S. G. & Bowman, A. W. (1995). load markers in clinical practice. Nature Non-parametric analysis of covariance. Medicine, 2, 625-629 Biometrics, 51, 920-931. Shi, M., Weiss, R. E., & Taylor, J. M. G. Zeger, S. L. & Diggle, P. J. (1994). (1996). An analysis of pediatric CD4+ counts Semiparametric models for longitudinal data for acquired immune Deficiency syndrome using with application to CD4+ cell numbers in HIV flexible random curves. Journal of the Royal seroconverters. Biometrics, 50, 689-699. Statistical Society. Series C, 45, 151-163. Tassie, J. M, Grabar, S., Lancar, R., Deloumeaux, J., Bentata, M., & Costagliola, D. (2002). Clinical Epidemiology Group from the French Hospital Database on HIV. Time to AIDS from 1992 to 1999 in HIV-1-infected subjects with known date of infection. Journal of Acquired Immune Deficiency Syndromes, 30, 81-87.

Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 451- 466 1538 – 9472/03/$30.00

Alphabet Letter Recognition And Emergent Literacy Abilities Of Rising Kindergarten Children Living In Low-Income Families

Stephanie Wehry Florida Institute of Education The University of North Florida

Alphabet letter recognition item responses from 1,299 rising kindergarten children from low-income families were used to determine the dimensionality of letter recognition ability. The rising kindergarteners were enrolled in preschool classrooms implementing a research-based early literary curriculum. Item responses from the TERA-3 subtests were also analyzed. Results indicated alphabet letter recognition was unitary. The ability of boys and younger children was less than girls and older children. Child-level letter recognition was highly associated with TERA-3 measures of letter knowledge and conventions of print. Classroom-level mean letter recognition ability accounted for most of variance in classroom mean TERA-3 scores.

Key words: Early childhood literacy, alphabet letter knowledge, latent variable modeling, two-level modeling, categorical factor analysis.

Introduction awareness as the skill and knowledge base of emergent literacy. They further suggested The No Child Left Behind Act has focused emergent literacy consists of outside-in processes attention on reading instruction in kindergarten that include the context in which reading and through third-grade. Programs such as the writing occurs and inside-out processes that Preschool Curriculum Evaluation Research include the knowledge and skills associated with (PCER) and Early Reading First (ERF) grants the alphabetic principle, emergent writing, and expand that focus to preschool curricula that cognitive processes. Specific examples of outside- support cognitive development including emergent in processes include oral language, conceptual literacy. Literacy researchers are connecting skills, and concepts of print. The inside-out theories about the acquisition of reading and processes are letter knowledge, phonological emergent literacy skills and experiences. processing skills, and syntax awareness. A study The emergent literacy model embodies by Whitehurst et al. (1999) of 4-year-old Head more than reading readiness and is used to Start children indicated inside-out processes were describe the acquisition of literacy on a much stronger influences on first- and second- developmental continuum. The model provides a grade reading outcomes than outside-in processes. picture of the acquisition of literacy that occurs Historically, reading has been defined in from early childhood rather than beginning at two ways; code breaking and meaning making kindergarten and further suggests literacy skills (Riley, 1996) or as decoding and comprehension develop concurrently and interdependently. (Gough, Juel, & Griffin, 1992; Mason, 1980; Whitehurst and Lonigan (1998) listed Perfetti, 1984). Two stages of reading acquisition vocabulary, conventions of print, emergent relative to the code breaking definition were writing, knowledge of graphemes, grapheme- originally proposed and those models were often phoneme correspondence, and phonological refined to include three stages (Frith, 1985; Gough & Hillinger, 1980; Gough, Juel, & Griffith, 1992; Mason, 1980; Sulzby, 1992). Stephanie Wehry is the Associate Director for The first stage involves the association of Research at the Florida Institute of Education, a spoken word with some visual feature of the University of North Florida. Email her at corresponding printed word. The second stage [email protected]. involves cryptanalysis of printed words or

451 452 ALPHABET RECOGNITION OF RISING KINDERGARTNER CHILDREN phonological processing involving the manner with their phonological sensitivity correspondence of graphemes and phonemes, and (Bowley, 1994; Stahl & Murray, 1994). Stahl and the third stage involves orthographic processing Murray suggested children’s letter knowledge involving the correspondence of patterns enables them to manipulate initial sounds – a skill and printed words. Baker, Torgeson, and Wagner that leads to word recognition. (1992) studied the role of phonological and Researchers have also found measures of orthographic processing and determined that phonological awareness independently predicted orthographic skills make an independent measures of word recognition and decoding contribution to reading achievement. Goswami (McGuiness, McGuiness, & Donohue, 1995), and (1993) saw these stages as cyclical where that among preschool children from low-income orthographic skills enhance phonological skills, families, measures of phonological sensitivity which in turn enhance orthographic skills. were associated with measures of letter knowledge Mason (1980) suggested alphabet (Lonigan, Burgess, Anthony, & Barker, 1998). knowledge initiates the first level of reading Whitehurst et al. (1999) found that reading ability acquisition by facilitating the breaking down of in early elementary school was strongly related to words into letters. Later, in a critique of five measures of preschool children’s skills that studies of children’s alphabet knowledge, Ehri included items requiring them to name a pictured (1983) went further and suggested children’s letter and to identify initial letters and sounds of knowledge of the alphabet is the main skill that pictured and named objects – tasks that measure enables them to move from the first stage to the grapheme-phoneme relationships. Lonigan, alphabetic or phonological stage of reading Burgess, and Anthony (2000), in a longitudinal acquisition and that it is difficult to separate study, found letter knowledge was independent of children’s letter-sound knowledge from other phonological sensitivity, environmental print, and emergent literacy skills. Chall (1983) summarized decoding, and that 54% of the variation in 17 studies of the relationship between knowledge kindergarten and first grade children’s reading of the alphabet and future reading achievement. skills was accounted for by preschool Although causation was not claimed, knowledge phonological sensitivity and letter knowledge. of the letters of the alphabet was seen as an As Adams (1990) suggested, a child’s important predictor of reading achievement. level of phonological processing is irrelevant if the Sulzby (1983) suggested children’s letter- child cannot identify the letters of the alphabet. If name ability is integrated into a more complex set a beginning reader cannot identify the letters then of early literacy skills and that children attempt to the reader cannot associate sounds with letters use some mechanism as they learn to associate (Bond & Dykstra, 1967; Chall, 1967; Mason, letter names with their visual forms. Children learn 1980). Moreover, orthographic competency these skills from exposure to books, songs, blocks, depends on the ability to visually identify and and learning to write their names. Sulzby (1992) discriminate the individual letters of the alphabet. further suggested alphabet letter knowledge How children acquire this ability falls in the precedes understanding the concept of word and domain of perceptual learning theory. comprehension; however, these stages reinforce There are two prevalent theories (Adams, each other. Bialystok (1991) suggested that 1990; Gibson & Levin, 1975); the template and children who can identify letters in non-alphabetic the feature theories. In the template theory, the order and understand that letters symbolize sounds brain stores templates of the most typical are on their way to code breaking. Riley (1996) representation of the letters and stimuli are proposed the link between alphabet letter compared to the stored templates. In the feature knowledge and concepts of print is the key to why theory, the letters of the alphabet are considered a alphabet letter knowledge is such a powerful group of symbols that share common distinct predictor of reading achievement. features. The brain stores the common features of Moreover, recent studies of emergent different letters and matches features of stimuli to literacy have focused on the relationships between the stored list. Both theories involve search and phonological awareness and later reading. But comparison. children’s letter knowledge is associated in some STEPHANIE WEHRY 453

Studies of children’s alphabet letter correlations between observations nested in groups knowledge span more than four decades, involve are positive resulting in inflated Type I error rates preschool to third-grade children from low- and in significance testing. middle-income families, and use either all or a Further research is needed to estimate the sample of the letters. Sulzby (1983) suggested magnitudes of intraclass correlations in preschool knowledge of the alphabet measured in achievement data. In this study, classrooms were kindergarten, not later, is the predictor of reading studied because of the large number of single- achievement. However, Early Childhood classroom locations in the data and because Head Longitudinal Study-Kindergarten researchers Start researchers found most of the variance in reported 66% of children entering kindergarten for program quality occurred at the classroom-level. A the first time recognized most of the letters of the two-level model was used to estimate the size of alphabet (Zill & West, 2001). the intraclass correlations; however, a two-level In recent studies of children’s alphabet study confounds the effects classrooms and sites knowledge, Whitehurst et al. (1999) studied Head for sites with more than one classroom. Start children and used a sample of letters embedded as items in another measure; Lonigan et Purposes Of This Study al. (1998) studied preschool children from low- income families and used all uppercase letters; The primary purpose of this study was to analyze Lonigan et al. (2000) studied preschool children the alphabet letter recognition ability of rising from middle- to upper-income families and used kindergarten children from low-income families all uppercase letters; and Roberts (2003) studied and determine if the ability was unitary or if it preschool children whose primary language was divided along the perceptual learning or not English and used a sample of letters. instructional features (Adams, 1990; Gibson & Studies of children from low-income Levin, 1975). A second purpose of this study was families are especially important because one third to investigate the relationship between recognition of American children experience reading of the letters of the alphabet and other measures of difficulties in school (Adams, 1990), and children emergent literacy using methodology that from low-income families have comparatively developed an interval measurement scale and lower levels of emergent literacy (Whitehurst & acknowledged the nested nature of the data. The Lonigan, 1998). Because individual differences in three research questions about responses from emergent literacy at entry into kindergarten are rising kindergarten children from low-income stable or increase over school years (Baydar, families are Brooks-Gunn, & Furstenberg, 1993; Juel, 1988; Stevenson & Newman, 1986), the impact of lower 1. Is the ability to recognize upper- and levels of emergent literacy follows preschool lowercase letters of the alphabet unitary or children through school. For these reasons, this multidimensional? study analyzed responses from rising kindergarten 2. Does a latent trait model of children’s children from low-income families using all responses on the three Test of Early upper- and lowercase letters of the alphabet and Reading Ability (TERA-3) subtests other items measuring emergent literacy abilities. confirm the test publisher’s three-factor Moreover, the children studied were structure? nested in classrooms nested in locations. Head 3. Using children’s two-parameter normal Start researchers (Westat, 1998) found significant ogive scores on alphabet letter recognition variation in program quality across Head Start and TERA-3 subtests in a two-level model: programs, centers, and classrooms with the largest a. What is the relationship between variation occurring at the classroom level. children’s alphabet letter knowledge and Whitehurst et al. (1999) also found the the TERA-3 subtest abilities? performance of Head Start children differed across b. Do these relationships differ by the age centers. Violating the assumption of independent and/or gender of the children? observations across experimental units is a major concern with the use of nested data. In most cases, 454 ALPHABET RECOGNITION OF RISING KINDERGARTNER CHILDREN

c. What portion of the individual differences taught children alphabet letter knowledge, in the children’s scores is accounted for phonemic awareness, and print concepts. Teachers the by the classrooms in which they learn? also used dialogic reading (Valdez-Menchara, & d. Are differences in the classroom means of Whitehurst, 1992; Whitehurst, Arnold, Epstein, TERA-3 subtest scores predicted by Angell, Smith, & Fischel, 1994) and provided classroom mean alphabet letter opportunities for emergent writing, reading, and recognition scores? comprehension. All instruction occurred in print- rich environments with labeled furniture and word Methodology walls. The evaluation of the literacy curriculum used measures of alphabet letter recognition and Participants other emergent literacy abilities in a Data were collected from 1,299 4-year-old pretest/posttest design. Data used in this study children during a one-month period from April 15, were the posttest data of that evaluation. 2002 to May 17, 2002. All children were eligible to attend public school kindergarten the following Measurement year. Birth dates were available for 1,025 of the Data were collected on the children’s children and their ages as of September 1 of the ability to recognize the 52 upper- and lowercase school year ranged from 48 to 65 months with the letters of the alphabet and from Form A of the Test average and median ages of 54.7 and 55 months, of Early Reading Ability (TERA-3) (Reid, Hresko, respectively. Gender was reported for 1001 & Hammill, 2001a). Trained examiners collected children: 530 (53%) were boys. The average responses from children in school settings in age (median) ages for boys and girls were 54.7 (55) appropriate one-on-one sessions. The children’s and 54.6 (55) months, respectively. Ethnicity data responses were recorded on scannable forms. were not collected; however, nearly all of the children were African American. Alphabet Letter Recognition Uppercase letter flashcards, arranged in a Classroom Context fixed non-alphabetic order, were presented one at The children were from low-income a time to each child. The child was asked to name families; therefore, were considered at risk for the letter. Following presentation of the 26 academic failure. They were attending Head Start, uppercase letters, lowercase letter flashcards, also faith-based, subsidized, and early intervention arranged in a fixed non-alphabetic order, were preschool programs located in six counties in presented one at a time. southeastern United States. Most of the children attended classrooms in urban settings; however, a TERA-3 few classrooms were located in small towns. The TERA-3 is composed of three subtests Children with complete scores and gender measuring unique but related early literacy skills. information were enrolled in 121 classrooms at 76 Items within each subtest are arranged by locations. difficulty and each subtest has a stopping Fifty-five of the locations were single- mechanism. All children began testing with the classroom sites, 16 of the locations were two- or first item in each subtest. According to Reid, three-classroom sites, and the remaining five Hresko, and Hammill (2001b), the Alphabet locations had four or more classrooms at each site. subtest measures graphophomenic knowledge, the All children in the study experienced at least one Conventions subtest measures knowledge of semester of an intensive early literacy curriculum. conventions of English print, and the Meaning Classroom teachers explicitly taught the inside-out subtest measures ability to comprehend meaning early literacy skills in classroom contexts that of print. Published validity and reliability provided outside-in early literacy experiences information indicates Cronbach Alpha coefficients (Whitehurst & Lonigan, 1998). Agencies funding of internal consistency for 4-year old children (5- participation in the literacy curriculum provided year-old children) for the Alphabet, Conventions, materials, teaching strategies, and weekly and Meaning subtests are .94 (.93), .88 (.86), and coaching for preschool teachers as they explicitly .94 (.84), respectively. STEPHANIE WEHRY 455

Data Analysis than 22% of the children recognized all uppercase Data were analyzed using Mplus 2.13 letters. Calfee, Cullenbine, DePorcel, and Royston (Muthén & Muthén, 2003). The flexibility of (cited in Mason, 1980) found the distribution of Mplus permits latent variable modeling with children’s uppercase letter recognition ability was categorical indicators. The use of raw scores bimodal with most children either recognizing less formed by summing correct item responses than eight or more than 20 letters. Figure 1 shows assumes all items are equally important in the distribution of the children’s upper- and measuring the underlying construct and that lowercase letter recognition summed scores. Data intervals between scores are uniform across the pile up on both extremes of the distribution ability continuum. In contrast, measurement (ceiling and floor effects) as previously modeling within the latent variable context permits determined. The pattern at both extremes is more a distinction between observed item scores and the obvious in the distribution of lowercase letter underlying construct, and the continuous latent responses. variables are free from measurement error. Categorical confirmatory factor analyses Dimensionality of Alphabet Letter Recognition: (CFAs) were conducted using the item responses Classical Test Theory from the alphabet letter recognition and the three Traditional methods of assessing test TERA-3 subtests. The analyses produced two- dimensionality use factor analytic methods and parameter normal ogive item response theory coefficients of internal consistency as indicators. (IRT) models. The CFAs resulted in error free Cronbach’s Alpha, a measure of internal continuous latent variables; however, Mplus does consistency, for the 52 items was .98 indicating not have the capability to use these results directly items consistently measured a unitary construct. in multilevel models. Factor scores, which are Factor analysis of the alphabet letter recognition estimated as in IRT modeling, were used as data produced four eigenvalues greater than 1.00; continuous variables in the two-level model. This 26.49, 1.97, 1.11, and 1.06 explaining 50.94, 3.79, procedure reintroduced some measurement error. 2.14, and 2.04 percent of the variance in the observations, respectively. These eigenvalues Results suggested the presence of one central factor with Alphabet Letter Recognition possibly up to three additional minor or difficulty factors. Distribution of Items and Summed Scores Item responses were available from 1,299 Dimensionality of Alphabet Letter Recognition: rising kindergarten children. Correct responses Item Response Theory were coded one and incorrect responses were Latent variable modeling permits a coded zero. Table 1 shows alphabet letter item measurement model of data that is error free, means and standard deviations. Additionally, three weighs the relative importance of each item, and scores were formed by summing responses; one places measurement on an interval scale. Several for uppercase letters, one for lowercase letters, and theoretical measurement models of alphabet letter one for total of the upper- and lowercase scores. recognition ability were evaluated using The means (standard deviations) for each of these categorical CFA. summed scores were 16.41 (9.11), 13.69 (8.89), Alphabet letter recognition often begins with and 30.08 (17.74), respectively. the uppercase letters as they are more visually Adams (1990) suggested alphabet letter distinct than the lowercase letters (Tinker, 1931). recognition instruction begins with the uppercase Therefore, Model I was a two-factor model with letters for preschool children, and the mean scores one factor representing the uppercase letters and indicated rising kindergarten children recognized one representing the lowercase letters. Model I more uppercase than lowercase letters and more was based on instructional strategy.

456 ALPHABET RECOGNITION OF RISING KINDERGARTNER CHILDREN

Table 1. Summary Statistics and Model VII Factor Loadings for Items Measuring Recognition of the Upper- and Lowercase Letters of the Alphabet

Uppercase letters Lowercase letters Variable Mean Standard Factor Mean Standard Factor deviation loading deviation loading Aa .75 .43 .90 .51 .50 .86 Bb .81 .39 .82 .43 .50 .84 Cc .70 .46 .90 .68 .47 .91 Dd .65 .48 .89 .32 .47 .77 Ee .65 .48 .90 .58 .49 .90 Ff .57 .50 .93 .42 .49 .91 Gg .56 .50 .91 .39 .49 .89 Hh .61 .50 .88 .43 .50 .86 Ii .59 .49 .89 .60 .49 .87 Jj .59 .49 .89 .56 .50 .89 Kk .66 .47 .84 .64 .48 .84 Ll .59 .49 .90 .31 .46 .79 Mm .57 .50 .84 .55 .50 .86 Nn .57 .50 .88 .39 .49 .82 Oo .85 .36 .87 .82 .38 .86 Pp .65 .48 .91 .54 .50 .86 Qq .60 .49 .86 .36 .48 .81 Ss .65 .47 .88 .63 .48 .89 Tt .64 .48 .88 .56 .50 .89 Uu .52 .50 .89 .43 .50 .85 Vv .45 .50 .87 .46 .50 .86 Ww .63 .48 .76 .64 .48 .75 Xx .71 .45 .75 .72 .45 .76 Yy .59 .49 .85 .55 .50 .86 Zz .65 .48 .88 .63 .48 .86

Perceptual learning theory suggests other straight-line features or have curved features in models. One theory suggests children holistically possible combination with straight-line segments. perceive the letters and form templates in their The secondary sort was by whether or not the memories for each letter learned. Another theory letters with curved features have places of suggests children recognize letters by a set of intersections such as B and P, or look round such distinctive visual features stored in their as O and Q. The tertiary sort was by whether memories. The feature theory is more mentally letters with straight-line features have diagonal efficient than the template theory. segments such as M and Z, or not such as E and F. Gibson and Levin (1975) reported that both children and adults sorted the uppercase letters of the alphabet by whether or not they have only

STEPHANIE WEHRY 457

Figure 1.The distribution of simple summed upper- and lowercase alphabet letter recognition scores.

25

20 Percent of Scores Lowercase Uppercase less than 8 letters 32.14 25.02 more than 20 letters 31.79 46.42 15

10 Percent of Scores

5

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Numbers of Items Correct Uppercase Letters Lowercase Letters

Several models involving the distinct Roberts (2003) used explicit instruction to features of the letters were investigated. Model II teach alphabet letter recognition to preschool was a two-factor model with one factor children and suggested there are 44 distinct representing letters whose visual representation is abstract symbols children must learn. She composed of diagonal line segments with no reasoned the upper- and lowercase forms for C, O, curved features (AKMNVWXYZkvwxyz) and a S, U, V, W, X, and Z are the same. Model V was a factor representing the remaining letters two-factor model with one factor representing (BCDEFGHIJLOPQRSTUabcdefghijlmnopqrstu). these eight pairs (COSUVWXZcosuvwxz) and Model III was a two-factor model with one factor one factor representing the remaining letters representing letters whose visual representation is (ABDEFGHIJKLMNPQRTYabdefghijklmnpqrty). composed only of line segments Rotated exploratory factor analysis of the (AEFHIKLMNTVWXYZikltvwxyz) and one data suggested four highly correlated factors with representing the remaining letters one primary factor. Therefore, a unitary model, (BCDGJOPQRSUabcdefghjmnopqrsu). Model IV Model VI, was fit. Additionally there are at least was a two-factor model with one factor seven letters whose upper- and lowercase visual representing letters whose visual representation forms are identical (C, O, S, V, W, X, and Z) and exhibits line symmetry four more whose upper- and lowercase visual (ABCDEHIMOTUVWXYZclotvwxz) and one forms are nearly identical (K, P, U, and Y); representing the remaining letters therefore, another unitary model with errors for (FGJKLNPQRSabdefghijkmnpqrsuy). these eleven pairs of letters freed to correlate was also fit, Model VII. 458 ALPHABET RECOGNITION OF RISING KINDERGARTNER CHILDREN

Categorical confirmatory factor analysis TERA-3 of the seven models was conducted using Mplus. A matrix of 1,299 observations, each observation TERA-3 is composed of three subtests measuring having 52 binary items, was analyzed. Weighted graphophemic knowledge (Alphabet), knowledge least squares estimation (WLSM) was used to of conventions of English print (Conventions), and estimate model parameters. Five fit statistics are the ability to comprehend meaning of print available for WLSM estimation: the comparative (Meaning), and is designed for use with children fit index (CFI), the Tucker-Lewis index (TLI), whose ages are between three years six months root mean square error approximation (RMSEA), and eight years six months. There are 29 Alphabet weighted room mean square residual (WRMR), items, 21 Conventions items, and 30 Meaning and standardized root mean square residual items. Any subtest item whose mean was less than (SRMR). Guidelines for good fit of categorical .05 was not used in this study. TERA-3 was models suggested CFI >.95, TLI >.95, RMSEA < administered to 1009 children in one-on-one .06, WRMR <. 90, and SRMR <. 08 (Hu & settings by trained examiners. Correct responses Bentler, 1999; Yu & Muthén, 2002). Table 2 were coded one and incorrect responses were shows fit statistics for each of the seven models. coded zero. Table 3 shows TERA-3 item means All seven models had CFI, TLI, and SRMR and standard deviations. fit statistics within limits established for good fit. None of the seven models had WRMR within limits established by Yu and Muthén (2002). The Subtest Alphabet RMSEA fit statistic of Model VII was the only Twenty-two Alphabet items were included one within limits and Model VII had the lowest in the study, and these items required children to WRMR. Therefore Model VII, a unitary model, identify pictured upper- and lowercase named exhibited the best overall fit and is supported by letters, to name identified pictured upper- and classical test theory and parsimony. Table 1 shows lowercase letters, to identify initial letters and factor loadings for Model VII, and factor scores sounds of text and named words, and to choose the from Model VII were used in the two-level model. correct text corresponding to a pictured object. Cronbach Alpha coefficient for the Alphabet subtest items used in the study was .93.

Table 2. Fit Indices and Factor Correlations for Seven Measurement Models of Alphabet Letter Recognition.

Model CFI TLI RMSEA WRMR SRMR Correlations I .99* .99* .09 2.52 .05* .77 II .99* .99* .09 2.39 .05* .71 III .99* .99* .09 2.48 .05* .71 IV .99* .99* .09 2.51 .05* .80 V .99* .99* .09 2.34 .05* .78 VI .99* .99* .09 2.56 .05* - VII 1.00* 1.00* .04* 1.31 .03* .14-.37

Note. * Denotes the value indicates model fit.

STEPHANIE WEHRY 459

Subtest Conventions restricted to measuring TERA-3 subtests suggested Twelve Conventions items were included by test developers. However, one Conventions in the study, and these items required children to item, C3, involved alphabet letter knowledge; identify pictured books that were oriented therefore, it was freed to load on both the Alphabet correctly for reading, to distinguish pictured text and Conventions latent variables. Figure 2 from other pictured line markings, to match provides a visual representation of the model, and, pictured uppercase with corresponding lowercase as can be seen, C3 was more strongly associated letters, to distinguish between text, title, author’s with the Alphabet latent variable. Model name, and illustrations when presented pictured parameters were estimated using WLSM, and fit first pages of a story, to identify the first and last indices were CFI = .99, TLI = .99, RMSEA = .05, words of a pictured paragraph, and to follow (by and WRMR = 1.50. Three of the indices, CFI, TFI, pointing) pictured text as it was read indicating and RMSEA, indicated model fit (Yu & Muthén, knowledge that text is read from left to right, top 2002). The three latent factors were correlated to bottom, and when to turn a pictured page. with the strongest correlation occurring between Cronbach Alpha coefficient for the Conventions Alphabet and Conventions. Table 3 shows factor subtest items used in the study was .80. loadings for the TERA-3 model, and factor scores from the model were used in the two-level model. Subtest Meaning Ten Meaning items were included in the Two-Level Path Analysis of the Alphabet Letter study, and these items required children to identify Recognition and TERA-3: Emergent Literacy pictured product labels corresponding to named Abilities of the Rising Kindergartners product categories, to identify pictured upper- and Alphabet letter recognition Model VII lowercase text placed adjacent to named pictured factor scores (Letters) and the TERA-3 subtest objects, and to identify pictured text corresponding factor scores (Alphabet, Conventions, and to named pictured objects when presented Meaning) were used in a two-level path analysis. amongseveral sets of pictured objects with The within-level used the child-level data and the corresponding text. Cronbach Alpha coefficient between-level used the classroom-level data. Table for the Meaning subtest items used in the study 4 shows summary statistics for the 986 child-level was .74. and the 121 classroom-level factor scores of the four variables. Confirmatory factor analysis of these 45 The analysis in multilevel terms involved items was performed using Mplus. Items were the following variables and notations:

ii is the th child of the 986 children studied, jj is the th classroom of the 121 classrooms studied, th th Subtestij is the TERA-3 subtest factor score of the i child in the j classroom,

Lettersij is the alphabet letter recognition Model VII factor score

of the ijth child in the th classroom, th Genderij is the gender () girls coded 0 and boys coded 1 of the i child in the jth classroom, and th th Ageij is the age in months on September 1 of the i child in the j classroom.

460 ALPHABET RECOGNITION OF RISING KINDERGARTNER CHILDREN

All three TERA-3 subtests were simultaneously analyzed. The analysis in multilevel terms involved the following child-level and classroom-level equations:

Child-Level

Subtestij=+ββ01 j j()()() Letters ij + β 2 j Gender ij + β 3 j Age ij + r ij Classroom-Level

βγγ00001.jjj=+()Letters + u

where,

th β0 j is the mean TERA-3 subtest factor score of the j classroom,

β1 j is the expected change in children's TERA-3 subtest factor scores associated with a change in their alphabet letter recognition factor scores,

β2 j is the expected difference in boys' TERA-3 subtest factor scores,

β3 j is the expected difference in children's TERA-3 subtest factor scores associated with a difference in their age,

rij is the unaccounted for individual differences in children's TERA-3 subtest ability,

u j is the unaccounted for classroom differences in TERA-3 factor score classroom means,

γ 00 is the grand mean of the TERA-3 subtest factors scores, and

γ 01 is the expected change in TERA-3 subtest classroom mean factor scores associated with a change in the classroom mean alphabet letter recognition factor scores.

This set of equations was replicated for The analyses indicated that alphabet letter each of the three TERA-3 subtest factor scores. knowledge predicted all three TERA-3 subtest Figure 3 shows the child-level and classroom-level abilities. Not surprisingly, the strongest influence path models and results. Parameters for the was on the Alphabet subtest scores. Both age and multilevel path analysis were estimated using gender influenced the Alphabet subtest scores Muthén’s maximum likelihood estimator for directly and indirectly through the Letters variable. balanced data (MUMLM). The fit indices for the Boys and younger children had lower Alphabet model were CFI = 1.00, TFI = .99, RMSEA = .02, subtest ability than girls and older children. The and SRMR <. 01 for the within model (.04 for the child-level model accounted for almost 70% of the classroom-level model): all indicated good fit. The child-level variance in the Alphabet subtest scores. intraclass correlations were .19, .21, .15, and .17 Alphabet letter recognition ability also for Letters, Alphabet, Conventions, and Meaning, influenced the Conventions subtest scores with the respectively. strength of association about two thirds as large as in the Alphabet subtest scores. Following the same pattern found with the Alphabet subtest scores, age

STEPHANIE WEHRY 461

Table 3. Summary Statistics and CFA Factor Loadings for TERA-3 Alphabet, Conventions, and Meaning Subtests

Variable Mean Standard Factor Variable Mean Standard Factor deviation loading deviation loading A1 .90 .30 .82 M1 .94 .24 .29 A2 .78 .41 .72 M2 .95 .22 .52 A3 .71 .46 .60 M3 .92 .28 .61 A4 .75 .43 .86 M4 .78 .41 .94 A5 .55 .50 .61 M5 .79 .41 .93 A6 .57 .50 .78 M6 .81 .39 .74 A7 .43 .50 .70 M7 .89 .31 .87 A8 .43 .50 .93 M8 .25 .43 .66 A9 .36 .48 .87 M9 .46 .50 .77 A10 .36 .48 .90 M10 .09 .29 .53 A11 .40 .49 .94 C1 .62 .49 .66 A12 .38 .49 .89 C2 .52 .50 .43 A13 .33 .47 .91 C3 .72 .45 .20 A14 .21 .41 .78 C4 .68 .47 .75 A15 .24 .43 .88 C5 .22 .41 .69 A16 .28 .45 .90 C6 .45 .50 .58 A17 .25 .43 .94 C7 .16 .36 .85 A18 .09 .29 .79 C8 .29 .46 .80 A19 .16 .37 .87 C9 .12 .32 .75 A20 .17 .38 .89 C10 .08 .28 .82 A21 .09 .29 .85 C11 .13 .34 .99 A22 .11 .31 .83 C13 .09 .29 .90 C3 .72 .50 .64

Note. n = 1,009 rising kindergarten children; A1-A22 are Alphabet Subtest items; C1-C11, and C13 are Conventions Subtest items; and M1-M10 are Meaning Subtest items.

and gender influenced the Conventions subtest letters variable. The child-level model accounted scores both directly and indirectly through the for almost 19% of the child-level variance in the Letters variable. Boys and younger children had Meaning subtest scores. lower Conventions subtest ability than girls and The classroom means of the Letters older children. The child-level model accounted variable predicted the classroom means of the for almost 36% of the child-level variance in the Alphabet, Conventions, and Meaning subtest Conventions subtest scores. scores. Residuals of classroom means of all three Alphabet letter recognition knowledge subtest scores were significantly different from also influenced the Meaning subtest scores with zero indicating the need for the multilevel model. the strength of the influence more than one fourth The proportion of variance in TERA-3 subtest as large as in the Alphabet subtest scores. Age classroom means accounted for by the classroom influenced the Meaning subtest scores both mean ability to recognize the letters of the directly and indirectly through the Letters variable; alphabet was 88, 60, and 27 percent, respectively older children had higher Meaning subtest ability for the Alphabet, Conventions, and Meaning than younger children. Gender influenced subtests. Meaning subtest scores only indirectly through the

462 ALPHABET RECOGNITION OF RISING KINDERGARTNER CHILDREN

Figure 2.The confirmatory factor analysis measurement model of the TERA-3 subtest items. All pictured correlations were statistically significant at α = .05. The t statistics ranged from a low value of 2.81 for Conventions measured by C3 to a high value of 48.35 for Alphabet measured by A4. The complete set of factor loadings is presented in Table 3.

STEPHANIE WEHRY 463

Figure 3.The two-level path analysis of the child- and classroom-level TERA-3 and alphabet letter recognition (Letters) factor scores. The pictured child-level correlations were all statistically significant at α = .05. The child-level t statistics ranged from a low of 1.98 for Alphabet by gender to a high value of 42.14 for Letters regressed on Alphabet. Additionally, the classroom-level t statistics ranged from a low value of 4.67 for Mean Letters regressed on Mean Meaning to a high value of 16.51 for Mean Letters regressed on Mean Alphabet.

Conclusion The findings from this study indicated the Participating classrooms were sponsored by ability to recognize the upper- and lowercase agencies that were either recruited by curriculum letters in non-alphabetic order in classroom developers for participation or whose sponsoring environments suggested by Lonigan et al. (1998) agencies requested participation and funded some was also highly associated with measures of extent of their participation. However, the graphophemic knowledge, conventions of print, participating children form a large, mostly urban, and knowledge of environmental print. Moreover, African American population of children from the classroom mean ability to recognize the letters low-income families who attended a variety of of the alphabet accounted for differences in preschool programs. classroom mean measures of other emergent literacy abilities. Child-Level Path Analysis What is more, the link between The path analyses indicated that alphabet phonological sensitivity and alphabet knowledge letter knowledge predicted all three TERA-3 is especially problematic for boys from low- subtest abilities. The TERA-3 items measured income families. McGuiness et al. (1995) found alphabet letter knowledge, conventions of print, that deficits in phonological awareness were more and emergent comprehension. problematic to future reading achievement for boys than girls. The results of this study suggest these deficits for boys from low-income families 464 ALPHABET RECOGNITION OF RISING KINDERGARTNER CHILDREN may begin at the point of learning to recognize the in this study could correctly respond to a much letters of the alphabet. greater percent of the Alphabet and Conventions The residuals of the child-level Alphabet than Meaning items which suggests higher ability and Conventions subtest were correlated. Both in those areas. That ability was directly related to Bialystok (1991) and Sulzby (1992) suggested the their classrooms’ combined ability to identify the influence of alphabet letter knowledge is linked to upper- and lowercase letters of the alphabet. concept of word. The relationship between the Because of this evidence and the explicit alphabet letter recognition variable and the teaching of letter knowledge among other skills, Conventions subtest scores may reflect the classroom mean letter knowledge is seen as a influence of letter recognition ability on those measure of the implementation of the literacy Conventions items requiring children to use their curriculum, especially because participation was concept of word to respond to items that required not uniformly implemented across sites in terms of them to follow pictured text as it was read to them the length of involvement during the school year or to point to various words. or in terms of previous literacy curriculum experience of classroom teachers. Some teachers Classroom-Level Path Analysis were new to the curriculum having worked with it The classroom-level model used four less than a semester and other teachers had worked variables, the Letters, Alphabet, Conventions, with it for several years. Supporting this Meaning variables aggregated at the classroom implementation explanation is the fact that of the level. The classroom mean of the Letters variable classrooms with the 16 lowest mean Letters predicted the classroom means of the Alphabet, scores, 12 were new sites with teachers new to the Conventions, and Meaning subtest scores. The curriculum and with participation beginning after intraclass correlations for Letters, Alphabet, the winter holidays. The remaining four Conventions, and Meaning were .19, .21, .15, and classrooms were early intervention special .17, respectively. These intraclass correlations are education classrooms. The implications of this relatively large for a homogeneous population. For explanation suggest mean classroom letter instance, in heterogeneous populations, Bryk and recognition ability may be simple measure of the Raudenbush (1992) estimated 18% of the variance quality of emergent literacy curricula and in math achievement scores of children in the 1982 experiences. High School and Beyond Survey was between- schools and Goldstein (1987) estimated 9% and Perceptual Learning Theory of Alphabet Letter 13% of the variance in reading achievement of Recognition elementary school children was between-schools Inspection of Table 1 indicates the most and between-classes, respectively. frequently recognized letters were uppercase A, B, A possible explanation for these relatively and C and upper- and lowercase X and O. This large intraclass correlations is instruction of some coupled with the alphabet letter summed scores of the subtest constructs is more readily adapted to depicted in Figure 1 suggests rising kindergarten the use of explicit instruction to enhance child children recognized more of the uppercase letters; learning. In fact, historical evaluation of the however, it cannot be determined from this study literacy curriculum used with preschool children whether this is because the uppercase letters are indicated the greatest increases in mean TERA-3 more visually distinct and therefore more easily subtest scores occurred with the Alphabet subtest recognized (Tinker, 1931) or whether the scores. Additionally, the percent of available uppercase letters are taught first to preschool subtest items used in this study (items with means children (Adams, 1990). The path analysis also greater than .05) were 76, 57 and 33 percent for indicated boys’ ability to recognize letters of the the Alphabet, Conventions, and Meaning subtests, alphabet was lower than girls and older children’s respectively, and 88 and 60 percent of the ability was higher than younger children. classroom-level variance in the Alphabet and These findings are limited by the lack of Conventions subtest means was accounted for by experimental design, but the size of the sample the classroom mean ability of the children to indicates these are areas for further research. The recognize the letters of the alphabet. The children fact that alphabet letter knowledge is an integral STEPHANIE WEHRY 465 part of a broader set of emergent literacy skills and Ehri, L. C. (1983). A critique of five is frequently learned in conjunction with broader studies related to letter-name knowledge and skills enhances Sulzby’s (1983) call for a better learning to read. In L. M. Gentile, M. L. Kamil, understanding of how children learn letter names and J. S. Blanchard (Eds.) Reading Research and the processes they use to recognize the letter Revisited (143-153).Columbus, OH: Charles E. forms. If children, in fact, recognize letters of the Morrill. alphabet by their distinctive features, a more Frith, U. (1985). Beneath the surface of controlled study is needed in which data are developmental . In K. Patterson, J. collected earlier in the learning process and at Marshall, & M. Coltheart (Eds.), Surface dyslexia several time points with instructional strategy (pp. 301-330). London: Erlbaum. modeled into the design. Perhaps as children Gibson, E. J., & Levin, H. (1975). A actively engage in learning to recognize the letters theory of perceptual learning and its relevance for of the alphabet, the construct changes from a understanding reading. In The psychology of multidimensional to a unitary one. reading. Cambridge, MA: MIT Press. Goldstein, H. (1987). Multilevel models References in educational and social research. New York: Oxford University Press. Adams, M. J. (1990). Learning about Goswami, U. (1993). Orthographic print: The first steps. In Beginning to read: analogies and reading development. Bulletin of the Thinking and learning about print (pp. 333-367). British Psychological Society, 6, 312-316. Cambridge, MA: MIT Press. Gough, P. B., & Hillinger, M. L. (1980). Baker, T. A., Torgenson, J. K., & Wagner, Learning to read: An unnatural act. Bulletin of the R. K. (1992). The role of orthographic processing Orton Society, 30, 171-176. skills on five different reading tasks. Reading Gough, P. B., Juel, C., & Griffin, P.L. Research Quarterly, 27, 334-345. (1992). Reading, spelling, and orthographic Baydar, N., Brooks-Gunn, J., & cipher. In P.B. Gough, L. C. Ehri, & R. Treiman Furstenberg, F. F. (1993). Early warning signs of (Eds.), Reading acquisition (pp. 35-48). Hillsdale, functional illiteracy: Predictors in childhood and NJ: Erlbaum. adolescence. Child Development, 64, 815-829. Hu, L. T. & Bentler, P. M. (1999). Cutoff Bialystok, E. (1991). Letters, sounds, and criteria for fit indices in covariance structure symbols: Changes in children’s understanding of analysis: Conventional criteria versus new written language. Applied 12, alternatives. Structural Equation Modeling, 6, 1- 75-89. 55. Bond, G. L., & Dykstra, R. (1967). The Juel, C. (1988). Learning to read and cooperative research program in first-grade write: A longitudinal study of 54 children from reading instruction. Reading Research Quarterly, first through fourth grades. Journal of Educational 2, 5-142. Psychology, 80, 437-447. Bowey, J. A. (1994). Phonological Lonigan, C. J., Burgess, S. R., & sensitivity in novice readers and nonreaders. Anthony, J. L. (2000). Development of emergent Journal of Experimental Child Psychology, 58, literacy and early reading skills in preschool 134-159. children: Evidence from a latent-variable Bryk, A. S., & Raudenbush, S. W. (1992). longitudinal study. , Hierarchical linear models: Applications and data 36, 596-613. analyses methods. Newbury Park, CA: Sage. Lonigan, C. J., Burgess, S. R., Anthony, Chall, J. S. (1983). The ABC’s of reading: J. L., & Barker, T. A. (1998). Development of Is the alphabet necessary? In Learning to read: phonological sensitivity in 2- to 5-year-old The great debate (pp.140-159). NY: McGraw Hill. children. Journal of Educational Psychology, 90, 294-311.

466 ALPHABET RECOGNITION OF RISING KINDERGARTNER CHILDREN

Mason, J. M. (1980). When do children Tinker, M. A. (1931). The influence of begin to read: An exploration of four-year-old form of type on perception of words. Journal of children’s letter and word reading competencies. Applied Psychology, 16, 167-174. Reading Research Quarterly, 13, 203-327. Valdez-Menchara, M. C., & Whitehurst, McGuiness, D., McGuiness, C., & G. J. (1992). Accelerating language development Donohue, J. (1995). Phonological training and the through picture book reading: a systematic alphabet principle: Evidence for reciprocal extension to Mexican day care. Developmental causality. Reading Research Quarterly, 30, 830- Psychology, 28, 1106-1114. 852. Westat, Inc., Rockville, MD., Abt Muthén, B. O., & Muthén, L. (2003). Associates, Inc., Bethesda, MD., CDM Group, I., Mplus statistical analysis with latent variables & Ellsworth Associates, Mclean, VA. (1998). Version 2.13. Los Angeles: Muthén & Muthén. Head Start Program Performance Measures. Perfetti, C. A. (1984, November). Second Progress Report. Head Start Research. Reading acquisition and beyond: Decoding U.S.; District of Columbia. includes cognition. American Journal of Whitehurst, G. (1999). Measurement of Education, 40-57. emergent literacy and literacy outcomes. Report to Reid, D. K., Hresko, W. P., & Hammill, the Advisory Committee on Head Start Research D. D. (2000a). The Test of Early Reading Ability – and Evaluation. Retrieved January 2002 from Third Edition. Austin, TX: Pro-Ed. http:\www2.acf.dhhs.gov/programs/hsb/hsreac/jun Reid, D. K., Hresko, W. P., & Hammill, 99/issue4.html. D. D. (2000b). TERA-3 examiners manual. Austin, Whitehurst, G. J., Arnold, D. S., Epstein, TX: pro·ed. J. N., Angell, A. L., Smith, J., & Fischel, J. E. Riley, J. L. (1996). The ability to label (1994). A picture book reading intervention in day letters of the alphabet at school entry: A discussion care and home for children from low-in-come on its value. Journal of Research in Reading 19, families. Developmental Psychology, 30, 542-555. 87-101. Whitehurst, G. J. & Lonigan, C. J. (1998). Roberts, T. A. (2003). Effects of alphabet- Child development and emergent literacy. Child letter instruction on young children’s word Development, 68, 848-872. recognition. Journal of Educational Psychology, Whitehurst, G. J., Zevenbergen, A. A., 95, 41-51. Crone, D. A., Schultz, M. D., Velting, O. N., & Stahl, S. A., & Murray, B. A. (1994). Fischel, J. E. (1999). Outcomes of an emergent Defining phonological awareness and its literacy intervention from Head Start through relationship to early reading. Journal of second grade. Journal of Educational Psychology, Educational Psychology, 86, 221-234. 91, 261-272. Stevenson, H. W., & Newman, R. S. Yu, C.-Y. & and Muthén, B. O. (2002). (1986). Long-term predictions of achievement and Evaluation of model fit indices for latent variable attitudes in mathematics and reading. Child models with categorical and continuous outcomes. Development, 57, 646-659. Technical report in preparation. Sulzby, E. (1983). A commentary on Zill, N., & West, J. (2001). Entering Ehri’s critiques of five studies relating to letter- Kindergarten: A Portrait of American Children name knowledge and learning to read: Broadening When They Begin School. Findings from the the question. In L. M. Gentile, M. L. Kamil, and J. Condition of Education, 2000. (Rep. No. S. Blanchard (Eds.) Reading Research Revisited NCES2001035). U.S.; District of Columbia: ED (pp. 143-153). Columbus, OH: Charles E. Morrill. Pubs. Sulzby, E. (1992). Research directions: Transition from emergent to conventional writing. Language Arts, 69, 291-297.

Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 467-474 1538 – 9472/03/$30.00

Deconstructing Arguments From The Case Against Hypothesis Testing

Shlomo S. Sawilowsky Educational Evaluation and Research Wayne State University

The main purpose of this article is to contest the propositions that (1) hypothesis tests should be abandoned in favor of confidence intervals, and (2) science has not benefited from hypothesis testing. The minor purpose is to propose (1) descriptive statistics, graphics, and effect sizes do not obviate the need for hypothesis testing, (2) significance testing (reporting p values and leaving it to the reader to determine significance) is subjective and outside the realm of the scientific method, and (3) Bayesian and qualitative methods should be used for Bayesian and qualitative research studies, respectively.

Key words: Hypothesis testing, bracketed intervals, significance testing, effect size, Bayes, qualitative

Introduction The “Confidence” Interval Attack There has been an increasing amount of journal Neyman (1934), who discovered the space given to the case against hypothesis bracketed interval, equated the probabilities testing over the past quarter of a century. The associated with its lower and upper bound with ensuing debate has taken many directions and “the ordinary concept of probability” (1934, p. has been graced with many forms of 590). Initially, he seemed to equate it with the argumentation (see, e.g., Sawilowsky, 2003a; fiducial argument promulgated by Fisher (1930). Knapp & Sawilowsky, 2001). Two styles of The presumed lack of difference in the attack against hypothesis testing are contested derivation of bracketed intervals and fiducial here. probabilities was the focus of the discussion The first is the proposition that subsequent to the reading of Neyman’s (1934) hypothesis testing should be abandoned in favor paper before the Royal Statistical Society. of confidence intervals. (I prefer the term Bowley (1934) raised the question and presented “bracketed” instead of “confidence” interval for his answer, “I am not at all sure that the reasons noted in Sawilowsky, 2003a.) Ancillary ‘confidence’ is not a ‘confidence trick’… Does to this attack is the proposition that hypothesis it really take us any further?... I think it does testing is tolerable if and only if it is (a) not” (p. 609). He considered bracketed intervals buttressed with a report of effect sizes, (b) to be nothing more than ordinary probabilities accompanied by graphical displays, or (c) expressed in a new form. Bayesian. Neyman (1934) replied that “questions The second style of attack is that raised in the discussion on the confidence hypothesis testing should be abandoned due to intervals would require too much space. In fact, philosophical arguments. An example is to clear up the matter entirely, a separate embodied in the question if science has publication is needed…[and] this is in benefited by hypothesis testing. preparation” (p. 623). He alluded to the nature of ______the response that would follow: “It has been suggested in the discussion that I used the term Shlomo Sawilowsky is Professor of Education ‘confidence coefficient’ instead of the term and Wayne State University Distinguished ‘fiducial probability’. This is certainly a Faculty Fellow. Email: [email protected]. misunderstanding” (p. 623). Did Neyman The author gratefully acknowledges discussions differentiate between his proposed bracketed on the ether with Rabbi Chaim Moshe Bergstein. interval and the venerable hypothesis test?

467 468 DECONSTRUCTING THE CASE AGAINST HYPOTHESIS TESTING

No. Neyman (1935) immediately discussion indicates that Fisher understood the disabused readers of the statistical literature of paper’s implication quite well. His response was this notion. He stated, “The problem of a terse defense of the fiducial argument as the estimation in its form of confidence intervals explanation of ordinary probability. stands entirely within the bound of the theory of Neyman (1941) was surprised! Fiducial probability” (p. 116), as does hypothesis testing. probability and the fiducial distribution of a How, then, did the claim that bracketed intervals parameter were “more or less, lapsus linguae, are superior and preferred eventually arise as a difficult to avoid in the early stages of a new weapon in the arsenal of the camp attempting to theory” (p. 129). The fiducial argument was make a case against hypothesis testing? vague, misconceived, and vacuous in explaining Neyman (1941) reviewed the ordinary probability. development of the bracketed interval, which is The aftermath took the form of translated from the Polish “przedzial ufności.” considerable and animated debate in the He mentioned this phrase in 1930 in lectures at literature on the fiducial argument. Many the University of Warsaw and the Central mathematical statisticians, regardless of College (Agriculture) in Warsaw, Poland. Prior theoretical persuasion, joined in the fray by to the redaction of the theory, Pytkowksi (1932) publishing their support or concern. Wald published a practical application. (1939), Wald and Wolfowitz (1939), and Welch Neyman (1941) recounted that he had (1939) sided with the bracketed interval. Fisher noticed numerical similarities obtained with his (1935), Starkey (1938), Sukhatme (1938), and method and that of the fiducial argument. As a Yates (1939) defended the fiducial argument. result, he had initially assumed the two Pitman (1939) opined that the two theories were paradigms were identical. Neyman was satisfied essentially the same, as did Bartlett (1939) to a with considering the bracketed interval as an lesser extent. extension of the fiducial argument because Bartlett (1936, 1939) also escalated the Fisher (1930) had priority. debate with the contention that where results Eventually, Neyman (1934) became diverge, the fault lies within the fiducial estranged from the fiducial argument. He no argument. As can be imagined, Fisher (1937, longer considered the two theories 1939a, 1939b) and Yates (1939) accepted the interchangeable. He left the reasons unstated in gauntlet. Jeffreys (1940) attempted to restore his opening presentation before the Society. calm in claiming that the bracketed interval and Fisher (1934) attended the reading as a the fiducial argument were both subsumed under discussant. Historical accounts of the exchange inverse probability in the system of Bayes. This were varied. Some expressed chagrin with had no effect on the debate, of course, because Fisher, who offered minimal comments on the few of the combatants were Bayesian. The new methodology, and instead concentrated on controversy would only die with Fisher. the relative merits of random vs purposive Neyman (1941) succinctly described the sampling selection. Others, in noting Bowley’s relationship between the two theories: “There is (1934) comment that the paper was difficult to none” (p. 130) because “the theories of fiducial understand, assumed that Fisher might have argument and of confidence intervals differ in neglected to read Neyman’s paper prior to the their basic conceptions” (p. 149). He was: reading and simply didn’t follow it. Still others proposed that this was Fisher’s feeble attempt at inclined to think that the literature on the blocking his baton from being passed to theory of fiducial argument was born out Neyman, just as Karl Pearson had tried in vain of ideas similar to those underlying the two decades prior with Fisher. theory of confidence intervals. These These reports misrepresented Fisher’s ideas, however, seem to have been too response. Most of his comments were directed to vague to crystallize into a mathematical the sampling problem because that was the theory. Instead, they resulted in primary thesis of Neyman’s (1934) paper. misconceptions of ‘fiducial probability’ Moreover, a careful review of the published and ‘fiducial distribution of a SHLOMO S. SAWILOWSKY 469

parameter’… In this light, the theory of data relevant to the speed of light, testing fiducial inference is simply non-existent. the hypothesis that light travels through a (p. 149) medium called luminiferous ether. If this ether existed, then light should travel Return to the “confidence” interval faster when moving in the same direction attack against hypothesis testing. Fisher’s as the motion of the earth - similar to a fiducial argument as the explanation of boat traveling faster when going probability was challenged and defeated. downstream compared with upstream. However, the ordinary understanding of Michelson and Morley interpreted their probability, even in its application to Fisher’s F published data, without tests of test, was never challenged, much less defeated. significance, as indicating that light Those who have raised the bracketed interval traveled the same speed no matter what attack against hypothesis testing are merely direction it was traveling. However, I exploiting Fisher’s discredited nomenclature and subjected their published data to a simple explanation of probability as he applied it to analysis of variance (ANOVA) and found hypothesis testing. statistical significance associated with the Ordinary probability is synonymous in direction the light was traveling (p. < .01). the theories of hypothesis testing and bracketed It is interesting to speculate how the intervals. Certainly, this was Neyman’s (1934) course of history might have been view. That is why we concluded, “There is an changed if Michelson and Morley had illogical swagger associated with criticizing been trained to use this corrupt form of hypothesis testing and subsequently advocating the scientific method, that is, testing the CIs [confidence intervals]” (Compton & null hypothesis first. They might have Sawilowsky, 2003, p. 584). concluded that there was evidence of significant differences in the speed of Philosophical Attack light associated with its direction and that “Has science benefited from hypothesis therefore there was evidence for testing?” The question is silly. No reputable luminiferous ether. If this ether existed, quantitative physical, behavioral, or social then light should travel faster when scientist would overlook the breadth and depth moving in the same ether. That of scholarly knowledge and its impact on society conclusion would have set back Einstein’s that has accrued from over a century of ideas many years, because his notions hypothesis testing. The definitive evidence: about relativity are based on light William Sealy Gosset created the t test to make traveling in every direction at the same better beer. speed. Fortunately, Michelson and Morley In an invited paper in this issue of did not corrupt the scientific method by Journal of Modern Applied Statistical Methods, testing the null hypothesis before they Professor Dayton addresses alternative strategies interpreted their data with respect to their to hypothesis testing. The motivating reference, research hypothesis. (p. 288) Carver (1978), championed the case against The best research articles are those hypothesis testing. Carver’s (1978) attack was that include no tests of statistical based on a variant of the philosophical attack: significance. In a single study, these tests speculation and assertion. “Even if properly used can be replaced with estimates of effect in the scientific method, educational research size and of sampling error, such as would still be better off without statistical standard errors and confidence intervals. significance testing” (p. 398). Carver (1993) Better still, by conducting multiple offered an “Einstein” gambit: studies, replication of results can replace statistical significance testing. (p. 289- An example from the history of 290) science will help to illustrate this point. Michelson and Morley (1887) collected 470 DECONSTRUCTING THE CASE AGAINST HYPOTHESIS TESTING

Responses to Carver’s (1993) claims Table 1. A Sampling Of Interferometry Results. appear below. In order to understand these ______remarks, it is necessary to preface with a Velocity description of interferometer data. Carver (1993) Experimenter Date (k/s) claimed the results were null. Indeed, the 1887 Michelson & Morley 1887 5 - ≤ 7.5 Michelson-Morley experiment is nearly Morley & Miller 1902-4 8.7 ± 0.6 unanimously touted as the most famous Morley & Miller 1905 7.5 experiment that produced a null result. (See, Miller 4/1/1925 10.1 ± .33 e.g., Feynman, Leighton, & Sands, 1963.) Miller 8/1/1925 11.2 ± .33 The interferometer was invented by Miller 9/15/1925 9.6 ± .33 Michelson to estimate the speed of light. It was Miller 9/23/1925 8.22 refined by Michelson (1881) and by Michelson Miller 2/8/26 9.3 ± .33 and Morley (1887a, 1887b) in an attempt to Picard & Stahel 1926 6.9 acquire evidence on the medium of propagation Picard & Stahel 1927 1.45 ± .007 of light called ether proposed by Aristotle. The Illingworth 1927 < 3 - 5 Michelson, hypothesized value, equal to the Earth’s orbital Pease, velocity, was approximately 30 km/s. & Pearson 1929 20 Michelson and Morley (1887a) did not Joos 1930 < 1.5 use hypothesis tests (which had yet to be Kennedy & invented, not withstanding allegations regarding Thornkike 1932 24 the dating of the sign test). Initially, they Michelson, presented “the results of the observations… Pease, graphically” (p. 333). Visual inspection led to & Pearson 1935 20 the conclusion there was an observed fringe ______shift, although it was less than what would be expected if the ether existed as hypothesized. the expected shift tends to zero. Subsequent They wrote, “It seems fair to conclude from the experiments conducted by Illingsworth (1927) figure that if there is any displacement due to the with Kennedy’s equipment, but with resilvered relative motion of the earth and the luminiferous mirrors, presented nonnull results. ether, this cannot be much greater than 0.01 of A variety of technical corrections were the distance between the fringes” (Michelson & introduced to account for the non-null results. Morley, 1887a, p. 333). Experiments were carefully designed to rule out Next, they presented descriptive rival hypotheses, such as temperature, drift, sign statistics. This led to the conclusion that “the of displacement, diurnal variation, and inter- ether is probably less than one sixth the earth’s session averaging. Nevertheless, no study orbital velocity, and certainly less than one produced null results. fourth” (p. 341). Values probably less than 5 Most interferometer experiments were km/s and certainly less than 7.5 km/s are not conducted by Miller (1933). He took more than null, although different from the expected value 200,000 readings from 1902 - 1927 based on of 30 km/s. Some results on interferometer 12,500 turns of the interferometer, including a experiments conducted from 1887 - 1935 are joint effort with Morley in the early 1900s. (In compiled in Table 1. comparison, Michelson and Morley made 36 The only null results via interferometry turns in four days, and Piccard and Stahel made were obtained by Kennedy in 1926. His results 96 turns in Belgium and 60 turns in Brussels.) were criticized by Illingsworth (1927), who Yet, Miller never obtained a null result. found the equipment suffered from a “reduced Shankland (et al., 1955) was Miller’s optical system” (p. 692). Múnera (1998) noted assistant, and subsequently was Professor of that the Kennedy experiment was unclear physics at Case Western Reserve University regarding the local solar time of the initial (where Morley was Professor of chemistry until orientation of the interferometer, which may 1906). After the death of his boss, he criticized have been at one of the four times per day that Miller’s work on the ether, notably with SHLOMO S. SAWILOWSKY 471 assistance from Albert Einstein. DeMeo (2000, In his experiments in 1923, and from 2001) strenuously defended Miller against 1925 - 1926 at Mt. Wilson, Miller took many Shankland’s criticisms. (The reader interested in steps to control for the effects of temperature. the dissident literature on ether should read The results were consistent with earlier DeMeo, 2000, 2001; and Múnera, 1998). Later, measurements. Similarly, Miller (cited in Joos & Shankland (1973, p. 2283) cited a letter received Miller, 1934) noted, “when Morley and Miller from Einstein dated 31 August 1954: designed their interferometer in 1904 they were fully cognizant of this... Elaborate tests have I thank you very much for sending been made... especially with artificial heating, me your careful study about the Miller for the development of methods which would be experiments. Those experiments, free from this effect [of temperature]” (p. 114). conducted with so much care, merit, of The Cleveland Plain Dealer (27 January 1926) course, a very careful statistical added, “Speaking before scientists at the investigation. This is more so as the University of Berlin, Einstein said the ether drift existence of a not trivial positive effect experiments [were null in the Michelson-Morley would affect very deeply the fundament experiment but] on Mount Wilson they showed of theoretical physics as it is presently positive results”, although he attributed it to accepted. temperature and altitude. You have shown convincingly that the observed effect... has nothing to do Einstein Gambit Declined with ‘ether-wind’, but has to do with There were thousands of interferomic differences of temperature. studies conducted by dozens of physicists since 1887, and in all but one experiment the results Einstein’s letter is instructive for many were demonstrably non-null. The only known reasons. First, he believed the interferometer null result was subsequently determined to be experiments on the ether “merit, of course, a caused by a miscalibrated instrument. When the very careful statistical analysis” [emphasis instrument was resilvered, and the experiment added]. Second, as late as the year of his death, replicated in the same location, the results were Einstein still believed that the interferometer about 4 km/s. experiments were a threat to his special theory Carver (1993) conducted a simple of relativity. Third, he had not updated his analysis of variance (ANOVA) and found knowledge many years after the specter of statistical significance (p < .01). These results temperature as a confounding variable was first are tenable, assuming the null hypothesis was raised. The Cleveland Plain Dealer (27 January the observations did not differ from zero. 1926) published an exchange between Einstein Nevertheless, Carver’s (1993) analysis suffers and Miller, with the latter concluding, from a bewildering array of questions, such as:

“The trouble with Prof. Einstein is • What data set was used? Was it from the that he knows nothing about my results,” noon readings, the afternoon readings, Dr. Miller said. “He has been saying for or a combination of readings? Was it thirty years that the interferometer from July 8th, 9th, 11th, or 12th of 1887; experiments in Cleveland showed or perhaps some combination of days? negative results. We never said they gave Did it include all 36 turns of the negative results, and they did not in fact interferometer, or some subset? give negative results. He ought to give me • What was the value of F? credit for knowing that temperature • What were the degrees of freedom? differences would affect results. He wrote • Were the underlying assumptions of to me in November suggesting this. I am independence, homoscedasticity, and not so simple as to make no allowance for normality considered? temperature.”

472 DECONSTRUCTING THE CASE AGAINST HYPOTHESIS TESTING

• Were covariates such as diurnal minimized its influence by reporting this effect variation or drift considered? size measure indicating that less that 1% of the • How was intersession averaging based variance in the speed of light was associated on different calibration curves handled? with its direction” (p. 289). The fallacy of his • According to Carver’s (1993) advice analysis is Michelson and Morley’s (1887a, and recommendation, why did he fail to 1887b) experiment obtained results of 5 to 7.5 present summary statistics or a graphic km/s. Regardless of what percent of variance it display of the results (either prior to the represents, how can anyone call a speed that ANOVA or afterwards)? exceeds the Earth’s satellite orbital velocity “null” and “seek to minimize its influence”? Carver (1993) claimed that this Of paramount importance, however, significant result from the hypothesis test would Carver (1993) tested the wrong hypothesis. Data have set Einstein back many years. This is inspection and graphs demonstrated interferomic unwarranted speculation. In his lecture in Berlin, data did not support the static model of Einstein rejected the 1887 Michelson-Morley luminiferous ether as a medium of propagation results as being nonnull, despite the evidence for light. Should a hypothesis test be desired, the contained within their descriptive statistics and correct test is whether the data were statistically graphs. Similarly, he would have ignored the significantly different – not from zero – but outcome of a hypothesis test. rather, from the hypothesized value of 30 k/s. Einstein’s theory was not based on any Carver (1993) described the process of experimental evidence. At various times conducting hypothesis tests prior to examining throughout his career, Einstein reminisced that it descriptive data as a corruption of the scientific was based on the principles of Maxwell and method. This is a straw-person argument. Who Lorentz, and he had not relied on the Michelson- promotes conducting hypothesis tests as a first Morley experiment. Holton (1969, 1988) step in the analysis of data? Who objects to suggested that not only did the interferometer examining raw data (e.g., for data entry errors, experiments have little or no impact, but there is outliers), computing descriptive statistics, and evidence that Einstein was unaware of the inspecting graphics prior, or as a follow-up, to Michelson-Morley experiment prior to conducting hypothesis tests? developing the special theory of relativity. Carver (1993) stated the best research Interferometer experimenters presented articles are those that contain no hypothesis graphical displays, from simple scatter grams tests. This regressive approach would truly set and histograms to more complex time series quantitative physical, behavioral, and social charts and hodograms. All pictorial science back more than a century. Reasonable representations substantiated nonzero results. people have different expectations of what Some of the latter interferometer experimenters constitutes a rare event vs what constitutes a reported standard errors. (Obviously, those who common event expected by chance alone. This is did not were remiss.) Many of the latter true with a single study, and all the more so with experimenters also reported bracketed intervals, many replications of a study. The debate is and zero was not in them. Múnera (1998) diminished, and possibly vanishes, with the summarized the bulk of interferometer studies simple agreement on a threshold (i.e., nominal with a bracketed interval, and zero was not in it. alpha level) prior to conducting an experiment. If statistical tests had been invented by 1887, it Carver’s (1993) reliance on reporting would have been easy to confirm the data were effect sizes as a panacea is naïve. Effect sizes statistically significantly different from zero. are sensitive to their own underlying Even Shankland (et al., 1955; 1973) was forced assumptions. In addition, the process of to admit this. enclosing effect sizes in a bracketed interval Carver (1993) reported an effect size relies on the same probabilities as does the (eta squared) of .005. He concluded “if obtained value of a hypothesis test. Carver Michelson and Morley had been forced … to do (1993) also recommended the practice of a test of statistical significance, they could have reporting an effect size whether the hypothesis SHLOMO S. SAWILOWSKY 473 test “is significant or not” (p. 288). This leads to Bartlett, M. S. (1939). Complete the “trouble with trivials” problem (see e.g., simultaneous fiducial distributions, Annals of Sawilowsky, 2003b, 2003c). Mathematical Statistics, 10, 129-138. Currently, it is a popular slogan among Bowley, A. L. (1934). Discussion on Dr. effect size enthusiasts to warn against Neyman’s paper. The Journal of the Royal “becoming stupid in another metric.” Yet, Statistical Society, 97, 607-610. Carver (1993) interpreted an eta squared of .005 Carver, R. P. (1978). The case against as null to minimize the study outcome. The statistical significance testing. Harvard experimental results Carver (1993) sought to Educational Review, 48, 378-399. minimize were speeds of over 16,750 miles per Carver, R. P. (1993). The case against hour! statistical significance testing, revisited. Journal of Experimental Education, 61(4), 287-292. The Next Generation of Arguments Compton, S., & Sawilowsky, S. (2003). As soon as these two lines of attack Do not discourage the use of p values. Annals of against hypothesis testing falter, three more Emergency Medicine, 41(4), p. 584. assaults are quickly proffered. This is not the DeMeo, J. (2001). Dayton Miller’s place to elaborate on them, but they are parried ether-drift experiments: A fresh look. Infinite briefly below. Energy Magazine, 38, 72-82. The first is to replace hypothesis testing DeMeo, J. (2002). Dayton Miller’s with significance testing. P values are reported ether-drift experiments: A fresh look. and it is left to the reader to decide if it is http://www.orgonelab.org/miller.htm. significant. Aside from being outside the realm Feynman, R. P., Leighton, R. B., & of the scientific method, subjective significance Sands, M. (1963). The Feynman lectures on testing is, in my view, a recipe for disaster physics: Mainly mechanics, radiation, and heat. (Knapp & Sawilowsky, 2001). (Note that Vol 1. Reading, MA: Addison-Wesley, 15-5. Carver’s, 1978, 1993, attack is actually against Fisher, R. A. (1930). Inverse hypothesis testing, although he calls it a case probability. Proceedings of the Cambridge against significance testing.) Philosophical Society, 26, 528-535. The second is to abandon the frequentist Fisher, R. A. (1934). Discussion on Dr. approach and conduct a Bayesian analysis. I Neyman’s paper. The Journal of the Royal strongly promote the method of Bayes in Statistical Society, 97, 614-619. selecting a pinch hitter in baseball because of the Fisher, R. A. (1935). The fiducial plethora of informative priors. However, in the argument in statistical inference. Annals of absence of definitive objective priors, a Eugenics, 6, 391-398. condition that pervades most of physical, Fisher, R. A. (1937). On a point raised behavioral, and social science, Bayesian by M. S. Bartlett on fiducial probability. Annals methods are not likely to be optimal. of Eugenics, 7, 370-375. The third is to abandon quantitative Fisher, R. A. (1939a). The comparison methodology altogether in favor of qualitative of samples with possibly unequal variances. techniques. I discussed this option elsewhere Annal of Eugenics, 9, 174-180. (Sawilowsky, 1999). Qualitative methods should Fisher, R. A. (1939b). A note on fiducial be used when the research hypothesis is inference. Annals of Mathematical Statistics, 10, qualitative, not because of some perceived 383-388. limitation of a quantitative method in pursuing a Jeffries, H. (1940). Note on the Behrens- quantitative research question. Fisher formula. Annals of Eugenics, 10, 48-51. Joos, G., & Miller, D. (1934, January References 15). Letters to the Editor. Physical Review, 45, 114. Bartlett, M. S. (1936). The information Holton, G. (1969). Einstein, Michelson, available in small samples. Proceedings of the and the ‘crucial’ experiment. Isis, 60 (1969). Cambridge Philosophical Society, 32, 560-566. 474 DECONSTRUCTING THE CASE AGAINST HYPOTHESIS TESTING

Holton, G. (1988), Thematic origins of Sawilowsky, S. (1999). Quasi- scientific thought, Kepler to Einstein. (Revised experimental design: The legacy of Campbell ed.). Cambridge, MA: Harvard University Press. and Stanley. In (Bruno D. Zumbo, Ed.) Social Illingworth, K. K. (1927) A repetition of indicators/quality of life research methods: the M-M experiment using Kennedy’s Methodological developments and issues, refinement. Physics Review, 30, 692-696. Yearbook 1999. Norwell, MA: Kluwer. Knapp, T., & Sawilowsky, S. (2001). Sawilowsky, S. (2003a). A different Constructive criticisms of methodological and future for social and behavioral science research. editorial practices. Journal of Experimental Journal of Modern Applied Statistical Methods, Education, 70, 65-79. 2(1), 128-132. Michelson, A. A. (1881). The relative Sawilowsky, S. (2003b). You think motion of the earth of the luminiferous aether. you’ve got trivials? Journal of Modern Applied American Journal of Science, 22(S3), 120-129. Statistical Methods, 2(1), 218-225. Michelson, A. A., & Morley, E. W. Sawilowsky, S. (2003c). Trivials: The (1887a). On the relative motion of the Earth and birth, sale, and final production of meta-analysis. the luminiferous ether. American Journal of Journal of Modern Applied Statistical Methods, Science, 34(S3), 333-345. 2(1), 242-246. Michelson, A. A., & Morley, E. W. Shankland, R., McCuskey, S. W., (1887b). On the relative motion of the earth and Leone, F.C., & Kuerti, G. (1955). New analysis the luminiferous aether. Philosophical of the interferometer observations of Dayton C. Magazine, 24(151) S5, 449-463. Miller. Review of Modern Physics, 27(2), 167- Miller, D. (1933, July). The ether-drift 178. experiment and the determination of absolute Shankland, R. (1973). Michelson’s role motion of the Earth. Review of Modern Physics, in the development of relativity. Applied Optics, 5(2), 203-242. 12(10), 2280-2287. Múnera, H. A. (1998). Michelson- Starkey, D. M. (1938). A test of the Morley experiments revisited: Systematic errors, significance of the difference between means of consistency among different experiments, and samples from two normal populations without compatibility with absolute space. Apeiron, 5, assuming equal variances. Annals of 37-54. Mathematical Statistics, 9, 201-213. Neyman, J. (1934). On the two different Sukhatme, P. V. (1938). On Fisher and aspects of the representative method: The Behrens’ test of significance of the difference in method of stratified sampling and the method of means of two normal samples. Sankhyā, 4, 39- purposive sampling. The Journal of the Royal 48. Statistical Society, 97, 558-625. Wald, A. (1939). Contributions to the Neyman, J. (1935). On the problem of theory of statistical estimation and testing confidence intervals. Annals of Mathematical hypotheses, Annals of Mathematical Statistics, Statistics, 6, 111-116. 10, 299-326. Neyman, J. (1941). Fiducial argument Wald, A., & Wolfowitz, J. (1939). and the theory of confidence intervals. Confidence limits for continuous distribution Biometrika, 32, 128-150. functions, Annals of Mathematical Statistics, 10, Pitman, E. J. G. (1939). The estimation 105-118. of the location and scale parameters of a Welch, B. L. (1939). On confidence continuous population of any given form. limits and sufficiency, with particular reference Biometrika, 30, 391-421. to parameters of location, Annals of Pytkowski, W. (1932). The dependence Mathematical Statistics, 10, 58-69. of the income in small farms upon their area, the Yates, F. (1939). An apparent outlay and the capital invested in cows. Warsaw: inconsistency arising from tesets of significance Bibljoteka Pulawska. based on fiducial distributions of unknown parameters. Proceedings of the Cambridge Philosophical Society, 35, 579-591. Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 475-477 1538 – 9472/03/$30.00

Brief Reports A Note On MLEs For Normal Distribution Parameters Based On Disjoint Partial Sums Of A Random Sample

W. J. Hurley Royal Military College of Canada

Maximum likelihood estimators are computed for the parameters of a normal distribution based on disjoint partial sums of a random sample. It has application in the disaggregation of financial data.

Introduction The expenditures for individual serials were not Motivation tracked. Hence, for these high demand courses, The Canadian Forces conducts much of the problem was to estimate the normal its army individual training at the Combat distribution parameters using this aggregated Training Center (CTC) in eastern Canada. Over data. the 2001-2002 Training Year, 97 serials (a With this background in mind, suppose “serial” is an instance of a “course”) were run the ammunition cost for a particular course is a for a total of 2008 students. The overall normal random variable with mean µ and expenditure on ammunition was $28.8 million. 2 variance σ . Let The Commander, CTC, was interested in developing a model of the ammunition dollar cost for each type of course in order to help him X = {X 1 , X 2 ,..., X n } assess the risk of over-expending his annual ammunition budget for a given slate of serials. be an iid sample from this distribution. At the point of budgetary deliberations for a Unfortunately we cannot observe individual given fiscal year, the ammunition cost for any elements of this sample. Rather, we can only serial is uncertain due primarily to uncertain observe a sample of disjoint partial sums. course enrollments, uncertain student failure Suppose the sample is partitioned into sets rates, and uncertain weather (ranges are closed S1 , S 2 ,..., S m with cardinalities k1 , k 2 ,..., k m when it gets dry due to the threat of forest fires). where As a first pass, we conceptualized the ammunition cost of a course as a normal random variable. To estimate its mean and variance, it S1 ∪ S 2 ∪...∪ S m = X would be reasonable to use historical data. For S i ∩ S j = 0 for all i ≠ j and some courses this is what we did. However there were some high demand courses where a k1 + k 2 +...+ k m = n. number of serials were run each year, and unfortunately, ammunition expenditures for these individual serials were aggregated into a Let κ (i) be the set of indices of the elements of single number for the year. S . For instance if S = {X , X , X }, then i 2 2 3 7 Bill Hurley is Professor of Business κ (i) = {2,3,7}. Then we observe the set of Administration. His research interests are partial sums, Y = {Y1 ,Y2 ,...,Ym }, where military operations research, decision analysis, game theory, logistics modeling and the application of OR to problems in sport. Contact Yi = ∑ X j for i = 1,2,..., m. him at [email protected] j∈κ (i)

475 476 A NOTE ON MLES FOR NORMAL DISTRIBUTION PARAMETERS

We want to compute MLEs for µ and σ 2 Hence, the estimator using Y rather than X. 2 There has been a lot of research on 2 1 m ()ykyii− σ=ˆ ∑i=1 grouped and combined datasets. See, for mk−1 i example, the work of Rao (1973). However, to my knowledge, the estimation problem is an unbiased estimate of the variance. described above has not been mentioned in the Another aspect of this problem is how to literature. revise these estimates as new data becomes available. At the CTC, this new data will not be Solution aggregated. Suppose the new sample is Note first that Y is normally distributed i Z = {Z1 , Z 2 ,..., Z p }. What now are the 2 with mean ki µ and variance kiσ . Also, the maximum likelihood estimates (MLEs) of µ 2 Yi are independent since the partial sums are and σ based on Y and Z ? The answer is a disjoint. Hence the likelihood function is straightforward application of the previous development. We simply think of Z i as an 2 2 1 ⎡ 1 (y − k µ) ⎤ additional element of Y having cardinality L(µ,σ ) = exp − i i . ∏ 2 ⎢ ∑ 2 ⎥ i i 2 k σ k =1. Hence we have that 2πkiσ ⎣ i ⎦ i

Maximizing the ln of this likelihood function y + z y + z * ∑iii ∑ i ∑iii ∑ i gives µˆ = y = = Y ∪Z k + p n + p ∑i i y y ∑i i ∑i i µˆ = y = = and k n ∑i i * 2 1 ⎡ m (y − k y ) p ⎤ σˆ 2 = i i + (z − y * ) 2 . and Y ∪Z ⎢∑∑i==11j j ⎥ m − p ⎣ ki ⎦

2 2 1 m (yi − ki y) Another Example σˆ = . m ∑i=1 k Returning to the CTC problem, suppose i we have the following data set for a given

course: Note that for the special case (we are m = n working at the level of the iid sample), the last Total equation returns the usual MLE for variance. Ammunition As for the properties of these estimators, Fiscal Year #Serials Dollars the MLE for the mean is unbiased, Expended

2001 3 713,316 E(Y )= µ, 2002 2 486,345 2003 3 728,408 but, not surprisingly, the estimator for the 2004 3 700,843 variance is biased: 2005 2 462,004

2 The MLEs for the mean and standard deviation ⎛ 1 m (Y − k Y ) ⎞ m −1 E⎜ i i ⎟ = σ 2 . are ⎜ ∑i=1 ⎟ ⎝ m ki ⎠ m y ∑i i 3,090,916 µˆ = = = 237,763 k 13 ∑i i

W. J. HURLEY 477

and An interesting extension would be to calculate maximum likelihood estimators in the 2 case where the partial sums overlapped. In this 2 1 m ()ykyii− σ=ˆ ∑i=1 case the Yi are no longer independent, and mk−1 i hence the likelihood function is more difficult to =11,691 calculate. respectively. References Discussion Rao, C. R. (1973). Linear statistical This analysis suggests that it would be easy to inference and its applications. NY: Wiley find maximum likelihood estimators for the parameters of other underlying distributions. The main requirement is to identify the distributions of sums of these random variables.

Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 478-480 1538 – 9472/03/$30.00

On Treating A Survey Of Convenience Sample As A Simple Random Sample

W. Gregory Thatcher J. Wanzer Drane Department of Health Department of Epidemiology & Biostatistics University of West Florida University of South Carolina

Threat of bias has kept many from using data gathered in less than optimal conditions. We maintain that when convenience sampling represents race and gender at nearly correct proportions and can be beneficial, as these two variables are quite often used as stratification variables. We compared a convenience sample with a proven sample. Race and Sex were nearly proportional as was found in the proven sample. We conclude that the convenience sample can be used as though it is simple random.

Key words: Simple random sampling, convenience sampling

Introduction weighted estimates of the YRBS sample of the same year, then we can at least increase our From the first semester of Introduction to confidence in the treatment of our sample as Statistics through our career as scientists by simple random. Such a comparison does not, nor whatever names, we are warned of the sampling will it ever, PROVE the convenience sample to and non-sampling errors and how to overcome be totally unbiased and simple random, but it them. Recently a question was asked: “May I will go a long way toward our believing the treat my convenience sample as a simple random prevalence calculated are nearly unbiased. sample?” To answer the question we employed a sample of known qualities, SCYRBS99, the Results South Carolina Youth Risk Behavior Survey of 1999. The estimates of gender and race prevalence will Representative coverage of Gender and be compared to those obtained from the Race is paramount, if the sample is to be SCYRBS99 sample, which are treated as instructive when formulating health policy, and population constants. Tables 1 and 2 display we know that SCYRBS99 and earlier YRBS those values together with the estimates from the samples are constructed so that the estimates of convenience sample. prevalence among these two variables, as well as Remembering that X2 is directly others, are nearly unbiased (CDC, 1999). proportional to the sample size, which is 4421 in If we can show that the estimates of the this case, then a Chi-square of 9.43 is not large percentages of gender and race are nearly the at all. In order to reach a significance of only same in the convenience sample as are in the 0.05, N had to be at least (4421/9.43)*3.84) = 1800. This is a case in which we have too much power. From an administrative point of view we W. Gregory Thatcher, Department of Health, would require alpha to be equivalent to about Leisure and Exercise Science. University of four standard errors or 0.0001. Therefore, we are West Florida, 11000 University Parkway able to accept a difference of 46.66- Pensacola, FL 32514. Phone: 850-474-2598, 44.36=2.30% as non-significant and Fax: 850-474-2106. Email: [email protected] administratively not important. Further, we can treat this sample as a simple random sample.

478 THATCHER & DRANE 479

Table 1: SCMS (Convenience Sample) with expected percentages and numbers obtained from SCYRBS99. Variable = GENDER. Expected F = P (F|SCYRBS99)*4733. X2 = (2409-2376.91)2/2376.91 + (2324-2356.09)2/2356.09 = 0.87, df = 1, p-value = 0.35.

SCMS SCYRBS99 GENDER Percent Percent Count Expected number F 50.90 50.22 2409 2376.91 M 49.10 49.78 2324 2356.09 Total 4733 4733

Table 2: SCMS (Convenience Sample) with expected percentages and numbers obtained from SCYRBS99. Variable = RACE. Expected B = P (B|SCYRBS99)*4733. X2 = (1961-2022.41)2/2022.41 + (2460-2310.65)2/2310.65 + (312-399.94)2/399.94 = 30.85 df = 2, p-value = 0.0000002.

SCMS SCYRBS99 RACE Percent Percent Count Expected B 41.43 42.73 1961 2022.41 W 51.98 48.82 2460 2310.65 O 6.59 8.45 312 399.94 Total 4733 4733

Table 3: A repeat of Table 2 with the O category excluded. Expected B = P(B|SCYRBS99)*4421. X2 = (1961-2062.84)2/2062.84 + (2460-2358.16)2/2358.16 = 9.43, df=1, p-value = 0.0021.

SCMS SCYRBS99 RACE Percent Percent Count Expected B 44.36 46.66 1961 2062.84 W 55.64 53.34 2460 2358.16 Total 4421 4421

480 CONVENIENCE SAMPLE AS A SIMPLE RANDOM SAMPLE

Between female and male distribution If stratification is made along the four the convenience sample is right on target, but the categories (B,F), (B,M), (W,F) and (W,M), p-value of the chi-square among the three racial estimates within category should be nearly groups indicates a noticeable difference. An unbiased. From those four strata, comparisons examination of actual count versus the could still be made without hesitation. If you expectations show there is an excess of white insist on a larger alpha, then the RACE variable students at the expense of those captured as ‘O’ should not appear in a regression, linear or or other than Black or White. If those are logistic, in conjunction with a set of risk and omitted, as usually is the case because of small confounder variables. numbers in more complex analyses, we have the results in Table 3. References

Conclusion Centers for Disease Control and Prevention. (1999). Division of Adolescent and The convenience sample has nearly the same School Health Youth Risk Behavior Survey, gender and racial compositions as is estimated 1999. http://www.cdc.gov/nccdphp/dash/yrbs/ from the SCYRBS99 data. It can then be treated as a simple random sample. For the skeptic or purist, caution should be used when generalizing across racial lines when using the SCMS data.

Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 481-496 1538 – 9472/03/$30.00

Early Scholars Conventional And Robust Paired And Independent-Samples t Tests: Type I Error And Power Rates

Katherine Fradette and H. J. Keselman Lisa Lix University of Manitoba University of Manitoba Department of Psychology Department of Community Health Sciences

James Algina Rand R. Wilcox University of Florida University of Southern California Department of Educational Psychology Department of Psychology

Monte Carlo methods were used to examine Type I error and power rates of 2 versions (conventional and robust) of the paired and independent-samples t tests under nonnormality. The conventional (robust) versions employed least squares means and variances (trimmed means and Winsorized variances) to test for differences between groups.

Key words: Paired t test, independent t test, robust methods, Monte Carlo methods

Introduction 2 = 2 + 2 − 2ρσ σ , (1) σX − X σX σX X X 1 2 1 2 1 2 It is well known that the paired-samples t test 2 2 where σ = σ j n j is the population variance of has more power to detect a difference between X j the means of two groups as the correlation the mean for group j ( j = 1, 2 ). between the groups becomes larger. That is, as It must be kept in mind, however, that the population correlation coefficient, ρ, the independent-samples t test has twice the increases, the standard error of the difference degrees of freedom of the paired-samples t test. between the means gets smaller, which in turn Generally, an increase in degrees of freedom is increases the magnitude of the t statistic (Kirk, accompanied by an increase in power. Thus, 1999). Equation 1, the population variance of the considering the loss of degrees of freedom for difference between mean values, demonstrates the paired-samples test, there is the question of how the standard error of the difference between just how large ρ must be in order for the paired- the means ( σ ) is reduced as the value of ρ X1−X 2 samples t test to achieve more power than the increases. independent-samples t test. Vonesh (1983) demonstrated that the paired-samples t test is more powerful than the Katherine Fradette ([email protected]) independent-samples test when the correlation is a graduate student in the Department of between the groups is .25 or larger. Furthermore, Psychology. H. J. Keselman is a Professor of Zimmerman (1997) observed that many authors Psychology, email: [email protected]. Lisa recommend the paired-samples t test only if “the Lix ([email protected]) is an Assistant two groups are highly correlated” and Professor of Community Health Sciences. James recommend the independent samples test if Algina ([email protected]) is a Professor of “they are uncorrelated or only slightly Educational Psychology. Rand R. Wilcox correlated” (p. 350). Zimmerman argued, ([email protected]) is a Professor of Psychology. however, that such authors often fail to take into account an important consequence of the use of the independent t test on dependent

481 482 CONVENTIONAL AND ROBUST T TESTS observations. Namely, Zimmerman (1997) noted the population. Violation of the normality that the independence assumption is violated assumption leads to distortion of Type I error when the independent-samples t test is rates and can lead to a loss of power to detect a performed on groups that are correlated, even to difference between the means (MacDonald, a very small degree, and such a violation of the 1999; Wilcox, 1997). Thus, in addition to an independence assumption distorts both Type I examination of the performance of the and Type II error rates. conventional (least squares) versions of the Zimmerman (1997) compared the Type paired and independent-samples t tests, the I error and power performance of the paired and performance of a robust version of each of the independent-samples t tests for normally tests was also investigated. distributed data, varying the magnitude of ρ. He The robust versions of the paired and found that a correlation as small as .1 seriously independent-samples t tests involve substituting distorted Type I error rates of the independent- robust measures of location and scale for their samples t test. Thus, according to Zimmerman, least squares counterparts. Specifically, the the practice of employing the independent- robust versions of the tests substitute trimmed samples t test when groups are slightly means for least squares means, and Winsorized correlated fails to protect against distortion of variances for least squares variances. the significance level and concluded that “a Calculation of the trimmed mean, which is correlation coefficient of .10 or .15 is not defined later in Equation 7, involves trimming a sufficient evidence of independence, not even specified percentage of the observations from for relatively small sample sizes” (p. 359). each tail of the distribution (for symmetric Zimmerman also demonstrated an example in trimming), and then computing the average of which, even when the correlation between two the remaining observations. The Winsorized groups was as low as .1, the paired t test was variance, which is defined later in Equation 8, is more powerful than the independent-samples t computed by first Winsorizing the observations test. Consequently, contrary to the (see Equation 5), which also involves removing recommendations of the authors he cites (e. g., the specified percentage of observations from Edwards, 1979; Hays, 1988; Kurtz, 1965), each end of the distribution. However, in this Zimmerman advocates the use of the paired- case the eliminated observations are replaced samples t test even when groups are only with the smallest and largest observation not correlated to a very small degree (i.e., .1), when removed from the left and right side of the distributions are normal. distribution, respectively. The Winsorized The question regarding how large ρ variance is then computed in the same manner as should be in order for the paired-samples t test to the conventional least squares variance, using achieve more power than the independent- the set of Winsorized observations. samples t test, when data are not normally Numerous studies have shown that, distributed has not been examined (Wilcox, under nonnormality, replacing least squares 2002). Evaluating the performance of statistics means and variances with trimmed means and under nonnormality is important, given that Winsorized variances leads to improved Type I psychological data are often not normal in shape error control and power rates for independent (Micceri, 1989; Wilcox, 1990). Hence, the goal groups designs (e.g., Keselman, Kowalchuk & of this study was to extend Zimmerman's (1997) Lix, 1998; Keselman, Wilcox, Kowalchuck & work by examining the Type I error and power Olejnik, 2002; Lix & Keselman, 1998; Yuen, rates of both the paired-samples and the 1974), as well as dependent groups designs (e.g., independent-samples t tests when distributions Keselman, Kowalchuk, Algina, Lix & Wilcox, were nonnormal, again varying the magnitude of 2000; Wilcox, 1993). In particular, Yuen (1974) ρ. was the first to propose that trimmed means and An investigation of the performance of Winsorized variances be used with Welch’s both the paired and independent-samples t tests (1938) heteroscedastic statistic in order to test under nonnormality raises a problem, however. for differences between two independent groups, Both tests assume normally distributed data in when distributions are nonnormal and variances FRADETTE, KESELMAN, LIX, ALGINA, & WILCOX 483 are unequal. Thus, Yuen’s method helps to Methodology protect against the consequences of violating the normality assumption and is designed to be Definition of the Test Statistics robust to variance heterogeneity. Yuen’s method Conventional Methods reduces to Welch’s (1938) heteroscedastic Suppose that n j observations, method when the percentage of trimming is zero X , X , …, X , are sampled from population (Wilcox, 2002). Yuen’s method can also be 1 j 2 j n j j extended to dependent groups. j ( j = 1, 2 ). In order to compute the conventional

It is important to note that while the independent-samples t test, let X j = ∑i X ij n j conventional paired and independent-samples t be the jth sample mean ( i = 1,…, n ; N = n ). statistics are used to test the hypothesis that the j ∑ j j population means are equal ( :µ = µ ), the 2 2 th H 0 1 2 Also let S j = ∑i (X ij − X j ) ()n j −1 be the j robust versions of the tests examine the sample variance. The estimate of the common hypothesis that the population trimmed means (i.e., pooled) variance is are equal ( H 0 :µt1 = µt2 ). Although the robust versions of the procedures are not testing (n − 1)(S 2 + n − 1 )S 2 S 2 = 1 1 2 2 . (2) precisely the same hypotheses as their p n + n − 2 conventional counterparts, both the robust and 1 2 conventional versions test the hypothesis that The test statistic for the conventional measures of the typical score are equal. In fact, independent-samples t test is according to many researchers, the trimmed mean is a better measure of the typical score X − X than the least squares mean, when distributions T = 1 2 , (3) are skewed (e.g., Keselman et al., 2002). ⎛ 1 1 ⎞ 2 ⎜ ⎟ This study compared (a) the S p ⎜ + ⎟ ⎝ n1 n2 ⎠ conventional (i.e., least squares means and variances) paired-samples t test, (b) the conventional independent-samples t test, (c) the which is distributed as a t variable with robust (trimmed means and Winsorized ν = n1 + n2 − 2 degrees of freedom, assuming variances) paired-samples t test, and (d) the normality and homogeneity of variances. robust independent-samples t test, based on their In order to compute the conventional paired- empirical rates of Type I error and power. As in samples t test, which assumes that the two Zimmerman's (1997) study with normal data, it groups are dependent, let S 2 = S 2 n , where X j j j was expected that as the size of the correlation S is the estimate of the standard error of the between the groups increased, both the X j conventional and robust versions of the paired- mean of group j. An estimate of the correlation samples t tests would perform better than their between the two groups is also needed to independent-samples counterparts, in terms of compute the paired-samples t statistic. The their ability to maximize power while correlation is defined as r = S12 S1S 2 , where maintaining empirical Type I error rates close to the nominal α level. It was also expected, based S12 = ∑i (X i1 − X1 )(X i2 − X 2 ) (n −1) , on previous findings (e.g., Keselman, et al.,

1998; Keselman, et al., 2000; Keselman et al., and n represents the total number of pairs. The 2002; Lix et al., 1998; Wilcox, 1993; Yuen, paired-samples test statistic is 1974), that the robust versions of both the paired and independent-samples t tests would perform better in terms of Type I error and power rates X 1 − X 2 T(PAIRED) = , (4) than the corresponding conventional versions. 2 2 S X + S X − 2rS X S X 1 2 1 2

which is distributed as a t variable with ν = n −1 484 CONVENTIONAL AND ROBUST T TESTS degrees of freedom, assuming normality. Then the robust independent-samples t test is

Robust Methods X t1 − X t2 TY = , (10) Suppose, again, that n j observations, d1 + d 2 X , X , …, X , are sampled from population 1 j 2 j n j j j. For both the independent-samples and paired- which is approximately distributed as a t variable with degrees of freedom samples t tests, first let X (1) j ≤ X (2) j ≤ ≤ X (n ) j j be the ordered observations of group j, and let γ (d + d )2 be the percentage of observations that are to be ν = 1 2 . (11) Y 2 2 trimmed from each tail of the distribution. Also d1 ()h1 −1 + d 2 (h2 −1 ) let g j = [γn j ], where [x] is the largest integer ≤ x . To calculate the robust versions of both To compute the robust paired-samples t statistics we must first Winsorize the test, as enumerated by Wilcox (2002), the paired observations by letting observations must first be Winsorized, as in Equation 5. It is important to note that when Y = X if X ≤ X Winsorizing the observations for the paired- ij (g j +1) j ij (g j +1) j samples t statistic, care must be taken to = X if X < X < X maintain the original pairing of the observations. ij (g j +1) j ij (n j −g j ) j . (5) The sample size for the robust version of the = X if X ≥ X (n j −g j ) j ij (n j −g j ) j paired-samples t test is h = n − 2g , where n is the total number of pairs. Let The sample Winsorized mean is defined as 1 2 n d = ∑ ()Y − Y , (12) 1 j j h()h −1 i ij Wj YWj = ∑Yij . (6) n j i=1 and The sample trimmed mean for the jth group is 1 also required to compute the robust versions of d = ∑ (Y − Y )()Y − Y , (13) the paired and independent-samples t tests and is 12 h()h −1 i i1 W1 i2 W 2 defined as

where Yij andYWj are defined in Equations 6 and 1 n j −g j X tj = ∑ X (i) j , (7) 7, respectively. The test statistic for the robust h j i=g j +1 paired-samples t test is

where h j = n j − 2g j . The sample Winsorized X t1 − X t2 TY (PAIRED) = , (14) variance for the robust independent-samples t d1 + d 2 − 2d12 test is which is approximately distributed as a t n j 2 1 2 variable with ν = h −1 degrees of freedom. SWj = ∑ ()Yij − YWj , (8) n j −1 i=1 Simulation Procedures Empirical Type I error and power rates where Yij andYWj are defined in Equations 5 and were collected for the conventional and robust 6, respectively. Finally, let versions of the paired and independent-samples t tests using a Monte Carlo procedure. Thus, a 2 (n j −1)SWj total of four tests were investigated: (a) the d j = . (9) h j ()h j −1 conventional paired-samples t test, (b) the FRADETTE, KESELMAN, LIX, ALGINA, & WILCOX 485 conventional independent-samples t test, (c) the Institute, 1989) was used to generate pseudo- robust paired-samples t test, and (d) the robust random normal variates, Z i (i = 1,…, N ). Next, independent-samples t test. Two-tailed tests the Z i s were modified using the algorithm were performed on each of the four procedures. Four variables were manipulated in the study: (a) sample size, (b) magnitude of the Yij = rZ i + 1− r Eij , (15) population correlation coefficient, (c) magnitude of the difference between groups, and (d) where the Eij s are pseudo-random normal population distribution. Following Zimmerman variates. In the case of this study, the Eij s were (1997), four sample sizes (N) were investigated: also generated by the SAS generator RANNOR. 10, 20, 40, and 80, and population correlations The variable r is determined as in Headrick and (ρ) ranging from -.5 to .5, in increments of .1, Sawilowsky (1999), and is dependent on the were induced. final desired population correlation (ρ). Both Y The difference in the mean (trimmed i1 mean) value for the two populations was also and Yi2 are random normal deviates with a 2 manipulated. When empirical Type I error rates correlation of r . Finally, the Yij s generated for were investigated, there was no difference the study were further modified in order to between the groups. When empirical power rates obtain nonnormally distributed observations, via were investigated, three values of the effect size the algorithm were investigated; the difference between the groups was set at .25, .5, and .75. These values Y * = a + bY + (−a)Y 2 + dY 3 , (16) were chosen in order to avoid ceiling and floor ij ij ij ij effects, a practice that has been employed in other studies (e.g., Keselman, Wilcox, Algina, where a, b, and d are constants that depend on Fradette, & Othman, 2003). the desired values of skewness ( γ1 ) and kurtosis

There were two population distribution ( γ 2 ) of the distribution, and can be determined conditions. Data for both groups were generated by solving equations found in Fleishman (1978, either from an exponential distribution or a chi- p. 523). The resultant Y * s are nonnormal squared distribution with one degree of freedom ij deviates with zero means and unit variances, and ( χ 2 ). Skewness and kurtosis values for the 1 are correlated to the desired level of ρ, which is exponential distribution are γ1 = 2 and γ 2 = 6 , specified when determining r. respectively. Skewness and kurtosis values for Observations with mean µ j (or µtj ) and 2 the χ1 distribution are γ1 = 8 and γ 2 =12 , 2 variance σ j were obtained via respectively. * For the robust versions of both the X ij = µ j + σ j ×Yij . The means (trimmed means) paired and the independent-samples t tests, the varied depending on the desired magnitude of percentage of trimming was 20%; thus, 20% of the difference between the two groups. In order the observations from each tail of the to achieve the desired difference, constants were distribution were removed. This proportion of added to the observations in each group. The trimming was chosen because it has been used in value of the constants, corresponding to each of other studies (e.g., Keselman et al., 1998; the four difference conditions investigated, were Keselman et al., 2000; Keselman et al., 2002; (a) 0, 0, (b) .25, 0, (c) .5, 0, and (d) .75, 0. These Lix et al., 1998) and because 20% trimming has values were added to each observation in the previously been recommended (e.g., Wilcox, first and second group, respectively. Thus, µ j 1997). ( µtj ) represents the value of the constants In order to generate the data for each corresponding to a given desired difference. condition, the method outlined in Headrick and 2 Sawilowsky (1999) for generating correlated Variances were set to σ j =1 in all conditions. multivariate nonnormal distributions was used. When using trimmed means, the empirically First, the SAS generator RANNOR (SAS 486 CONVENTIONAL AND ROBUST T TESTS

determined population trimmed mean µt was In order for a test to be considered robust, its ˆ subtracted from the Y * variates before empirical rate of Type I error ( α ) had to be ij contained within Bradley's (1978) liberal multiplying by σ (see Keselman et al., 2002 j criterion of robustness: 0.5α ≤ αˆ ≤1.5α . Hence, for further discussion regarding the generation for this study, in which a five percent nominal of variates to be used with trimming). Ten significance level was employed, a test was thousand replications of the data generation considered robust in a particular condition if its procedure were performed for each of the empirical rate of Type I error fell within the conditions studied. .025 − .075 interval. A test was considered to be nonrobust in a particular condition if αˆ fell Results outside of this interval. Tables 1 and 2 display the range of Type I errors made by each of the Type I Error Rates investigated tests across all samples sizes (N = Each of the four investigated tests was 10, 20, 40, 80), as a function of ρ. We felt it was evaluated based on its ability to control Type I acceptable to enumerate a range across all errors, under conditions of nonnormality. In the sample sizes investigated because at all values of case of the two versions of the independent- N, a similar pattern of results was observed. samples t tests, the independence assumption was also violated when ρ was not equal to zero.

Table 1: Range of Proportion of Type I Errors for All Tests Under the Exponential Distribution

Exponential Distribution

Rho (ρ) Conventional Procedure Robust Procedure

Independent Paired Independent Paired

-0.5 .116 - .143 .060 - .093 .103 - .108 .051 - .057

-0.4 .100 - .128 .056 - .085 .092 - .099 .052 - .054

-0.3 .089 - .116 .055 - .083 .078 - .092 .047 - .054

-0.2 .081 - .108 .059 - .086 .070 - .080 .049 - .057

-0.1 .071 - .091 .059 - .078 .062 - .067 .049 - .053

0 .042 - .048 .039 - .049 .038 - .046 .035 - .045

0.1 .035 - .043 .042 - .053 .031 - .038 .031 - .049

0.2 .025 - .029 .044 - .050 .024 - .031 .030 - .052

0.3 .019 - .021 .042 - .053 .017 - .021 .028 - .048

0.4 .011 - .012 .039 - .052 .012 - .016 .03 - .044

0.5 .006 - .007 .04 - .047 .006 - .01 .028 - .045

FRADETTE, KESELMAN, LIX, ALGINA, & WILCOX 487

2 Table 2: Range of Proportion of Type I Errors for All Tests Under the χ1 Distribution

2 Chi-Squared Distribution ( χ1 )

Rho (ρ) Conventional Procedure Robust Procedure

Independent Paired Independent Paired

-0.5 .120 - .171 .068 - .129 .102 - .107 .056 - .073

-0.4 .100 - .161 .060 - .125 .090 - .096 .056 - .068

-0.3 .093 - .145 .063 - .118 .079 - .089 .051 - .066

-0.2 .087 - .135 .067 - .114 .070 - .082 .052 - .067

-0.1 .075 - .114 .064 - .102 .063 - .068 .052 - .058

0 .038 - .046 .034 - .046 .026 - .045 .025 - .042

0.1 .031 - .041 .033 - .049 .023 - .036 .022 - .042

0.2 .026 - .029 .035 - .046 .020 - .030 .023 - .044

0.3 .020 - .021 .033 - .052 .018 - .023 .023 - .043

0.4 .011 - .015 .035 - .051 .015 - .018 .022 - .046

0.5 .006 - .011 .035 - .045 .009 - .013 .020 - .042

Table 1 displays the range of empirical paired-samples counterparts. In fact, the total Type I error rates for each test, as a function of number of values that fell outside of the range of ρ, under the exponential distribution condition. Bradley's liberal criterion was 30 and 26 (out of It is apparent from the table that both versions of 44) for the conventional and robust versions of the paired-samples t test maintained Type I the independent t test, respectively. Thus, the errors near the nominal level of significance, α. robust independent t test was indeed slightly In fact, only 6 of 44 values fell outside the range more robust, overall, than the conventional of Bradley's .025 − .075 interval for the independent t test. Both versions of the conventional paired t test; none did for the independent-samples t test were effective at robust paired t test. Thus, for data that follow an controlling Type I errors when the population exponential distribution, the robust paired t test correlation (ρ) was zero; however, this control was insensitive to nonnormality at every value was reduced the more that ρ deviated from zero. of ρ . A comparison of the conventional and An inspection of Table 2, which robust versions of the paired t test in Table 1 displays the range of Type I errors for the tests 2 reveals that, in particular, the robust version was for the χ1 distribution, reveals a pattern of more effective at controlling Type I errors when results similar to that for the exponential the population correlation (ρ) between the distribution. However, all of the tests were groups was negative. 2 somewhat less robust under the χ1 distribution Table 1 also shows that the independent- than the exponential distribution condition. That samples tests were not as robust, overall, as their 488 CONVENTIONAL AND ROBUST T TESTS is, nonrobust liberal values were greater in value groups (in the case of the robust tests), or the 2 for χ1 data than for exponentially distributed least squares means of the groups (in the case of data. Specifically, the total number of values that the conventional tests). Figures 1, 2, and 3 fell outside of Bradley's liberal interval for the display the power of each of the investigated conventional versions of the paired and tests to detect a true difference between the independent-samples t tests were 12 and 31 (out (trimmed) means of the groups, as a function of of 44), respectively. The total number of the magnitude of the difference between the nonrobust values for the robust versions of the (trimmed) means. The results portrayed in these paired and independent-samples t tests were five figures were averaged over all sample sizes. and 28, respectively. While the power rates of the tests increased as Power Rates the size of N increased, again, we felt it was The four tests were also evaluated based acceptable to collapse over the sample size on empirical power rates. Therefore, each test conditions because the tests showed a similar was judged on its ability to detect a true pattern of results in relation to one another for difference between the trimmed means of the all values of N.

FRADETTE, KESELMAN, LIX, ALGINA, & WILCOX 489

Exponential Distribution

D = 0 0.8

0.7

0.6

0.5 Conventional Paired Conventional Independent 0.4 Robust Paired

0.3 Robust Independent

0.2

0.1 Probability of Rejecting Null Hypothesis of Rejecting Probability

0 0 0.25 0.5 0.75 1 Magnitude of the Difference Between (Trimmed) Means

Chi-Squared Distribution (One df )

D = 0 0.9

0.8

0.7

0.6 Conventional Paired 0.5 Conventional Independent Robust Paired 0.4 Robust Independent 0.3

0.2

Probability of Rejecting Null Hypothesis of Rejecting Probability 0.1

0 0 0.25 0.5 0.75 1 Magnitude of the Difference Between (Trimmed) Means

Figure 1. Probability of rejecting H0 for the conventional and robust paired and independent-samples t tests; ρ = 0 .

490 CONVENTIONAL AND ROBUST T TESTS

Exponential Distribution

D = 0.3 0.9

0.8

0.7

0.6 Conventional Paired 0.5 Conventional Independent Robust Paired 0.4 Robust Independent 0.3

0.2

Probability of Rejecting Null Hypothesis 0.1

0 0 0.25 0.5 0.75 1 Magnitude of the Difference Between (Trimmed) Means

Chi-Squared Distribution (One df )

D = 0.3 1

0.9

0.8

0.7 Conventional Paired 0.6 Conventional Independent 0.5 Robust Paired

0.4 Robust Independent

0.3

0.2 Probability of Rejecting Null Hypothesis Probability 0.1

0 0 0.25 0.5 0.75 1 Magnitude of the Difference Between (Trimmed) Means

Figure 2. Probability of rejecting H0 for the conventional and robust paired and independent-samples t tests; ρ = 0.3 .

FRADETTE, KESELMAN, LIX, ALGINA, & WILCOX 491

Exponential Distribution

D = -0.3 0.9

0.8

0.7

0.6 Conventional Paired 0.5 Conventional Independent Robust Paired 0.4 Robust Independent 0.3

0.2

Probability of Rejecting Null Hypothesis of Rejecting Probability 0.1

0 0 0.25 0.5 0.75 1 Magnitude of the Difference Between (Trimmed) Means

Chi-Squared Distribution (One df )

D = -0.3 1

0.9

0.8

0.7 Conventional Paired 0.6 Conventional Independent 0.5 Robust Paired 0.4 Robust Independent

0.3

0.2

Probability of Rejecting Null Hypothesis 0.1

0 0 0.25 0.5 0.75 1 Magnitude of the Difference Between (Trimmed) Means

Figure 3. Probability of rejecting H0 for the conventional and robust paired and independent-samples t tests; ρ = −0.3 .

492 CONVENTIONAL AND ROBUST T TESTS

2 Figure 1 displays the power rates of the exponential or the χ1 distributions. In fact, the 2 tests for both the χ1 and the exponential figure shows that the power functions of the distributions when ρ = 0 . The upper portion of independent-samples t tests were higher than the figure reveals that when data followed an their paired-samples counterparts under both exponential distribution, the power functions of distributions. The lower portion of Figure 3 2 the four tests were quite similar, with the shows that under the χ1 distribution, while the empirical power of the robust versions only power functions of both versions of the slightly higher than the corresponding power of independent-samples t test were higher than the conventional versions. However, an their corresponding versions of the paired- inspection of the lower portion of Figure 1 samples test, the power rates of both robust tests 2 indicates that under the χ1 distribution, the were higher than the conventional tests, as was power functions of the robust tests were the case with the other levels of ρ. considerably higher than those of both conventional versions. In addition, Figure 1 Conclusion shows that when no correlation existed between the groups, the power functions of the Four different statistics for testing the difference independent-samples t tests were slightly higher between two groups were investigated based on than their paired-samples counterparts. their power to detect a true difference between Figure 2 shows the power functions of two groups and their ability to control Type I the tests for both the χ 2 and exponential errors. The primary objective for conducting the 1 study was to determine which of the tests would distributions when ρ = .3 . The upper portion of perform best when the data for the two groups Figure 2 indicates that when the data were were correlated and the assumption of a normal exponentially distributed and positively distribution of the responses was violated. correlated, the power functions of both versions Although empirical Type I error and of the paired-samples t test were higher than power rates are two separate measures of a test’s those of the independent-samples tests. The effectiveness, in order to evaluate the overall lower portion of the figure, which displays performance of the investigated procedures, 2 power for the χ1 distribution for this same value power and Type I error rates must be considered of ρ, demonstrates that while the power function concomitantly. The reason for this is that if a test of each of the paired-samples t tests was higher does not maintain the rate of Type I errors at or than its respective independent-samples around the nominal α level, this can cause a counterpart, the power rates of both robust tests distortion in power. Figures 4 and 5 provide a were higher than those of the conventional tests. summary of the results for the exponential and 2 Figure 3 displays the power rates of the χ1 distributions, respectively. These figures 2 tests for the χ1 and exponential distributions were included to allow the reader to easily when ρ = −.3 . Unlike the results obtained for examine the Type I error and power rates of positively correlated data, the paired-samples t each of the distributions concurrently. tests showed no apparent power advantage over the independent-samples t tests when the groups were negatively correlated, for either the

FRADETTE, KESELMAN, LIX, ALGINA, & WILCOX 493

Exponential Distribution

Difference Between (Trimmed ) Means = 0 0.15 Conventional Paired Conventional Independent Robust Paired Robust Independent 0.1 ng the Null Hypothesis

0.05 Probability of Rejecti

0 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 Population Correlation

Difference Between (Trimmed) Means = 0.5 0.8

0.7

0.6

0.5

0.4

0.3 Conventional Paired 0.2 Conventional Independent

0.1 Robust Paired Probability of Rejecting the Null Hypothesis Robust Independent 0 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 Population Correlation

Figure 4. Probability of rejecting H0 as a function of ρ and the magnitude of the difference between (trimmed) means for the conventional and robust paired and independent-samples t tests exponential distribution.

494 CONVENTIONAL AND ROBUST T TESTS

Chi-Squared Distribution (One df )

Difference Between (Trimmed ) Means = 0 0.15 Conventional Paired Conventional Independent Robust Paired Robust Independent 0.1 ng the Null Hypothesis

0.05 Probability of Rejecti

0 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 Population Correlation

Difference Between (Trimmed) Means = 0.5 0.9

0.8

0.7

0.6

0.5

0.4

0.3 Conventional Paired 0.2 Conventional Independent Robust Paired

Probability of Rejecting the Null Hypothesis 0.1 Robust Independent 0 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 Population Correlation

Figure 5. Probability of rejecting H0 as a function of ρ and the magnitude of the difference between 2 (trimmed) means for the conventional and robust paired and independent-samples t tests under the χ1 distribution.

FRADETTE, KESELMAN, LIX, ALGINA, & WILCOX 495

As the results indicated, the only time In conclusion, there need only be a the independent tests maintained the Type I error small positive or negative correlation between rate close to the nominal level was when there two groups in order for the paired t test to be was no correlation between the groups; this more effective than the independent t test when ability grew worse as ρ got larger. In fact, the the data are nonnormal. In fact, although Vonesh Type I error control of the independent t tests (1983) showed that there needs to be a began to break down when the correlation correlation of at least .25 in the population for between the groups was as small as ±.1. Thus, the paired t test to be more powerful than the with the exception of the ρ = 0 condition, both independent test, when the distortion of Type I the robust and the conventional versions of the error rates, resulting from the application of the independent t test were quite poor at controlling independent-samples t test on dependent data, Type I errors. Because of this distortion of the was taken into account, the paired-samples t Type I error rate, the powers of the independent tests performed best when the correlation was as tests are not interpretable (Zimmerman, 1997) low as ±.1. Thus, just as Zimmerman (1997) when ρ is not equal to zero. cautions when dealing with normal data, Both versions of the paired t test, researchers should take care to ensure that their however, did a much better job of controlling data is not correlated in any way when using the Type I errors than their independent-samples independent t test on nonnormal data, lest the counterparts when there was a correlation existence of even a slight dependence alters the between the groups, for nonnormal data. significance level of the test. In addition, given Because the paired-samples t tests maintained that the population distributions were not normal Type I errors close to the nominal level, the in shape, the robust version of the paired t test empirical power rates of the paired t tests, unlike performed the best under all the conditions that those of the independent tests, can be taken to were studied. Thus, based on the results of this accurately represent their ability to detect a true investigation, it is recommended that researchers difference between the groups. Thus, as use the robust paired-samples t test, which expected, when power and Type I error rates are employs trimmed means and Winsorized both taken into account, it can be said that the variances, when dealing with nonnormal data. paired t tests were more effective than their independent samples counterparts when groups References were correlated, even when this correlation was low (i.e., ±.1). This finding agrees with Bradley, J. V. (1978). Robustness? Zimmerman's (1997) results for normally British Journal of Mathematical and Statistical distributed data. Psychology, 31, 144-152. Furthermore, the robust paired-samples t Edwards, A. L. (1979). Multiple test was more effective, in terms of Type I error regression and the analysis of variance and control, than the conventional paired test. The covariance. New York: Freeman. robust paired test was also consistently more Fleishman, A. I. (1978). A method for powerful than the conventional version, and this simulating non-normal distributions. Psychometrika, 43, 521-532. power advantage increased as skewness and th kurtosis in the population increased. Therefore, Hays, W. L. (1988). Statistics (4 ed.). as expected, the robust version of the paired- New York: Holt, Rinehart, & Winston. samples t test performed better than the Headrick, T. C., & Sawilowsky, S. S. conventional version of the test, for nonnormal (1999). Simulating correlated multivariate data. This result is supported by many other nonnormal distributions: Extending the studies involving trimmed means and Fleishman power method. Psychometrika, 64, Winsorized variances (e.g., Keselman, et al., 25-35. 1998; Keselman, et al., 2000; Keselman et al., 2002; Lix et al., 1998; Wilcox, 1993; Yuen, 1974).

496 CONVENTIONAL AND ROBUST T TESTS

Keselman, H. J., Kowalchuk, R. K., SAS Institute Inc. (1989). SAS/IML Algina, J., Lix, L. M., & Wilcox, R. R. (2000). software: Usage and reference, version 6 (1st Testing treatment effects in repeated measures ed.). Cary NC: Author. designs: Trimmed means and bootstrapping. Vonesh, E. F. (1983). Efficiency of British Journal of Mathematical and Statistical repeated measures designs versus completely Psychology, 53, 175-191. randomized designs based on multiple Keselman, H. J., Kowalchuk, R. K., & comparisons. Communications in Statistics A: Lix, L. M. (1998). Robust nonorthogonal Theory and Methods, 12, 289-301. analyses revisited: An update based on trimmed Welch, B. L. (1938). The significance of means. Psychometrika, 63, 145-163. the difference between two means when the Keselman, H. J., Wilcox, R. R., Algina, population variances are unequal. Biometrika, J., Fradette, K., Othman, A. R. (2003). A power 29, 350-362. comparison of robust test statistics based on Wilcox, R. R. (1990). Comparing the adaptive estimators. Submitted for publication. means of two independent groups. Biometrical Keselman, H. J., Wilcox, R. R., Journal, 36, 259-273. Kowalchuk, R. K., & Olejnik, S. (2002). Wilcox, R. R. (1993). Analysing Comparing trimmed or least squares means of repeated measures or randomized block designs two independent skewed populations. using trimmed means. British Journal of Biometrical Journal, 44, 478-489. Mathematical and Statistical Psychology, 46, Kirk, R. E. (1999). Statistics: An 63-76. introduction. Orlando, FL: Harcourt Brace. Wilcox, R. R. (1997). Introduction to Kurtz, K. H. (1965). Foundations of robust estimation and testing. San Diego: psychological research. Boston: Allyn & Bacon. Academic Press. Lix, L. M., & Keselman, H. J. (1998). Wilcox, R. R. (2002). Applying To trim or not to trim: Tests of location equality contemporary statistical methods. San Diego: under heteroscedasticity and nonnormality. Academic Press. Educational and Psychological Measurement, Yuen, K. K. (1974). The two-sample 58, 409-429. trimmed t for unequal population variances. MacDonald, P. (1999). Power, Type I Biometrika, 61, 165-170. error, and Type III error rates of parametric and Zimmerman, D. W. (1997). A note on nonparametric statistical tests. Journal of the interpretation of the paired-samples t test. Experimental Education, 67, 367-380. Journal of Educational and Behavioral Micceri, T. (1989). The unicorn, the Statistics, 22, 349-360. normal curve, and other improbable creatures. Psychological Bulletin, 105, 156-166.

Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 497-511 1538 – 9472/03/$30.00

Fitting Generalized Linear Mixed Models For Point-Referenced Spatial Data

Armin Gemperli Penelope Vounatsou Swiss Tropical Institute Basel, Switzerland

Non-Gaussian point-referenced spatial data are frequently modeled using generalized linear mixed models (GLMM) with location-specific random effects. Spatial dependence can be introduced in the covariance matrix of the random effects. Maximum likelihood-based or Bayesian estimation implemented via Markov chain Monte Carlo (MCMC) for such models is computationally demanding especially for large sample sizes because of the large number of random effects and the inversion of the covariance matrix involved in the likelihood. We review three fitting procedures, the Penalized Quasi Likelihood method, the MCMC, and the Sampling-Importance-Resampling method. They are assessed in terms of estimation accuracy, ease of implementation, and computational efficiency using a spatially structured dataset on infant mortality from Mali.

Key words: Geostatistics, infant mortality, kriging, Markov chain Monte Carlo (MCMC), penalized quasi likelihood (PQL), risk mapping, sampling-importance-resampling (SIR)

Introduction Statistical inference of point referenced data often assumes that the observations arise Point referenced spatial data arise from from a Gaussian spatial stochastic process and observations collected at geographical locations introduce covariate information and possibly over a fixed continuous space. Proximity in trend surface specification on the mean structure space introduces correlations between the while spatial correlation on the variance- observations rendering the independence covariance matrix Σ of the process. Under assumption of standard statistical methods second order stationarity, Σ determines the invalid. Ignoring spatial correlation will result in well-known variogram. When isotropy is also underestimation of the standard error of the assumed, the elements of Σ are modeled by parameter estimates, and therefore liberal parametric functions of the separation between inference as the null hypothesis is rejected too the corresponding locations. For non-Gaussian often. A wide range of analytical tools within the data, the spatial correlation is modeled on the field of geostatistics have been developed covariance structure of location-specific random concerning with the description and estimation effects introduced into the model and assumed to of spatial patterns, the modeling of data in the arise from a Gaussian stationary spatial process. presence of spatial correlation and the kriging, For Gaussian data, the generalized least that is the spatial prediction at unobserved squares (GLS) approach can be used iteratively locations. to obtain estimates βˆ of the regression

coefficients conditional on the covariance

Armin Gemperli is completing his PhD at the parameters. The covariance parameters θ can Biostatistics Unit. Penelope Vounatsou is Senior be estimated conditional on βˆ by fitting the Statistician. We are grateful for assistance from semivariogram empirically or by maximum Macro International Inc., and acknowledge likelihood or restricted maximum likelihood discussions with Tom Smith and Marcel Tanner. methods (Zimmerman and Zimmerman, 1991). This work was supported by the Swiss National Statistical estimation for non-Gaussian Science Foundation grant Nr. 3200-057165.99. data is based on the theory of generalized linear mixed models (GLMM). A common approach is

497 498 FITTING MIXED MODELS FOR POINT-REFERENCED DATA

to integrate out the random effects and proceed A drawback of the maximum likelihood- with maximum likelihood based approaches for based methods employed in geostatistical estimating the covariate and covariogram modeling is the large sample asymptotic parameters. This integration can be implemented inference. For a spatial stochastic process numerically (Anderson and Hinde, 1998; {();Yu u∈ D }, with DR⊂ 2 the asymptotic Preisler, 1988; Lesaffre and Spiessens, 2001) concept can be applied either to the sample size when dimensionality is low or via within a fixed space D (infill asymptotics) or to approximations. Breslow and Clayton (1993) the space D (increasing domain asymptotics). show, that for known covariance parameters, the In the latter, observations are spaced far enough Laplace approximation leads to the same to be considered uncorrelated. The results can estimator for the fixed and random effects differ, depending on the type of asymptotics parameters as the one arising by maximizing the used (see e.g. Tubilla, 1975). penalized quasi-likelihood (PQL). Bayesian hierarchical geostatistical Implementation of this approach requires models implemented via Monte Carlo methods iterating between iterated weighted least squares avoid asymptotic inference as well as many for estimating the fixed and random effects computational problems in model fitting and parameters and maximizing the profile prediction. Diggle et al. (1998) suggest inference likelihood for estimating the covariance on the posterior density via Markov chain Monte parameters. An extension of the PQL procedure Carlo (MCMC). This iterative approach requires is discussed by Wolfinger and O’Connell repeated inversions of the covariance matrix of (1993). The PQL approach is implemented in the spatial process, which is involved in the some statistical packages due to its relative likelihood. The size of this matrix increases with simplicity, however it provides biased estimates the number of locations. Inversion of large when the number of random effects increases matrices can drastically slow down the running (McCulloch, 1997; Booth and Hobert, 1999) or time of the algorithm and cause numerical when the data are far from normal. instabilities affecting the accuracy of the The generalized estimating equation estimates. To overcome this problem Gelfand et methods developed by Liang and Zeger (1986) al. (1999) suggest non-iterative simulation via and Zeger and Liang (1986) estimate covariate the Sampling-Importance-Resampling (SIR) effects under the assumption of independence, algorithm (Rubin, 1987). The quality of SIR but correct their standard error to account for the hinge on the ability to formulate an easy-to- spatial dependence. The method is unable to draw-from importance-density, which comes as estimate the spatial random effects. The EM close as possible to the true joint posterior algorithm (Dempster, Laird and Rubin, 1977) distribution of the parameters. has been implemented in model fit by treating In this article, we review three fitting the spatial random effects as "missing" data. The procedures; the maximum likelihood-based PQL intractable integration of the random effects method, the MCMC simulation and the SIR. We which is required in the E-step is overcome by assess these methods in terms of estimation simulation, such as Metropolis-Hastings accuracy, ease of implementation and algorithm (McCulloch, 1997) or importance computational efficiency using a spatially sampling/rejection sampling method (Booth and structured dataset on infant mortality from Mali Hobert, 1999). For spatial settings, particular collected over 181 locations. A description of Pseudo-Likelihood approaches have been the dataset and the applied questions which established which capture solely the site to site motivated this work are given in the next variation between pairs or groups of section. Then we describe the model as well as observations (Besag, 1974). For the special case the three fitting approaches. A discussion on the of a binary outcome, Heagerty and Lele (1998) ease of implementation of each approach and a have proposed a thresholding model using a comparison of the inferences obtained is given composite likelihood approach. in the conclusion section.

GEMPERLI & VOUNATSOU 499

Data such as logit in our mortality risk application. The data that motivated this work were However the spatial structure of the data renders collected under the Demographic and Health the independence assumption of Yij invalid, Surveys (DHS) program. The aim of the program is to collect and analyze reliable leading to narrower confidence intervals for β demographic and health data for regional and and thus to overestimation of the significance of national family and health planning. Data are the predictors. commonly collected in developing countries. One approach to take into account DHS is funded by the U.S. Agency for spatial dependence is via the generalized linear International Development (USAID) and mixed model (GLMM) reviewed by Breslow implemented by Macro International Inc. The and Clayton (1993). In particular, we introduce standard DHS methodology involves collecting the unobserved spatial variation by a latent complete birth histories from women of stationary, isotropic Gaussian process U over childbearing age, from which a record of age our study region D , such that and survival can be computed for each child. U = (,UU12 ,,… Un )~(0,) N Σ , where Σij is a The data are available to researchers via the parametric function of the distance d between internet (www.measureDHS.com). ij

Birth histories corresponding to 35,906 locations si and s j . Conditional on the random children were extracted from the data of the term U , we assume that are independent DHS-III 1995/96 household survey carried out i Yij in Mali. Additional relevant covariates extracted with EY(|)ij U i= π ij . The Ui enters the model were the year of birth, residence, mothers on the same scale as the predictors, that is education, infant’s sex, birth order, preceding birth interval and mothers age at birth. Using t location information provided by Macro gU()π ij= X ijβ + i (1) International, we were able to geo-locate 181 distinct sites by using digital maps and and captures unmeasured geographical databases, such as the African data sampler heterogeneity (small scale variation). (1995) and the Geoname Gazetteer (1995). The A commonly used parameterization for objective of data analysis was to assess the the covariance Σ is Σ=σρφ2 (;d ) where effect of birth and socio-economic parameters ij ij 2 on infant mortality and produce smooth maps of σ is the variance of the spatial process and mortality risk in Mali. These maps will help ρ(;φ dij ) a valid correlation function with a identifying areas of high mortality risk and assist scale parameter φ which controls the rate of child mortality intervention programs. correlation decay with increasing distance. In Methodology most applications a monotonic correlation function is chosen i.e. the exponential function which has the form ρ(φφ ;ddij )=− exp( ij ) . Let Yij be a binary response corresponding to Ecker and Gelfand (1997) propose several other the mortality risk of child j at site s , i parametric correlation forms, such as the in=1,..., taking value 1 if the child survived Gaussian, Cauchy, spherical and the Bessel. A separate set of location-specific the first year of life and 0 otherwise, and let Xij t be the vector of associated covariates. Within random effects, W = (,WW12 ,,… Wn ) is often the generalized linear model framework (GLM), added in Equation 1 to account for unexplained non-spatial variation (Diggle et al., 1998), where we assume Yij are i.i.d. Bernoulli random W , in=1,… , are considered to be variables with EY()= π and model predictors i ij ij independent, arising from a Normal distribution, t 2 2 as g()π ij= X ijβ where g()⋅ is a link function WNi ~(0,)τ . The τ is known in 500 FITTING MIXED MODELS FOR POINT-REFERENCED DATA

geostatistics as the nugget effect and introduces where p(|U σ 2 ,)φ is the distribution of the a discontinuity at the origin of the covariance spatial random effects, that is function: pN(|U σφ2 ,)≡ (0,)Σ . Σ=τσρφ221(ij = ) + ( ; d ) . ij ij A large number of repeated samples at the same Markov chain Monte Carlo estimation location make the nugget identifiable, otherwise Diggle et al. (1998) suggest Markov its use in the model is not justifiable because the chain Monte Carlo and in particular Gibbs extra binomial variation is already accounted for sampling for fitting GLMM for point-referenced by the spatial random effect. data. The standard implementation of the Gibbs algorithm requires sampling from the full Parameter estimation conditional posterior distributions which in our The above GLMM is highly application have the following forms: parameterized and maximum likelihood methods can fail to estimate all parameters p(,|,,)β β UY ∝ simultaneously. The estimation approach starts kk− n ni by integrating out the random effects and exp(Xijkβ kY ij ) (2) estimating the other parameters using the ∏∏1++ exp(Xt β U ) marginal likelihood ij==11 ij i

p(|,,,)YUβ σ 2 φ p(|UUσφ2 ,)d . ∫ 2 pU(|iiUY− ,,,)σφ ∝ However, this integral has analytical solution n ni exp(UY ) only for Gaussian data. For non-Gaussian data iij × the integrand can be approximated using a first- ∏∏ t ij==111exp(++Xijβ U i ) order Taylor series expansion around its 211/2−− maximizing value, after which the integration is ||σ −×ΣΣΣii,,−−− i ii feasible. This approach, known as the Laplace 1 −12 approximation, results in the penalized quasi- exp(− (Uiiiii−×ΣΣ,−−U − ) likelihood (PQL) estimator (Breslow and 2 (3) 211−− Clayton, 1993), which was shown in various ())σ − ΣΣΣii,,−−− i ii simulation studies to produce biased results (Browne and Draper, 2000; Neuhaus and Segal, p(|φσU ,21/2 )∝× |Σ |− 1997). Breslow and Lin (1995) determined the (4) asymptotic bias in variance component problems 1 t −1 −(1)a + exp(−+ (U Σ U b /φφ )) 1 for first- and second-order approximations in 2 1 comparison to McLaurin approximations. Following the Bayesian modeling p(|,)~σφ2 U specification, we need to adopt prior distributions for all model parameters. We chose InverseGamma(an2 + / 2, non-informative Uniform priors for the 1 t −1 regression coefficients, i.e. p()β ∝ 1 , and vague b2 + UR U), 2 inverse Gamma priors for the σ 2 and φ Rkl= ρ(;φ d kl ) (5) parameters: p()φ = IG ( a11 , b ) and 2 where p()σ = IG (,) a22 b . Bayesian inference is based on the joint posterior distribution t β−−+kkkK= (,,β111……ββ , ,, β ), pL(,β UY ,σφ2 , | )∝× (,β UY ; ) U = (,,UUUU…… , ,, )t , , −−+iiin111 22 t pp()β (U |σφ ,) p ( σ ) p () φ ΣΣ−−ii,,= i i = Cov(,)U−ii U and GEMPERLI & VOUNATSOU 501

t which is easy to simulate from and then re- Σ−−−iii= Cov(,)UU . 2 sample from g(,β UY ,σφ ,; )according to the Samples from p(|,)σ 2 U φ can be drawn easily importance weights as this is a known distribution. The conditionals p(,β UY ,σφ2 , | ) of the other parameters do not have standard w(,β U ,σφ2 , )= . (8) forms and a random walk Metropolis algorithm g(,β UY ,σφ2 ,; ) with a Gaussian proposal density having mean equal to the estimate from the previous iteration 2 The density g (,)σ φ of the ISD could and variance derived from the inverse second s derivative of the log-posterior could be be taken as a product of independent inverse 2 employed for simulation. Gamma distributions ggss()()σ φ . It is The likelihood calculations in Equations however preferable to adopt a bivariate 3, 4, and 5 require inversions of the distribution which accounts for interrelations (1)(1)nn−× − matrices Σ−i , in=1,… , and between the two parameters and thus it the nn× matrix Σ , respectively. Matrix approximates closer the p(,|)σφ2 Y . We inversion is an order 3 operation, which has to considered a bivariate t-distribution on log(σ 2 ) be repeated for evaluating the conditional and log(φ ) with low degrees of freedom and distribution of all n random effects U and that i mean around the maximum likelihood estimates of the φ parameter, within each Gibbs sampling of log(σ 2 ) and log(φ ) . The spatial random iteration. This leads to an enormous demand of computing capacity and makes implementation effects can be simulated from a multivariate of the algorithm extremely slow (or possibly normal distribution, infeasible), especially for large number of 22 locations. gNs (|U σ ,)φσρφ≡ (0,(,))⋅ .

Sampling-Importance-Resampling This step requires matrix decomposition of Gelfand et al. (1999) propose Bayesian σ 2 ρφ(,)⋅ , repeatedly at every iteration. This is inference for point-referenced data using non- an operation of order 2 and the most expensive iterative Sampling-Importance-Resampling numerical part of the simulation from the ISD. (SIR) simulation. They replace matrix inversion with simulation by introducing a suitable The density gs (|β UY ; ) can be a Normal importance sampling density g()⋅ and re-write ˆ ˆ distribution, gNs (|β UY ; )≡ (βΣU ,β ), with the joint posterior as ˆ βU equal to the regression coefficients p*2(,β UY ,σφ , | )= estimated from an ordinary logistic regression ˆ 2 with offset U and Σβ equal to the covariance p(,β UY ,σφ , | ) 2 (6) 2 g(,β UY ,σφ ,; ). ˆ g(,β UY ,σφ ,; ) matrix of βU . When the ISD approximates well the They construct the importance sampling density posterior distribution, one expects that the (ISD) by standardized importance weights are Uniformly distributed. When this is not the case, the ISD g(,β UY ,σφ2 ,; ) would give rise to very few dominant weights (7) leading to an inefficient and wrong sampler. A 22 possible remedy would be to embed the = ggss(|β UY ; ) ( U |σ ,)φσφ g s ( ,) Sampling-Importance-Resampling simulation in an iterative scheme which refines the initial 502 FITTING MIXED MODELS FOR POINT-REFERENCED DATA

guesses of the ISD and allows after few t t with Σ11 = E()UU , Σ00= E()UU 0 0 and iterations more uniform weights. tt Point estimates of the parameters should ΣΣ01== 10E()UU 0 . The mean of the preferably be calculated from the importance Gaussian distribution in (10) is the classical weights using all sampled values, rather than kriging estimator (Matheron, 1963). from the re-sampled values, what leads to The Bayesian predictive distribution of smaller bias. For example the mean and variance Y0 is given by: ()k of β is estimated by ββ= ww/ 2 i ikik∑∑ PPP(|)YY= (|,)(|,,) Yβ UUUσ φ × kk 0000∫ ()k 2 22 and ∑∑wwki()/ββ− i k respectively, Pddddd(,β UY ,σ ,φσφ | )β UU0 (11) kk ()k where is the k th sampled value of β 2 βi i P(,β UY ,σφ , | ) is the posterior distribution from the ISD. of the parameters and obtained by the Gibbs sampler or the SIR approach. Simulation-based Spatial Prediction Bayesian spatial prediction is performed by Modeling point-referenced data is not consecutive drawing samples from the posterior only useful for identifying significant covariates distribution, the distribution of the spatial but for producing smooth maps of the outcome random effects at new locations and the by predicting it at unsampled locations. Spatial Bernoulli-distributed predicted outcome. In SIR, prediction is usually refereed as kriging. drawing is performed from the set of all sampled Let Y0 be a vector of the binary parameters with weighting given in Equation response at new, unobserved locations s , (8). 0i The maximum-likelihood predictor in=1,… , 0 . Following the maximum likelihood (Equation 9) can be interpreted as the Bayesian approach, the distribution of Y0 is given by: predictor (Equation 11), with parameters fixed at their maximum-likelihood estimates. In contrast to Bayesian kriging, classical kriging does not P(|,,,)Y βˆ Uˆ σφˆ 2 ˆ = 0 (9) account for uncertainty in estimation of β and PP(|,)(|,,)Y βˆ UUUˆ σφˆ 2 ˆ d U ∫ 000 0 the covariance parameters. where βˆ , σˆ 2 and φˆ are the maximum Results likelihood estimates of the corresponding A generalized linear mixed model was fitted to ˆ parameters. In PQL, U is derived as part of the the infant mortality data in Mali using the three iterative estimation process (Breslow and estimation approaches discussed in the ˆ methodology-section, PQL, MCMC and SIR Clayton, 1993). P(|,)Y00β U is the Bernoulli- likelihood at new locations and together with an ordinary logistic regression (GLM) which did not account for spatial ˆ 2 ˆ P(|,,)UU0 σˆ φ is the distribution of the dependence. The purpose of the analysis was to assess the effect of maternal and socio-economic spatial random effects U0 at new sites, given ˆ factors on infant mortality, produce a smooth U at observed sites and is Normal map of mortality risk in Mali and compare the results obtained from the above procedures. ˆ 2 ˆ P(|,,)UU0 σφˆ = Univariate analysis based on the ordinary (10) logistic regression revealed that the following N(,ΣΣ−−11Uˆ ΣΣΣΣ− ) 01 11 00 01 11 10 variables should be included in the model: child’s birthday, region type, mother’s degree of education, sex, birth order, preceding birth interval and mother’s age at birth. GEMPERLI & VOUNATSOU 503

We fitted the non-spatial logistic model by the authors in FORTRAN 95 (Compaq (GLM) in SAS (SAS Institute Inc., Cary, NC, Visual Fortran v6.6) and run on an Unix USA) using Proc Logistic. The spatial model AlphaServer 8400. For small number of with the PQL estimation method was also fitted locations the freeware software WinBUGS in SAS using the %GLIMMIX-macro. This (www.mrc-bsu.cam.ac.uk/bugs) can also be used macro is based on the approach of Wolfinger to obtain MCMC simulation-based estimates. and O’Connell (1993) and does subsequent calls Proc Mixed for normal data supports Bayesian of Proc Mixed to iteratively estimate mixed modeling by allowing specification of prior models for non-normal data. It is supported by a distributions for the parameters and MCMC collection of spatial correlation functions, such simulation. However, this possibility is currently as the exponential, Gaussian, linear, power and available only for variance component models spherical. In our application, we have chosen the and not for spatial covariances, which holds for exponential function. MCMC and SIR the %GLIMMIX macro, too. estimation were implemented in software written

Table 1: Comparison of the computational costs for the Bayesian, simulation based approaches.

Model Initial Final No. of batches and Iterations to Thinning* Time per sample sample size convergence 1,000 size from iterations posterior MCMC 50,000 1,720 - 7,000 25 7 hrs 14 min SIR 400,000 1,600 800 batches with 0 0 1 hr 23 500 values (2 min batches per draw) *Minimum lag at which autocorrelation was not significant.

Table 2: Comparison of parameter estimates from the binary spatial model using different estimation strategies. The binary outcome is the survival of the first year of life.

Birth year Model Estimate σ 2 φ Intercept 1966-71 1972-77 1978-83 1984-89 GLM MLE - - 1.81 -0.18 0.04 0.09 0.12 95% CI - - 1.43,2.11 -0.44,0.09 -0.22,0.29 -0.16,0.34 -0.13,0.37 PQL MLE 1.05 2.07 2.59 -0.19 0.03 0.09 0.12 95% CI 0.72,1.81 0.54,4.63 1.43,3.74 -0.48,0.11 -0.26,0.31 -0.19,0.37 -0.17,0.40 MCMC Mean 1.32 0.07 1.76 -0.20 0.01 0.07 0.10 Median 0.91 0.04 1.75 -0.21 0.01 0.07 0.09 95% CI 0.22,3.89 0.008,0.24 1.47,2.09 -0.46,0.08 -0.25,0.27 -0.19,0.33 -0.16,0.36 SIR Mean 0.91 0.005 1.77 -0.19 0.03 0.08 0.11 Median 0.61 0.03 1.73 -0.18 0.03 0.08 0.11 95% CI 0.22,2.62 0.0004,0.015 0.34,3.25 -0.44,0.06 -0.21,0.27 -0.16,0.31 -0.13,0.34

504 FITTING MIXED MODELS FOR POINT-REFERENCED DATA

Birth year Residency Education Sex Birth order Model Estimate 1990-96 Urban No Primary Male 2nd or 3rd 4th to 6th GLM MLE 0.17 0.32 -0.56 -0.66 -0.14 -1.90 -1.97 95% CI -0.08,0.42 0.22,0.36 -0.75,-0.31 -0.83,-0.43 -0.16,-0.05 -2.40,-1.32 -2.48,-1.38 PQL MLE 0.16 0.29 -0.54 -0.58 -0.14 -1.88 -1.95 95% CI -0.12,0.44 0.19,0.39 -0.75,-0.32 -0.78,-0.38 -0.20,-0.09 -2.42,-1.34 -2.50,-1.40 MCMC Mean 0.15 0.3 -0.55 -0.6 -0.14 -1.90 -2.00 Median 0.14 0.3 -0.54 -0.59 -0.14 -1.95 -2.02 95% CI -0.11,0.40 0.23,0.38 -0.74,-0.36 -0.78,-0.42 -0.16,-0.10 -2.39,-1.44 -2.48,-1.51 SIR Mean 0.16 0.33 -0.50 -0.57 -0.14 -1.88 -1.96 Median 0.16 0.34 -0.50 -0.57 -0.14 -1.88 -1.96 95% CI -0.08,0.39 0.25,0.41 -0.68,0.32 -0.75,-0.40 -0.19,-0.09 -2.34,-1.42 -2.43,-1.49

Birth Preceding birth Mothers age at birth order interval Model Estimate 7th or 2-4 > 4 20-29 30-39 39-49 higher GLM MLE -2.10 2.34 2.71 0.24 0.31 0.19 95% CI -2.62,-1.51 1.76,2.84 2.11,3.22 0.13,0.29 0.15,0.40 -0.02,0.42 PQL MLE -2.07 2.31 2.67 0.25 0.32 0.19 95% CI -2.63,-1.52 1.77,2.85 2.12,3.22 0.17,0.33 0.19,0.44 -0.07,0.44 MCMC Mean -2.10 2.37 2.73 0.26 0.33 0.20 Median -2.15 2.38 2.74 0.26 0.33 0.20 95% CI -2.16,-1.63 1.87,2.82 2.20,3.22 0.19,0.32 0.23,0.43 -0.009,0.43 SIR Mean -2.09 2.31 2.65 0.25 0.32 0.21 Median -2.09 2.30 2.65 0.25 0.32 0.21 95% CI -2.56,-1.62 1.82,2.77 2.16,3.13 0.18,0.31 0.22,0.43 0.01,0.42

Convergence of the PQL approach to the global required extensive fine tuning in order to derive mode of the likelihood was highly dependent on good estimates. We ran the sampler several the starting values. We suggest to compare the times and adjusted the degrees of freedom and results by running the procedure with several mean parameter in the bivariate t-distribution starting values. Computationally, the PQL is fast g (,)σ 2 φ , according to those values leading to in comparison to the simulation-based s procedures, MCMC and SIR, but it runs quickly large weights. Instead of resampling from the out of workspace for larger dataset. A whole sequence of parameters according to their comparison of the computational time required weights, we obtained better results by dividing for the MCMC and SIR algorithms is given in the generated parameters into batches and table 1. MCMC estimation was applied using a drawing an equal number of samples with single chain. Convergence was assessed using replacement from every batch. The Geweke’s (1992) criterion. The algorithm implementation of the SIR algorithm was found converged after 7,000 iterations. A final sample to be difficult. Despite the effort applied to from the posterior distribution of size 1,720 was improve the SIR estimator, the derived weights obtained by sampling every 25th iterations after show a highly skewed distribution, with a few convergence was reached. The SIR algorithm dominating values (Figure 1).

GEMPERLI & VOUNATSOU 505

Figure 1: Distribution of the weights in the Sampling-Importance-Resampling (SIR) procedure.

Table 2 gives the parameter estimates Figure 2: Variogram cloud of the residuals obtained by the four approaches. The fixed in a non-spatial model. effect coefficients β show no fundamental difference in their point estimates between the competitive models, with the exception of the intercept coefficient. The PQL estimate of the intercept is higher than from the other estimators. The standard error of β estimated from GLM is narrower than in the spatial models, as we were expecting. Discrepancies between the fitting approaches are observed in the estimates of the covariance parameters σ 2 and φ . The posterior density of σ 2 obtained from MCMC simulation was found to be highly skewed to the left. PQL overestimates φ suggesting a lower spatial variation than the Bayesian approaches. This confirms known results about bias in the PQL estimates especially for the covariance parameters σ 2 and φ due to the bad quality of the first-order approximation of the integrand. The SIR estimates are similar to those obtained from MCMC.

506 FITTING MIXED MODELS FOR POINT-REFERENCED DATA

Figure 2 shows three plots of the proposed by Cressie and Hawkins (1980), which semivariogram cloud based on the Anscombe is displayed in Figure 3, too. The MCMC, SIR residuals obtained after fitting the GLM model. and PQL based estimators were calculated by The semivariogram cloud is a plot of half the replacing the estimates of σ 2 and φ obtained squared difference of the residuals versus the from the three approaches in distance between their sample locations. The mean of the squared differences at each lag 2 gives an estimator of the semivariogram. The γσ(hh )= (1−−⋅ exp( φ )) . three plots correspond to the 5%, 50% and 95% quartile of the squared difference of the The MCMC and SIR estimators appear to be residuals. The semivariogram cloud shows high between the two other empirical semivariogram variability and an increasing trend from the estimators. Because we have omitted the nugget origin indicating lag-dependent variation. For a term, they pass through the origin. Nevertheless, stationary spatial process, the semivariogram their values fit nicely into the graph. The PQL relates to the covariance of the random effects. estimate does not capture the correlation present Therefore we expect high variability in the at large lags. It represents the classical covariance parameters. semivariogram estimator well, but it is far off Figure 3 depicts different the robust version. semivariogram estimators. The classical Regarding our application, Figure 4 estimator by Matheron (1963) was calculated by displays the locations of the DHS surveys and the observed infant mortality risk in Mali. The 1 risk factors which were found to be statistically γˆ()hZZ=− ( (ss ) ( ))2 , ∑ Nh() ij significant related to infant mortality (table 2) |()|Nh confirm findings made by other authors. The negative association between maternal education where Z()si is the Anscombe residual at and mortality has been described by Farah and location s , Preston (1982) and Cleland and Ginneken i (1989). Higher education may result in higher

Nh()=−=± {(,ssij ): s i s j h ε } health awareness, better utilization of health facilities (Jain, 1988), higher income and ability to purchase goods and services which improves and |()|Nh is its cardinality. This estimator is infants health (Schultz, 1979). sensitive to outliers and a robust version was

GEMPERLI & VOUNATSOU 507

Figure 3: Semivariogram estimators: Classical semivariogram estimator by Matheron (circles), Robust version by Cressie and Hawkins (triangles), MCMC (long dashed line), SIR (short dashed line) and PQL (line) fit.

The observed time trend, with higher (Kalipeni, 1993) with lowest risk for age around infant survival for more recent years, was found thirty. The higher risk in young women may be not statistically significant. Longer birth explained by not fully developed maternal intervals and low birth order reduce the risk of resources and that in older women by the effect infant death. Mortality was related to the of ageing. The MCMC-based estimate of the φ residency and sex of the infant with girls and parameter revealed strong spatial correlation urbanites being at lower risk of dying during the which reduces to less than 5% for distances first year of life. The impact mothers age has on longer than 75km. infant mortality shows the typical J-shape

508 FITTING MIXED MODELS FOR POINT-REFERENCED DATA

Figure 4: Observed mortality in 36,906 infants from the DHS surveys conducted in the years 1995 and 1996 at 181 distinct locations in Mali.

Figure 5: Predicted spatial random effects from the infant mortality model using MCMC. The darker the shading, the lower the survival.

GEMPERLI & VOUNATSOU 509

Predictions of the child mortality risk implementation may be infeasible due to long using the MCMC approach were made at computing time. The SIR runs considerably 600,000 new locations on a regular grid, faster than MCMC, but it requires tedious covering the whole area of Mali south of 18 tuning. Finding an ISD which approximates well degrees latitude north. Because the covariates the posterior distribution is difficult to develop are infant-specific and can not be extrapolated and application-specific. Rigorous methods for for the new locations, we predict the random evaluating the suitability of the ISD do not exist. effects only. The map with prediction is This increases the possibility of drawing displayed in Figure 5. The map indicates a misleading inference. higher infant mortality risk mainly in the MCMC is the most practical and, when Northern part of the Niger delta. This region has it comes to prediction, accurate approach to date low population density and water availability is for fitting geostatistical problems. However, it is seasonal. The many lakes in this region are computationally intensive, especially for dataset preferred breeding site for the malaria mosquito. with large number of locations. More research is Low mortality is predicted in North-Western required in ways of improving the convergence Mali at the border to Mauritania and Senegal. In of the algorithm and the inversion of large this region, the population is more active in matrices. Gilks and Roberts (1996), Mira and migrating to other countries for business Sargent (2000) and Haran et al. (2001) have purposes, bringing money to the region. Health proposed general MCMC algorithms for facility coverage is also reflected in the improving convergence. Rue (2000) and Pace predictive map, where the coverage is low in the and Barry (1997) have applied innovative Northern Niger delta and high in the North-East. numerical methods using sparse matrix solvers for fitting areal data. In future, similar Conclusion approaches need to be adapted and assessed for modeling point-referenced spatial data. Generalized linear mixed models for large point- referenced spatial data are highly parameterized References and their estimation is hampered by computational problems. Reliable estimation Anderson, D. A., & Hinde, J. P. (1988). methods that can be applied in standard software Random effects in generalized linear models and or algorithms that can accurately estimate the the EM algorithm. Communication in Statistics: model parameters within practical time Theory and Methods, 17, 3847–3856. constraints do not exist. In this paper we Besag, J. (1974). Spatial interaction and compared a few recent developments using a the statistical analysis of lattice systems. Journal real dataset on infant mortality in Mali. of the Royal Statistical Society, Series B, 36, The advantage of the PQL method is 192–236. that it can be applied in standard statistical Booth, J. G., & Hobert, J. P. (1999). software package. However estimates are biased Maximizing generalized linear mixed model especially those for the covariance parameters. likelihoods with an automated Monte Carlo EM The algorithm depends highly on the starting algorithm. Journal of the Royal Statistical values and can easily converge to a local mode. Society, Series B, 61, 265–285. For medium to large number of locations Breslow, N. E., & Clayton, D. G. implementations of this algorithm is impeded by (1993). Approximate inference in generalized computer memory problems. linear mixed models. Journal of the American Bayesian methods can provide flexible Statistical Association, 88, 9–25. ways of modeling point-referenced data, give Breslow, N. E., & Lin, X. (1995). Bias unbiased estimates of the parameters and their correction in generalized linear mixed models standard error and have computational with a single component of dispersion. advantages for problems larger than the ones the Biometrika, 82, 81–91. maximum likelihood methods can handle. However, for very large number of locations, an 510 FITTING MIXED MODELS FOR POINT-REFERENCED DATA

Browne, W. J., & Draper, D. (2000). A Haran, M., Hodges, J. S., & Carlin, B. P. comparison of Bayesian and likelihood methods (2001). Accelerating computation in Markov for fitting multilevel models. Submitted. random field models for spatial data via Cleland, J., & van Ginneken, J. K. structured MCMC. Journal of Computaional (1989). Maternal education and child survival in and Graphical Statistics, 12, 249-264. developing countries: The search for pathways Heagerty, P. J., & Lele, S. R. (1998). A of influence. Social Science and Medicine, 27, composite likelihood approach to binary spatial 1357–1368. data. Journal of the American Statistical Cressie, N., & Hawkins, D. M. (1980). Association, 93, 1099–1111. Robust estimation of the variogram. Jain, A. (1988). Determinants of Mathematical Geology, 12, 115–125. regional variation in infant mortality in rural Dempster, A. P., Laird, N. M, & Rubin India. In A. Jain, & L. Visaria (Eds.), Infant D. B. (1977). Maximum likelihood from mortality in India: Differentials and incomplete data via the EM algorithm. Journal determinants, 127–167. Sage Publications. of the Royal Statistical Society, Series B, 39, 1– Kalipeni, E. (1993). Determinants of 38. infant mortality in Malawi: a spatial perspective. Diggle, P. J.,Tawn, J. A., & Moyeed, R. Social Science and Medicine, 37, 183–198. A. (1998). Model-based geostatistics. Journal of Lesaffre, E., & Spiessens, B. (2001). On the Royal Statistical Society, Series C, 47, 299- the effect of the number of quadrature points in a 350. logistic random-effects model: An example. Ecker, M., & Gelfand, A. E. (1997). Journal of the Royal Statistical Society, Series Bayesian variogram modeling for an isotropic C, 50, 325–335. spatial process. Journal of Agricultural, Liang, K. Y., & Zeger, S. L. (1986). Biological and Environmental Statistics, 2, 347– Longitudinal data analysis using generalized 369. linear models. Biometrika, 73, 13–22. Farah, A. A., & Preston, S. H. (1982). Matheron, G. (1963). Principles of Child mortality differentials in Sudan. geostatistics. Economic Geology, 58, 1246– Population and Development Review, 8, 365– 1266. 383. McCulloch, C. E. (1997). Maximum GDE Systems Inc. (1995). Geoname likelihood algorithms for generalized linear Digital Gazetteer, Version I, CD-ROM. mixed models. Journal of the American Gelfand. A. E.,Ravishanker. N., & Statistical Association, 92, 162–170. Ecker, M. (1999). Modeling and inference for Mira, A., & Sargent, D. J. (2000). point-referenced binary spatial data. In D. Strategies for speeding Markov chain Monte Dey,S. Ghosh & B. Mallick (Eds.), Generalized Carlo algorithms. Technical Report, University linear models: a bayesian perspective. 373–386. of Insubria, Varese. Marcel Dekker Inc. Neuhaus, J. N., & Segal, M. R. (1997). Geweke, J. (1992). Evaluating the An assessment of approximate maximum accuracy of sampling-based approaches to the likelihood estimators in generalized linear mixed calculation of posterior moments. In J. M. models. In T. G. Gregoire, D. R. Brillinger, P. J. Bernardo, J. O. Berger,A.P. Dawid & A. F. M. Diggle, E. Russek-Cohen, W. G. Warren, & R. Smith (Eds.), Bayesian statistics, 4, 169–193. D. Wolfinger (Eds.), Modeling longitudinal and Oxford University Press. spatially correlated data: Lecture notes in Gilks, W. R., & Roberts, G. O. (1996). statistics, 122, 11–22. Strategies for improving MCMC. In W. R. Pace, K. R., & Barry, R. (1997). Quick Gilks, S. Richardson, & D. J. Spiegelhalter computation of the regressions with spatially (Eds.), Markov chain Monte Carlo in practice. autoregressive dependent variable. 89–114. London: Chapman and Hall. Geographical Analysis, 29, 232–247.

GEMPERLI & VOUNATSOU 511

Preisler. H. K. (1988). Maximum Tubilla, A. (1975). Error convergence likelihood estimates for binary data with random rates for estimates of multidimensional integrals effects. Biometrical Journal, 3, 339–350. of random functions. Technical Report No. 72, Rubin, D. B. (1987). Comment on: The Department of Statistics, Stanford University, calculation of posterior distributions by data Stanford, CA. augmentation. by M. A. Tanner and W. H. Wolfinger, R., & O’Connell, M. (1993). Wong. Journal of the American Statistical Generalized linear mixed models: A pseudo- Association, 82, 543–546. likelihood approach. Journal of Statistical Rue, H. (2000). Fast sampling of Computation and Simulation, 48, 233–243. gaussian markov random fields. Journal of the World Resources Institute (1995) Africa Royal Statistical Society, Series B, 48, 233–243. Data Sampler. Edition I, CD-ROM. Schultz, T. P. (1979, June 19-25). Zeger, S. L., & Liang, K. Y. (1986). Interpretation of relations among mortality, Longitudinal data analysis for discrete and economics of the household and the health continuous outcomes. Biometrika, 42, 121–130. environment. Proceedings of the meeting on Zimmerman, D. L., & Zimmerman, M. socio-economic determinants and consequences B. (1991). A comparison of spatial of mortality, Mexico City. Geneva: World semivariogram estimators and corresponding Health Organization. ordinary kriging predictors. Technometrics, 33, 77–91.

Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 512-519 1538 – 9472/03/$30.00

Bootstrapping Confidence Intervals For Robust Measures Of Association

Jason E. King Baylor College of Medicine

A Monte Carlo simulation study compared four bootstrapping procedures in generating confidence intervals for the robust Winsorized and percentage bend correlations. Results revealed the superior resiliency of the robust correlations over r, with neither outperforming the other. Unexpectedly, the bootstrapping procedures achieved roughly equivalent outcomes for each correlation.

Key words: Robust methods, bootstrapping, percentage bend correlation, Winsorized correlation

Introduction The Winsorized correlation (rw) is computed in an identical fashion to r except that a specified A number of “robust” (Box, 1953) analogs to proportion of extreme scores in each tail are first traditional estimators, population parameters, Winsorized, that is, deleted and set equal to the and hypothesis-testing methods have seen most extreme score remaining in the tail of the development during the past 40 years. Robust distribution. The percentage bend correlation procedures typically retain the statistical (rpb) is based on the percentage bend measures interpretations associated with classical of location and midvariance and is less intuitive. procedures, but are more resistant to See Wilcox (1994, 1997) for the relevant distributional non-normalities and outliers. The equations. Pearson product-moment correlation is without Yet few researchers have explored these question the most commonly used measure of newer correlations, notably with respect to linear association, yet is not robust to departures estimating confidence intervals and defining from normality, especially when the bivariate their sampling distributions. For statistics with surface is non-normal and dependence exists no known sampling distribution, Efron’s (1979, (King, 2003). 1982) bootstrap has proven to be effective in a Two new robust alternatives to r appear variety of contexts. The conjecture is that the promising. The Winsorized correlation (Devlin, sampling distribution of a statistic can be Gnanadesikan, & Kettenring, 1975; approximated by the distribution of a large Gnanadesikan & Kettenring, 1972; Wilcox, number of resampled estimates of the statistic 1993) and the percentage bend correlation obtained from a single sample of observations. (Wilcox, 1994, 1997) yield interpretations The distribution of resampled estimates analogous to r and asymptotically equal zero forms an empirically-derived sampling under bivariate independence, yet possess distribution from which confidence intervals or properties that curb the influence of other indices may be estimated, either for distributional non-normalities. inferential or descriptive purposes (Thompson, 1993). The usefulness of bootstrapping is evident because an increasing number of This article was based on the doctoral disciplines are now encouraging or requiring the dissertation by Jason E. King. The author reporting of confidence intervals (Thompson, acknowledges Professor Bruce Thompson and 2002; Vacha-Haase, Nilsson, Reetz, Lance, & the doctoral committee for their contributions. Thompson, 2000; Wilkinson & APA Task Force Email address: [email protected]. on Statistical Inference, 1999). An “almost bewildering array” (Hall, 1988, p. 927) of bootstrapping procedures is now available. These vary in the accuracy with which the bootstrap-generated interval spans the true interval. Accuracy is also contingent on the

512 CONFIDENCE INTERVALS FOR ROBUST MEASURES 513 type of statistic under examination. At the accuracy of a bootstrapping procedure for a current level of knowledge, it is unknown which given statistic (Hall, 1988; Wilcox, 1997). Thus, bootstrapping procedure produces the most the present study compared bootstrapped accurate confidence intervals for rpb and rw. correlations across a wide range of conditions Although Wilcox (1993, 1994, 1997) compared including nine distributional shape variations, Type I error rates for these robust correlations, one contaminated distribution, six mixed only two studies (Wilcox, 1997; Wilcox & distributions, three independence and Muska, 2001) have examined the accuracy of dependence conditions (i.e., population bootstrapped confidence intervals for rpb, and correlations of .0, .4, .8), and four sample sizes none for rw. Clearly, more research is needed. (i.e., ns of 20, 50, 100, 250). The goal of this simulation study was to Four indices served as points of compare various means of bootstrapping comparison for the bootstrapped correlations: confidence intervals for rw and rpb across a Type I error rate, bias, efficiency, and interval variety of conditions. The study compared four width. The latter was constructed by modifying a bootstrapping procedures, each of which has ratio proposed by Efron (1988) such that the proven useful in some contexts: the ordinary width of each bootstrap-estimated interval was percentile bootstrap (Efron, 1979), an adjusted divided by the width of a “true” (i.e., Monte bootstrap (Strube, 1988), the bias-corrected Carlo-estimated) confidence interval. This bootstrap (BC; Efron, 1981, 1982, 1985), and required drawing an additional 10,000 samples, the bias-corrected and accelerated bootstrap each of size n, from each simulated population (BCa; Efron, 1987). The Pearson r and Fisher’s to create the “true” sampling distributions. inverse hyperbolic tangent transformation of r, Simulation studies typically compare rz, were included for comparative purposes, Type I error rates and other indices in an although the latter frequently fails to produce informal manner; however, a more formal even asymptotically correct results (Duncan & analysis is useful for processing the large Layard, 1973). number of indices obtained in the present study. Analysis of Variance (ANOVA) is well suited Methodology for quantifying sources of variation. This procedure allowed for partitioning the The simulation procedure began by randomly systematic variance components affecting the generating 1,000,000 observations from a indices (viz., correlational measure, population with known characteristics, serving bootstrapping procedure, distributional shape as a derived population. This step was necessary and type, sample size, and strength of bivariate because the Winsorized and percentage bend relationship). correlation parameters (ρw and ρpb) will not necessarily exactly equal ρ under dependence Results conditions. The second step involved drawing m = 100 samples, each of size n, from the derived Tables 1-5 and Figures 1-2 display population and calculating sample estimates for representative results averaged across each of the four correlational measures. Lastly, distributional shape. Disaggregated data and B = 500 bootstrap samples were drawn by fuller explanations are available in King (2000). sampling with replacement from each of the m Efficiency varied little across the correlational samples and 95% confidence intervals calculated measures and is not presented. via each of the four bootstrapping procedures. Gamma (γ) and beta (β) are two constants that Comparisons Among Bootstrapping Procedures must be fixed in computing the Winsorized and As regards Type I error rate (see Tables percentage bend correlations, respectively. 1, 2, and Figure 1) and bias (see Tables 3, 4, and These were each set to .2 for all simulations. Figure 2), no bootstrapping procedure emerged Real data often demonstrate excessive as unmistakably superior across a majority of distributional non-normality (Bradley, 1977, conditions for either robust correlation (e.g., a 1978; Micceri, 1989; Rasmussen, 1986; Stigler, Bootstrap by Correlation effect is absent in 1973; Wilcox, 1990) and such can moderate the Tables 2 and 4). 514 JASON E. KING

Table 1. Type I Error Rates Averaged Across All Distributional Conditions

n = 20 n = 50 n = 100 n = 250 r rz rw rpb r rz rw rpb r rz rw rpb r rz rw rpb ρ = 0 Percentile .06 .06 .03 .04 .07 .07 .06 .07 .05 .05 .05 .05 .07 .07 .04 .04 Adjusted .06 .06 .03 .04 .07 .07 .06 .07 .05 .05 .05 .05 .07 .07 .04 .04 BC .05 .05 .03 .05 .07 .07 .06 .06 .05 .05 .06 .05 .06 .06 .03 .04 BCa .05 .04 .03 .05 .07 .07 .05 .06 .06 .06 .06 .05 .07 .07 .04 .04 ρ = .4 Percentile .11 .11 .03 .07 .08 .08 .04 .06 .07 .07 .04 .04 .08 .08 .05 .05 Adjusted .11 .11 .04 .08 .08 .08 .04 .06 .08 .08 .04 .04 .08 .08 .05 .05 BC .08 .08 .03 .06 .08 .08 .03 .06 .08 .08 .04 .04 .09 .09 .05 .05 BCa .09 .08 .04 .07 .09 .09 .04 .06 .10 .10 .04 .03 .11 .11 .04 .05 ρ = .8 Percentile .09 .09 .06 .07 .06 .06 .06 .06 .06 .06 .06 .04 .07 .07 .05 .05 Adjusted .15 .15 .09 .12 .06 .06 .06 .06 .08 .08 .07 .07 .08 .08 .04 .05 BC .10 .10 .05 .05 .06 .06 .06 .07 .07 .07 .06 .04 .08 .08 .06 .05 BCa .12 .12 .06 .07 .07 .07 .05 .06 .10 .10 .06 .04 .09 .09 .06 .06 Note. Italicized values are greater than two standard errors beyond the nominal .05 level.

Table 2. Analysis of Variance for Type I Error Rate by Correlation and Bootstrapping Procedure

Source df F p η2 Model 15 11.028 <.001 .088 CORR 3 50.511 <.001 .081 BOOT 3 2.735 .042 .004 CORR * BOOT 9 .631 .772 .003 Error 1712 (.002) Total 1727 Note. Mean square error enclosed in parentheses.

CONFIDENCE INTERVALS FOR ROBUST MEASURES 515

Figure 1. Mean Type I error rate by correlation and bootstrapping procedure. Reference line indicates the nominal alpha rate of .05.

.15

.10

r .05

Mean TypeError I Rate rz

rw

0.00 rpb Percentile Adjusted BC BCa

Bootstrapping Procedure

Table 3. Interval Bias Averaged Across All Distributional Conditions

n = 20 n = 50 n = 100 n = 250

r rz rw rpb r rz rw rpb r rz rw rpb r rz rw rpb ρ = 0 Percentile .33 .33 .30 .31 .24 .24 .22 .22 .17 .17 .16 .16 .11 .11 .10 .10 Adjusted .36 .36 .35 .34 .24 .24 .23 .23 .17 .17 .16 .16 .11 .11 .10 .10 BC .32 .32 .31 .31 .23 .23 .22 .22 .16 .16 .16 .16 .11 .11 .10 .10

BCa .34 .33 .31 .31 .24 .24 .22 .22 .17 .17 .16 .16 .11 .11 .10 .10 ρ = .4 Percentile .36 .36 .33 .33 .24 .24 .20 .20 .21 .21 .15 .15 .15 .15 .10 .09 Adjusted .38 .38 .37 .37 .24 .24 .21 .21 .21 .21 .15 .15 .15 .15 .10 .09 BC .36 .36 .33 .32 .24 .24 .21 .20 .21 .21 .15 .14 .16 .16 .10 .10

BCa .38 .37 .33 .33 .26 .25 .21 .20 .23 .23 .15 .15 .17 .17 .10 .10 ρ = .8 Percentile .26 .26 .25 .23 .17 .17 .14 .13 .13 .13 .09 .08 .10 .10 .06 .05 Adjusted .31 .31 .31 .30 .17 .17 .15 .15 .13 .13 .09 .09 .10 .10 .06 .06 BC .26 .26 .28 .24 .17 .17 .14 .14 .14 .14 .09 .09 .11 .11 .06 .06

BCa .28 .28 .28 .25 .18 .18 .15 .15 .15 .15 .09 .09 .12 .12 .06 .06

Table 4. Analysis of Variance for Bias by Correlation and Bootstrapping Procedure

Source df F p η2 Model 15 3.497 <.001 .030 CORR 3 15.558 <.001 .026 BOOT 3 1.551 .199 .003 CORR * BOOT 9 .125 .999 .001 Error 1712 (.010) Total 1727 Note. Mean square error enclosed in parentheses.

516 JASON E. KING

Figure 2. Mean bias by correlation and bootstrapping procedure.

.35

.30

.25

.20

.15 Mean Bias r .10 rz

.05 rw

0.00 rpb Percentile Adjusted BC BCa

Bootstrapping Procedure

Table 5. Confidence Interval Ratios Averaged Across All Distributional Conditions

n = 20 n = 50 n = 100 n = 250 r rz rw rpb r rz rw rpb r rz rw rpb r rz rw rpb ρ = 0 Percentile .91 .91 1.11 1.02 .91 .91 1.04 1.00 .92 .92 1.03 1.00 .93 .93 1.01 .99 Adjusted .98 .98 1.20 1.10 .94 .94 1.07 1.03 .94 .94 1.04 1.02 .94 .94 1.01 1.00 BC .92 .92 1.11 1.02 .91 .91 1.04 1.00 .92 .92 1.03 1.00 .93 .93 1.01 .99 BCa .92 .92 1.11 1.02 .92 .92 1.04 1.00 .93 .93 1.02 1.00 .94 .94 1.01 .99 ρ = .4 Percentile .84 .84 1.09 1.00 .84 .84 1.04 .99 .92 .92 1.03 1.00 .86 .86 1.01 1.00 Adjusted .91 .91 1.17 1.08 .86 .86 1.07 1.02 .93 .93 1.04 1.02 .87 .87 1.02 1.00 BC .86 .86 1.11 1.02 .84 .84 1.05 1.00 .91 .91 1.02 1.00 .86 .86 1.01 1.00 BCa .85 .86 1.11 1.02 .84 .84 1.05 1.01 .92 .92 1.03 1.00 .86 .86 1.01 1.00 ρ = .8 Percentile .81 .81 1.08 .98 .85 .85 1.05 1.02 .81 .81 1.00 .98 .84 .84 1.01 1.00 Adjusted .87 .87 1.16 1.06 .88 .88 1.08 1.05 .83 .83 1.01 .99 .84 .84 1.01 1.00 BC .85 .85 1.17 1.04 .87 .87 1.08 1.04 .82 .82 1.01 .99 .84 .84 1.01 1.00 BCa .83 .86 1.16 1.05 .85 .87 1.10 1.06 .81 .82 1.02 1.01 .84 .85 1.01 1.01 Note: Ratios greater than 1.0 indicate a bootstrap-estimated interval wider than the “true” interval, and conversely.

Under a few conditions, the BC and penalized rw and rpb. The BCa intervals ordinary percentile procedures procured slightly frequently ran short, the BC intervals shorter more accurate intervals than did the BCa. In still, and the percentile bootstrap the shortest of addition, the adjusted bootstrap intervals were, the four. These trends were slight and not by and large, unacceptable, regardless of the unexpected (e.g., it is widely known that the robust measure under examination. percentile bootstrap tends to produce narrow Regarding the width of the estimated intervals). intervals (see Table 5), no bootstrapping procedure clearly bettered the others. For small Comparisons Among Correlations sample size conditions the adjusted bootstrap Confidence intervals formed for rw and averaged relatively wider intervals. This rpb generally outperformed those for r and rz for widening effect improved accuracy for the both Type I error rate and bias. Although the narrow r- and rz-generated intervals, but present paper does not depict the data CONFIDENCE INTERVALS FOR ROBUST MEASURES 517 disaggregated by distributional shape, over the percentile bootstrap, and the “adjusted” predictable results surfaced. Under normality all bootstrap may have even inflated bias and Type four correlations produced similar Type I error I error rate. While this finding may be rates, although the Pearson r and its transform interpreted as disappointing because the more saw slightly lower levels of bias, at least under elaborate procedures did not offer increased small sample conditions. However, as accuracy, researchers can be more confident that distributional shape diverged from normality or the ordinary percentile bootstrap is capable of included contaminated or mixed distributions, delivering relatively precise confidence intervals the robust correlations surpassed r. As an aside, for these robust measures. the bias index generally produced neater, more It may be that the more complicated theoretically consistent results than did Type I procedures did not surpass the percentile error rate. This is probably due to the bootstrap due to the technical specifications of dichotomous nature of the latter, that is, a given the simulation. The original study design interval either does or does not enclose the entailed drawing 1,000 samples for each parameter of interest and cause a Type I error, condition, but this number was reduced to 100 whereas bias is measured on the more sensitive given excessive computational demands. Even ratio scale of measurement. though the goal in this component of the Regarding interval width, r and its simulation procedure is not to fully reproduce a transform consistently underestimated the “true” sampling distribution, more samples may be endpoints, more so under non-normal necessary to achieve stable asymptotic conditions. At times, such intervals were little dynamics. Similarly, the number of bootstrap more than half the “true” width. Bootstrapped samples had to be reduced considerably (e.g., confidence intervals for the percentage bend setting B to 3,000 produced only 25 samples in correlation closely mimicked the “true” intervals eight hours due to the large number of simulated in almost every instance, and intervals for the conditions and the involved calculations for rpb). Winsorized correlation tended to run slightly However, for this simulation component, the wide. objective is indeed to model a theoretical sampling distribution, F()θ , via a bootstrapped Conclusion ˆ* sampling distribution, Fˆ(θ ). Five hundred This study confirmed that the Winsorized and bootstrap samples may be sufficient for percentage bend correlations are useful estimating standard errors (Efron, 1987; Efron & alternatives to the Pearson correlation and are Tibshirani, 1993; but cf. Booth & Sarkar, 1998), preferred when resilience to distributional non- but not for forming tight confidence bands normality is needed. Results for three of the four (Lunneborg, 2000). Follow-up studies should comparative indices (efficiency was virtually a increase these quantities if possible. constant) confirmed the robustness of the two The study also revealed that Fisher’s robust measures under non-normal, mixed, and transformation of r did not appreciably improve contaminated distributional conditions, with either Type I error rate or bias. When neither outperforming the other. The percentage bootstrapping the Pearson correlation, it seems bend and Winsorized correlations reduced bias, that the r-to-z transformation merely increases more accurately reflected theoretical Type I computational time without concomitantly error probabilities, and more faithfully affecting accuracy, as supported by Seivers reproduced the width of true (Monte Carlo (1996) in his conclusions about rz. simulated) intervals. The robust measures In sum, the robust measures may be compared favorably to r even under the bivariate recommended for general use when it is desired normal conditions. to quantify the linear association underlying the Interestingly, across a wide range of majority of the sample observations, while simulation conditions the four bootstrapping excluding outliers. Each of the bootstrapping procedures achieved roughly equivalent procedures reviewed maintained similar levels outcomes as applied to either robust correlation. of accuracy and may be applied in estimating The complex BC and BCa procedures failed to confidence intervals for the robust correlations, offer sizeable improvements in interval accuracy excepting the adjusted bootstrap. 518 JASON E. KING

References King, J. E. (2000). Bootstrapping robust measures of association (Doctoral dissertation, Box, G. E. P. (1953). Non-normality and Texas A&M University, 2000). Dissertation tests on variances. Biometrika, 40, 318-335. Abstracts International, 61(11), 5948B. Booth, J. G., & Sarkar, S. (1998). Monte King, J. E. (2003, February). What have Carlo approximation and bootstrap variances. we learned from 100 years of robustness studies American Statistician, 52, 354-357. on r? Paper presented at the annual meeting of Bradley, J. V. (1977). A common the Southwest Educational Research situation conducive to bizarre distribution Association, San Antonio, TX. shapes. American Statistician, 31, 147-50. Lunneborg, C. E., (2000). Data analysis Bradley, J. V. (1978). Robustness? by resampling: Concepts and applications. British Journal of Mathematical and Statistical Pacific Grove, CA: Brooks/Cole. Psychology, 31, 144-152. Micceri, T. (1989). The unicorn, the Devlin, S. J., Gnanadesikan, R., & normal curve, and other improbable creatures. Kettenring, J. R. (1975). Robust estimation and Psychological Bulletin, 105, 156-166. outlier detection with correlation coefficients. Rasmussen, J. L. (1986). An evaluation Biometrika, 62, 531-545. of parametric and non-parametric tests on Duncan, G. T., & Layard, M. W. (1973). modified and non-modified data. British Journal A Monte-Carlo study of asymptotically robust of Mathematical and Statistical Psychology, 39, tests for correlation. Biometrika, 60, 551-558. 213-220. Efron, B. (1979). Bootstrap methods: Sievers, W. (1996). Standard and Another look at the jackknife. Annals of bootstrap confidence intervals for the correlation Statistics, 7, 1-26. coefficient. British Journal of Mathematical and Efron, B. (1981). Nonparametric Statistical Psychology, 49, 381-396. standard errors and confidence intervals. Stigler, S. M. (1973). Simon Newcomb, Canadian Journal of Statistics, 9, 139-172. Percy Daniell, and the history of robust Efron, B. (1982). The jackknife, the estimation 1885-1920. Journal of the American bootstrap, and other resampling plans. Statistical Association, 68, 872-879. Philadelphia: Society for Industrial and Applied Strube, M. J. (1988). Bootstrap Type I Mathematics. error rates for the correlation coefficient: An Efron, B. (1985). Bootstrap confidence examination of alternate procedures. intervals for a class of parametric problems. Psychological Bulletin, 104, 290-292. Biometrika, 72, 45-58. Thompson, B. (1993). The use of Efron, B. (1987). Better bootstrap statistical significance tests in research: confidence intervals. Journal of the American Bootstrap and other alternatives. Journal of Statistical Association, 82, 171-185. Experimental Education, 61(4), 361-377. Efron, B. (1988). Bootstrap confidence Thompson, B. (2002). What future intervals: Good or bad? Psychological Bulletin, quantitative social science research could look 104, 293-296. like: Confidence intervals for effect sizes. Efron, B., & Tibshirani, R. J. (1993). An Educational Researcher, 31(3), 24-31. introduction to the bootstrap. New York: Vacha-Haase, T., Nilsson, J. E., Reetz, Chapman & Hall. D. R., Lance, T. S., & Thompson, B. (2000). Gnanadesikan, R., & Kettenring, J. R. Reporting practices and APA editorial policies (1972). Robust estimates, residuals, and outlier regarding statistical significance and effect size. detection with multiresponse data. Biometrics, Theory & Psychology, 10, 413-425. 28, 81-124. Wilcox, R. R. (1990). Comparing Hall, P. (1988). Theoretical comparison variances and means when distributions have of bootstrap confidence intervals. Annals of non-identical shapes. Communications in Statistics, 16, 927-953. Statistics-Simulation and Computation, 19, 155- Huber, P. J. (1981). Robust Statistics. 173. New York: Wiley.

CONFIDENCE INTERVALS FOR ROBUST MEASURES 519

Wilcox, R. R. (1993). Some results on a Wilcox, R. R., & Muska, J. (2001). Winsorized correlation coefficient. British Inferences about correlations when there is Journal of Mathematical and Statistical heteroscedasticity. British Journal of Psychology, 46, 339-349. Mathematical & Statistical Psychology, 54, 39- Wilcox, R. R. (1994). The percentage 47. bend correlation coefficient. Psychometrika, 59, Wilkinson, L., & APA Task Force on 601-616. Statistical Inference. (1999). Statistical methods Wilcox, R. R. (1997). Tests of in psychology journals: Guidelines and independence and zero correlations among p explanations. American Psychologist, 54, 594- random variables. Biometrical Journal, 39, 183- 604. 193.

Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 520-524 1538 – 9472/03/$30.00

JMASM Algorithms and Code JMASM8: Using SAS To Perform Two-Way Analysis Of Variance Under Variance Heterogeneity

Scott J. Richter Mark E. Payton Department of Mathematical Sciences Department of Statistics University of North Carolina at Greensboro Oklahoma State University

We present SAS code to implement the method proposed by Brunner et al. (1997) for performing two- way analysis of variance under variance heterogeneity.

Key words: ANOVA, heteroscedasticity, SAS

Introduction Brunner Method The method of Brunner et al. (1997) is a Brunner et al. (1997) suggested a method to small sample adjustment to the well-known perform tests on main effects and interaction in Wald statistic, which permits heterogeneous two-way analysis of variance that allows variance but is known to have inflated Type I variance heterogeneity. Their approach is to use error rates for small sample sizes. Consider a a generalization of chi-square approximations two-way layout with a levels of factor A and b dating back to Patnaik (1949) and Box (1954). levels of factor B. Assume a set of independent Their statistic is identical to the classical random variables ANOVA F-statistic, and thus their method can be regarded as a robust extension of the classical 2 X ij∼ Niab(µσ i , i ), =1,..., . ANOVA to heteroscedastic designs. They recommend that their method should always be ′ Let µ = ()µµ12, ,..., µab denote the vector preferred (even in the homoscedastic case) to the classical ANOVA. Richter and Payton (2003) containing the ab• population means. Then the found that the performance of their statistic hypotheses of no main effects and interaction compares favorably to that of the usual ANOVA can be written as F-statistic for sample sizes of at least n = 7 per factor combination. HA0 ():MµA = 0 In this article, we present a SAS HB0 ():MµB = 0 program (SAS Institute, Cary, N.C.) for implementing the Brunner et al. (1997) method. HAB0 ():MµAB = 0, where 1 MP=⊗ J A abb Scott Richter is an Assistant Professor in the 1 Mathematical Sciences Department at the MJPBab= ⊗ University of North Carolina at Greensboro. His a email address is [email protected]. Mark 1 MP=⊗ J. Payton is a Professor in the Department of ABab b Statistics at Oklahoma State University. His email address is [email protected].

520 RICHTER & PAYTON 521

2 1 ⎡ ˆ ⎤ Here PIcc=− J c, where Ic is a cc× identity tr ()SN c f = ⎣ ⎦ den ˆ 2 matrix, Jc a cc× matrix of 1’s, and the symbol tr ()SN Λ ⊗ represents the Kronecker product of the matrices. The vector of observed cell means is denominator degrees of freedom, where ′ denoted by X = ()XX1,..., ab and the estimated ⎧ 11⎫ covariance matrix is given by Λ = diag ⎨ ,..., ⎬ nn−−11 ⎧⎫2 2 ⎩⎭1 ab ˆ S1 Sab 2 SN =•Ndiag⎨⎬,..., , where Si is the ⎩⎭nn1 ab (Brunner, 1997). ab th i sample variance and Nn= ∑ i . References i=1 For a complete cross-classification, the Box, G. E. P. (1954). Some theorems on test statistic is quadratic forms applied in the study of analysis of variance problems, I. Effect of inequality of N • XMX′ variance in the one-way classification. The FB = , 1 Annals of Mathematical Statistics, 25, 290-302. tr Sˆ ()n −1 ()N Brunner, E., Dette, H., & Munk, A. (1997). Box-type approximations in nonparametric factorial designs. Journal of the which has an approximate F distribution with American Statistical Association, 92, 1494- 1502. 2 1 ⎡⎤ˆ Patnaik, P. B. (1949). The noncentral χ2 2 • tr ()SN ()n −1 ⎣⎦ and F-distributions and their applications. f = Biometrika, 36, 202-232. num ˆˆ tr ()MSNN MS Richter, S. J., & Payton, M. E. (2003). Performing two-way analysis of variance under numerator and variance heterogeneity. Journal of Modern Applied Statistical Methods, 2(1), 152-160.

Appendix

The following program can be used to perform the FB test described above.

/* Program to compute unadjusted and Box-adjusted F-ratios and p-values for a two-way layout. */ data rcht; input a b RESP @@; datalines; ; proc sort; by a b;

522 USING SAS TO PERFORM TWO-WAY ANALYSIS OF VARIANCE

/* Run Proc Mixed to calculate variances */ proc mixed data=rcht; class a b; model RESP=a|b; repeated/type=un(1) group=a*b;

/* Create data sets for covariances, means and class levels for input to Proc IML */ ods listing exclude covparms; ods output covparms=tempcov; ods listing exclude classlevels; ods output classlevels=levels; ods listing exclude dimensions; ods output dimensions=sizes;

/* Suppress printing of Proc Mixed tables */ ods listing exclude fitstatistics; ods listing exclude reml; ods listing exclude ConvergenceStatus; ods listing exclude IterHistory; ods listing exclude lrt; ods listing exclude modelinfo; ods listing exclude tests3;

/* Use Proc GLM to calculate and output Type III F-statistics */ proc glm data = rcht noprint outstat=fstats; class a b; model RESP=a|b/ss3; run;

/* Use Proc Means to calculate and output cell means and sample sizes */ proc means data=rcht noprint; var RESP; by a b; output mean=means n=n out=tempss;

/* Begin Proc IML to calculate adjusted df and p-values */ proc iml;

/* Create matrices from data sets created above */ use levels; read point 1 var {levels} into nla; read point 2 var {levels} into nlb; use sizes; read point 1 var {value} into parms; read point 8 var {value} into nobs; use tempss; read all var {n} into ni; use tempcov; read all var {estimate} into tsighat; sighat=diag(tsighat); do i=1 to parms; sighat[i,i]=sighat[i,i]/ni[i]; end; shat=nobs*sighat; use tempss; read all var {means} into Xbar; results=j(3,4,0); use fstats; read point 2 var {F} into FHA; read point 3 var {F} into FHB; read point 4 var {F} into FHAB; RICHTER & PAYTON 523 read point 2 var {df} into dfa; read point 3 var {df} into dfb; read point 4 var {df} into dfAB; read point 1 var {df} into dfe; RESULTS[1,1]=FHA; RESULTS[2,1]=FHB; RESULTS[3,1]=FHAB;

/* Calculations for Box-type adjustment */ MA=(i(nla)-(1/nla)*j(nla))@(1/nlb*j(nlb)); DMA=diag(MA); denA=trace(DMA*Shat); dmas=dma*shat; QA=nobs*Xbar`*MA*Xbar; FNA=QA/denA; RESULTS[1,3]=FNA;

MB=((1/nla)*j(nla))@(i(nlb)-(1/nlb)*j(nlb)); DMB=diag(MB); denb=trace(DMb*Shat); Qb=nobs*Xbar`*Mb*Xbar; FNB=Qb/denb; RESULTS[2,3]=FNB;

MAB=(i(nla)-(1/nla)*j(nla))@(i(nlb)-(1/nlb)*j(nlb)); DMAB=diag(MAB); denAb=trace(DMAb*Shat); QAb=nobs*Xbar`*MAb*Xbar; FNAB=QAb/denAb; RESULTS[3,3]=FNAB; Lambda=DIAG(1/NI);

/* Calculate adjusted df */ fA=((trace(DMA*Shat))**2)/(trace((MA*Shat)*(MA*Shat))); foA=(trace(DMA*Shat))**2/(trace(DMA**2*Shat**2*Lambda)); fB=((trace(DMB*Shat))**2)/(trace((MB*Shat)*(MB*Shat))); foB=(trace(DMB*Shat))**2/(trace(DMB**2*Shat**2*Lambda)); fAB=((trace(DMAB*Shat))**2)/(trace((MAB*Shat)*(MAB*Shat))); foAB=(trace(DMAB*Shat))**2/(trace(DMAB**2*Shat**2*Lambda));

/* Calculate p-values */ adjpvalA=1-probf(FnA,fA,foA); RESULTS[1,4]=ADJPVALA; adjpvalB=1-probf(FnB,fB,foB); RESULTS[2,4]=ADJPVALB; adjpvalAB=1-probf(FnAB,fAB,foAB); RESULTS[3,4]=ADJPVALAB; pvala=1-probf(fHa,dfa,dfe); RESULTS[1,2]=PVALA; pvalb=1-probf(fHb,dfb,dfe); RESULTS[2,2]=PVALB; 524 USING SAS TO PERFORM TWO-WAY ANALYSIS OF VARIANCE pvalab=1-probf(fHab,dfab,dfe); RESULTS[3,2]=PVALAB;

/* Print results */ headings={' ANOVA F' ' P-value' ' Adjusted F' ' P- value'}; EFFECT={A,B,AB}; mattrib RESULTS colname=HEADINGS FORMAT=7.3 label='Results'; print effect RESULTS; quit; Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 525-530 1538 – 9472/03/$30.00

JMASM9: Converting Kendall’s Tau For Correlational Or Meta-Analytic Analyses

David A. Walker Educational Research and Assessment Northern Illinois University

Expanding on past research, this study provides researchers with a detailed table for use in meta-analytic applications when engaged in assorted examinations of various r-related statistics, such as Kendall’s tau (τ) and Cohen’s d, that estimate the magnitude of experimental or observational effect. A program to convert from the lesser-used tau coefficient to other effect size indices when conducting correlational or meta-analytic analyses is presented.

Key words: Effect size, meta-analysis

Introduction The r group can be considered as based on the correlation between treatment and result (Levin, There is a heightened effort within the social and 1994). For this group, “Effect size is generally behavioral sciences to report effect sizes with reported as some proportion of the total variance research findings (APA, 2001; Henson & Smith, accounted for by a given effect” (Stewart, 2000, 2000; Knapp, 1998). Effect sizes show the p. 687). Cohen (1988) suggested that values of strength and magnitude of a relationship and .01, .06, and .14 serve as indicators of small, account for the total variance of an outcome. medium, and large effect sizes for this group. The American Psychological Association (APA) However, it is at the discretion of the researcher encouraged recently, “Always provide some to note the context in which small, medium, and effect size estimate when reporting a p value” large effects are being defined when using d and (Wilkinson & The APA Task Force on r related indices. Statistical Inference, 1999, p. 599). An analysis A review of the literature indicated that of effect sizes allows researchers to evaluate the researchers have discussed the merits, or lack importance of the result and not just the thereof, of employing correlation coefficients, probability of the result (Kirk, 1996; Shaver, such as Kendall’s tau (τ), to assist in conducting 1985). meta-analytic studies or other forms of Furthermore, effect sizes fall into two correlational and/or experimental analyses categories: d and r. The d group encompasses (Cooper & Hedges, 1994; Ferguson & Takane, measures of effect size in terms of mean 1989; Gibbons, 1985; Gilpin, 1993; Roberts & difference and standardized mean difference. Kunst, 1990; Smithson, 2001; Wolf, 1987). Cohen (1988) defined the values of effect sizes Indeed, the reporting of tau in research studies for this group as small d = .20, medium d = .50, has not been as prevalent as found with and large d = .80. Spearman’s rho (ρ) or Pearson’s r. However, tau has been emphasized recently as a substitute for r in various research contexts. For instance, David Walker is an Assistant Professor of Rupinski and Dunlap (1996) looked at the Educational Research and Assessment, Northern accuracy of formulas for estimating r from tau. Illinois University, ETRA Department, 208 Gabel, DeKalb, IL 60115, 815-753-7886. E-mail him at: [email protected].

525 526 CONVERTING KENDALL’S TAU

Methodology Results Purpose Tables for transforming correlation In the accompanying SPSS data set, we are coefficients, such as Pearson’s r to Spearman’s given a value for tau ranging from .001 to 1.000. rho, have been produced in the recent past With a presented value of tau, the table that (Gilpin, 1993; Strahan, 1982). Expanding upon ensues can be created in total as an internal Gilpin and Strahan’s research, this study will matrix via the SPSS syntax program provided in provide researchers with a detailed table for use Appendix A or as individual, selected conditions in meta-analytic applications when engaged in by way of the subsequent formulas. assorted examinations of various r-related With a presented value of tau, we can statistics, such as Fisher’s Zr and Cohen’s d, that calculate a Pearson’s r using Kendall’s formula estimate the magnitude of experimental or (1970, p. 126). observational effect. In addition, the table will be expanded to measure values in increments of r = sin (.5 πτ) (1) .001 of a percent from .001 to 1.000, add more commonly utilized effect size variants of r not found in the original, and, most importantly, COMPUTE r=SIN (3.141592654*τ*.5). provide SPSS syntax to convert from the lesser- EXECUTE. used tau coefficient to other effect size indices when conducting correlational or meta-analytic Further, with a known value for τ and r, analyses. we can compute a Spearman’s rho statistic using

Gilpin’s formula (1993, p. 91), Assumptions

This research study is not intended to be ρ = 3[τ sin-1 (r/2)]/sin-1 r (2) an exhaustive study of effect sizes, but serves as a prologue to impart contextual reference to the internal matrix table being presented. Also, it is COMPUTE p=3*τ*ARSIN(r/2)/ARSIN(r). presupposed that researchers understand that tau EXECUTE. and rho apply distinct metrics, which means that they cannot be likened to one another due to a To compute a Fisher’s Zr statistic from a great difference between their absolute values given value of r, derived from tau, we can apply (cf. Kendall, 1970; Strahan, 1982). As noted by the following formula (Rosenthal, 1994, p. 237). researchers (Kendall, 1970; Gilpin, 1993), as the values of τ and ρ increase, their numerical Zr = ½ loge [(1 + r) / (1 – r)] (3) similitude decreases greatly. The same trend holds for these two correlation coefficients’ COMPUTE z = .5*LN((1+r)/(1-r)). squared indices, where “τ2 does not reflect at all EXECUTE. adequately the proportion of shared variance...” (Strahan, 1982, p. 764). To calculate a Cohen’s d from a known A further assumption is that the table r, derived from tau, the subsequent formula was produced by means of this procedure reflects employed (Rosenthal, 1994, p. 239). Note that accurate values when a normal distribution is for small to medium sample sizes, this formula present, as well as a relatively large sample size. will yield positively biased estimates. To correct Also, the true values of the squared indices will for this, if presented with this situation, the be non-negative and the non-squared index expression should be multiplied by the factor .5 values are symmetrical (i.e., τ < 0), thus [(n-1) / n] remaining the same numerically when negative. 2 .5 d = 2r/[(1-r ) ] (4)

DAVID A. WALKER 527

COMPUTE d = 2*r/SQRT(1-r**2). addressed correctly in a research study to make EXECUTE. reliable, justifiable decisions and have these decisions, based on the statistics of study, To calculate a Cohen’s f statistic from a authenticated by others. Converting from the given d, derived from tau, the ensuing formula lesser-used tau coefficient to other effect size was utilized (Cohen, 1988, p.276). indices, the presented SPSS syntax program can create an internal matrix table and new data set f = ½d (5) to assist researchers in determining the size of an effect for commonly utilized r-related indices COMPUTE f = .5*d. when engaging in correlational and meta- EXECUTE. analytic analyses.

To determine the amount of variance References accounted for with correlation coefficients, such American Psychological Association. as r, ρ, or τ, we square their value, which yields (2001). Publication manual of the American the extent of the effect in terms of “how much of Psychological Association (5th ed.). the variability in the dependent variable(s) is Washington, DC: Author. associated with the variation in the independent Cohen, J. (1988). Statistical power variable(s)” (Snyder & Lawson, 1993, p. 338). A analysis for the behavioral sciences (2nd ed.). caution should be noted, however, that when Hillsdale, NJ: Lawrence Erlbaum Associates, using squared indices to determine effect size, a Publishers. loss of directionality is an issue and also the Cohen, J. (1994). The earth is round (p power affiliated with these indices is often < .05). American Psychologist, 49, 997-1003. distorted when reporting research findings Cooper, H., & Hedges, L. V. (Eds.). (Rosenthal, 1994). (1994). The handbook of research synthesis. With r-related squared indices, it should New York: Russell Sage Foundation. be mentioned that eta-squared (η2) and r2 are Ferguson, G. A., & Takane, Y. (1989). identical numerically and r2 and f2 are related 2 Statistical analysis in psychology and education monotonically. Cohen (1994) determined that η (6th ed.). New York: McGraw-Hill. was a population correlation ratio that could be Gibbons, J. D. (1985). Nonparametric “... computed on samples and its population methods for quantitative analysis (2nd ed.). value estimated therefrom” (p. 281). Effect size Columbus, OH: American Sciences Press, Inc. estimates of this order have been called epsilon- 2 2 Gilpin, A. R. (1993). Table for squared (ξ ) and omega-squared (ω ). Thus, this conversion of Kendall’s tau to Spearman’s rho type of effect size tends to measure the within the context of measures of magnitude of proportion of variance in the population due to a effect for meta-analysis. Educational and 2 particular effect. Cohen’s formula for η (1994, Psychological Measurement, 53, 87-92. 2 p. 281) can be used if r is not preferred, where Henson, R. K., & Smith, A. D. (2000). 2 2 f = d /4. State of the art in statistical significance and effect size reporting: A review of the APA Task 2 2 2 η = f / (1 + f ) (6) Force report and current trends. Journal of Research and Development in Education, 33, COMPUTE eta2 = f**2/(1+f**2). 285-296. EXECUTE. Kendall, M. G. (1970). Rank correlation methods (4th ed.). London: Charles Griffin & Conclusion Co. Kirk, R. (1996). Practical significance: Methodological appropriateness is a consequent- A concept whose time has come. Educational ial area within research that should be nearly and Psychological Measurement, 56, 746-759. perfect. Concepts such as effect size need to be 528 CONVERTING KENDALL’S TAU

Knapp, T. R. (1998). Comments on the Smithson, M. (2001). Correct statistical significance testing articles. Research confidence intervals for various regression effect in Schools, 5, 39-41. sizes and parameters: The importance of Levin, J. R. (1994). Crafting educational noncentral distributions in computing intervals. intervention research that’s both credible and Educational and Psychological Measurement, creditable. Educational Psychology Review, 6, 61, 605-632. 231-243. Snyder, P., & Lawson, S. (1993). Roberts, D. M., & Kunst, R. E. (1990). Evaluating results using corrected and A case against continuing use of the Spearman uncorrected effect size estimates. Journal of formula for rank-order correlation. Experimental Education, 61, 334-349. Psychological Reports, 66, 339-349. Stewart, D. W. (2000). Testing Rosenthal, R. (1994). Parametric statistical significance testing: Some measures of effect size. In H. Cooper & L. V. observations of an agnostic. Educational and Hedges (Eds.), The handbook of research Psychological Measurement, 60, 685-690. synthesis (pp. 213-244). New York: Russell Strahan, R. F. (1982). Assessing Sage Foundation. magnitude of effect from rank-order correlation Rupinski, M. T., & Dunlap, W. P. coefficients. Educational and Psychological (1996). Approximating Pearson product-moment Measurement, 42, 763-765. correlations from Kendall’s tau and Spearman’s Wilkinson, L., & The APA Task Force rho. Educational and Psychological on Statistical Inference. (1999). Statistical Measurement, 56, 419-429. methods in psychology journals: Guidelines and Shaver, J. (1985). Chance and nonsense. explanations. American Psychologist, 54, 594- Phi Delta Kappan, 67, 57-60. 604. Wolf, F. M. (1987). Meta-analysis: Quantitative methods for research synthesis. Beverly Hills, CA: Sage Publications.

DAVID A. WALKER 529

Appendix

Notes: To produce an internal matrix output and a complete table as a new data set in SPSS, access the embedded Tau Data Set and then run Tau Syntax. Create the data set in SPSS with a variable (1 column by 1000 rows) containing the values .001 - 1.000 (.001), or contact the author to obtain copies of the data set and the syntax. An example of the tabled values appears on the following page.

compute r = SIN(3.141592654 * t * .5). compute rs = 3 * t * ARSIN(r / 2) / ARSIN(r). compute zr = .5 * LN((1 + r) / (1 - r)). compute d = 2 * r / SQRT(1 - r ** 2). compute f = d*.5. compute r2 = r **2. compute f2 = d**2/4. compute eta2 = (f2) / (1 + f2). execute. * FINAL REPORTS *. FORMAT r to eta2 (f9.4). VARIABLE LABELS t 'Tau' /r 'Pearsons r' /rs 'Spearmans Rank' /zr 'Fishers z' /d 'Cohens d' /f 'f (Related to d as an SD of Standardized Means when k = 2 and n = n)'/ r2 'R Square' / eta2 'Eta Square'. REPORT FORMAT=LIST AUTOMATIC ALIGN(CENTER) /VARIABLES=t r rs zr /TITLE "Proportion of Variance-Accounted-For Effect Sizes: Measures of Relationship". REPORT FORMAT=LIST AUTOMATIC ALIGN(CENTER) /VARIABLES=d f /TITLE "Standardized Mean Difference Effect Sizes". REPORT FORMAT=LIST AUTOMATIC ALIGN(CENTER) /VARIABLES= r2 eta2 /TITLE "Proportion of Variance-Accounted-For Effect Sizes: Squared Indices". 530 CONVERTING KENDALL’S TAU

An Example Of Tabled Values

tau r p Zr d f R2 eta2 0.001 0.0016 0.0015 0.0016 0.0031 0.0016 0.0000 0.0000 0.002 0.0031 0.0030 0.0031 0.0063 0.0031 0.0000 0.0000 0.003 0.0047 0.0045 0.0047 0.0094 0.0047 0.0000 0.0000 0.004 0.0063 0.0060 0.0063 0.0126 0.0063 0.0000 0.0000 0.005 0.0079 0.0075 0.0079 0.0157 0.0079 0.0001 0.0001 0.006 0.0094 0.0090 0.0094 0.0189 0.0094 0.0001 0.0001 0.007 0.0110 0.0105 0.0110 0.0220 0.0110 0.0001 0.0001 0.008 0.0126 0.0120 0.0126 0.0251 0.0126 0.0002 0.0002 0.009 0.0141 0.0135 0.0141 0.0283 0.0141 0.0002 0.0002 0.010 0.0157 0.0150 0.0157 0.0314 0.0157 0.0002 0.0002 0.011 0.0173 0.0165 0.0173 0.0346 0.0173 0.0003 0.0003 0.012 0.0188 0.0180 0.0189 0.0377 0.0189 0.0004 0.0004 0.013 0.0204 0.0195 0.0204 0.0408 0.0204 0.0004 0.0004 0.014 0.0220 0.0210 0.0220 0.0440 0.0220 0.0005 0.0005 0.015 0.0236 0.0225 0.0236 0.0471 0.0236 0.0006 0.0006 0.016 0.0251 0.0240 0.0251 0.0503 0.0251 0.0006 0.0006 0.017 0.0267 0.0255 0.0267 0.0534 0.0267 0.0007 0.0007 0.018 0.0283 0.0270 0.0283 0.0566 0.0283 0.0008 0.0008 0.019 0.0298 0.0285 0.0298 0.0597 0.0299 0.0009 0.0009 0.020 0.0314 0.0300 0.0314 0.0629 0.0314 0.0010 0.0010 0.021 0.0330 0.0315 0.0330 0.0660 0.0330 0.0011 0.0011 0.022 0.0346 0.0330 0.0346 0.0691 0.0346 0.0012 0.0012 0.023 0.0361 0.0345 0.0361 0.0723 0.0361 0.0013 0.0013 0.024 0.0377 0.0360 0.0377 0.0754 0.0377 0.0014 0.0014 0.025 0.0393 0.0375 0.0393 0.0786 0.0393 0.0015 0.0015 0.026 0.0408 0.0390 0.0409 0.0817 0.0409 0.0017 0.0017 0.027 0.0424 0.0405 0.0424 0.0849 0.0424 0.0018 0.0018 0.028 0.0440 0.0420 0.0440 0.0880 0.0440 0.0019 0.0019 0.029 0.0455 0.0435 0.0456 0.0912 0.0456 0.0021 0.0021 0.030 0.0471 0.0450 0.0471 0.0943 0.0472 0.0022 0.0022 0.031 0.0487 0.0465 0.0487 0.0975 0.0487 0.0024 0.0024 0.032 0.0502 0.0480 0.0503 0.1006 0.0503 0.0025 0.0025 0.033 0.0518 0.0495 0.0519 0.1038 0.0519 0.0027 0.0027 0.034 0.0534 0.0510 0.0534 0.1069 0.0535 0.0028 0.0028 0.035 0.0550 0.0525 0.0550 0.1101 0.0550 0.0030 0.0030 0.036 0.0565 0.0540 0.0566 0.1132 0.0566 0.0032 0.0032 0.037 0.0581 0.0555 0.0582 0.1164 0.0582 0.0034 0.0034 0.038 0.0597 0.0570 0.0597 0.1195 0.0598 0.0036 0.0036 0.039 0.0612 0.0585 0.0613 0.1227 0.0613 0.0037 0.0037 0.040 0.0628 0.0600 0.0629 0.1258 0.0629 0.0039 0.0039 0.041 0.0644 0.0615 0.0644 0.1290 0.0645 0.0041 0.0041 0.042 0.0659 0.0630 0.0660 0.1321 0.0661 0.0043 0.0043 0.043 0.0675 0.0645 0.0676 0.1353 0.0676 0.0046 0.0046 0.044 0.0691 0.0660 0.0692 0.1385 0.0692 0.0048 0.0048 0.045 0.0706 0.0675 0.0707 0.1416 0.0708 0.0050 0.0050 0.046 0.0722 0.0690 0.0723 0.1448 0.0724 0.0052 0.0052 0.047 0.0738 0.0705 0.0739 0.1479 0.0740 0.0054 0.0054 0.048 0.0753 0.0719 0.0755 0.1511 0.0755 0.0057 0.0057 0.049 0.0769 0.0734 0.0770 0.1542 0.0771 0.0059 0.0059 0.050 0.0785 0.0749 0.0786 0.1574 0.0787 0.0062 0.0062

Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 531-532 1538 – 9472/03/$30.00

Letters To The Editor

Additional Reflections On Significance Testing Vance W. Berger, Biometry Research Group, National Cancer Institute. E-mail: Knapp, T.R. (2002). Some reflections on [email protected]. significance testing. Journal of Modern Applied Statistical Methods, 1(2), 240-242. Reference Berger, V.W. (2000). Pros and cons of Knapp (2002) raised good points concerning permutation tests in clinical trials. Statistics in significance testing. The bottom line, “If you Medicine 19, 1319-1328. have hypotheses to test … [then] test them”, not only makes sense, but in fact is an argument that ______I have often made when consulting, although the closest I have come to this issue in the literature Predictor Importance In Multiple Regression is an allusion (Berger, 2000, Section 2.1). Yet, the support for this assertion, based Whittaker, T.A., Fouladi, R.T., & Williams, N.J. on refuting the statement that confidence (2002). Determining predictor importance in intervals provide identical or non-conflicting multiple regression under varied correlational inferences with significance tests, might benefit and distributional conditions. Journal of Modern from elaboration. In fact, the confidence interval Applied Statistical Methods, 1(2), 354-366. constructed by Knapp (2002) is but one of several that could have been constructed. To William Kruskal may have been right in noting argue that it is not the best among these is to that the relative importance of predictors in a argue that its discredit cannot serve as a regression analysis is meaningful to researchers, simultaneous discredit to the class it purports to but I’m not so sure it should always be the case. represent (albeit not very well). My principal concerns about the It would be a simple matter to construct Whittaker et al. (2002) article are these: a confidence region as precisely the set of parameter values which, when serving as the 1. Multiple regression analysis is used for null hypothesis, lead to significance tests that prediction and for causal analysis. When a user cannot be rejected. With this definition of a asks: “What are the most important variables in confidence set (which often, but not always, this regression?”, the answer depends upon the reduces to a confidence interval), it is a purpose of the analysis. Whittaker et al. failed to tautology that the confidence set cannot distinguish sufficiently between the two contradict the results of the significance test. purposes and seem to advocate a “one size fits Why, then, should hypotheses be tested? all” method for determining the relative Because it is problematic to base policy importance of regressors (apparently Budescu’s decisions, that affect the public, on apparent dominance analysis, perhaps augmented by the directions of effect when the study is conducted Johnson index). by a party with a vested interest in the outcome. Requiring statistical significance is one 2. I do not see the need for the Monte Carlo reasonable way to operationalize the need for a approach to the problem. In his text, Darlington preponderance of evidence, and raise the hurdle, (1990) provided a mathematical explanation for in such a case. If an alpha level were chosen the equivalence with respect to rank-ordering of strategically, perhaps based on safety, importance of their Methods 3 (the t statistics for convenience, and cost in a medical study, then the betas), 5 (the squared partials) and 6 (the the results of the significance test of efficacy squared semi-partials), along with two others would correspond to the optimal decision. (the p-values for the betas and the changes in R- Clearly, there are other ways to raise the hurdle. square from the reduced model with the variable deleted to the full model with the variable

531 532 LETTERS TO THE EDITOR

included). [They do cite Darlington (1968), but used as one of the methods AND as the do not cite his later text.] On page 364 of their goodness criterion, which would of course article they speculated as to why Methods 3, 5, “stack the deck” in its favor. The confusion with and 6 “all performed identically”. Those three the two meanings, however, remains. methods must perform identically. As far as the other five methods are I have a couple of other lesser concerns: concerned, Method 1 (squared zero-order correlations, or unsquared zero-order 1. Their definition of “the dominant predictor” is correlations, for that matter) can be dismissed a bit strange. In what sense is an independent out of hand, because the other regressors are not variable that correlates .40 - .60 with the statistically controlled. dependent variable dominant over other I can’t see any reason why anyone independent variables that correlate .30 with the would ever use Method 2 (the betas). For one dependent variable? thing, the betas aren't restricted to the -1 to +1 range, so although they are standardized they are 2. Their “Nursing Facility Consumer awkward to compare. Satisfaction Survey” example is a poor example. Method 4 (the beta-times-r products) has The data are for seven-point Likert-type scales been criticized in the past (see, for example, with ridiculously high means and there is an Darlington, 1968). The fact that those products inherent regressor/regressand contamination sum to the over-all R-square is a poor basis for since the three predictors are concerned with variance partitioning and for the determination specific satisfactions and the dependent variable of relative importance (some of those products is over-all satisfaction. can even be negative--for suppressor variables). That leaves Methods 7 (Budescu) and 8 Thomas R. Knapp, Professor Emeritus, (Johnson). I have no doubt that similar, non- University of Rochester & The Ohio State Monte Carlo-based, arguments could be made University. regarding those methods for determining the relative importance of regressors, but even if Note: I would like to thank Richard Darlington such arguments were necessary, Whittaker et al. for his very helpful suggestions regarding an (2002) were interested in comparing all eight earlier version of this critique. methods, not just those two. References 3. Two different meanings of the word “dominance” was confusing. One of the Darlington, R.B. (1968). Multiple meanings, “dominance analysis”, is associated regression in psychological research and with Budescu’s method. The other meaning, practice. Psychological Bulletin, 69(3), 161-182. identifying the “dominant predictor” (p. 358), Darlington, R.B. (1990). Regression and was apparently the criterion for determining linear models. New York: McGraw-Hill. which methods were best. When I first read the article I thought that the Budescu method was

Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 533-534 1538 – 9472/03/$30.00

Statistical Pronouncements II

“There are some who appear to pride “No valid sampling error formula exists themselves on their absence of knowledge of unless the selection of the sample [is] through mathematics. I never understood why it should the use of an objective method of be a matter of pride” - Arthur L. Bowley, (1934, randomization” - William G. Cochran (1947, Discussion, The Journal of the Royal Statistical Recent developments in sampling theory in the Society, 97, p. 607). United States, Proceedings of the International Statistical Institute, 3(A), p. 41). “Do we know more than was known to Todhunter?” - Arthur L. Bowley, (ibid, p. 609). “The student should be warned that he cannot expect miracles to be wrought by the use “To try and state mathematics without of statistical tools” - Quinn McNemar (1949, either chalk or with a minimum of chalk [is] Psychological statistics, Wiley, p. 3.) perhaps a hopeless task” - L. Isserlis (ibid, p. 614). “Much ingenuity is shown by investigators in concocting possible explanations “The advantage of excluding by severe of the discrepancies among the results of mathematical requirements many quacks is different workers” - William G. Cochran (1950, bought too dear if it shuts out a single John The present status of biometry, Bulletin of the Graunt” - Major Greenwood (1939, Journal of International Statistical Institute, 32(2), p. 133). the Royal Statistical Society, 102, p. 552). “Any estimate made from a sample is “Many important applications of subject to error” - William G. Cochran (1951, statistics, while employing elementary statistical Modern methods in the sampling of human techniques, demand thorough knowledge and populations, American Journal of Public Health, long experience in the applied field” - William 41(6), p. 647). G. Cochran (1945, Training at the professional level for statistical work in agriculture and “The principle that governs modern biology, Journal of the American Statistical sampling practice is the familiar economic Association, 40, p. 163). maxim that one should get the most for one’s money” - William G. Cochran (ibid, p. 648). “Youth is the time to learn mathematics” - William G. Cochran (1946, “Any theory is at best approximately Graduate training in statistics, American true, but nevertheless, if we are going to reject a Mathematics Monthly, 53(4), p. 199). theory, we do so because it does not fit the data we have, not because it would not fit a much “Statistics depends primarily on larger sample of data that we do not have” - mathematics and mathematicians for its future William G. Cochran (ibid, p. 336). development…Such mathematicians need not be regarded as lost or strayed from the fold” - “Sampling is all too often taken far too William G. Cochran (ibid, p. 199). lightly” - William G. Cochran, (1954, Principles of sampling, Journal of the American Statistical “The missing link is that we do not Association, 49, p. 13). know which of the theoretical non-normal distributions that have been studied are typical “[There are] unwarranted shotgun of the error distributions that turn up in practice” marriages between the quantitatively - William G. Cochran (1947, Some unsophisticated idea of sample as ‘what you get consequences when the assumptions for the by grabbing a handful’ and the mathematically analysis of variance are not satisfied, Biometrics, precise notion of a “simple random sample’” - 3(1), p. 25). William G. Cochran, (ibid, p. 13).

533 534 Statistical Pronouncements II

“The scientist tends to think of seeing a “As regard the rejection of observations statistician when he has some problem… mostly [as outliers], I distrust any slick formal rule” - when something had gone wrong with the David J. Finney (1970, Discussion, Statistics in experiment or survey… As a result, endocrinology, MIT Press, p. 72). statisticians… see a sorry collection of the wrecks of research projects” - William G. “Time is perhaps the most mysterious Cochran, (1955, Research techniques in the thing in a mysterious universe.” - Maurice study of human beings, Milbank Memorial Fund Kendall (1976, Time-series, p. 1). Quarterly, 33(2), p. 122). “The modern theory of statistics began “In statistical training centers, with the realization that, although individuals something is done to teach young statisticians might not behave deterministically, aggregates how to get along with scientists” - William G. of individuals were themselves subjects to laws Cochran (ibid, p. 123). which could often be summarized in fairly simple mathematical terms” - Maurice Kendall “The statistician is a poor marriage risk, (ibid, p. 4). and may be suffering from marital strain” - William G. Cochran (ibid, p. 124). “Some data are not worth analyzing, even when we have the big guns of a “The ability to do experiments is one of mathematical arsenal ready for attack” - the most powerful weapons man has for making attributed to William G. Cochran by Frederick advances in his understanding of the world” - Mosteller in the Forward to Contribution to William G. Cochran (1957, The philosophy Statistics: William G. Cochran, Wiley, 1982, p. underlying the design of experiments, vii). Proceedings of the 1st Conference on the Design of Experiments in Army Research, Development “There were just two qualifications for and Testing, p 1). membership [in the national statistics society] - first you had to have $5 [and] second you had to “All mathematical methods are be willing to give it to the society” - attributed to oversimplifications” - William G. Cochran William G. Cochran by Frederick Mosteller (1961, The role of mathematics in the medical (ibid, p. xi). sciences, New England Journal of Medicine, 265, p. 176). “You usually can’t follow papers at meetings” - Frederick Mosteller (Contribution to “Many of the standard results in Statistics: William G. Cochran, 1982, p. xii). theoretical statistics were obtained without encountering really difficult mathematics” - “It has been said that more time has William G. Cochran (ibid, p. 230). been spent generating and testing random numbers than using them” - C. A. Whitney “Electronic machines… can free us from (1984, Generating and testing pseudo-random overdependence on the assumption of normality numbers, BYTE, October, 9(11), p. 128). and from confinement to approximate linear solutions to nonlinear problems” - William G. “Someone told me that each equation I Cochran (ibid, p. 232-232.) included in the book would halve the sales” - Stephen Hawkins (1988, A brief history of time: “Nonparametric theory is elegant” - From the big bang to black holes, Bantam, p. Jaroslav Hajek (1969. A course in vi.). nonparametric statistics, Holden-Day, preface.)

SIGN Ad.206x131mm 18/2/04 10:11 am Page 1

NEW IN 2004

The new magazine of the Royal Statistical Society

Edited by Helen Joyce

Significance is a new quarterly magazine for anyone interested in statistics and the analysis and interpretation of data. It aims to communicate and demonstrate, in an entertaining and thought-provoking way, the practical use of statistics in all walks of life and to show how statistics benefit society.

Articles are largely non-technical and hence accessible and appealing, not only to members of the profession, but to all users of statistics.

As well as promoting the discipline and covering topics of professional relevance, Significance contains a mixture of statistics in the news, case- studies, reviews of existing and newly developing areas of statistics, the application of techniques in practice and problem solving, all with an international flavour.

Special Introductory Offer: 25% discount on a new personal subscription Plus Great Discounts for Students!

Further information including submission guidelines, subscription information and details of how to obtain a free sample copy are available at

www.blackwellpublishing.com/SIGN PASS 2002 Power Analysis and Sample Size Software from NCSS

PASS performs power analysis and Power vs N1 by Alpha with M1=20.90 M2=17.80 PASS Beats the Competition! S1=3.67 S2=3.01 N2=N1 2-Sided T Test calculates sample sizes. Use it before 1.0 No other program calculates sample you begin a study to calculate an sizes and power for as many different appropriate sample size (it meets the 0.8 statistical procedures as does PASS. requirements of government agencies 0.6 0.01 Specifying your input is easy, especially that want technical justification of the with the online help and manual. Alpha Power 0.05 0.4 sample size you have used). Use it after PASS automatically displays charts and a study to determine if your sample size 0.10 0.2 graphs along with numeric tables and was large enough. PASS calculates the 0.0 text summaries in a portable format that sample sizes necessary to perform all of 0 10 20 30 40 50 is cut and paste compatible with all word N1 the statistical tests listed below. processors so you can easily include the A power analysis usually involves results in your proposal. PASS comes with two manuals that contain several “what if” questions. PASS lets Choose PASS. It's more comprehensive, tutorials, examples, annotated output, you solve for power, sample size, effect easier-to-use, accurate, and less references, formulas, verification, and size, and alpha level. It automatically expensive than any other sample size complete instructions on each procedure. creates appropriate tables and charts of program on the market. the results. And, if you cannot find an answer in the PASS is accurate. It has been manual, our free technical support staff Trial Copy Available extensively verified using books and (which includes a PhD statistician) is You can try out PASS by downloading it reference articles. Proof of the available. from our website. This trial copy is accuracy of each procedure is included System Requirements good for 30 days. We are sure you will agree that it is the easiest and most in the extensive documentation. PASS runs on Windows 95/98/ME/NT/ comprehensive power analysis and 2000/XP with at least 32 megs of RAM and PASS is a standalone system. Although sample size program available. it is integrated with NCSS, you do not 30 megs of hard disk space. have to own NCSS to run it. You can use it with any statistical software you want. PASS sells for as little as $449.95.

Analysis of Variance Proportions T Tests Group Sequential Tests Factorial AOV Chi-Square Test Cluster Randomization Alpha Spending Functions Fixed Effects AOV Confidence Interval Confidence Intervals Lan-DeMets Approach Geisser-Greenhouse Equivalence of McNemar* Equivalence T Tests Means MANOVA* Equivalence of Proportions Hotelling’s T-Squared* Proportions Multiple Comparisons* Fisher's Exact Test Group Sequential T Tests Survival Curves One-Way AOV Group Sequential Proportions Mann-Whitney Test Equivalence Planned Comparisons Matched Case-Control One-Sample T-Tests Means Randomized Block AOV McNemar Test Paired T-Tests Proportions New Repeated Measures AOV* Odds Ratio Estimator Standard Deviation Estimator Correlated Proportions* Regression / Correlation One-Stage Designs* Two-Sample T-Tests Proportions – 1 or 2 Wilcoxon Test Miscellaneous Features Correlations (one or two) Two Stage Designs (Simon’s) Automatic Graphics Cox Regression* Survival Analysis Three-Stage Designs* Finite Population Corrections Logistic Regression Cox Regression* Solves for any parameter Multiple Regression Miscellaneous Tests Logrank Survival -Simple Text Summary Poisson Regression* Exponential Means – 1 or 2* Logrank Survival - Advanced* Unequal N's Intraclass Correlation ROC Curves – 1 or 2* Group Sequential - Survival Linear Regression Variances – 1 or 2 Post-Marketing Surveillance ROC Curves – 1 or 2* *New in PASS 2002

NCSS Statistical Software · 329 North 1000 East · Kaysville, Utah 84037 Internet (download free demo version): http://www.ncss.com · Email: [email protected] Toll Free: (800) 898-6109 · Tel: (801) 546-0445 · Fax: (801) 546-3907 PASS 2002 adds power analysis and sample size to your statistical toolbox

WHAT’S NEW IN PASS 2002? TEXT STATEMENTS NEW USER’S GUIDE II Thirteen new procedures have been added The text output translates the numeric A new, 250-page manual describes each new to PASS as well as a new home-base output into easy-to-understand procedure in detail. Each chapter contains window and a new Guide Me facility. sentences. These statements may be explanations, formulas, examples, and MANY NEW PROCEDURES transferred directly into your grant accuracy verification. The new procedures include a new multi- proposals and reports. The complete manual is stored in PDF factor repeated measures program that GRAPHICS format on the CD so that you can read and includes multivariate tests, Cox The creation of charts and graphs is printout your own copy. proportional hazards regression, Poisson easy in PASS. These charts are easily GUIDE ME regression, MANOVA, equivalence transferred into other programs such The new Guide Me facility makes it easy for testing when proportions are correlated, as MS PowerPoint and MS Word. first time users to enter parameter values. multiple comparisons, ROC curves, and The program literally steps you through those Hotelling’s T-squared. options that are necessary for the sample size calculation. NEW HOME BASE A new home base window has been added just for PASS users. This window helps you select the appropriate program module. COX REGRESSION A new Cox regression procedure has been added to perform power analysis and sample size calculation for this important statistical technique. REPEATED MEASURES A new repeated-measures analysis module has been added that lets you analyze designs with up to three grouping factors and up to three repeated factors. The analysis includes both the univariate F test and three common multivariate tests including Wilks Lambda. RECENT REVIEW In a recent review, 17 of 19 reviewers selected PASS as the program they would recommend to their colleagues.

PASS calculates sample sizes for... My Payment Options: ___ Check enclosed Please rush me my own personal license of PASS 2002. ___ Please charge my: __VISA __MasterCard __Amex Qty ___ Purchase order enclosed ___ PASS 2002 Deluxe (CD and User's Guide): $499.95...... $ _____ Card Number ___ PASS 2002 CD (electronic documentation): $449.95 ...... $ ______Expires______PASS 2002 5-User Pack (CD & 5 licenses): $1495.00...... $ _____ Signature______Please provide daytime phone: ___ PASS 2002 25-User Pack (CD & 25 licenses): $3995.00 ....$ _____ ( )______PASS 2002 User's Guide II (printed manual): $30.00...... $ _____ Ship my PASS 2002 to: ___ PASS 2002 Upgrade CD for PASS 2000 users: $149.95 ...... $ _____

Typical Shipping & Handling: USA: $9 regular, $22 2-day, $33 NAME overnight. Canada: $19 Mail. Europe: $50 Fedex...... $ _____ COMPANY Total: ...... $ _____ ADDRESS FOR FASTEST DELIVERY, ORDER ONLINE AT WWW.NCSS.COM CITY/STATE/ZIP Email your order to [email protected] COUNTRY (IF OTHER THAN U.S.) Fax your order to (801) 546-3907 NCSS, 329 North 1000 East, Kaysville, UT 84037 (800) 898-6109 or (801) 546-0445

NCSS 329 North 1000 East Kaysville, Utah 84037

Histogram of SepalLength by Iris

30.0 Iris Setosa Versicolor Virginica 20.0

Count 10.0

0.0 40.0 53.3 66.7 80.0 SepalLength

Announcing NCSS 2004 Seventeen New Procedures

NCSS 2004 is a new edition of our popular statistical NCSS package that adds seventeen new procedures.

New Procedures Meta-Analysis Binary Diagnostic Tests Two Independent Proportions Procedures for combining studies Four new procedures provide the Two Correlated Proportions measuring paired proportions, means, specialized analysis necessary for One-Sample Binary Diagnostic Tests independent proportions, and hazard diagnostic testing with binary outcome Two-Sample Binary Diagnostic Tests ratios are available. Plots include the data. These provide appropriate specificity Paired-Sample Binary Diagnostic Tests forest plot, radial plot, and L’Abbe plot. and sensitivity output. Four experimental Cluster Sample Binary Diagnostic Tests Both fixed and random effects models designs can be analyzed including Meta-Analysis of Proportions are available for combining the results. independent or paired groups, comparison Meta-Analysis of Correlated Proportions with a gold standard, and cluster Meta-Analysis of Means Curve Fitting randomized. Meta-Analysis of Hazard Ratios This procedure combines several of our Curve Fitting curve fitting programs into one module. ROC Curves Tolerance Intervals It adds many new models such as This procedure generates both binormal Comparative Histograms Michaelis-Menten. It analyzes curves and empirical (nonparametric) ROC ROC Curves from several groups. It compares fitted curves. It computes comparative measures Elapsed Time Calculator models across groups using computer- such as the whole, and partial, area under T-Test from Means and SD’s intensive randomization tests. It the ROC curve. It provides statistical tests Hybrid Appraisal (Feedback) Model computes bootstrap confidence intervals. comparing the AUC’s and partial AUC’s for paired and independent sample designs. Documentation Tolerance Intervals The printed, 330-page manual, called This procedure calculates one and two Hybrid (Feedback) Model NCSS User’s Guide V, is available for sided tolerance intervals using both This new edition of our hybrid appraisal $29.95. An electronic (pdf) version of distribution-free (nonparametric) model fitting program includes several new the manual is included on the distribution methods and normal distribution optimization methods for calibrating CD and in the Help system. (parametric) methods. Tolerance parameters including a new genetic intervals are bounds between which a algorithm. Model specification is easier. Two Proportions given percentage of a population falls. Several new exact and asymptotic Binary variables are automatically techniques were added for hypothesis Comparative Histogram generated from class variables. testing (null, noninferiority, equivalence) This procedure displays a comparative and calculating confidence intervals for histogram created by interspersing or Statistical Innovations Products the difference, ratio, and odds ratio. overlaying the individual histograms of Through a special arrangement with Designs may be independent or paired. two or more groups or variables. This Statistical Innovations (S.I.), NCSS Methods include: Farrington & Manning, allows the direct comparison of the customers will receive $100 discounts on: Gart & Nam, Conditional & distributions of several groups. Latent GOLDÒ - latent class modeling Unconditional Exact, Wilson’s Score, SI-CHAIDÒ - segmentation trees Miettinen & Nurminen, and Chen. Random Number Generator GOLDMineRÒ - ordinal regression Matsumoto’s Mersenne Twister random For demos and other info visit: number generator (cycle length > www.statisticalinnovations.com 10**6000) has been implemented.

Please rush me the following products: My Payment Option: Qty ___ Check enclosed ___ NCSS 2004 CD upgrade from NCSS 2001, $149.95 ...... $______Please charge my: __VISA __ MasterCard ___Amex ___ NCSS 2004 User’s Guide V, $29.95...... $______Purchase order attached______NCSS 2004 CD, upgrade from earlier versions, $249.95...... $_____ Card Number ______Exp ______NCSS 2004 Deluxe (CD and Printed Manuals), $599.95...... $_____ Signature______PASS 2002 Deluxe, $499.95 ...... $_____ Telephone: ___ Latent Gold® from S.I., $995 - $100 NCSS Discount = $895..... $_____ ( ) ______GoldMineR® from S.I., $695 - $100 NCSS Discount = $595 ..... $_____ Email: ______CHAID® Plus from S.I., $695 - $100 NCSS Discount = $595.... $_____ Approximate shipping--depends on which manuals are ordered (U.S: $10 Ship to: ground, $18 2-day, or $33 overnight) (Canada $24) (All other countries NAME ______$10) (Add $5 U.S. or $40 International for any S.I. product) ...... $_____ ADDRESS ______Total...... $_____ ADDRESS______TO PLACE YOUR ORDER ADDRESS______CALL: (800) 898-6109 FAX: (801) 546-3907 ONLINE: www.ncss.com CITY ______STATE ______

MAIL: NCSS, 329 North 1000 East, Kaysville, UT 84037 ZIP/POSTAL CODE ______COUNTRY ______

Y = Michaelis-Menten

10.0 Histogram of SepalLength by Iris Forest Plot of Odds Ratio Type 30.0 1 Iris 3 1 Diet 7.5 2 S1 3 S7 Treatment 20.0 S8 S17 S16 Combined 5.0 S9 Count S24 Diet 10.0 S21 Response S22 Drug Ave 2.5 Drug Surgery S5 0.0 S4 38.5 52.0 65.5 79.0 S26 SepalLength S32 0.0 S11 0.0 5.0 10.0 15.0 20.0 S3 Temp S15 S6 ROC Curve of Fever Histogram of SepalLength by Iris S23 S34 1.00 S20 Criterions 30.0 Iris S19 S18 Sodium1 StudyId Sodium2 Setosa S2 Versicolor S28 0.75 Virginica S10 S29 20.0 S31 S13 S12 0.50 S27

Count S33 S30

Sensitivity 10.0 Ave Surgery 0.25 S14 S25 Ave 0.0 40.0 53.3 66.7 80.0 Total 0.00 SepalLength 0.00 0.25 0.50 0.75 1.00 .001 .01 .1 1 10 100 1-Specificity Odds Ratio Statistical and Graphics Procedures Available in NCSS 2004

Analysis of Variance / T-Tests Plots / Graphs Regression / Correlation Survival / Reliability Curve Fitting Meta-Analysis* Analysis of Covariance Bar Charts All-Possible Search Accelerated Life Tests Bootstrap C.I.’s* Independent Proportions* Analysis of Variance Box Plots Canonical Correlation Cox Regression Built-In Models Correlated Proportions* Barlett Variance Test Contour Plot Correlation Matrices Cumulative Incidence Group Fitting and Testing* Hazard Ratios* Crossover Design Analysis Dot Plots Cox Regression Exponential Fitting Model Searching Means* Factorial Design Analysis Error Bar Charts Kendall’s Tau Correlation Extreme-Value Fitting Nonlinear Regression Friedman Test Histograms Linear Regression Hazard Rates Randomization Tests* Binary Diagnostic Tests* Geiser-Greenhouse Correction Histograms: Combined* Logistic Regression Kaplan-Meier Curves Ratio of Polynomials One Sample* General Linear Models Percentile Plots Multiple Regression Life-Table Analysis User-Specified Models Two Samples* Mann-Whitney Test Pie Charts Nonlinear Regression Lognormal Fitting Paired Samples* MANOVA Probability Plots PC Regression Log-Rank Tests Miscellaneous Clustered Samples* Multiple Comparison Tests ROC Curves* Poisson Regression Probit Analysis Area Under Curve One-Way ANOVA Scatter Plots Response-Surface Proportional-Hazards Bootstrapping Proportions Paired T-Tests Scatter Plot Matrix Ridge Regression Reliability Analysis Chi-Square Test Tolerance Intervals* Power Calculations Surface Plots Robust Regression Survival Distributions Confidence Limits Two Independent* Repeated Measures ANOVA Violin Plots Stepwise Regression Time Calculator* Cross Tabulation Two Correlated* T-Tests – One or Two Groups Spearman Correlation Weibull Analysis Data Screening Exact Tests* T-Tests – From Means & SD’s Experimental Designs Variable Selection Fisher’s Exact Test Exact Confidence Intervals* Wilcoxon Test Balanced Inc. Block Multivariate Analysis Frequency Distributions Farrington-Manning* Box-Behnken Quality Control Cluster Analysis Mantel-Haenszel Test Fisher Exact Test Time Series Analysis Central Composite Xbar-R Chart Correspondence Analysis Nonparametric Tests Gart-Nam* Method ARIMA / Box - Jenkins D-Optimal Designs C, P, NP, U Charts Discriminant Analysis Normality Tests McNemar Test Decomposition Fractional Factorial Capability Analysis Factor Analysis Probability Calculator Miettinen-Nurminen* Exponential Smoothing Latin Squares Cusum, EWMA Chart Hotelling’s T-Squared Proportion Tests Wilson’s Score* Method Harmonic Analysis Placket-Burman Individuals Chart Item Analysis Randomization Tests Equivalence Tests* Holt - Winters Response Surface Moving Average Chart Item Response Analysis Tables of Means, Etc. Noninferiority Tests* Seasonal Analysis Screening Pareto Chart Loglinear Models Trimmed Means Spectral Analysis Taguchi R & R Studies MANOVA Univariate Statistics Mass Appraisal Trend Analysis Multi-Way Tables Comparables Reports

Multidimensional Scaling Hybrid (Feedback) Model* *New Edition in 2004 Principal Components Nonlinear Regression Sales Ratios

JOIN DIVISION 5 OF APA!

The Division of Evaluation, Measurement, and Statistics of the American Psychological Association draws together individuals whose professional activities and/or interests include assessment, evaluation, measurement, and statistics. The disciplinary affiliation of division membership reaches well beyond psychology, includes both members and non-members of APA, and welcomes graduate students.

Benefits of membership include: $ subscription to Psychological Methods or Psychological Assessment (student members, who pay a reduced fee, do not automatically receive a journal, but may do so for an additional $18) $ The Score – the division’s quarterly newsletter $ Division’s Listservs, which provide an opportunity for substantive discussions as well as the dissemination of important information (e.g., job openings, grant information, workshops)

Cost of membership: $38 (APA membership not required); student membership is only $8

For further information, please contact the Division’s Membership Chair, Yossef Ben-Porath ([email protected]) or check out the Division’s website:

http://www.apa.org/divisions/div5/ ______

ARE YOU INTERESTED IN AN ORGANIZATION DEVOTED TO EDUCATIONAL AND BEHAVIORAL STATISTICS?

Become a member of the Special Interest Group - Educational Statisticians of the American Educational Research Association (SIG-ES of AERA)!

The mission of SIG-ES is to increase the interaction among educational researchers interested in the theory, applications, and teaching of statistics in the social sciences.

Each Spring, as part of the overall AERA annual meeting, there are seven sessions sponsored by SIG-ES devoted to educational statistics and statistics education. We also publish a twice-yearly electronic newsletter.

Past issues of the SIG-ES newsletter and other information regarding SIG-ES can be found at http://orme.uark.edu/edstatsig.htm

To join SIG-ES you must be a member of AERA. Dues are $5.00 per year.

For more information, contact Joan Garfield, President of the SIG-ES, at [email protected].

Position Available: Top bio-tech company seeks a seasoned statistical manager to hire, develop and lead a team of applied statisticians. Primary role is to integrate statistical methodology and practice into product/process development, manufacturing operations and quality. This key leader will provide linkage between manufacturing, engineering, development and biostatistics. MS in statistics or related field.

Research Statistician: Established clinical group adding staff to provide dedicated preclinical support to a development center. Interact and support scientists with formulation, stability testing, bioanalytics and bio assays. PhD w/3 yrs or MS w/ 6 years industry experience required along with expertise in complicated design methods. Northeast location.

Contact Information: Eve Kriz, Smith Hanley Associates, 99 Park Avenue, New York, NY 10016, 212-687-9696 ext. 228, [email protected].

______

Statistics Through Monte Carlo Simulation

With Fortran

Shlomo S. Sawilowsky and Gail F. Fahoome Copyright © 2003 ISBN: 0-9740236-0-4

Purchase Email, CD, & Softcover Versions Online Via Secure Paypal At

http://tbf.coe.wayne.edu/jmasm /u/v/k Journal of Modern Applied Statistical Methods Copyright © 2003 JMASM, Inc. November, 2003, Vol. 2, No. 2, 281-534 1538 – 9472/03/$30.00

Instructions For Authors

Follow these guidelines when submitting a manuscript:

1. JMASM uses a modified American Psychological Association style guideline. 2. Submissions are accepted via e-mail only. Send them to the Editorial Assistant at [email protected]. Provide name, affiliation, address, e-mail address, and 30 word biographical statements for all authors in the body of the email message. 3. There should be no material identifying authorship except on the title page. A statement should be included in the body of the e-mail that, where applicable, indicating proper human subjects protocols were followed, including informed consent. A statement should be included in the body of the e--mail indicating the manuscript is not under consideration at another journal. 4. Provide the manuscript as an external e-mail attachment in MS Word for the PC format only. (Wordperfect and .rtf formats may be acceptable - please inquire.) Please note that Tex (in its various versions), Exp, and Adobe .pdf formats are designed to produce the final presentation of text. They are not amenable to the editing process, and are not acceptable for manuscript submission. 5. The text maximum is 20 pages double spaced, not including tables, figures, graphs, and references. Use 11 point Times Roman font. If the technical expertise is available, submit the manuscript in two column format. 6. Create tables without boxes or vertical lines. Place tables, figures, and graphs “in-line”, not at the end of the manuscript. Figures may be in .jpg, .tif, .png, and other formats readable by Adobe Illustrator or Photoshop. 7. The manuscript should contain an Abstract with a 50 word maximum, following by a list of key words or phrases. Major headings are Introduction, Methodology, Results, Conclusion, and References. Center headings. Subheadings are left justified; capitalize only the first letter of each word. Sub-subheadings are left- justified, indent optional. 8. Do not use underlining in the manuscript. Do not use bold, except for (a) matrices, or (b) emphasis within a table, figure, or graph. Do not number sections. Number all formulas, tables, figures, and graphs, but do not use italics, bold, or underline. Do not number references. Do not use footnotes or endnotes. 9. In the References section, do not put quotation marks around titles of articles or books. Capitalize only the first letter of books. Italicize journal or book titles, and volume numbers. Use “&” instead of “and” in multiple author listings. 10. Suggestions for style: Instead of “I drew a sample of 40” write “A sample of 40 was selected”. Use “although” instead of “while”, unless the meaning is “at the same time”. Use “because” instead of “since”, unless the meaning is “after”. Instead of “Smith (1990) notes” write “Smith (1990) noted”. Do not strike spacebar twice after a period.

Print Subscriptions Print subscriptions including postage for professions is US $60 per year; graduate students is US $30 per year; and libraries, universities, and corporations is US $195 per year. Subscribers outside of the US and Canada pay a US $10 surcharge for additional postage. Online access is currently free at http://tbf.coe.wayne.edu/jmasm. Mail subscription requests with remittances to JMASM, P. O. Box 48023, Oak Park, MI, 48237. Email journal correspondence, other than manuscript submissions, to [email protected].

Notice To Advertisers Send requests for advertising information to [email protected].

Two Years in the Making...

Intel® Visual Fortran 8.0 The next generation of Visual Fortran is here! Intel Visual Fortran 8.0 was developed jointly Now by Intel and the former DEC/Compaq Fortran Available! engineering team.

Visual Fortran Timeline Performance Outstanding performance on Intel architecture including Intel® 1997 DEC releases Pentium® 4, Intel® Xeon™ and Intel Itanium® 2 processors, Digital Visual Fortran 5.0 as well as support for Hyper-Threading Technology.

1998 Compaq acquires DEC and releases DVF 6.0 Compatibility 1999 Compaq ships CVF 6.1 • Plugs into Microsoft Visual Studio* .NET

2001 Compaq ships CVF 6.6 • Microsoft PowerStation4 language and library support • Strong compatibility with Compaq* Visual Fortran 2001 Intel acquires CVF engineering team 2003 Intel releases Support Intel Visual Fortran 8.0 1 year of free product upgrades and Intel Premier Support

Intel Visual Fortran 8.0 “The Intel Fortran Compiler 7.0 was first-rate, and Intel Visual Fortran • CVF front-end + 8.0 is even better. Intel has made a giant leap forward in combining Intel back-end the best features of Compaq Visual Fortran and Intel Fortran. This compiler… continues to be a ‘must-have’ tool for any Twenty-First • Better performance Century Fortran migration or software development project.”

• OpenMP Support —Dr. Robert R. Trippi Professor Computational Finance • Real*16 University of California, San Diego

FREE trials available at: programmersparadise.com/intel

To order or request additional information call: 800-423-9990 Email: [email protected]