SWEATING THE SMALL STUFF: DOES DATA CLEANING AND TESTING OF ASSUMPTIONS REALLY MATTER IN THE 21ST CENTURY?
Topic Editor Jason W. Osborne
PSYCHOLOGY FRONTIERS COPYRIGHT STATEMENT ABOUT FRONTIERS © Copyright 2007-2013 Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering Frontiers Media SA. approach to the world of academia, radically improving the way scholarly research is managed. All rights reserved. All content included on this site, The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share such as text, graphics, logos, button and generate knowledge. Frontiers provides immediate and permanent online open access to all icons, images, video/audio clips, downloads, data compilations and its publications, but this alone is not enough to realize our grand goals. software, is the property of or is licensed to Frontiers Media SA (“Frontiers”) or its licensees and/or subcontractors. The copyright in the FRONTIERS JOURNAL SERIES text of individual articles is the property of their respective authors, The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online subject to a license granted to Frontiers. journals, promising a paradigm shift from the current review, selection and dissemination The compilation of articles processes in academic publishing. constituting this e-book, as well as All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service all content on this site is the exclusive property of Frontiers. to the scholarly community. At the same time, the Frontiers Journal Series operates on a revo- Images and graphics not forming lutionary invention, the tiered publishing system, initially addressing specific communities of part of user-contributed materials may not be downloaded or copied scholars, and gradually climbing up to broader public understanding, thus serving the interests without permission. of the lay society, too. Articles and other user-contributed materials may be downloaded and reproduced subject to any copyright or other notices. No financial DEDICATION TO QUALITY payment or reward may be given for any such reproduction except to the Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interac- author(s) of the article concerned. tions between authors and review editors, who include some of the world’s best academicians. As author or other contributor you grant permission to others to Research must be certified by peers before entering a stream of knowledge that may eventually reproduce your articles, including reach the public - and shape society; therefore, Frontiers only applies the most rigorous and any graphics and third-party materials supplied by you, in unbiased reviews. accordance with the Conditions for Frontiers revolutionizes research publishing by freely delivering the most outstanding research, Website Use and subject to any copyright notices which you include evaluated with no bias from both the academic and social point of view. in connection with your articles and By applying the most advanced information technologies, Frontiers is catapulting scholarly materials. publishing into a new generation. All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary WHAT ARE FRONTIERS RESEARCH TOPICS? only. For the full conditions see the Conditions for Authors and the Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are Conditions for Website Use. collections of at least ten articles, all centered on a particular subject. With their unique mix Cover image provided by Ibbl sarl, Lausanne CH of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot ISSN 1664-8714 research area! ISBN 978-2-88919-155-0 Find out more on how to host your own Frontiers Research Topic or contribute to one as an DOI 10.3389/978-2-88919-155-0 author by contacting the Frontiers Editorial Office: [email protected]
Frontiers in Psychology August 2013 | Sweating the Small Stuff: Does data cleaning and testing of assumptions really matter in the 21st century? | 1 SWEATING THE SMALL STUFF: DOES DATA CLEANING AND TESTING OF ASSUMPTIONS REALLY MATTER IN THE 21ST CENTURY?
Topic Editor: Jason W. Osborne, University of Louisville, USA
Modern statistical software makes it easier than ever to do thorough data screening/cleaning and to test assumptions associated with the analyses researchers perform.
However, few authors (even in top-tier journals) seem to be reporting data cleaning/screening and testing assumptions associated with the statistical analyses being reported. Few popular textbooks seem to focus on these basics.
In the 21st Century, with our complex modern analyses, is data screening and cleaning still relevant? Do outliers or extreme scores matter any more? Does having normally distributed variables improve analyses? Are there new techniques for screening or cleaning data that researchers should be aware of?
Are most analyses robust to violations of most assumptions, to the point that researchers really don’t need to pay attention to assumptions any more?
My goal for this special issue is examine this issue with fresh eyes and 21st century methods. I believe that we can demonstrate that these things do still matter, even when using “robust” methods or non-parametric techniques, and perhaps identify when they matter MOST or in what way they can most substantially affect the results of an analysis.
I believe we can encourage researchers to change their habits through evidence-based discussions revolving around these issues. It is possible we can even convince editors of important journals to include these aspects in their evaluation /review criteria, as many journals in the social sciences have done with effect size reporting in recent years.
I invite you to join me in demonstrating WHY paying attention to these mundane aspects of quantitative analysis can be beneficial to researchers.
Frontiers in Psychology August 2013 | Sweating the Small Stuff: Does data cleaning and testing of assumptions really matter in the 21st century? | 2 Table of Contents
05 Is Data Cleaning and the Testing of Assumptions Relevant in the 21st Century? Jason W. Osborne 08 Are Assumptions of Well-Known Statistical Techniques Checked, and Why (Not)? Rink Hoekstra, Henk A. L. Kiers and Addie Johnson 17 Statistical Conclusion Validity: Some Common Threats and Simple Remedies Miguel A. García-Pérez 28 Is Coefficient Alpha Robust to Non-Normal Data? Yanyan Sheng and Zhaohui Sheng 41 The Assumption of a Reliable Instrument and Other Pitfalls to Avoid When Considering the Reliability of Data Kim Nimon, Linda Reichwein Zientek and Robin K. Henson 54 Replication Unreliability in Psychology: Elusive Phenomena or “Elusive” Statistical Power? Patrizio E. Tressoldi 59 Distribution of Variables by Method of Outlier Detection W. Holmes Finch 71 Statistical Assumptions of Substantive Analyses Across the General Linear Model: A Mini-Review Kim F. Nimon 76 Tools to Support Interpreting Multiple Regression in the Face of Multicollinearity Amanda Kraha, Heather Turner, Kim Nimon, Linda Reichwein Zientek and Robin K. Henson 92 A Simple Statistic for Comparing Moderation of Slopes and Correlations Michael Smithson 101 Old and New Ideas for Data Screening and Assumption Testing for Exploratory and Confirmatory Factor Analysis David B. Flora, Cathy LaBrish and R. Philip Chalmers 122 On the Relevance of Assumptions Associated with Classical Factor Analytic Approaches† Daniel Kasper and Ali Ünlü
Frontiers in Psychology August 2013 | Sweating the Small Stuff: Does data cleaning and testing of assumptions really matter in the 21st century? | 3 142 How Predictable are “Spontaneous Decisions” and “Hidden Intentions”? Comparing Classification Results Based on Previous Responses with Multivariate Pattern Analysis of fMRI BOLD Signals Martin Lages and Katarzyna Jaworska 150 Using Classroom Data to Teach Students About Data Cleaning and Testing Assumptions Kevin Cummiskey, Shonda Kuiper and Rodney Sturdivant
Frontiers in Psychology August 2013 | Sweating the Small Stuff: Does data cleaning and testing of assumptions really matter in the 21st century? | 4 EDITORIAL published: 25 June 2013 doi: 10.3389/fpsyg.2013.00370 Is data cleaning and the testing of assumptions relevant in the 21st century?
Jason W. Osborne*
Educational and Counseling Psychology, University of Louisville, Louisville, Kentucky, USA *Correspondence: [email protected] Edited by: Axel Cleeremans, Université Libre de Bruxelles, Belgium
You must understand fully what your assumptions say and what By the middle of the 20th century, researchers had assembled they imply. You must not claim that the “usual assumptions” are some evidence that some minimal violations of some assumptions acceptable due to the robustness of your technique unless you had minimal effects on error rates under certain circumstances— really understand the implications and limits of this assertion in in other words, if your variances are not exactly identical across the context of your application. And you must absolutely never all groups, but are relatively close, it is probably acceptable to use any statistical method without realizing that you are implic- interpret the results of that test despite this technical violation itly making assumptions, and that the validity of your results can never be greater than that of the most questionable of these of assumptions. Box (1953) is credited with coining the term (Vardeman and Morris, 2003, p. 26). “robust” (Boneau, 1960) which usually indicates that violation of an assumption does not substantially influence the Type I error 1 Modern quantitative studies use sophisticated statistical anal- rate of the test . Thus, many authors published studies show- yses that rely upon numerous important assumptions to ensure ing that analyses such as simple one-factor ANOVA analyses are the validity of the results and protection from mis-estimation of “robust” to non-normality of the populations (Pearson, 1931) outcomes. Yet casual inspection of respected journals in various and to variance inequality (Box, 1953) when group sizes are equal. fields shows a marked absence of discussion of the mundane, This means that they concluded that modest (practical) viola- basic staples of quantitative methodology such as data cleaning tions of these assumptions would not increase the probability of or testing of assumptions, leaving us in the troubling position Type I errors [although even Pearson (1931) notes that strong of being surrounded by intriguing quantitative findings but not non-normality can bias results toward increased Type II errors]. able to assess the quality or reliability of the knowledge base of These fundamental, important debates focused on minor (but our field. practically insignificant) deviations from absolute normality or Few of us become scientists in order to do harm to the liter- exactly equal variance, (i.e., if a skew of 0.01 or 0.05 would ature. Indeed most of us seek to help people, improve the world make results unreliable). Despite being relatively narrow in scope in some way, to make a difference. However, all the effort in the (e.g., primarily concerned with Type I error rates in the con- world will not accomplish these goals in the absence of valid, text of exactly equal sample sizes and relatively simple one-factor reliable, generalizable results—which can only be had with clean ANOVA analyses) these early studies appear to have given social (non-faulty) data and assumptions of analyses met. scientists the impression that these basic assumptions are unim- portant. These early studies do not mean, however, that all WHERE DOES THIS IDEA OF DATA CLEANING AND TESTING analyses are robust to dramatic violations of these assumptions, ASSUMPTIONS COME FROM? or attest to robustness without meeting the other conditions (e.g., exactly equal cell sizes). Researchers have discussed the importance of assumptions from the introduction of our early modern statistical tests (e.g., These findings do not necessarily generalize to broad viola- tions of any assumption under any condition, and leave open Pearson, 1901; Student, 1908; Pearson, 1931). Even the most recently-developed statistical tests are developed in a context of questions regarding Type II error rates and mis-estimation of effect sizes and confidence intervals. Unfortunately, the latter certain important assumptions about the data. Mathematicians and statisticians developing the tests we take point seems to have been lost on many modern researchers. Recall that these early researchers on “robustness” were often applied for granted today had to make certain explicit assumptions about the data in order to formulate the operations that occur “under statisticians working in places such as chemical and agricultural companies as well as research labs such as Bell Telephone Labs, the hood” when we perform statistical analyses. A common exam- ple is that the data (or errors) are normally distributed, or that not in the social sciences where data may be more likely to be messy. Thus, these authors are viewing “modest deviations” as all groups (errors) have roughly equal variance. Without these assumptions the formulae and conclusions are not valid. exactly that- minor deviations from mathematical models of per- fect normality and perfect equality of variance that are practically Early in the 20th century these assumptions were the focus of vigorous debate and discussion. For example, since data rarely unimportant. Social scientists rarely see data that are as clean as that discussed in these robustness studies. are perfectly normally distributed, how much of a deviation from normality is acceptable? Similarly, it is rare that two groups would have exactly identical variances, how close to equal is good 1Note that Type II error rates and mis-estimation of parameters is much less enough to maintain the goodness of the results? rarely discussed and investigated.
www.frontiersin.org June 2013 | Volume 4 | Article 370 | 5 Osborne Assumptions and data cleaning
Further, important caveats came with conclusions around inferential methods. Further, only 8.20% reported assessing “robustness”—such as adequate sample sizes, equal group sizes, homogeneity of variance, and only 4.92% assessed both distri- and relatively simple analyses such as one-factor ANOVA. butional assumptions and homogeneity of variance. While some This mythology of robustness, however, appears to have taken earlier studies asserted ANOVA to be robust to violations of root in the social sciences and may have been accepted as these assumptions (Feir-Walsh and Toothaker, 1974), more recent broad fact rather than narrowly, as intended. Through the lat- work contradicts this long-held belief, particularly where designs ter half of the 20th century this term came to be used more extend beyond simple One-Way ANOVA and where cell sizes often as researchers published narrowly-focused studies that are unbalanced (which seems fairly common in modern ANOVA appeared to reinforce the mythology of robustness, perhaps inad- analyses within the social sciences) (Wilcox, 1987; Lix et al., 1996). vertently indicating that robustness was the rule rather than the In examining articles reporting multivariate analyses, exception. Keselman et al. (1998) describe a more dire situation. None of In one example of this type of research, studies reported the 79 studies utilizing multivariate ANOVA procedures reported that simple statistical procedures such as the Pearson Product- examining relevant assumptions of variance homogeneity, and in Moment Correlation and the One-Way ANOVA (e.g., Feir-Walsh only 6.33% of the articles was there any evidence of examining of and Toothaker, 1974; Havlicek and Peterson, 1977)wererobust distributional assumptions (such as normality). to even “substantial violations” of assumptions. It is perhaps not Similarly, in their examination of 226 articles that utilized surprising that “robustness” appears to have become unques- some type of repeated-measures analysis, only 15.50% made ref- tioned canon among quantitative social scientists, despite the erence to some aspect of assumptions, but none appeared to caveats to these latter assertions, and the important point that report assessing sphericity, an important assumption in these these assertions of robustness usually relates only to Type I error designs that can lead to substantial inflation of error rates and rates, yet other aspects of analyses (such as Type II error rates or mis-estimation of effects, when violated (Maxwell and Delaney, the accuracy of the estimates of effects) might still be strongly 1990, p. 474). influenced by violation of assumptions. Finally, their assessment of articles utilizing covariance designs However, the finding that simple correlations might be robust (N = 45) was equally disappointing—75.56% of the studies to certain violations is not to say that similar but more complex reviewed made no mention of any assumptions or sample dis- procedures (e.g., multiple regression, path analysis, or structural tributions, and most (82.22%) failed to report any information equation modeling) are equally robust to these same violations. about the assumption of homogeneity of regression slope, an Similarly, should one-way ANOVA be robust to violations of assumption critical to the validity of ANCOVA designs. assumptions2, it is not clear that similar but more complex pro- Another survey of articles published in 1998 and 1999 volumes cedures (e.g., factorial ANOVA or ANCOVA) would be equally of well-respected Educational Psychology journals (Osborne, robust to these violations. Yet recent surveys of quantitative 2008) showed that indicators of high quality data cleaning in pub- research in many sciences affirms that a relatively low percentage lished articles were sorely lacking. Specifically, authors in these of authors in recent years report basic information such as having top educational psychology journals almost never reported test- checked for extreme scores, normality of the data, or having tested ing any assumptions of the analyses used (only 8.30% reported assumptions of the statistical procedures being used (Keselman having tested any assumption), only 26.0% reported reliability of et al., 1998; Osborne, 2008; Osborne et al., 2012). It seems, then, data being analyzed, and none reported any significant data clean- that this “mythology of robustness” has led a substantial percent- ing (e.g., examination of data for outliers, normality, analysis of age of social science researchers to believe it unnecessary to check missing data, random responding, etc.). the goodness of their data and the assumptions that their tests are Finally, a recent survey of recent articles published in promi- based on (or report having done so). nent APA journals 2009 volumes (Osborne et al., 2012)found Recent surveys of top research journals in the social sciences3 improved, but uninspiring results (see Figure 1.1). For exam- confirm that authors (and reviewers and editors) are discon- ple, the percentage of authors reporting anything resembling certingly casual about data cleaning and reporting of tests of minimal data cleaning ranged from 22 to 38% across journals. assumptions. One prominent review of education and psychology This represents a marked improvement from previous surveys, research by Keselman et al. (1998) provided a thorough review of but still leaves a majority of authors failing to report any type empirical social science during the 1990s. The authors reviewed of data cleaning or testing of assumptions, a troubling state of studies from 17 prominent journals spanning different areas of affairs. Similarly, between 10 and 32% reported checking for education and psychology, focusing on empirical articles with distributional assumptions, and 32–45% reported dealing with ANOVA-type designs. missing data in some way (although usually through methods In looking at 61 studies utilizing univariate ANOVA between- considered sub-optimal). Clearly, even in the 21st century, the subjects designs, the authors found that only 11.48% of majority of authors in highly-respected scholarly journals fail authors reported anything related to assessing normality, almost to report information about these basic issues of quantitative uniformly assessing normality through descriptive rather than methods. When I wrote a whole book on data cleaning (Osborne, 2012), 2To be clear, it is debatable as to whether these relatively simple procedures my goal was to debunk this mythology of robustness and laissez- are as robust as previously asserted. faire that seems to have seeped into the zeitgeist of quantitative 3Other reviewers in other sciences tend to find similar results, unfortunately. methods. The challenge handed to authors in this book was to
Frontiers in Psychology | Quantitative Psychology and Measurement June 2013 | Volume 4 | Article 370 | 6 Osborne Assumptions and data cleaning
go beyond the basics of data cleaning and testing assumptions— discussion that strikes at the very heart of our quantitative to show that assumptions and quality data are still relevant and disciplines; namely, whether we can trust any of the results important in the 21st century. They went above and beyond we read in journals, and whether we can apply (or gener- this challenge in many interesting—and unexpected ways. I hope alize) those results beyond the limited scope of the original that this is the beginning—or a continuation—of an important sample.
REFERENCES ANOVA, MANOVA, and ANCOVA Before and After Collecting Your Wilcox, R. (1987). New designs in anal- Boneau, C. A. (1960). The effects of Analyses. Rev. Edu. Res. 68, 350–386. Data. Thousand Oaks, CA: Sage ysis of variance. Ann. Rev. Psychol. violations of assumptions under- doi: 10.3102/00346543068003350 Publications. 38, 29–60. doi: 10.1146/annurev.ps. lying the t test. Psychol. Bull. 57, Lix, L., Keselman, J., and Keselman, H. Osborne, J. W., Kocher, B., and Tillman, 38.020187.000333 49–64. doi: 10.1037/h0041412 (1996). Consequences of assump- D. (2012). “Sweating the small stuff: Box, G. (1953). Non-normality and tion violations revisited: a quanti- do authors in APA journals clean Received: 16 April 2013; accepted: 06 tests on variances. Biometrika tative review of alternatives to the data or test assumptions (and should June 2013; published online: 25 June 40, 318. one-way analysis of variance “F” anyone care if they do),” in Paper 2013. Feir-Walsh, B., and Toothaker, L. Test. Rev. Educ. Res. 66, 579–619. presented at the Annual meeting Citation: Osborne JW (2013) Is data (1974). An empirical comparison Maxwell, S., and Delaney, H. of the Eastern Education Research cleaning and the testing of assump- of the ANOVA F-test, normal (1990). Designing Experiments Association, (Hilton Head, SC). tions relevant in the 21st century? scores test and Kruskal-Wallis test and Analyzing Data: a Model Pearson, E. (1931). The analysis of vari- Front. Psychol. 4:370. doi: 10.3389/fpsyg. under violation of assumptions. Comparison Perspective.Pacific ance in cases of non-normal varia- 2013.00370 Educ.Psychol.Meas.34, 789. doi: Grove, CA: Brooks Cole Publishing tion. Biometrika 23, 114. This article was submitted to Frontiers 10.1177/001316447403400406 Company. Pearson, K. (1901). Mathematical in Quantitative Psychology and Havlicek, L. L., and Peterson, N. L. Osborne, J. W. (2008). Sweating the contribution to the theory of evo- Measurement, a specialty of Frontiers in (1977). Effect of the violation of small stuff in educational psychol- lution. VII: On the correlation Psychology. assumptions upon significance lev- ogy: how effect size and power of characters not quantita- Copyright © 2013 Osborne. This is els of the Pearson r. Psychol. Bull. reporting failed to change from tively measurable. Philos. Trans. an open-access article distributed under 84, 373–377. doi: 10.1037/0033- 1969 to 1999, and what that means R. Soc. Lond. B Biol. Sci. 195, the terms of the Creative Commons 2909.84.2.373 for the future of changing prac- 1–47. Attribution License,whichpermitsuse, Keselman, H. J., Huberty, C. J., Lix, tices. Educ. Psychol. 28, 1–10. doi: Student. (1908). The probable error of distribution and reproduction in other L. M., Olejnik, S., Cribbie, R. 10.1080/01443410701491718 a mean. Biometrika 6, 1–25. forums, provided the original authors A., Donahue, B., et al. (1998). Osborne, J. W. (2012). Best Practices Vardeman, S., and Morris, M. (2003). and source are credited and subject to any Statistical practices of educational in Data Cleaning: A Complete Guide Statistics and Ethics. Am. Stat. 57, copyright notices concerning any third- researchers: an analysis of their to Everything You Need to Do 21–26. doi: 10.1198/0003130031072 party graphics etc.
www.frontiersin.org June 2013 | Volume 4 | Article 370 | 7 ORIGINAL RESEARCH ARTICLE published: 14 May 2012 doi: 10.3389/fpsyg.2012.00137 Are assumptions of well-known statistical techniques checked, and why (not)?
Rink Hoekstra1,2*, Henk A. L. Kiers 2 and Addie Johnson2
1 GION –Institute for Educational Research, University of Groningen, Groningen, The Netherlands 2 Department of Psychology, University of Groningen, Groningen, The Netherlands
Edited by: A valid interpretation of most statistical techniques requires that one or more assumptions Jason W. Osborne, Old Dominion be met. In published articles, however, little information tends to be reported on whether University, USA the data satisfy the assumptions underlying the statistical techniques used. This could be Reviewed by: Jason W. Osborne, Old Dominion due to self-selection: Only manuscripts with data fulfilling the assumptions are submit- University, USA ted. Another explanation could be that violations of assumptions are rarely checked for in Jelte M. Wicherts, University of the first place. We studied whether and how 30 researchers checked fictitious data for Amsterdam, The Netherlands violations of assumptions in their own working environment. Participants were asked to *Correspondence: analyze the data as they would their own data, for which often used and well-known tech- Rink Hoekstra, GION, University of Groningen, Grote Rozenstraat 3, niques such as the t-procedure, ANOVA and regression (or non-parametric alternatives) 9712 TG Groningen, The Netherlands were required. It was found that the assumptions of the techniques were rarely checked, e-mail: [email protected] and that if they were, it was regularly by means of a statistical test. Interviews afterward revealed a general lack of knowledge about assumptions, the robustness of the techniques with regards to the assumptions, and how (or whether) assumptions should be checked. These data suggest that checking for violations of assumptions is not a well-considered choice, and that the use of statistics can be described as opportunistic.
Keywords: assumptions, robustness, analyzing data, normality, homogeneity
INTRODUCTION and Games, 1974; Bradley, 1980; Sawilowsky and Blair, 1992; Most statistical techniques require that one or more assumptions Wilcox and Keselman, 2003; Bathke, 2004), and many ways of be met, or, in the case that it has been proven that a technique is checking to see if assumptions have been met (as well as solu- robust against a violation of an assumption, that the assumption tions to overcoming problems associated with any violations) is not violated too extremely. Applying the statistical techniques have been proposed (e.g., Keselman et al., 2008). Using a statis- when assumptions are not met is a serious problem when ana- tical test is one of the frequently mentioned methods of check- lyzing data (Olsen, 2003; Choi, 2005). Violations of assumptions ing for violations of assumptions (for an overview of statistical can seriously influence Type I and Type II errors, and can result methodology textbooks that directly or indirectly advocate this in overestimation or underestimation of the inferential measures method, see e.g., Hayes and Cai, 2007). However, it has also and effect sizes (Osborne and Waters, 2002). Keselman et al. been argued that it is not appropriate to check assumptions by (1998) argue that “The applied researcher who routinely adopts means of tests (such as Levene’s test) carried out before decid- a traditional procedure without giving thought to its associated ing on which statistical analysis technique to use because such assumptions may unwittingly be filling the literature with non- tests compound the probability of making a Type I error (e.g., replicable results” (p. 351). Vardeman and Morris (2003) state Schucany and Ng, 2006). Even if one desires to check whether “...absolutely never use any statistical method without realizing or not an assumption is met, two problems stand in the way. that you are implicitly making assumptions, and that the validity First, assumptions are usually about the population, and in a of your results can never be greater than that of the most question- sample the population is by definition not known. For exam- able of these” (p. 26). According to the sixth edition of the APA ple, it is usually not possible to determine the exact variance of Publication Manual, the methods researchers use “...must sup- the population in a sample-based study, and therefore it is also port their analytic burdens, including robustness to violations of impossible to determine that two population variances are equal, the assumptions that underlie them...” [American Psychological as is required for the assumption of equal variances (also referred Association (APA, 2009); p. 33]. The Manual does not explic- to as the assumption of homogeneity of variances) to be satis- itly state that researchers should check for possible violations of fied. Second, because assumptions are usually defined in a very assumptions and report whether the assumptions were met, but strict way (e.g., all groups have equal variances in the popula- it seems reasonable to assume that in the case that researchers do tion, or the variable is normally distributed in the population), not check for violations of assumptions, they should be aware of the assumptions cannot reasonably be expected to be satisfied. the robustness of the technique. Given these complications, researchers can usually only exam- Many articles have been written on the robustness of certain ine whether assumptions are not violated “too much” in their techniques with respect to violations of assumptions (e.g., Kohr sample; for deciding on what is too much, information about
www.frontiersin.org May 2012 | Volume 3 | Article 137 | 8 Hoekstra et al. Why assumptions are seldom checked
the robustness of the technique with regard to violations of the of another study in which 79 articles with a between-subject mul- assumptions is necessary. tivariate design were checked for references to assumptions, and The assumptions of normality and of homogeneity of vari- again the assumptions were rarely mentioned. Osborne (2008) ances are required to be met for the t-test for independent group found similar results: In 96 articles published in high quality jour- means, one of the most widely used statistical tests (Hayes and nals, checking of assumptions was reported in only 8% of the Cai, 2007), as well as for the frequently used techniques ANOVA cases. and regression (Kashy et al., 2009). The assumption of normality In the present paper a study is presented in which the behav- is that the scores in the population in case of a t-test or ANOVA, ior of researchers while analyzing data was observed, particularly and the population residuals in case of regression, be normally with regard to the checking for violations of assumptions when distributed. The assumption of homogeneity of variance requires analyzing data. It was hypothesized that the checking of assump- equal population variances per group in case of a t-test or ANOVA, tions might not routinely occur when analyzing data. There can, and equal population variances for every value of the independent of course, be rational reasons for not checking assumptions. variable for regression. Although researchers might be tempted to Researchers might, for example, have knowledge about the robust- think that most statistical procedures are relatively robust against ness of the technique they are using with respect to violations of most violations, several studies have shown that this is often not assumptions, and therefore consider the checking of possible vio- the case, and that in the case of one-way ANOVA, unequal group lations unnecessary. In addition to observing whether or not the sizes can have a negative impact on the technique’s robustness data were checked for violations of assumptions, we therefore also (e.g., Havlicek and Peterson, 1977; Wilcox, 1987; Lix et al., 1996). administered a questionnaire to assess why data were not always Many textbooks advise that the assumptions of normality and checked for possible violations of assumptions. Specifically, we homogeneity of variance be checked graphically (Hazelton, 2003; focused on four possible explanations for failing to check for vio- Schucany and Ng, 2006), such as by making normal quantile plots lations of assumptions: (1) lack of knowledge of the assumption, for checking for normality. Another method, which is advised (2) not knowing how to check whether the assumption has been in many other textbooks (Hayes and Cai, 2007), is to use a so- violated, (3) not considering a possible violation of an assumption called preliminary test to determine whether to continue with problematic (for example, because of the robustness of the tech- the intended technique or to use an alternative technique instead. nique), and (4) lack of knowledge of an alternative in the case that Preliminary tests could, for example, be used to choose between an assumption seems to be violated. a pooled t-test and a Welch t-test or between ANOVA and a non-parametric alternative. Following the argument that prelim- MATERIALS AND METHODS inary tests should not be used because, amongst others, they can PARTICIPANTS inflate the probability of making a Type I error (e.g., Gans, 1981; Thirty Ph.D. students, 13 men and 17 women (mean age = 27, Wilcox et al., 1986; Best and Rayner, 1987; Zimmerman, 2004, SD = 1.5), working at Psychology Departments (but not in the 2011; Schoder et al., 2006; Schucany and Ng, 2006; Rochon and area of methodology or statistics) throughout The Netherlands, Kieser, 2011), it has also been argued that in many cases uncondi- participated in the study. All had at least 2 years experience tional techniques should be the techniques of choice (Hayes and conducting research at the university level. Ph.D. students were Cai, 2007). For example, the Welch t-test, which does not require selected because of their active involvement in the collection and homogeneity of variance,would be seen a priori as preferable to the analysis of data. Moreover, they were likely to have had their pooled variance t-test (Zimmerman, 1996; Hayes and Cai, 2007). statistical education relatively recently, assuring a relatively up- Although the conclusions one can draw when analyzing a data to-date knowledge of statistics. They were required to have at least set with statistical techniques depend on whether the assumptions once applied a t-procedure, a linear regression analysis and an for that technique are met, and, if that is not the case, whether ANOVA, although not necessarily in their own research project. the technique is robust against violations of the assumption, no Ten participants were randomly selected from each of the Uni- work, to our knowledge, describing whether researchers check for versities of Tilburg, Groningen, and Amsterdam, three cities in violations of assumptions in practice has been published. When different regions of The Netherlands. In order to get 30 partici- possible violations of assumptions are checked, and why they pants, 41 Ph.D. students were approached for the study, of whom sometimes are not, is a relevant question given the continuing 11 chose not to participate. Informed consent was obtained from prevalence of preliminary tests. For example, an inspection of the all participants, and anonymity was ensured. most recent 50 articles published in 2011 in Psychological Science that contained at least one t-test, ANOVA or regression analysis, TASK revealed that in only three of these articles was the normality of the The task consisted of two parts: data analysis and questionnaire. data or the homogeneity of variances discussed, leaving open the For the data analysis task participants were asked to analyze six question of whether these assumptions were or were not checked data sets,and write down their inferential conclusions for each data in practice, and how. Keselman et al. (1998) showed in a review set. The t-test, ANOVA and regression (or unconditional alterna- of 61 articles that used a between-subject univariate design that in tives to those techniques) were intended to be used, because they only a small minority of articles anything about the assumptions are relatively simple, frequently used, and because it was expected of normality (11%) or homogeneity (8%) was mentioned, and that most participants would be familiar with those techniques. that in only 5% of the articles something about both assumptions Participants could take as much time as they wanted, and no limit was mentioned. In the same article, Keselman et al. present results was given to the length of the inferential conclusions. The second
Frontiers in Psychology | Quantitative Psychology and Measurement May 2012 | Volume 3 | Article 137 | 9 Hoekstra et al. Why assumptions are seldom checked
part of the task consisted of filling in a questionnaire with questions violation of an assumption. Four of the six data sets contained about the participants’ choices during the data analysis task, and significant effects, one of the two data sets for which a t-test was about the participants’ usual behavior with respect to assumption supposed to be used contained a clear violation of the assump- checking when analyzing data. All participants needed between tion of normality, one of the two data sets for which ANOVA was 30 and 75 min to complete the data analysis task and between 35 supposed to be used contained a clear violation of the assumption and 65 min to complete the questionnaire. The questionnaire also of homogeneity of variance, and effect size was relatively large in included questions about participants’ customs regarding visual- three data sets and relatively small in the other data sets (see Table 1 ization of data and inference, but these data are not presented for an overview). here. All but three participants performed the two tasks at their To get more information on which choices were made by the own workplace. The remaining three used an otherwise unoccu- participants during task performance, and why these choices were pied room in their department. During task performance, the first made, participants were asked to“think aloud”during task perfor- author was constantly present. mance. This was recorded on cassette. During task performance, the selections made within the SPSS program were noted by the Data analysis task first author, in order to be able to check whether there was any The data for the six data sets that the participants were asked check for the assumptions relevant for the technique that was used. to analyze were offered in SPSS format, since every participant Furthermore, participants were asked to save the SPSS syntax files. indicated using SPSS as their standard statistical package. Before For the analysis of task performance, the information from the starting to analyze the data, the participants were given a short notes made by the first author, the tape recordings and the syntax description of a research question without an explicit hypothe- files were combined to examine behavior with respect to checking sis, but with a brief description of the variables in the SPSS file. for violations of the assumptions of normality and homogene- The participants were asked to analyze the data sets and interpret ity of variance. Of course, had participants chosen unconditional the results as they do when they analyze and interpret their own techniques for which one or both of these assumptions were data sets. Per data set, they were asked to write down an answer to not required to be met, their data for the scenario in question the following question: “What do these results tell you about the would not have been used, provided that a preliminary test was situation in the population? Explain how you came to your con- not carried out to decide whether to use the unconditional tech- clusion”.An example of such an instruction for which participants nique. However, in no cases were unconditional techniques used were expected to use linear regression analysis, translated from the or even considered. The frequency of selecting preliminary tests Dutch, is shown in Figure 1. Participants were explicitly told that was recorded separately. consultation of any statistical books or the internet was allowed, Each data set was scored according to whether violations of but only two participants availed themselves of this opportunity. the assumptions of normality and homogeneity of variance were The short description of the six research questions was written checked for. A graphical assessment was counted as correctly in such a way as to suggest that two t-tests, two linear regression analyses and two ANOVAs should be carried out, without explic- itly naming the analysis techniques. Of course, non-parametric or Table 1 | An overview of the properties of the six scenarios. unconditional alternatives were considered appropriate, as well. Scenario Technique to be Effect size p-Value Violations of The results of a pilot experiment indicated that the descrip- used assumption tions were indeed sufficient to guide people in using the desired technique: The five people tested in the pilot used the intended 1 t-Test Medium 0.04 Normality technique for each of the six data sets. All six research question 2 t-Test Very small 0.86 None descriptions, also in translated form, can be found in the Appendix 3 Regression analysis Large 0.00 None to this study. 4 Regression analysis Medium 0.01 None The six data sets differed with respect to the effect size, the sig- 5 ANOVA Large 0.05 Homogeneity nificance of the outcomes, and to whether there was a “strong” 6 ANOVA Close to 0 0.58 None
FIGURE 1 | An example of one of the research question descriptions. In this example, participants were supposed to answer this question by means of a regression analysis.
www.frontiersin.org May 2012 | Volume 3 | Article 137 | 10 Hoekstra et al. Why assumptions are seldom checked
checking for the assumption, provided that the assessment was variance was scored as being familiar with the assumption, even appropriate for the technique at hand. A correct check for the if the participants did not specify what, exactly, was required assumption of normality was recorded if, for the t-test and to follow a normal distribution or which variances were sup- ANOVA, a graphical representation of the different groups was posed to be equal. Explaining the assumptions without explicitly requested, except when the graph was used only to detect out- mentioning them was also scored as being familiar with this liers. Merely looking at the numbers, without making a visual assumption. representation was considered insufficient. For regression analy- sis, making a plot of the residuals was considered to be a correct Unfamiliarity with how to check the assumptions. Participants check of the assumption of normality. Deciding whether this was were asked if they could think of a way to investigate whether done explicitly was based on whether the participant made any there was a violation of each of the two assumptions (normality reference to normality when thinking aloud. A second option and homogeneity of variance) for t-tests, ANOVA and regression, was to make a QQ- or PP-plot of the residuals. Selecting the respectively. Thus, the assumptions per technique were explicitly Kolmogorov–Smirnov test or the Shapiro–Wilk test within SPSS given, whether or not they had been correctly reported in answer was considered checking for the assumption of normality using a to the previous question. For normality, specifying how to visual- preliminary test. ize the data in such a way that a possible violation was visible was Three ways of checking for the assumption of homogeneity of categorized as a correct way of checking for assumption violations variance for the t-test and ANOVA were considered adequate. The (for example: making a QQ-plot, or making a histogram), even first was to make a graphical representation of the data in such a when no further information was given about how to make such a way that difference in variance between the groups was visible (e.g., visualization. Mentioning a measure of or a test for normality was boxplots or scatter plots, provided that they are given per group). also considered correct. For studying homogeneity of variance, A second way was to make an explicit reference to the variance rules of thumb or tests, such as Levene’s test for testing equality of the groups. A final possibility was to compare standard devi- of variances, were categorized as a correct way of checking this ations of the groups in the output, with or without making use assumption, and the same holds for eyeballing visual representa- of a rule of thumb to discriminate between violations and non- tions from which variances could be deduced. Note that the criteria violations. For regression analysis, a scatter plot or a residual plot for a correct check are lenient, since they include preliminary tests was considered necessary to check the assumption of homogeneity that are usually considered inappropriate. of variance. Although the assumption of homogeneity of variance assumes equality of the population variations, an explicit reference Violation of the assumption not being regarded problematic. to the population was not required. The preliminary tests that were For techniques for which it has been shown that they are robust recorded included Levene’s test, the F-ratio test, Bartlett’s test, and against certain assumption violations, it can be argued that it the Brown–Forsythe test. makes sense not to check for these assumptions, because the The frequency of using preliminary tests was reported sepa- outcome of this checking process would not influence the interpre- rately from other ways of checking for assumptions. Although tation of the data anyway. To study this explanation, participants the use of preliminary tests is often considered an inappropriate were asked per assumption and for the three techniques whether method for checking assumptions, their use does show awareness they considered a possible violation to be influential. Afterward, of the existence of the assumption. Occurrences of checking for the answers that indicated that this influence was small or absent irrelevant assumptions, such as equal group sizes for the t-test, or were scored as satisfying the criteria for this explanation. normality of all scores for one variable (instead of checking for normality per group) for all three techniques were also counted, Unfamiliarity with a remedy against a violation of an assump- but scored as incorrectly checking for an assumption. tion. One could imagine that a possible violation of assumptions is not checked because no remedy for such violations is known. Questionnaire Participants were thus asked to note remedies for possible viola- The questionnaire addressed four explanations for why an tions of normality and homogeneity of variance for each of the assumption was not checked: (1) Unfamiliarity with the assump- three statistical analysis techniques. Correct remedies were defined tion, (2) Unfamiliarity with how to check the assumptions, (3) as transforming the data (it was not required that participants Violation of the assumption not being regarded problematic, and specify which transformation), using a different technique (e.g., a (4) Unfamiliarity with a remedy against a violation of the assump- non-parametric technique when the assumption of normality has tion. Each of these explanations was operationalized before the been violated) and increasing the sample size. questionnaires were analyzed. The experimenter was present dur- ing questionnaire administration to stimulate the participants to DATA ANALYSIS answer more extensively, if necessary, or ask them to reformulate All results are presented as percentages of the total number of par- their answer when they seemed to have misread the question. ticipants or of the total number of analyzed data sets, depending on the specific research question. Confidence intervals (CIs) are Unfamiliarity with the assumptions. Participants were asked to given, but should be interpreted cautiously because the sample write down the assumptions they thought it was necessary to check cannot be regarded as being completely random. The CIs for per- for each of the three statistical techniques used in the study. Sim- centages were calculated by the so-called Score CIs (Wilson, 1927). ply mentioning the assumption of normality or homogeneity of All CIs are 95% CIs.
Frontiers in Psychology | Quantitative Psychology and Measurement May 2012 | Volume 3 | Article 137 |11 Hoekstra et al. Why assumptions are seldom checked
RESULTS participants mentioned at least one of the correct ways to check Of the six datasets that the 30 participants were required to analyze, for a violation of the assumption. The majority of the participants in all but three instances the expected technique was chosen. In the failed to indicate that the alleged robustness of a technique against remaining three instances, ANOVA was used to analyze data sets violations of the relevant assumption was a reason not to check that were meant to be analyzed by means of a t-test. Since ANOVA these assumptions in the first place. Many participants did not is in this case completely equivalent to an independent-samples t- know whether a violation of an assumption was important or not. test, it can be concluded that an appropriate technique was chosen Only in a minority of instances was an acceptable remedy for a for all data sets. In none of these cases, an unconditional technique violation of an assumption mentioned. No unacceptable remedies was chosen. were mentioned. In general, participants indicated little knowl- Violations of,or conformance with,the assumptions of normal- edge of how to overcome a violation of one of the assumptions, ity and homogeneity of variance were correctly checked in 12% and most participants reported never having looked for a remedy (95%CI = [8%, 18%]) and 23% (95%CI = [18%, 30%]), respec- against a violation of statistical assumptions. tively, of the analyzed data sets. Figure 2 shows for each of the three Participants had been told what the relevant assumptions were techniques how frequently possible violations of the assumptions before they had to answer these questions. Therefore, the results of normality and homogeneity of variance occurred, and whether for the last three explanations per assumption in Figure 3 are the checking was done correctly, or whether a preliminary test was reported for all participants,despite the fact that many participants used. Note that the assumption of normality was rarely checked for reported being unfamiliar with the assumption. This implies that, regression, and never correctly. In the few occasions that normality especially for the assumption of normality and to a lesser extent was checked the normality of the scores instead of the residuals for the assumption of equal variances, the results regarding the last was examined. Although this approach might be useful for study- three explanations should be interpreted with caution. ing the distribution of the scores, it is insufficient for determining whether the assumption of normality has been violated. DISCUSSION The percentages of participants giving each of the four reasons In order to examine people’s understanding of the assumptions for not checking assumptions as measured by the questionnaire of statistical tests and their behavior with regard to checking are given in Figure 3. A majority of the participants were unfamil- these assumptions, 30 researchers were asked to analyze six data iar with the assumptions. For each assumption, only a minority of sets using the t-test, ANOVA, regression or a non-parametric
FIGURE 2 |The frequency of whether two assumptions were checked at was used for three often used techniques in percentages of the total all, whether they were checked correctly, and whether a preliminary test number of cases. Between brackets are 95% CIs for the percentages.
www.frontiersin.org May 2012 | Volume 3 | Article 137 | 12 Hoekstra et al. Why assumptions are seldom checked
FIGURE3|Percentages of participants giving each of the explanations for not checking assumptions as a function of assumption and technique. Error bars indicate 95% CIs. alternative, as appropriate. All participants carried out nominally assumption of homogeneity of variance for ANOVA is violated appropriate analyses, but only in a minority of cases were the data (e.g., largest standard deviation is larger than twice the smallest examined for possible violations of assumptions of the chosen standard deviation), whereas such rules of thumb for checking techniques. Preliminary test outcomes were rarely consulted, and possible violations of the assumption of normality were unknown only if given by default in the course of carrying out an analysis. to our participants. It was also found that Levene’s test was often The results of a questionnaire administered after the analyses used as a preliminary test to choose between the pooled t-test and were performed revealed that the failure to check for violations of the Welch t-test, despite the fact that the use of preliminary tests assumptions could be attributed to the researchers’ lack of knowl- is often discouraged (e.g., Wilcox et al., 1986; Zimmerman, 2004, edge about the assumptions, rather than to a deliberate decision 2011; Schucany and Ng, 2006). An obvious explanation for this not to check the data for violations of the assumptions. could be that the outcomes of Levene’s test are given as a default Homogeneity of variance was checked for in roughly a third of option for the t procedure in SPSS (this was the case in all versions all cases and the assumption of normality in less than a quarter that were used by the participants). The presence of Levene’s test of the data sets that were analyzed. Moreover, for the assumption together with the corresponding t-tests may have led researchers of normality checks were often carried out incorrectly. An expla- to think that they should use this information. Support for this nation for the finding that the assumption of homogeneity of hypothesis is that preliminary tests were not carried out in any variance was checked more often than the assumption of normal- other cases. ity is the fact that a clear violation of this assumption can often be It is possible that researchers have well-considered reasons for directly deduced from the standard deviations, whereas measures not checking for possible violations of assumption. For example, indicating normality are less common. Furthermore, many partic- they may be aware of the robustness of a technique with respect to ipants seemed familiar with a rule of thumb to check whether the violations of a particular assumption, and quite reasonably chose
Frontiers in Psychology | Quantitative Psychology and Measurement May 2012 | Volume 3 | Article 137 | 13 Hoekstra et al. Why assumptions are seldom checked
not to check to see if the assumption is violated. Our questionnaire, sample is not representative of Ph.D. students at research uni- however,revealed that many researchers simply do not know which versities. A fourth and last limitation is the fact that it is not assumptions should be met for the t-test, ANOVA, and regres- clear what training each of the participants had on the topic sion analysis. Only a minority of the researchers correctly named of assumptions. However, all had their education in Psychology both assumptions, despite the fact that the statistical techniques Departments in The Netherlands, where statistics is an important themselves were well-known to the participants. Even when the part of the basic curriculum. It is thus unlikely that they were not assumptions were provided to the participants during the course subjected to extensive discussion on the importance of meeting of filling out the questionnaire, only a minority of the participants assumptions. reported knowing a means of checking for violations, let alone Our findings show that researchers are relatively unknowledge- which measures could be taken to remedy any possible violations able when it comes to when and how data should be checked or which tests could be used instead when violations could not be for violations of assumptions of statistical tests. It is notable that remedied. the scientific community tolerates this lack of knowledge. One A limitation of the present study is that, although researchers possible explanation for this state of affairs is that the scien- were asked to perform the tasks in their own working environ- tific community as a whole does not consider it important to ment, the setting was nevertheless artificial, and for that reason verify that the assumptions of statistical tests are met. Alterna- the outcomes might have been biased. Researchers were obvi- tively, other scientists may assume too readily that if nothing is ously aware that they were being watched during this observation said about assumptions in a manuscript, any crucial assump- study, which may have changed their behavior. However, we expect tions were met. Our results suggest that in many cases this that if they did indeed conduct the analyses differently than they might be a premature conclusion. It seems important to con- would normally do, they likely attempted to perform better rather sider how statistical education can be improved to draw atten- than worse than usual. A second limitation of the study is the tion to the place of checking for assumptions in statistics and relatively small number of participants. Despite this limited num- how to deal with possible violations (including deciding to use ber and the resulting lower power, however, the effects are large, unconditional techniques). Requiring that authors describe how and the CIs show that the outcomes are unlikely to be due to they checked for the violation of assumptions when the tech- chance alone. A third limitation is the possible presence of selec- niques applied are not robust to violations would, as Bakker tion bias. The sample was not completely random because the and Wicherts (2011) have proposed, force researchers on both selection of the universities involved could be considered a con- ends of the publishing process to show more awareness of this venience sample. However, we have no reason to think that the important issue.
REFERENCES assumptions upon significance lev- Kohr, R. L., and Games, P. A. for normality for the one-sample t- American Psychological Association. els of the Pearson r. Psychol. Bull. 84, (1974). Robustness of the analysis test. Br. J. Math. Stat. Psychol. 64, (2009). Publication Manual of the 373–377. of variance, the Welch procedure 410–426. American Psychological Association, Hayes, A. F., and Cai, L. (2007). Further andaBoxproceduretoheteroge- Sawilowsky, S. S., and Blair, R. C. (1992). 6th Edn. Washington, DC: American evaluating the conditional decision neous variances. J. Exp. Educ. 43, A more realistic look at the robust- Psychological Association rule for comparing two independent 61–69. ness and type II error properties of Bakker, M., and Wicherts, J. M. means. Br. J. Math. Stat. Psychol. 60, Lix, L. M., Keselman, J. C., and Kesel- the t test to departures from popu- (2011). The (mis)reporting of sta- 217–244. man, H. J. (1996). Consequences lation normality. Psychol. Bull. 111, tistical results in psychology jour- Hazelton, M. L. (2003). A graphical tool of assumption violations revisited: 352–360. nals. Behav. Res. Methods 43, for assessing normality, Am. Stat. 57, a quantitative review of alternatives Schoder, V., Himmelmann, A., and Wil- 666–678. 285–288. to the one-way analysis of vari- helm, K. P. (2006). Preliminary test- Bathke, A. (2004). The ANOVA F test Kashy, D. A., Donnellan, M. B., Acker- ance F test. Rev. Educ. Res. 66, ing for normality: some statistical can still be used in some unbal- man,R. A.,and Russell,D. W.(2009). 579–620. aspects of a common concept. Clin. anced designs with unequal vari- Reporting and interpreting research Olsen, C. H. (2003). Review of the use of Exp. Dermatol. 31, 757–761. ances and nonnormal data. J. Stat. in PSPB: Practices, principles, and statistics in infection and immunity. Schucany, W. R., and Ng, H. K. T. Plan. Inference 126, 413–422. pragmatics. Pers. Soc. Psychol. Bull. Infect. Immun. 71, 6689–6692. (2006). Preliminary goodness-of-fit Best, D. J., and Rayner, J. C. W. (1987). 35, 1131–1142. Osborne, J. (2008). Sweating the small tests for normality do not val- Welch’s approximate solution for Keselman, H. J., Algina, J., Lix, L. stuff in educational psychology: how idate the one-sample Student t. the Behrens–Fisher problem. Tech- M., Wilcox, R. R., and Deering, effect size and power reporting failed Commun. Stat. Theory Methods 35, nometrics 29, 205–210. K. N. (2008). A generally robust to change from 1969 to 1999, and 2275–2286. Bradley, J. V. (1980). Nonrobustness in approach for testing hypotheses what that means for the future of Vardeman, S. B., and Morris, M. D. Z, t, and F tests at large sample sizes. and setting confidence intervals for changing practices. Educ. Psychol. 28, (2003). Statistics and ethics: some Bull. Psychon. Soc. 16, 333–336. effect sizes. Psychol. Methods 13, 151–160. advice for young statisticians. Am. Choi, P. T. (2005). Statistics for the 110–129. Osborne, J. W., and Waters, E. (2002). Stat. 57, 21. reader: what to ask before believ- Keselman, H. J., Huberty, C. J., Lix, L. Four assumptions of multi- Wilcox, R. R. (1987). New designs in ing the results. Can. J. Anaesth. 52, M.,Olejnik,S.,Cribbie,R.,Donahue, ple regression that researchers analysis of variance. Annu.Rev.Psy- R1–R5. B., Kowalchuk, R. K., Lowman, L. L., should always test. Pract. Assess. chol. 38, 29–60. Gans, D. J. (1981). Use of a preliminary Petoskey, M. D., Keselman, J. C., and Res. Evalu. 8. Available at: Wilcox, R. R., Charlin, V. L., and test in comparing two sample means. Levin, J. R. (1998). Statistical prac- http://www-psychology.concordia. Thompson,K. L. (1986). New Monte Commun. Stat. Simul. Comput. 10, tices of educational researchers: an ca/fac/kline/495/osborne.pdf Carlo results on the robustness of 163–174. analysis of their ANOVA, MANOVA Rochon, J., and Kieser, M. (2011). the ANOVA F, W, and F∗ statistics. Havlicek, L. L., and Peterson, N. L. and ANCOVA. Rev. Educ. Res. 68, A closer look at the effect of Commun. Stat. Simul. Comput. 15, (1977). Effects of the violation of 350. preliminary goodness-of-fit testing 933–943.
www.frontiersin.org May 2012 | Volume 3 | Article 137 | 14 Hoekstra et al. Why assumptions are seldom checked
Wilcox, R. R., and Keselman, H. J. Zimmerman, D. W. (2004). A note on commercial or financial relationships This article was submitted to Frontiers (2003). Modern robust data analysis preliminary tests of equality of vari- that could be construed as a potential in Quantitative Psychology and Mea- methods: measures of central ten- ances. Br. J. Math. Stat. Psychol. 57, conflict of interest. surement, a specialty of Frontiers in dency. Psychol. Methods 8, 254–274. 173–181. Psychology. Wilson,E. B. (1927). Probable inference, Zimmerman, D. W. (2011). A simple Received: 07 November 2011; accepted: Copyright © 2012 Hoekstra, Kiers the law of succession, and statisti- and effective decision rule for choos- 20 April 2012; published online: 14 May and Johnson. This is an open-access cal inference. J. Am. Stat. Assoc. 22, ing a significance test to protect 2012. article distributed under the terms 209–212. against non-normality. Br. J. Math. Citation: Hoekstra R, Kiers HAL of the Creative Commons Attribution Zimmerman, D. W. (1996). Some prop- Stat. Psychol. 64, 388–409. and Johnson A (2012) Are assump- Non Commercial License, which per- erties of preliminary tests of equality tions of well-known statistical mits non-commercial use, distribution, of variances in the two-sample loca- Conflict of Interest Statement: The techniques checked, and why and reproduction in other forums, pro- tion problem. J. Gen. Psychol. 123, authors declare that the research was (not)? Front. Psychology 3:137. doi: vided the original authors and source are 217–231. conducted in the absence of any 10.3389/fpsyg.2012.00137 credited.
Frontiers in Psychology | Quantitative Psychology and Measurement May 2012 | Volume 3 | Article 137 | 15 Hoekstra et al. Why assumptions are seldom checked
APPENDIX SPSS file, the scores on the self-esteem questionnaire are given. RESEARCH QUESTION DESCRIPTIONS The second column shows the weights of the men, measured in In this Appendix, the six research question descriptions are pre- kilograms. sented in translated form. Descriptions 1 and 2 were supposed 4. A researcher is interested to what extent the weight of women to be answered by means of a t-test, Descriptions 3 and 4 by can predict their self-esteem. She expects a linear relationship means of regression analysis, and Descriptions 5 and 6 by means between weight and self-esteem. To study the relationship, she of ANOVA. takes a random sample of 100 women, and administers a ques- tionnaire to them to measure their self-esteem (on a scale from 1. A researcher is interested in the extent to which group A and 0 to 50), and measures the participants’ weight. In Column 1 group B differ in cognitive transcentivity. He has scores of 25 of the SPSS file, the scores on the self-esteem questionnaire are randomly selected participants from each of the two groups on given. The second column shows the weights of the women, a cognitive transcentivity test (with the range of possible scores measured in kilograms. from 0 to 25). In Column 1 of the SPSS file, the scores of the 5. A researcher is interested to what extent young people of three participants on the test are given, and in column 2 the group nationalities differ with respect to the time in which they can membership (group A or B) is given. run the 100 meters. To study this, 20 persons between 20 and 2. A researcher is interested in the extent to which group C and 30 years of age per nationality are randomly selected, and the group D differ in cognitive transcentivity. He has scores of 25 times in which they run the 100 meters is measured. In Column randomly selected participants from each of the two groups on 1 of the SPSS file, their times are given in seconds. The num- a cognitive transcentivity test (with the range of possible scores bers “1, “2,” and “3” in Column 2 represent the three different from 0 to 25). In Column 1 of the SPSS file, the scores of the nationalities. participants on the test are given, and in column 2 the group 6. A researcher is interested to what extent young people of three membership (group C or D) is given. other nationalities differ with respect to time in which they can 3. A researcher is interested to what extent the weight of men run the 100 meters. To study this, 20 persons between 20 and can predict their self-esteem. She expects a linear relationship 30 years of age per nationality are randomly selected, and the between weight and self-esteem. To study the relationship, she times in which they run the 100 meters is measured. In Column takes a random sample of 100 men, and administers a question- 1 of the SPSS file, their times are given in seconds. The num- naire to them to measure their self-esteem (on a scale from 0 to bers “1, “2,” and “3” in Column 2 represent the three different 50), and measures the participants’ weight. In Column 1 of the nationalities.
www.frontiersin.org May 2012 | Volume 3 | Article 137 | 16 ORIGINAL RESEARCH ARTICLE published: 29 August 2012 doi: 10.3389/fpsyg.2012.00325 Statistical conclusion validity: some common threats and simple remedies
Miguel A. García-Pérez*
Facultad de Psicología, Departamento de Metodología, Universidad Complutense, Madrid, Spain
Edited by: The ultimate goal of research is to produce dependable knowledge or to provide the evi- Jason W. Osborne, Old Dominion dence that may guide practical decisions. Statistical conclusion validity (SCV) holds when University, USA the conclusions of a research study are founded on an adequate analysis of the data, gen- Reviewed by: Megan Welsh, University of erally meaning that adequate statistical methods are used whose small-sample behavior Connecticut, USA is accurate, besides being logically capable of providing an answer to the research ques- David Flora, York University, Canada tion. Compared to the three other traditional aspects of research validity (external validity, *Correspondence: internal validity, and construct validity), interest in SCV has recently grown on evidence Miguel A. García-Pérez, Facultad de that inadequate data analyses are sometimes carried out which yield conclusions that a Psicología, Departamento de Metodología, Campus de proper analysis of the data would not have supported. This paper discusses evidence of Somosaguas, Universidad three common threats to SCV that arise from widespread recommendations or practices Complutense, 28223 Madrid, Spain. in data analysis, namely, the use of repeated testing and optional stopping without control e-mail: [email protected] of Type-I error rates, the recommendation to check the assumptions of statistical tests, and the use of regression whenever a bivariate relation or the equivalence between two variables is studied. For each of these threats, examples are presented and alternative practices that safeguard SCV are discussed. Educational and editorial changes that may improve the SCV of published research are also discussed.
Keywords: data analysis, validity of research, regression, stopping rules, preliminary tests
Psychologists are well aware of the traditional aspects of research power to detect an effect if it exists, (2) whether there is a risk validity introduced by Campbell and Stanley (1966) and fur- that the study will “reveal” an effect that does not actually exist, ther subdivided and discussed by Cook and Campbell (1979). and (3) how can the magnitude of the effect be confidently esti- Despite initial criticisms of the practically oriented and some- mated. They nevertheless considered the latter aspect as a mere what fuzzy distinctions among the various aspects (see Cook step ahead once the first two aspects had been satisfactorily solved, and Campbell, 1979, pp. 85–91; see also Shadish et al., 2002, and they summarized their position by stating that SCV “refers pp. 462–484), the four facets of research validity have gained to inferences about whether it is reasonable to presume covaria- recognition and they are currently covered in many textbooks tion given a specified α level and the obtained variances” (Cook on research methods in psychology (e.g., Beins, 2009; Good- and Campbell, 1979, p. 41). Given that mentioning “the obtained win, 2010; Girden and Kabacoff, 2011). Methods and strate- variances” was an indirect reference to statistical power and men- gies aimed at securing research validity are also discussed in tioning α was a direct reference to statistical significance, their these and other sources. To simplify the description, construct position about SCV may have seemed to only entail consideration validity is sought by using well-established definitions and mea- that the statistical decision can be incorrect as a result of Type-I surement procedures for variables, internal validity is sought by and Type-II errors. Perhaps as a consequence of this literal inter- ensuring that extraneous variables have been controlled and con- pretation, review papers studying SCV in published research have founds have been eliminated, and external validity is sought by focused on power and significance (e.g.,Ottenbacher,1989; Otten- observing and measuring dependent variables under natural con- bacher and Maas, 1999), strategies aimed at increasing SCV have ditions or under an appropriate representation of them. The only considered these issues (e.g., Howard et al., 1983), and tutori- fourth aspect of research validity, which Cook and Campbell als on the topic only or almost only mention these issues along with called statistical conclusion validity (SCV), is the subject of this effect sizes (e.g., Orme, 1991; Austin et al., 1998; Rankupalli and paper. Tandon, 2010). This emphasis on issues of significance and power Cook and Campbell, 1979, pp. 39–50) discussed that SCV may also be the reason that some sources refer to threats to SCV as pertains to the extent to which data from a research study can rea- “any factor that leads to a Type-I or a Type-II error” (e.g., Girden sonably be regarded as revealing a link (or lack thereof) between and Kabacoff, 2011, p. 6; see also Rankupalli and Tandon, 2010, independent and dependent variables as far as statistical issues are Section 1.2), as if these errors had identifiable causes that could be concerned. This particular facet was separated from other factors prevented. It should be noted that SCV has also occasionally been acting in the same direction (the three other facets of validity) and purported to reflect the extent to which pre-experimental designs includes three aspects: (1) whether the study has enough statistical provide evidence for causation (Lee, 1985) or the extent to which
www.frontiersin.org August 2012 | Volume 3 | Article 325 | 17 García-Pérez Statistical conclusion validity
meta-analyses are based on representative results that make the compute confidence intervals for proportions and discussed three conclusion generalizable (Elvik, 1998). papers (among many others that reportedly could have been cho- But Cook and Campbell’s (1979, p. 80) aim was undoubtedly sen) in which inadequate confidence intervals had been computed. broader, as they stressed that SCV “is concerned with sources of More recently, Bakker and Wicherts (2011) conducted a thorough random error and with the appropriate use of statistics and statisti- analysis of psychological papers and estimated that roughly 50% cal tests” (italics added). Moreover, Type-I and Type-II errors are of published papers contain reporting errors, although they only an essential and inescapable consequence of the statistical decision checked whether the reported p value was correct and not whether theory underlying significance testing and, as such, the potential the statistical test used was appropriate. A similar analysis carried occurrence of one or the other of these errors cannot be prevented. out by Nieuwenhuis et al. (2011) revealed that 50% of the papers The actual occurrence of them for the data on hand cannot be reporting the results of a comparison of two experimental effects assessed either. Type-I and Type-II errors will always be with us in top neuroscience journals had used an incorrect statistical pro- and, hence, SCV is only trivially linked to the fact that research will cedure. And Bland and Altman (2011) reported further data on never unequivocally prove or reject any statistical null hypothesis the prevalence of incorrect statistical analyses of a similar nature. or its originating research hypothesis. Cook and Campbell seemed An additional indicator of the use of inadequate statistical pro- to be well aware of this issue when they stressed that SCV refers cedures arises from consideration of published papers whose title to reasonable inferences given a specified significance level and a explicitly refers to a re-analysis of data reported in some other given power. In addition, Stevens (1950, p. 121) forcefully empha- paper. A literature search for papers including in their title the sized that“it is a statistician’s duty to be wrong the stated number of terms “a re-analysis,” “a reanalysis,” “re-analyses,” “reanalyses,” or times,”implying that a researcher should accept the assumed risks “alternative analysis” was conducted on May 3, 2012 in the Web of Type-I and Type-II errors, use statistical methods that guaran- of Science (WoS; http://thomsonreuters.com), which rendered 99 tee the assumed error rates, and consider these as an essential part such papers with subject area “Psychology” published in 1990 or of the research process. From this position, these errors do not later. Although some of these were false positives, a sizeable num- affect SCV unless their probability differs meaningfully from that ber of them actually discussed the inadequacy of analyses carried which was assumed. And this is where an alternative perspective out by the original authors and reported the results of proper alter- on SCV enters the stage, namely, whether the data were analyzed native analyses that typically reversed the original conclusion. This properly so as to extract conclusions that faithfully reflect what the type of outcome upon re-analyses of data are more frequent than data have to say about the research question. A negative answer the results of this quick and simple search suggest, because the raises concerns about SCV beyond the triviality of Type-I or Type- information for identification is not always included in the title of II errors. There are actually two types of threat to SCV from this the paper or is included in some other form: For a simple exam- perspective. One is when the data are subjected to thoroughly inad- ple, the search for the clause “a closer look” in the title rendered equate statistical analyses that do not match the characteristics of 131 papers, many of which also presented re-analyses of data that the design used to collect the data or that cannot logically give an reversed the conclusion of the original study. answer to the research question. The other is when a proper sta- Poor design or poor sample size planning may, unbeknownst to tistical test is used but it is applied under conditions that alter the the researcher, lead to unacceptable Type-II error rates, which will stated risk probabilities. In the former case, the conclusion will be certainly affect SCV (as long as the null is not rejected; if it is, the wrong except by accident; in the latter, the conclusion will fail to probability of a Type-II error is irrelevant). Although insufficient be incorrect with the declared probabilities of Type-I and Type-II power due to lack of proper planning has consequences on sta- errors. tistical tests, the thread of this paper de-emphasizes this aspect of The position elaborated in the foregoing paragraph is well sum- SCV (which should perhaps more reasonably fit within an alter- marized in Milligan and McFillen’s (1984, p. 439) statement that native category labeled design validity) and emphasizes the idea “under normal conditions (...) the researcher will not know when that SCV holds when statistical conclusions are incorrect with the a null effect has been declared significant or when a valid effect stated probabilities of Type-I and Type-II errors (whether the lat- has gone undetected (...) Unfortunately, the statistical conclusion ter was planned or simply computed). Whether or not the actual validity, and the ultimate value of the research, rests on the explicit significance level used in the research or the power that it had control of (Type-I and Type-II) error rates.” This perspective on is judged acceptable is another issue, which does not affect SCV: SCV is explicitly discussed in some textbooks on research methods The statistical conclusion is valid within the stated (or computed) (e.g., Beins, 2009, pp. 139–140; Goodwin, 2010, pp. 184–185) and error probabilities. A breach of SCV occurs, then, when the data some literature reviews have been published that reveal a sound are not subjected to adequate statistical analyses or when control failure of SCV in these respects. of Type-I or Type-II errors is lost. For instance, Milligan and McFillen’s (1984, p. 438) reviewed It should be noted that a further component was included into evidence that “the business research community has succeeded consideration of SCV in Shadish et al.’s (2002) sequel to Cook and in publishing a great deal of incorrect and statistically inade- Campbell’s (1979) book, namely, effect size. Effect size relates to quate research” and they dissected and discussed in detail four what has been called a Type-III error (Crawford et al., 1998), that additional cases (among many others that reportedly could have is, a statistically significant result that has no meaningful practical been chosen) in which a breach of SCV resulted from gross mis- implication and that only arises from the use of a huge sample. matches between the research design and the statistical analysis. This issue is left aside in the present paper because adequate con- Similarly, García-Pérez (2005) reviewed alternative methods to sideration and reporting of effect sizes precludes Type-III errors,
Frontiers in Psychology | Quantitative Psychology and Measurement August 2012 | Volume 3 | Article 325 | 18 García-Pérez Statistical conclusion validity
although the recommendations of Wilkinson and The Task Force data as they come in to be able to prevent the administration of a on Statistical Inference (1999) in this respect are not always fol- seemingly ineffective or even hurtful treatment to new patients. In lowed. Consider, e.g., Lippa’s (2007) study of the relation between studies involving a waiting-list control group, individuals in this sex drive and sexual attraction. Correlations generally lower than group are generally transferred to an experimental group mid- 0.3 in absolute value were declared strong as a result of p values way along the experiment. In studies with laboratory animals, the below 0.001. With sample sizes sometimes nearing 50,000 paired experimenter may want to stop testing animals before the planned observations, even correlations valued at 0.04 turned out signifi- number has been reached so that animals are not wasted when an cant in this study. More attention to effect sizes is certainly needed, effect (or the lack thereof) seems established. In these and anal- both by researchers and by journal editors and reviewers. ogous cases, the decision as to whether data will continue to be The remainder of this paper analyzes three common practices collected results from an analysis of the data collected thus far, typ- that result in SCV breaches, also discussing simple replacements ically using a statistical test that was devised for use in conditions for them. of fixed sampling. In other cases, experimenters test their statistical hypothesis each time a new observation or block of observations is STOPPING RULES FOR DATA COLLECTION WITHOUT collected, and continue the experiment until they feel the data are CONTROL OF TYPE-I ERROR RATES conclusive one way or the other. Software has been developed that The asymptotic theory that provides justification for null hypoth- allows experimenters to find out how many more observations esis significance testing (NHST) assumes what is known as fixed will be needed for a marginally non-significant result to become sampling, which means that the size n of the sample is not itself a significant on the assumption that sample statistics will remain random variable or, in other words, that the size of the sample has invariant when the extra data are collected (Morse, 1998). been decided in advance and the statistical test is performed once The practice of repeated testing and optional stopping has the entire sample of data has been collected. Numerous procedures been shown to affect in unpredictable ways the empirical Type-I have been devised to determine the size that a sample must have error rate of statistical tests designed for use under fixed sampling according to planned power (Ahn et al., 2001; Faul et al., 2007; (Anscombe, 1954; Armitage et al., 1969; McCarroll et al., 1992; Nisen and Schwertman, 2008; Jan and Shieh, 2011), the size of the Strube, 2006; Fitts, 2011a). The same holds when a decision is effect sought to be detected (Morse, 1999), or the width of the made to collect further data on evidence of a marginally (non) confidence intervals of interest (Graybill, 1958; Boos and Hughes- significant result (Shun et al., 2001; Chen et al., 2004). The inac- Oliver,2000; Shieh and Jan,2012). For reviews,see Dell et al. (2002) curacy of statistical tests in these conditions represents a breach and Maxwell et al. (2008). In many cases,a researcher simply strives of SCV, because the statistical conclusion thus fails to be incorrect to gather as large a sample as possible. Asymptotic theory supports with the assumed (and explicitly stated) probabilities of Type-I NHST under fixed sampling assumptions, whether or not the size and Type-II errors. But there is an easy way around the inflation of the sample was planned. of Type-I error rates from within NHST, which solves the threat In contrast to fixed sampling, sequential sampling implies that to SCV that repeated testing and optional stopping entail. the number of observations is not fixed in advance but depends In what appears to be the first development of a sequential by some rule on the observations already collected (Wald, 1947; procedure with control of Type-I error rates in psychology, Frick Anscombe, 1953; Wetherill, 1966). In practice, data are analyzed (1998) proposed that repeated statistical testing be conducted as they come in and data collection stops when the observations under the so-called COAST (composite open adaptive sequen- collected thus far satisfy some criterion. The use of sequential tial test) rule: If the test yields p < 0.01, stop collecting data and sampling faces two problems (Anscombe, 1953, p. 6): (i) devis- reject the null; if it yields p > 0.36, stop also and do not reject ing a suitable stopping rule and (ii) finding a suitable test statistic the null; otherwise, collect more data and re-test. The low crite- and determining its sampling distribution. The mere statement rion at 0.01 and the high criterion at 0.36 were selected through of the second problem evidences that the sampling distribution simulations so as to ensure a final Type-I error rate of 0.05 for of conventional test statistics for fixed sampling no longer holds paired-samples t tests. Use of the same low and high criteria under sequential sampling. These sampling distributions are rela- rendered similar control of Type-I error rates for tests of the tively easy to derive in some cases, particularly in those involving product-moment correlation, but they yielded slightly conserv- negative binomial parameters (Anscombe, 1953; García-Pérez and ative tests of the interaction in 2 × 2 between-subjects ANOVAs. Núñez-Antón, 2009). The choice between fixed and sequential Frick also acknowledged that adjusting the low and high criteria sampling (sometimes portrayed as the “experimenter’s intention”; might be needed in other cases, although he did not address them. see Wagenmakers, 2007) has important ramifications for NHST This has nevertheless been done by others who have modified and because the probability that the observed data are compatible (by extended Frick’s approach (e.g., Botella et al., 2006; Ximenez and any criterion) with a true null hypothesis generally differs greatly Revuelta, 2007; Fitts, 2010a,b, 2011b). The result is sequential pro- across sampling methods. This issue is usually bypassed by those cedures with stopping rules that guarantee accurate control of final who look at the data as a “sure fact” once collected, as if the sam- Type-I error rates for the statistical tests that are more widely used pling method used to collect the data did not make any difference in psychological research. or should not affect how the data are interpreted. Yet, these methods do not seem to have ever been used in actual There are good reasons for using sequential sampling in psycho- research, or at least their use has not been acknowledged. For logical research. For instance, in clinical studies in which patients instance, of the nine citations to Frick’s (1998) paper listed in WoS are recruited on the go, the experimenter may want to analyze as of May 3, 2012, only one is from a paper (published in 2011) in
www.frontiersin.org August 2012 | Volume 3 | Article 325 | 19 García-Pérez Statistical conclusion validity
which the COAST rule was reportedly used, although unintend- and adhere to the fixed sampling assumptions of statistical tests; edly. And not a single citation is to be found in WoS from papers otherwise, use the statistical methods that have been developed for reporting the use of the extensions and modifications of Botella use with repeated testing and optional stopping. et al. (2006) or Ximenez and Revuelta (2007). Perhaps researchers in psychology invariably use fixed sampling,but it is hard to believe PRELIMINARY TESTS OF ASSUMPTIONS that “data peeking” or “data monitoring” was never used, or that To derive the sampling distribution of test statistics used in para- the results of such interim analyses never led researchers to collect metric NHST, some assumptions must be made about the proba- some more data. Wagenmakers (2007, p. 785) regretted that “it bility distribution of the observations or about the parameters of is not clear what percentage of p values reported in experimen- these distributions. The assumptions of normality of distributions tal psychology have been contaminated by some form of optional (in all tests), homogeneity of variances (in Student’s two-sample stopping. There is simply no information in Results sections that t test for means or in ANOVAs involving between-subjects fac- allows one to assess the extent to which optional stopping has tors), sphericity (in repeated-measures ANOVAs), homoscedas- occurred.” This incertitude was quickly resolved by John et al. ticity (in regression analyses), or homogeneity of regression slopes (2012). They surveyed over 2000 psychologists with highly reveal- (in ANCOVAs) are well known cases. The data on hand may or ing results: Respondents affirmatively admitted to the practices of may not meet these assumptions and some parametric tests have data peeking, data monitoring, or conditional stopping in rates been devised under alternative assumptions (e.g., Welch’s test for that varied between 20 and 60%. two-sample means, or correction factors for the degrees of free- Besides John et al.’s (2012) proposal that authors disclose these dom of F statistics from ANOVAs). Most introductory statistics details in full and Simmons et al.’s (2011) proposed list of require- textbooks emphasize that the assumptions underlying statistical ments for authors and guidelines for reviewers, the solution to tests must be formally tested to guide the choice of a suitable test the problem is simple: Use strategies that control Type-I error statistic for the null hypothesis of interest. Although this recom- rates upon repeated testing and optional stopping. These strate- mendation seems reasonable, serious consequences on SCV arise gies have been widely used in biomedical research for decades from following it. (Bauer and Köhne, 1994; Mehta and Pocock, 2011). There is no Numerous studies conducted over the past decades have shown reason that psychological research should ignore them and give up that the two-stage approach of testing assumptions first and subse- efficient research with control of Type-I error rates, particularly quently testing the null hypothesis of interest has severe effects on when these strategies have also been adapted and further devel- Type-I and Type-II error rates. It may seem at first sight that this is oped for use under the most common designs in psychological simply the result of cascaded binary decisions each of which has its research (Frick, 1998; Botella et al., 2006; Ximenez and Revuelta, own Type-I and Type-II error probabilities; yet, this is the result of 2007; Fitts, 2010a,b). more complex interactions of Type-I and Type-II error rates that It should also be stressed that not all instances of repeated test- do not have fixed (empirical) probabilities across the cases that end ing or optional stopping without control of Type-I error rates up treated one way or the other according to the outcomes of the threaten SCV. A breach of SCV occurs only when the conclu- preliminary test: The resultant Type-I and Type-II error rates of sion regarding the research question is based on the use of these the conditional test cannot be predicted from those of the prelim- practices. For an acceptable use, consider the study of Xu et al. inary and conditioned tests. A thorough analysis of what factors (2011). They investigated order preferences in primates to find affect the Type-I and Type-II error rates of two-stage approaches out whether primates preferred to receive the best item first rather is beyond the scope of this paper but readers should be aware that than last. Their procedure involved several experiments and they nothing suggests in principle that a two-stage approach might be declared that “three significant sessions (two-tailed binomial tests adequate. The situations that have been more thoroughly stud- per session, p < 0.05) or 10 consecutive non-significant sessions ied include preliminary goodness-of-fit tests for normality before were required from each monkey before moving to the next conducting a one-sample t test (Easterling and Anderson, 1978; experiment. The three significant sessions were not necessarily Schucany and Ng, 2006; Rochon and Kieser, 2011), preliminary consecutive (...) Ten consecutive non-significant sessions were tests of equality of variances before conducting a two-sample t taken to mean there was no preference by the monkey” (p. 2304). test for means (Gans, 1981; Moser and Stevens, 1992; Zimmerman, In this case, the use of repeated testing with optional stopping at 1996,2004; Hayes and Cai,2007),preliminary tests of both equality a nominal 95% significance level for each individual test is part of variances and normality preceding two-sample t tests for means of the operational definition of an outcome variable used as a (Rasch et al.,2011),or preliminary tests of homoscedasticity before criterion to proceed to the next experiment. And, in any event, regression analyses (Caudill, 1988; Ng and Wilcox, 2011). These the overall probability of misclassifying a monkey according to and other studies provide evidence that strongly advises against this criterion is certainly fixed at a known value that can eas- conducting preliminary tests of assumptions. Almost all of these ily be worked out from the significance level declared for each authors explicitly recommended against these practices and hoped individual binomial test. One may object to the value of the resul- for the misleading and misguided advice given in introductory tant risk of misclassification, but this does not raise concerns textbooks to be removed. Wells and Hintze (2007, p. 501) con- about SCV. cluded that “checking the assumptions using the same data that In sum, the use of repeated testing with optional stopping are to be analyzed, although attractive due to its empirical nature, threatens SCV for lack of control of Type-I and Type-II error is a fruitless endeavor because of its negative ramifications on the rates. A simple way around this is to refrain from these practices actual test of interest.” The ramifications consist of substantial but
Frontiers in Psychology | Quantitative Psychology and Measurement August 2012 | Volume 3 | Article 325 | 20 García-Pérez Statistical conclusion validity
unknown alterations of Type-I and Type-II error rates and, hence, magnitude of stimuli, and any error with which these magnitudes a breach of SCV. are measured (or with which stimuli with the selected magnitudes Some authors suggest that the problem can be solved by replac- are created) is negligible in practice. Among others, this is the ing the formal test of assumptions with a decision based on a case in psychophysical studies aimed at estimating psychophysi- suitable graphical display of the data that helps researchers judge cal functions describing the form of the relation between physical by eye whether the assumption is tenable. It should be emphasized magnitude and perceived magnitude (e.g., Green, 1982) or psy- that the problem still remains, because the decision on how to ana- chometric functions describing the form of the relation between lyze the data is conditioned on the results of a preliminary analysis. physical magnitude and performance in a detection, discrimina- The problem is not brought about by a formal preliminary test, tion, or identification task (Armstrong and Marks, 1997; Saberi but by the conditional approach to data analysis. The use of a non- and Petrosyan, 2004; García-Pérez et al., 2011). Regression or anal- formal preliminary test only prevents a precise investigation of the ogous methods are typically used to estimate the parameters of consequences on Type-I and Type-II error rates. But the “out of these relations, with stimulus magnitude as the independent vari- sight, out of mind” philosophy does not eliminate the problem. able and perceived magnitude (or performance) as the dependent It thus seems that a researcher must make a choice between two variable. The use of regression in these cases is appropriate because evils: either not testing assumptions (and, thus, threatening SCV the independent variable has fixed values measured without error as a result of the uncontrolled Type-I and Type-II error rates that (or with a negligible error).Another area in which the use of regres- arise from a potentially undue application of the statistical test) or sion is permissible is in simulation studies on parameter recovery testing them (and, then, also losing control of Type-I and Type- (García-Pérez et al., 2010), where the true parameters generating II error rates owing to the two-stage approach). Both approaches the data are free of measurement error by definition. are inadequate, as applying non-robust statistical tests to data that But very few other predictor variables used in psychology meet do not satisfy the assumptions has generally as severe implica- this requirement, as they are often test scores or performance mea- tions on SCV as testing preliminary assumptions in a two-stage sures that are typically affected by non-negligible and sometimes approach. One of the solutions to the dilemma consists of switch- large measurement error. This is the case of the proportion of hits ing to statistical procedures that have been designed for use under and the proportion of false alarms in psychophysical tasks, whose the two-stage approach. For instance, Albers et al. (2000) used theoretical relation is linear under some signal detection mod- second-order asymptotics to derive the size and power of a two- els (DeCarlo, 1998) and, thus, suggests the use of simple linear stage test for independent means preceded by a test of equality of regression to estimate its parameters. Simple linear regression is variances. Unfortunately, derivations of this type are hard to carry also sometimes used as a complement to statistical tests of equality out and, hence, they are not available for most of the cases of inter- of means in studies in which equivalence or agreement is assessed est. A second solution consists of using classical test statistics that (e.g., Maylor and Rabbitt, 1993; Baddeley and Wilson, 2002), and have been shown to be robust to violation of their assumptions. in these cases equivalence implies that the slope should not differ Indeed, dependable unconditional tests for means or for regres- significantly from unity and that the intercept should not differ sion parameters have been identified (see Sullivan and D’Agostino, significantly from zero. The use of simple linear regression is also 1992; Lumley et al., 2002; Zimmerman, 2004, 2011; Hayes and Cai, widespread in priming studies after Greenwald et al. (1995; see also 2007; Ng and Wilcox, 2011). And a third solution is switching to Draine and Greenwald, 1998), where the intercept (and sometimes modern robust methods (see, e.g., Wilcox and Keselman, 2003; the slope) of the linear regression of priming effect on detectability Keselman et al., 2004; Wilcox, 2006; Erceg-Hurn and Mirosevich, of the prime are routinely subjected to NHST. 2008; Fried and Dehling, 2011). In all the cases just discussed and in many others where the Avoidance of the two-stage approach in either of these ways X variable in the regression of Y on X is measured with error, will restore SCV while observing the important requirement that a study of the relation between X and Y through regression is statistical methods should be used whose assumptions are not inadequate and has serious consequences on SCV. The least of violated by the characteristics of the data. these problems is that there is no basis for assigning the roles of independent and dependent variable in the regression equation REGRESSION AS A MEANS TO INVESTIGATE BIVARIATE (as a non-directional relation exists between the variables, often RELATIONS OF ALL TYPES without even a temporal precedence relation), but regression para- Correlational methods define one of the branches of scientific psy- meters will differ according to how these roles are assigned. In chology (Cronbach, 1957) and they are still widely used these days influential papers of which most researchers in psychology seem in some areas of psychology. Whether in regression analyses or to be unaware, Wald (1940) and Mandansky (1959) distinguished in latent variable analyses (Bollen, 2002), vast amounts of data regression relations from structural relations, the latter reflecting are subjected to these methods. Regression analyses rely on an the case in which both variables are measured with error. Both assumption that is often overlooked in psychology, namely, that authors illustrated the consequences of fitting a regression line the predictor variables have fixed values and are measured without when a structural relation is involved and derived suitable estima- error. This assumption, whose validity can obviously be assessed tors and significance tests for the slope and intercept parameters without recourse to any preliminary statistical test, is listed in all of a structural relation. This topic was brought to the attention statistics textbooks. of psychologists by Isaac (1970) in a criticism of Treisman and In some areas of psychology, predictors actually have this Watts’ (1966) use of simple linear regression to assess the equiva- characteristic because they are physical variables defining the lence of two alternative estimates of psychophysical sensitivity (d0
www.frontiersin.org August 2012 | Volume 3 | Article 325 | 21 García-Pérez Statistical conclusion validity
measures from signal detection theory analyses). The difference between regression and structural relations is briefly mentioned in 5 passing in many elementary books on regression,the issue of fitting structural relations (sometimes referred to as Deming’s regression or the errors-in-variables regression model) is addressed in detail in 4 most intermediate and advance books on regression (e.g., Fuller, 1987; Draper and Smith, 1998) and hands-on tutorials have been published (e.g., Cheng and Van Ness, 1994; Dunn and Roberts, 3 1999; Dunn, 2007). But this type of analysis is not in the toolbox 2 1 ’ ˆ of the average researcher in psychology . In contrast, recourse to d this type analysis is quite common in the biomedical sciences. Use of this commendable method may generalize when 2 researchers realize that estimates of the slope β and the intercept α of a structural relation can be easily computed through 1 r 2 2 2 2 2 2 Sy − λSx + Sy − λSx + 4λSxy βˆ = , (1) 2Sxy 0 ¯ ˆ ¯ 0 1 2 3 4 5 αˆ = Y − βX, (2) ˆ d’1 ¯ ¯ 2 2 where X, Y , Sx , Sy , and Sxy are the sample means, variances, and FIGURE 1 | Replot of data fromYeshurun et al. (2008, their Figure 8) λ = σ2 /σ2 covariance of X and Y, and εy εx is the ratio of the vari- with their fitted regression line through the origin (red line) and a ances of measurement errors in Y and in X. When X and Y are fitted structural relation (blue line). The identity line is shown with the same variable measured at different times or under different dashed trace for comparison. For additional analyses bearing on the SCV of conditions (as in Maylor and Rabbitt, 1993; Baddeley and Wilson, the original study, see García-Pérez and Alcalá-Quintana (2011). 2002), λ = 1 can safely be assumed (for an actual application, see Smith et al., 2004). In other cases, a rough estimate can be used, (2) regression-based statistical tests of H : β = 1 render empiri- as the estimates of α and β have been shown to be robust except 0 cal Type-I error rates that are much higher than the nominal rate under extreme departures of the guesstimated λ from its true value when both variables are measured with error (García-Pérez and (Ketellapper, 1983). Alcalá-Quintana, 2011). For illustration, consider Yeshurun et al. (2008) comparison of In sum, SCV will improve if structural relations instead of signal detection theory estimates of d0 in each of the intervals of a regression equations were fitted when both variables are measured two alternative forced-choice task, which they pronounced differ- with error. ent as revealed by a regression analysis through the origin. Note that this is the context in which Isaac (1970) had illustrated the inappropriateness of regression. The data are shown in Figure 1, CONCLUSION 0 0 Type-I and Type-II errors are essential components of the statis- and Yeshurun et al. rejected equality of d1 and d2 because the regression slope through the origin (red line, whose slope is 0.908) tical decision theory underlying NHST and, therefore, data can differed significantly from unity: The 95% confidence interval for never be expected to answer a research question unequivocally. the slope ranged between 0.844 and 0.973. Using Eqs 1 and 2, This paper has promoted a view of SCV that de-emphasizes con- the estimated structural relation is instead given by the blue line sideration of these unavoidable errors and considers instead two in Figure 1. The difference seems minor by eye, but the slope of alternative issues: (1) whether statistical tests are used that match the structural relation is 0.963, which is not significantly differ- the research design, goals of the study, and formal characteristics ent from unity (p = 0.738, two-tailed; see Isaac, 1970, p. 215). This of the data and (2) whether they are applied in conditions under outcome, which reverses a conclusion raised upon inadequate data which the resultant Type-I and Type-II error rates match those analyses, is representative of other cases in which the null hypoth- that are declared as limiting the validity of the conclusion. Some esis H 0: β = 1 was rejected. The reason is dual: (1) the slope of examples of common threats to SCV in these respects have been a structural relation is estimated with severe bias through regres- discussed and simple and feasible solutions have been proposed. sion (Riggs et al., 1978; Kalantar et al., 1995; Hawkins, 2002) and For reasons of space, another threat to SCV has not been covered in this paper, namely, the problems arising from multiple testing (i.e., in concurrent tests of more than one hypothesis). Multiple 1SPSS includes a regression procedure called “two-stage least squares” which only testing is commonplace in brain mapping studies and some impli- implements the method described by Mandansky (1959) as “use of instrumental cations on SCV have been discussed, e.g., by Bennett et al. (2009), variables” to estimate the slope of the relation between X and Y. Use of this method requires extra variables with specific characteristics (variables which may simply not Vul et al. (2009a,b), and Vecchiato et al. (2010). be available for the problem at hand) and differs meaningfully from the simpler and All the discussion in this paper has assumed the frequentist more generally applicable method to be discussed next approach to data analysis. In closing, and before commenting on
Frontiers in Psychology | Quantitative Psychology and Measurement August 2012 | Volume 3 | Article 325 | 22 García-Pérez Statistical conclusion validity
how SCV could be improved, a few words are worth about how that a p value is the probability of a Type-I error in the current Bayesian approaches fare on SCV. case, or that it is a measure of the strength of evidence against the null that the current data provide. The most prevalent problems of THE BAYESIAN APPROACH p values are their potential for misuse and their widespread mis- Advocates of Bayesian approaches to data analysis, hypothesis interpretation (Nickerson, 2000). But misuse or misinterpretation testing, and model selection (e.g., Jennison and Turnbull, 1990; do not make NHST and p values uninterpretable or worthless. Wagenmakers, 2007; Matthews, 2011) overemphasize the prob- Bayesian approaches are claimed to be free of these presumed lems of the frequentist approach and praise the solutions offered problems, yielding a conclusion that is exclusively grounded on the by the Bayesian approach: Bayes factors (BFs) for hypothesis test- data. In a naive account of Bayesian hypothesis testing, Malakoff ing, credible intervals for interval estimation, Bayesian posterior (1999) attributes to biostatistician Steven Goodman the assertion probabilities, Bayesian information criterion (BIC) as a tool for that the Bayesian approach “says there is an X% probability that model selection and, above all else, strict reliance on observed data your hypothesis is true–not that there is some convoluted chance and independence of the sampling plan (i.e., fixed vs. sequential that if you assume the null hypothesis is true, you will get a similar sampling). There is unquestionable merit in these alternatives and or more extreme result if you repeated your experiment thou- a fair comparison with their frequentist counterparts requires a sands of times.” Besides being misleading and reflecting a poor detailed analysis that is beyond the scope of this paper.Yet,I cannot understanding of the logic of calibrated NHST methods, what resist the temptation of commenting on the presumed problems of goes unmentioned in this and other accounts is that the Bayesian the frequentist approach and also on the standing of the Bayesian potential to find out the probability that the hypothesis is true will approach with respect to SCV. not materialize without two crucial extra pieces of information. One of the preferred objections to p values is that they relate to One is the a priori probability of each of the competing hypothe- data that were never collected and which, thus, should not affect ses, which certainly does not come from the data. The other is the decision of what hypothesis the observed data support or fail the probability of the observed data under each of the competing to support. Intuitively appealing as it may seem, the argument is hypothesis, which has the same origin as the frequentist p value flawed because the referent for a p value is not other data sets and whose computation requires distributional assumptions that that could have been observed in undone replications of the same must necessarily take the sampling method into consideration. experiment. Instead, the referent is the properties of the test sta- In practice, Bayesian hypothesis testing generally computes BFs tistic itself, which is guaranteed to have the declared sampling and the result might be stated as “the alternative hypothesis is x distribution when data are collected as assumed in the derivation times more likely than the null,”although the probability that this of such distribution. Statistical tests are calibrated procedures with type of statement is wrong is essentially unknown. The researcher known properties, and this calibration is what makes their results may be content with a conclusion of this type, but how much of interpretable. As is the case for any other calibrated procedure or these odds comes from the data and how much comes from the measuring instrument, the validity of the outcome only rests on extra assumptions needed to compute a BF is undecipherable. In adherence to the usage specifications. And, of course, the test sta- many cases research aims at gathering and analyzing data to make tistic and the resultant p value on application cannot be blamed for informed decisions such as whether application of a treatment the consequences of a failure to collect data properly or to apply should be discontinued, whether changes should be introduced the appropriate statistical test. in an educational program, whether daytime headlights should be Consider a two-sample t test for means. Those who need a ref- enforced,or whether in-car use of cell phones should be forbidden. erent may want to notice that the p value for the data from a given Like frequentist analyses, Bayesian approaches do not guarantee experiment relates to the uncountable times that such test has been that the decisions will be correct. One may argue that stating how applied to data from any experiment in any discipline. Calibration much more likely is one hypothesis over another bypasses the deci- of the t test ensures that a proper use with a significance level of, sion to reject or not reject any of them and, then, that Bayesian say, 5% will reject a true null hypothesis on 5% of the occasions, approaches to hypothesis testing are free of Type-I and Type-II no matter what the experimental hypothesis is, what the variables errors. Although this is technically correct, the problem remains are, what the data are, what the experiment is about, who carries from the perspective of SCV: Statistics is only a small part of a it out, or in what research field. What a p value indicates is how research process whose ultimate goal is to reach a conclusion and tenable it is that the t statistic will attain the observed value if make a decision, and researchers are in a better position to defend the null were correct, with only a trivial link to the data observed their claims if they can supplement them with a statement of the in the experiment of concern. And this only places in a precise probability with which those claims are wrong. quantitative framework the logic that the man on the street uses Interestingly, analyses of decisions based on Bayesian to judge, for instance, that getting struck by lightning four times approaches have revealed that they are no better than frequentist over the past 10 years is not something that could identically have decisions as regards Type-I and Type-II errors and that parametric happened to anybody else, or that the source of a politician’s huge assumptions (i.e., the choice of prior and the assumed distribu- and untraceable earnings is not the result of allegedly winning top tion of the observations) crucially determine the performance of lottery prizes numerous times over the past couple of years. In any Bayesian methods. For instance,Bayesian estimation is also subject case, the advantage of the frequentist approach as regards SCV is to potentially large bias and lack of precision (Alcalá-Quintana and that the probability of a Type-I or a Type-II error can be clearly and García-Pérez, 2004; García-Pérez and Alcalá-Quintana, 2007), the unequivocally stated, which is not to be mistaken for a statement coverage probability of Bayesian credible intervals can be worse
www.frontiersin.org August 2012 | Volume 3 | Article 325 | 23 García-Pérez Statistical conclusion validity
than that of frequentist confidence intervals (Agresti and Min, statistical training in many programs must be strengthened. A 2005; Alcalá-Quintana and García-Pérez, 2005), and the Bayesian single course in experimental design and a single course in mul- posterior probability in hypothesis testing can be arbitrarily large tivariate analysis is probably insufficient for the typical student or small (Zaslavsky, 2010). On another front, use of BIC for model to master the course material. Someone who is trained only selection may discard a true model as often as 20% of the times, in theory and content will be ill-prepared to contribute to the while a concurrent 0.05-size chi-square test rejects the true model advancement of the field or to critically evaluate the research of between 3 and 7% of times, closely approximating its stated per- others.” But statistical education does not seem to have changed formance (García-Pérez and Alcalá-Quintana, 2012). In any case, much over the subsequent 25 years, as revealed by survey stud- the probabilities of Type-I and Type-II errors in practical decisions ies conducted by Aiken et al. (1990), Friedrich et al. (2000), made from the results of Bayesian analyses will always be unknown Aiken et al. (2008), and Henson et al. (2010). Certainly some and beyond control. work remains to be done in this arena, and I can only second the proposals made in the papers just cited. But there is also IMPROVING THE SCV OF RESEARCH the problem of the unhealthy over-reliance on narrow-breadth, Most breaches of SCV arise from a poor understanding of statisti- clickable software for data analysis, which practically obliterates cal procedures and the resultant inadequate usage. These problems any efforts that are made to teach and promote alternatives (see can be easily corrected, as illustrated in this paper, but the prob- the list of “Pragmatic Factors” discussed by Borsboom, 2006, lems will not have arisen if researchers had had a better statistical pp. 431–434). training in the first place. There was a time in which one simply The last trench in the battle against breaches of SCV is occupied could not run statistical tests without a moderate understanding by journal editors and reviewers. Ideally, they also watch for prob- of NHST. But these days the application of statistical tests is only a lems in these respects. There is no known in-depth analysis of the mouse-click away and all that students regard as necessary is learn- review process in psychology journals (but see Nickerson, 2005) ing the rule by which p values pouring out of statistical software and some evidence reveals that the focus of the review process is tell them whether the hypothesis is to be accepted or rejected, as not always on the quality or validity of the research (Sternberg, the study of Hoekstra et al. (2012) seems to reveal. 2002; Nickerson, 2005). Simmons et al. (2011) and Wicherts et al. One way to eradicate the problem is by improving statisti- (2012) have discussed empirical evidence of inadequate research cal education at undergraduate and graduate levels, perhaps not and review practices (some of which threaten SCV) and they have just focusing on giving formal training on a number of meth- proposed detailed schemes through which feasible changes in edi- ods but by providing students with the necessary foundations torial policies may help eradicate not only common threats to that will subsequently allow them to understand and apply meth- SCV but also other threats to research validity in general. I can ods for which they received no explicit formal training. In their only second proposals of this type. Reviewers and editors have analysis of statistical errors in published papers, Milligan and the responsibility of filtering out (or requesting amendments to) McFillen (1984, p. 461) concluded that “in doing projects, it is research that does not meet the journal’s standards, including SCV. not unusual for applied researchers or students to use or apply a The analyses of Milligan and McFillen (1984) and Nieuwenhuis statistical procedure for which they have received no formal train- et al. (2011) reveal a sizeable number of published papers with ing. This is as inappropriate as a person conducting research in statistical errors. This indicates that some remains to be done in a given content area before reading the existing background lit- this arena too, and some journals have indeed started to take action erature on the topic. The individual simply is not prepared to (see Aickin, 2011). conduct quality research. The attitude that statistical technology is secondary or less important to a person’s formal training is ACKNOWLEDGMENTS shortsighted. Researchers are unlikely to master additional sta- This research was supported by grant PSI2009-08800 (Ministerio tistical concepts and techniques after leaving school. Thus, the de Ciencia e Innovación, Spain).
REFERENCES in statistics, measurement, and Alcalá-Quintana, R., and García-Pérez, significance tests on accumulating Agresti,A., and Min,Y.(2005). Frequen- methodology in psychology: replica- M. A. (2004). The role of paramet- data. J. R. Stat. Soc. Ser. A 132, tist performance of Bayesian confi- tion and extension of Aiken, West, ric assumptions in adaptive Bayesian 235–244. dence intervals for comparing pro- Sechrest, and Reno’s (1990) survey estimation. Psychol. Methods 9, Armstrong, L., and Marks, L. E. (1997). portions in 2 × 2 contingency tables. of PhD programs in North America. 250–271. Differential effect of stimulus con- Biometrics 61, 515–523. Am. Psychol. 63, 32–50. Alcalá-Quintana, R., and García- text on perceived length: impli- Ahn, C., Overall, J. E., and Tonidandel, Aiken, L. S., West, S. G., Sechrest, L., Pérez, M. A. (2005). Stopping cations for the horizontal–vertical S. (2001). Sample size and power cal- and Reno, R. R. (1990). Gradu- rules in Bayesian adaptive thresh- illusion. Percept. Psychophys. 59, culations in repeated measurement ate training in statistics, methodol- old estimation. Spat. Vis. 18, 1200–1213. analysis. Comput. Methods Programs ogy, and measurement in psychol- 347–374. Austin, J. T., Boyle, K. A., and Lual- Biomed. 64, 121–124. ogy: a survey of PhD programs in Anscombe, F. J. (1953). Sequential esti- hati, J. C. (1998). Statistical conclu- Aickin, M. (2011). Test ban: policy of North America. Am. Psychol. 45, mation. J. R. Stat. Soc. Series B 15, sion validity for organizational sci- the Journal of Alternative and Com- 721–734. 1–29. ence researchers: a review. Organ. plementary Medicine with regard to Albers, W., Boon, P. C., and Kallen- Anscombe, F. J. (1954). Fixed- Res. Methods 1, 164–208. an increasingly common statistical berg, W. C. M. (2000). The asymp- sample-size analysis of sequential Baddeley, A., and Wilson, B. A. (2002). error. J. Altern. Complement. Med. totic behavior of tests for nor- observations. Biometrics 10, Prose recall and amnesia: implica- 17, 1093–1094. mal means based on a variance 89–100. tions for the structure of work- Aiken, L. S., West, S. G., and Mill- pre-test. J. Stat. Plan. Inference 88, Armitage, P., McPherson, C. K., and ing memory. Neuropsychologia 40, sap, R. E. (2008). Doctoral training 47–57. Rowe, B. C. (1969). Repeated 1737–1743.
Frontiers in Psychology | Quantitative Psychology and Measurement August 2012 | Volume 3 | Article 325 | 24 García-Pérez Statistical conclusion validity
Bakker, M., and Wicherts, J. M. DeCarlo, L. T. (1998). Signal detection two-sample location problem. Stat. tactile figures. Percept. Psychophys. (2011). The (mis) reporting of theory and generalized linear mod- Methods Appl. 20, 409–422. 31, 315–323. statistical results in psychology els. Psychol. Methods 3, 186–205. Friedrich, J., Buday, E., and Kerr, Greenwald, A. G., Klinger, M. R., and journals. Behav. Res. Methods 43, Dell, R. B., Holleran, S., and Ramakrish- D. (2000). Statistical training in Schuh, E. S. (1995). Activation by 666–678. nan, R. (2002). Sample size determi- psychology: a national survey and marginally perceptible (“sublimi- Bauer, P., and Köhne, K. (1994). Eval- nation. ILAR J. 43, 207–213. commentary on undergraduate nal”) stimuli: dissociation of uncon- uation of experiments with adap- Draine, S. C., and Greenwald, A. programs. Teach. Psychol. 27, scious from conscious cognition. J. tive interim analyses. Biometrics 50, G. (1998). Replicable unconscious 248–257. Exp. Psychol. Gen. 124, 22–42. 1029–1041. semantic priming. J. Exp. Psychol. Fuller, W. A. (1987). Measurement Error Hawkins, D. M. (2002). Diagnostics Beins, B. C. (2009). Research Methods. A Gen. 127, 286–303. Models. New York: Wiley. for conformity of paired quantita- Tool for Life, 2nd Edn. Boston, MA: Draper, N. R., and Smith, H. (1998). Gans, D. J. (1981). Use of a preliminary tive measurements. Stat. Med. 21, Pearson Education. Applied Regression Analysis, 3rd Edn. test in comparing two sample means. 1913–1935. Bennett, C. M., Wolford, G. L., and New York: Wiley. Commun. Stat. Simul. Comput. 10, Hayes, A. F., and Cai, L. (2007). Further Miller, M. B. (2009). The princi- Dunn, G. (2007). Regression models 163–174. evaluating the conditional decision pled control of false positives in neu- for method comparison data. J. Bio- García-Pérez, M. A. (2005). On the rule for comparing two independent roimaging. Soc. Cogn. Affect. Neu- pharm. Stat. 17, 739–756. confidence interval for the bino- means. Br. J. Math. Stat. Psychol. 60, rosci. 4, 417–422. Dunn, G., and Roberts, C. (1999). Mod- mial parameter. Qual. Quant. 39, 217–244. Bland, J. M., and Altman, D. G. elling method comparison data. Stat. 467–481. Henson, R. K., Hull, D. M., and (2011). Comparisons against base- Methods Med. Res. 8, 161–179. García-Pérez, M. A., and Alcalá- Williams, C. S. (2010). Methodology line within randomised groups are Easterling, R. G., and Anderson, H. E. Quintana, R. (2007). Bayesian adap- in our education research culture: often used and can be highly mis- (1978). The effect of preliminary tive estimation of arbitrary points on toward a stronger collective quan- leading. Trials 12, 264. normality goodness of fit tests on a psychometric function. Br. J. Math. titative proficiency. Educ. Res. 39, Bollen, K. A. (2002). Latent vari- subsequent inference. J. Stat. Com- Stat. Psychol. 60, 147–174. 229–240. ables in psychology and the social put. Simul. 8, 1–11. García-Pérez, M. A., and Alcalá- Hoekstra, R., Kiers, H., and John- sciences. Annu. Rev. Psychol. 53, Elvik, R. (1998). Evaluating the statisti- Quintana, R. (2011). Testing equiv- son, A. (2012). Are assumptions 605–634. cal conclusion validity of weighted alence with repeated measures: of well-known statistical tech- Boos, D. D., and Hughes-Oliver, J. M. mean results in meta-analysis by tests of the difference model of niques checked, and why (2000). How large does n have to be analysing funnel graph diagrams. two-alternative forced-choice per- (not)? Front. Psychol. 3:137. for Z and t intervals? Am. Stat. 54, Accid. Anal. Prev. 30, 255–266. formance. Span. J. Psychol. 14, doi:10.3389/fpsyg.2012.00137 121–128. Erceg-Hurn, C. M., and Mirosevich, 1023–1049. Howard, G. S., Obledo, F. H., Cole, Borsboom, D. (2006). The attack of the V. M. (2008). Modern robust sta- García-Pérez, M. A., and Alcalá- D. A., and Maxwell, S. E. (1983). psychometricians. Psychometrika 71, tistical methods: an easy way to Quintana, R. (2012). On the dis- Linked raters’ judgments: combat- 425–440. maximize the accuracy and power crepant results in synchrony judg- ing problems of statistical conclu- Botella, J., Ximenez, C., Revuelta, J., of your research. Am. Psychol. 63, ment and temporal-order judg- sion validity. Appl. Psychol. Meas. 7, and Suero, M. (2006). Optimiza- 591–601. ment tasks: a quantitative model. 57–62. tion of sample size in controlled Faul, F., Erdfelder, E., Lang, A.-G., and Psychon. Bull. Rev. (in press). Isaac, P. D. (1970). Linear regres- experiments: the CLAST rule. Behav. Buchner, A. (2007). G∗Power 3: doi:10.3758/s13423-012-0278-y sion, structural relations, and mea- Res. Methods Instrum. Comput. 38, a flexible statistical power analysis García-Pérez, M. A., Alcalá-Quintana, surement error. Psychol. Bull. 74, 65–76. program for the social, behavioral, R., and García-Cueto, M. A. (2010). 213–218. Campbell, D. T., and Stanley, J. C. and biomedical sciences. Behav. Res. A comparison of anchor-item Jan, S.-L., and Shieh, G. (2011). Opti- (1966). Experimental and Quasi- Methods 39, 175–191. designs for the concurrent calibra- mal sample sizes for Welch’s test Experimental Designs for Research. Fitts, D. A. (2010a). Improved stopping tion of large banks of Likert-type under various allocation and cost Chicago, IL: Rand McNally. rules for the design of efficient small- items. Appl. Psychol. Meas. 34, considerations. Behav. Res. Methods Caudill, S. B. (1988). Type I errors sample experiments in biomedical 580–599. 43, 1014–1022. after preliminary tests for het- and biobehavioral research. Behav. García-Pérez, M. A., Alcalá-Quintana, Jennison, C., and Turnbull, B. W. eroscedasticity. Statistician 37, Res. Methods 42, 3–22. R., Woods, R. L., and Peli, E. (2011). (1990). Statistical approaches to 65–68. Fitts, D. A. (2010b). The variable- Psychometric functions for detec- interim monitoring of clinical trials: Chen, Y. H. J., DeMets, D. L., and Lang, criteria sequential stopping rule: tion and discrimination with and a review and commentary. Stat. Sci. K. K. G. (2004). Increasing sam- generality to unequal sample sizes, without flankers. Atten. Percept. Psy- 5, 299–317. ple size when the unblinded interim unequal variances, or to large chophys. 73, 829–853. John, L. K., Loewenstein, G., and Prelec, result is promising. Stat. Med. 23, ANOVAs. Behav. Res. Methods 42, García-Pérez, M. A., and Núñez- D. (2012). Measuring the preva- 1023–1038. 918–929. Antón, V. (2009). Statistical infer- lence of questionable research prac- Cheng, C. L., and Van Ness, J. W. (1994). Fitts, D. A. (2011a). Ethics and ani- ence involving binomial and nega- tices with incentives for truth telling. On estimating linear relationships mal numbers: Informal analyses, tive binomial parameters. Span. J. Psychol. Sci. 23, 524–532. when both variables are subject to uncertain sample sizes, inefficient Psychol. 12, 288–307. Kalantar, A. H., Gelb, R. I., and errors. J. R. Stat. Soc. Series B 56, replications, and Type I errors. J. Girden, E. R., and Kabacoff, R. I. (2011). Alper, J. S. (1995). Biases in sum- 167–183. Am. Assoc. Lab. Anim. Sci. 50, Evaluating Research Articles. From mary statistics of slopes and inter- Cook, T. D., and Campbell, D. T. (1979). 445–453. Start to Finish, 3rd Edn. Thousand cepts in linear regression with Quasi-Experimentation: Design and Fitts, D. A. (2011b). Minimizing ani- Oaks, CA: Sage. errors in both variables. Talanta 42, Analysis Issues for Field Settings. mal numbers: the variable-criteria Goodwin, C. J. (2010). Research in Psy- 597–603. Boston, MA: Houghton Mifflin. sequential stopping rule. Comp. chology. Methods and Design, 6th Keselman, H. J., Othman, A. R., Wilcox, Crawford, E. D., Blumenstein, B., Med. 61, 206–218. Edn. Hoboken, NJ: Wiley. R. R., and Fradette, K. (2004). The and Thompson, I. (1998). Type Frick, R. W. (1998). A better stop- Graybill, F. A. (1958). Determining new and improved two-sample t test. III statistical error. Urology ping rule for conventional statistical sample size for a specified width Psychol. Sci. 15, 47–51. 51, 675. tests. Behav. Res. Methods Instrum. confidence interval. Ann. Math. Stat. Ketellapper, R. H. (1983). On estimat- Cronbach, L. J. (1957). The two disci- Comput. 30, 690–697. 29, 282–287. ing parameters in a simple linear plines of scientific psychology. Am. Fried, R., and Dehling, H. (2011). Green, B. G. (1982). The perception errors-in-variables model. Techno- Psychol. 12, 671–684. Robust nonparametric tests for the of distance and location for dual metrics 25, 43–47.
www.frontiersin.org August 2012 | Volume 3 | Article 325 | 25 García-Pérez Statistical conclusion validity
Lee, B. (1985). Statistical conclusion Ng, M., and Wilcox, R. R. (2011). A Shieh, G., and Jan, S.-L. (2012). Opti- Wagenmakers, E.-J. (2007). A practical validity in ex post facto designs: comparison of two-stage procedures mal sample sizes for precise inter- solution to the pervasive problems practicality in evaluation. Educ. Eval. for testing least-squares coefficients val estimation of Welch’s procedure of p values. Psychon. Bull. Rev. 14, Policy Anal. 7, 35–45. under heteroscedasticity. Br. J. Math. under various allocation and cost 779–804. Lippa, R. A. (2007). The relation Stat. Psychol. 64, 244–258. considerations. Behav. Res. Methods Wald, A. (1940). The fitting of straight between sex drive and sexual Nickerson, R. S. (2000). Null hypothe- 44, 202–212. lines if both variables are sub- attraction to men and women: a sis significance testing: a review of Shun, Z. M., Yuan, W., Brady, W. E., ject to error. Ann. Math. Stat. 11, cross-national study of heterosex- an old and continuing controversy. and Hsu, H. (2001). Type I error in 284–300. ual, bisexual, and homosexual men Psychol. Methods 5, 241–301. sample size re-estimations based on Wald, A. (1947). Sequential Analysis. and women. Arch. Sex. Behav. 36, Nickerson, R. S. (2005). What authors observed treatment difference. Stat. New York: Wiley. 209–222. want from journal reviewers and Med. 20, 497–513. Wells, C. S., and Hintze, J. M. (2007). Lumley, T., Diehr, P., Emerson, S., editors. Am. Psychol. 60, 661–662. Simmons, J. P., Nelson, L. D., and Dealing with assumptions underly- and Chen, L. (2002). The impor- Nieuwenhuis, S., Forstmann, B. U., and Simoshohn, U. (2011). False- ing statistical tests. Psychol. Sch. 44, tance of the normality assump- Wagenmakers, E.-J. (2011). Erro- positive psychology: undisclosed 495–502. tion in large public health data neous analyses of interactions in flexibility in data collection and Wetherill,G. B. (1966). Sequential Meth- sets. Annu. Rev. Public Health 23, neuroscience: a problem of signifi- analysis allows presenting anything ods in Statistics. London: Chapman 151–169. cance. Nat. Neurosci. 14, 1105–1107. as significant. Psychol. Sci. 22, and Hall. Malakoff,D. (1999). Bayes offers a“new” Nisen, J. A., and Schwertman, N. C. 1359–1366. Wicherts, J. M., Kievit, R. A., Bakker, way to make sense of numbers. Sci- (2008). A simple method of comput- Smith, P. L., Wolfgang, B. F., and Sin- M., and Borsboom, D. (2012). ence 286, 1460–1464. ing the sample size for chi-square test clair, A. J. (2004). Mask-dependent Letting the daylight in: review- Mandansky, A. (1959). The fitting of for the equality of multinomial dis- attentional cuing effects in visual ing the reviewers and other ways straight lines when both variables are tributions. Comput. Stat. Data Anal. signal detection: the psychometric to maximize transparency in sci- subject to error. J. Am. Stat. Assoc. 54, 52, 4903–4908. function for contrast. Percept. Psy- ence. Front. Comput. Psychol. 6:20. 173–205. Orme, J. G. (1991). Statistical con- chophys. 66, 1056–1075. doi:10.3389/fncom.2012.00020 Matthews, W. J. (2011). What might clusion validity for single-system Sternberg, R. J. (2002). On civility in Wilcox, R. R. (2006). New methods judgment and decision making designs. Soc. Serv. Rev. 65, 468–491. reviewing. APS Obs. 15, 34. for comparing groups: strategies for research be like if we took a Bayesian Ottenbacher, K. J. (1989). Statisti- Stevens, W. L. (1950). Fiducial lim- increasing the probability of detect- approach to hypothesis testing? cal conclusion validity of early its of the parameter of a discon- ing true differences. Curr. Dir. Psy- Judgm. Decis. Mak. 6, 843–856. intervention research with handi- tinuous distribution. Biometrika 37, chol. Sci. 14, 272–275. Maxwell, S. E., Kelley, K., and Rausch, capped children. Except. Child. 55, 117–129. Wilcox, R. R., and Keselman, H. J. J. R. (2008). Sample size planning 534–540. Strube, M. J. (2006). SNOOP: a pro- (2003). Modern robust data analy- for statistical power and accuracy Ottenbacher, K. J., and Maas, F. (1999). gram for demonstrating the conse- sis methods: measures of cen- in parameter estimation. Annu. Rev. How to detect effects: statistical quences of premature and repeated tral tendency. Psychol. Methods 8, Psychol. 59, 537–563. power and evidence-based practice null hypothesis testing. Behav. Res. 254–274. Maylor, E. A., and Rabbitt, P. M. A. in occupational therapy research. Methods 38, 24–27. Wilkinson, L., and The Task Force on (1993). Alcohol, reaction time and Am. J. Occup. Ther. 53, 181–188. Sullivan, L. M., and D’Agostino, R. Statistical Inference. (1999). Statis- memory: a meta-analysis. Br. J. Psy- Rankupalli, B., and Tandon, R. (2010). B. (1992). Robustness of the t test tical methods in psychology jour- chol. 84, 301–317. Practicing evidence-based psychia- applied to data distorted from nor- nals: guidelines and explanations. McCarroll, D., Crays, N., and Dunlap, try: 1. Applying a study’s findings: mality by floor effects. J. Dent. Res. Am. Psychol. 54, 594–604. W. P. (1992). Sequential ANOVAs the threats to validity approach. 71, 1938–1943. Ximenez, C., and Revuelta, J. (2007). and type I error rates. Educ. Psychol. Asian J. Psychiatr. 3, 35–40. Treisman, M., and Watts, T. R. (1966). Extending the CLAST sequential Meas. 52, 387–393. Rasch, D., Kubinger, K. D., and Moder, Relation between signal detectabil- rule to one-way ANOVA under Mehta, C. R., and Pocock, S. J. (2011). K. (2011). The two-sample t test: ity theory and the traditional proce- group sampling. Behav. Res. Methods Adaptive increase in sample size pre-testing its assumptions does not dures for measuring sensory thresh- Instrum. Comput. 39, 86–100. when interim results are promising: pay off. Stat. Pap. 52, 219–231. olds: estimating d’ from results given Xu, E. R., Knight, E. J., and Kralik, J. a practical guide with examples. Stat. Riggs, D. S., Guarnieri, J. A., and Addel- by the method of constant stimuli. D. (2011). Rhesus monkeys lack a Med. 30, 3267–3284. man, S. (1978). Fitting straight lines Psychol. Bull. 66, 438–454. consistent peak-end effect. Q. J. Exp. Milligan, G. W., and McFillen, J. M. when both variables are subject to Vecchiato, G., Fallani, F. V., Astolfi, L., Psychol. 64, 2301–2315. (1984). Statistical conclusion valid- error. Life Sci. 22, 1305–1360. Toppi, J., Cincotti, F.,Mattia, D., Sali- Yeshurun, Y., Carrasco, M., and Mal- ity in experimental designs used in Rochon, J., and Kieser, M. (2011). A nari, S., and Babiloni, F. (2010). The oney, L. T. (2008). Bias and business research. J. Bus. Res. 12, closer look at the effect of pre- issue of multiple univariate com- sensitivity in two-interval forced 437–462. liminary goodness-of-fit testing for parisons in the context of neuro- choice procedures: tests of the Morse, D. T. (1998). MINSIZE: a com- normality for the one-sample t- electric brain mapping: an appli- difference model. Vision Res. 48, puter program for obtaining mini- test. Br. J. Math. Stat. Psychol. 64, cation in a neuromarketing exper- 1837–1851. mum sample size as an indicator of 410–426. iment. J. Neurosci. Methods 191, Zaslavsky, B. G. (2010). Bayesian ver- effect size. Educ. Psychol. Meas. 58, Saberi, K., and Petrosyan, A. (2004). A 283–289. sus frequentist hypotheses testing in 142–153. detection-theoretic model of echo Vul, E., Harris, C., Winkielman, P., clinical trials with dichotomous and Morse, D. T. (1999). MINSIZE2: a inhibition. Psychol. Rev. 111, 52–66. and Pashler, H. (2009a). Puzzlingly countable outcomes. J. Biopharm. computer program for determining Schucany,W.R.,and Ng,H. K. T.(2006). high correlations in fMRI studies Stat. 20, 985–997. effect size and minimum sample size Preliminary goodness-of-fit tests for of emotion, personality, and social Zimmerman, D. W. (1996). Some prop- for statistical significance for uni- normality do not validate the one- cognition. Perspect. Psychol. Sci. 4, erties of preliminary tests of equality variate, multivariate, and nonpara- sample Student t. Commun. Stat. 274–290. of variances in the two-sample loca- metric tests. Educ. Psychol. Meas. 59, Theory Methods 35, 2275–2286. Vul, E., Harris, C., Winkielman, P., and tion problem. J. Gen. Psychol. 123, 518–531. Shadish, W. R., Cook, T. D., and Camp- Pashler, H. (2009b). Reply to com- 217–231. Moser, B. K., and Stevens, G. R. (1992). bell, D. T. (2002). Experimental and ments on “Puzzlingly high correla- Zimmerman, D. W. (2004). A note on Homogeneity of variance in the two- Quasi-Experimental Designs for Gen- tions in fMRI studies of emotion, preliminary tests of equality of vari- sample means test. Am. Stat. 46, eralized Causal Inference. Boston, personality, and social cognition.” ances. Br. J. Math. Stat. Psychol. 57, 19–21. MA: Houghton Mifflin. Perspect. Psychol. Sci. 4, 319–324. 173–181.
Frontiers in Psychology | Quantitative Psychology and Measurement August 2012 | Volume 3 | Article 325 | 26 García-Pérez Statistical conclusion validity
Zimmerman, D. W. (2011). A simple conducted in the absence of any Citation: García-Pérez MA (2012) Copyright © 2012 García-Pérez. This is and effective decision rule for choos- commercial or financial relationships Statistical conclusion validity: some an open-access article distributed under ing a significance test to protect that could be construed as a potential common threats and simple reme- the terms of the Creative Commons Attri- against non-normality. Br. J. Math. conflict of interest. dies. Front. Psychology 3:325. doi: bution License, which permits use, distri- Stat. Psychol. 64, 388–409. 10.3389/fpsyg.2012.00325 bution and reproduction in other forums, Received: 10 May 2012; paper pending This article was submitted to Frontiers provided the original authors and source published: 29 May 2012; accepted: 14 in Quantitative Psychology and Measure- are credited and subject to any copy- Conflict of Interest Statement: The August 2012; published online: 29 August ment, a specialty of Frontiers in Psychol- right notices concerning any third-party author declares that the research was 2012. ogy. graphics etc.
www.frontiersin.org August 2012 | Volume 3 | Article 325 | 27 ORIGINAL RESEARCH ARTICLE published: 15 February 2012 doi: 10.3389/fpsyg.2012.00034 Is coefficient alpha robust to non-normal data?
Yanyan Sheng 1* and Zhaohui Sheng 2
1 Department of Educational Psychology and Special Education, Southern Illinois University, Carbondale, IL, USA 2 Department of Educational Leadership, Western Illinois University, Macomb, IL, USA
Edited by: Coefficient alpha has been a widely used measure by which internal consistency reliabil- Jason W. Osborne, Old Dominion ity is assessed. In addition to essential tau-equivalence and uncorrelated errors, normality University, USA has been noted as another important assumption for alpha. Earlier work on evaluating this Reviewed by: Pamela Kaliski, College Board, USA assumption considered either exclusively non-normal error score distributions, or limited James Stamey, Baylor University, conditions. In view of this and the availability of advanced methods for generating univariate USA non-normal data, Monte Carlo simulations were conducted to show that non-normal distri- *Correspondence: butions for true or error scores do create problems for using alpha to estimate the internal Yanyan Sheng, Department of consistency reliability. The sample coefficient alpha is affected by leptokurtic true score Educational Psychology and Special Education, Southern Illinois distributions, or skewed and/or kurtotic error score distributions. Increased sample sizes, University, Wham 223, MC 4618, not test lengths, help improve the accuracy, bias, or precision of using it with non-normal Carbondale, IL 62901-4618, USA. data. e-mail: [email protected] Keywords: coefficient alpha, true score distribution, error score distribution, non-normality, skew, kurtosis, Monte Carlo, power method polynomials
INTRODUCTION 2004; Graham, 2006; Green and Yang, 2009), which have been Coefficient alpha (Guttman, 1945; Cronbach, 1951) has been one considered as two major assumptions for alpha. The normality of the most commonly used measures today to assess internal assumption, however, has received little attention. This could be consistency reliability despite criticisms of its use (e.g., Raykov, a concern in typical applications where the population coeffi- 1998; Green and Hershberger,2000; Green andYang,2009; Sijtsma, cient is an unknown parameter and has to be estimated using 2009). The derivation of the coefficient is based on classical test the sample coefficient. When data are normally distributed, sam- theory (CTT; Lord and Novick, 1968), which posits that a person’s ple coefficient alpha has been shown to be an unbiased estimate observed score is a linear function of his/her unobserved true score of the population coefficient alpha (Kristof, 1963; van Zyl et al., (or underlying construct) and error score. In the theory, measures 2000); however, less is known about situations when data are can be parallel (essential) tau-equivalent, or congeneric, depend- non-normal. ing on the assumptions on the units of measurement, degrees of Over the past decades, the effect of departure from normality precision, and/or error variances. When two tests are designed to on the sample coefficient alpha has been evaluated by Bay (1973), measure the same latent construct, they are parallel if they mea- Shultz (1993), and Zimmerman et al. (1993) using Monte Carlo sure it with identical units of measurement, the same precision, simulations. They reached different conclusions on the effect of and the same amounts of error; tau-equivalent if they measure it non-normal data. In particular, Bay (1973) concluded that a lep- with the same units, the same precision, but have possibly differ- tokurtic true score distribution could cause coefficient alpha to ent error variance; essentially tau-equivalent if they assess it using seriously underestimate internal consistency reliability. Zimmer- the same units, but with possibly different precision and differ- man et al. (1993) and Shultz (1993), on the other hand, found ent amounts of error; or congeneric if they assess it with possibly that the sample coefficient alpha was fairly robust to departure different units of measurement, precision, and amounts of error from the normality assumption. The three studies differed in the (Lord and Novick, 1968; Graham, 2006). From parallel to con- design,in the factors manipulated and in the non-normal distribu- generic, tests are requiring less strict assumptions and hence are tions considered, but each is limited in certain ways. For example, becoming more general. Studies (Lord and Novick, 1968, pp. 87– Zimmerman et al. (1993) and Shultz (1993) only evaluated the 91; see also Novick and Lewis, 1967, pp. 6–7) have shown formally effect of non-normal error score distributions. Bay (1973), while that the population coefficient alpha equals internal consistency looked at the effect of non-normal true score or error score dis- reliability for tests that are tau-equivalent or at least essential tau- tributions, only studied conditions of 30 subjects and 8 test items. equivalent. It underestimates the actual reliability for the more Moreover, these studies have considered only two or three scenar- general congeneric test.Apart from essential tau-equivalence,coef- ios when it comes to non-normal distributions. Specifically, Bay ficient alpha requires two additional assumptions: uncorrelated (1973) employed uniform (symmetric platykurtic) and exponen- errors (Guttman, 1945; Novick and Lewis, 1967) and normality tial (non-symmetric leptokurtic with positive skew) distributions (e.g., Zumbo, 1999). Over the past decades, studies have well doc- for both true and error scores. Zimmerman et al. (1993) generated umented the effects of violations of essential tau-equivalence and error scores from uniform, exponential, and mixed normal (sym- uncorrelated errors (e.g., Zimmerman et al., 1993; Miller, 1995; metric leptokurtic) distributions, while Shultz (1993) generated Raykov, 1998; Green and Hershberger, 2000; Zumbo and Rupp, them using exponential, mixed normal, and negative exponential
www.frontiersin.org February 2012 | Volume 3 | Article 34 | 28 Sheng and Sheng Effect of non-normality on coefficient alpha
(non-symmetric leptokurtic with negative skew) distributions. for tau-equivalence, and Since the presence of skew and/or kurtosis determines whether and how a distribution departs from the normal pattern, it is Xij = ti + υj + eij , (3) desirable to consider distributions with varying levels of skew and kurtosis so that a set of guidelines can be provided. Gen- where Σjυj = 0, for essential tau-equivalence. erating univariate non-normal data with specified moments can Summing across k items, we obtain a composite score (Xi+) be achieved via the use of power method polynomials (Fleish- and a scale error score (ei+). The variance of the composite scores man, 1978), and its current developments (e.g., Headrick, 2010) is then the summation of true score and scale error score variances: make it possible to consider more combinations of skew and σ2 = σ2 + σ2 kurtosis. X+ t e+ . (4) Further, in the actual design of a reliability study, sample size determination is frequently an important and difficult aspect. The The reliability coefficient, ρXX , is defined as the proportion of literature offers widely different recommendations, ranging from composite score variance that is due to true score variance: 15 to 20 (Fleiss, 1986), a minimum of 30 (Johanson and Brooks, 2010) to a minimum of 300 (Nunnally and Bernstein, 1994). σ2 ρ = t . (5) Although Bay (1973) has used analytical derivations to suggest XX σ2 X+ that coefficient alpha shall be robust against the violation of the normality assumption if sample size is large, or the number of Under (essential) tau-equivalence, that is, for models in (2) and items is large and the true score kurtosis is close to zero, it is (3), the population coefficient alpha, defined as never clear how many subjects and/or items are desirable in such situations. σ k j =j Xj X In view of the above, the purpose of this study is to investigate α = j , k − 1 σ2 the effect of non-normality (especially the presence of skew and/or X+ kurtosis) on reliability estimation and how sample sizes and test or lengths affect the estimation with non-normal data. It is believed ⎛ ⎞ that the results will not only shed insights on how non-normality k σ2 k j=1 j affects coefficient alpha, but also provide a set of guidelines for α = ⎝1 − ⎠ , (6) − σ2 researchers when specifying the numbers of subjects and items in k 1 X+ a reliability study. ρ is equal to the reliability as defined in (5). As was noted, XX and MATERIALS AND METHODS α focus on the amount of random error and do not evaluate error This section starts with a brief review of the CTT model for coef- that may be systematic. ficient alpha. Then the procedures for simulating observed scores Although the derivation of coefficient alpha based on Lord and used in the Monte Carlo study are described, followed by measures Novick (1968) does not require distributional assumptions for ti that were used to evaluate the performance of the sample alpha in and eij, its estimation does (see Shultz, 1993; Zumbo, 1999), as the each simulated situation. sample coefficient alpha estimated using sample variances s2,
PRELIMINARIES ⎛ ⎞ Coefficient alpha is typically associated with true score the- k 2 k j=1 sj ory (Guttman, 1945; Cronbach, 1951; Lord and Novick, 1968), αˆ = ⎝1 − ⎠ ,(7) k − 1 s2 where the test score for person i on item j, denoted as Xij,is X+ assumed to be a linear function of a true score (tij) and an error score (eij): is shown to be the maximum likelihood estimator of the popu- lation alpha assuming normal distributions (Kristof, 1963; van = + ∼ (μ σ2) Xij tij eij , (1) Zyl et al., 2000). Typically, we assume ti N t , t and ∼ ( σ2) σ2 eij N 0, e ,where e has to be differentiated from the scale = ... = ... = ρ = σ2 i 1, , n and j 1, , k,whereE(eij) 0, te 0, and error score variance e+ defined in (4). ρ = eij ,eij 0. Here, eij denotes random error that reflects unpre- dictable trial-by-trial fluctuations. It has to be differentiated from STUDY DESIGN systematic error that reflects situational or individual effects that To evaluate the performance of the sample alpha as defined in may be specified. In the theory,items are usually assumed to be tau- (7) in situations where true score or error score distributions equivalent, where true scores are restricted to be the same across depart from normality, a Monte Carlo simulation study was car- items, or essentially tau-equivalent, where they are allowed to dif- ried out, where test scores of n persons (n = 30, 50, 100, 1000) for k fer from item to item by a constant (υj). Under these conditions items (k = 5, 10, 30) were generated assuming tau-equivalence and (1) becomes where the population reliability coefficient (ρXX ) was specified to be 0.3, 0.6, or 0.8 to correspond to unacceptable, acceptable, or Xij = ti + eij (2) very good reliability (Caplan et al., 1984, p. 306; DeVellis, 1991,
Frontiers in Psychology | Quantitative Psychology and Measurement February 2012 | Volume 3 | Article 34 | 29 Sheng and Sheng Effect of non-normality on coefficient alpha
p. 85; Nunnally, 1967, p. 226). These are referred to as small, 4. c0 =−0.446924,c1 = 1.242521,c2 = 0.500764,c3 =−0.184710, moderate, and high reliabilities in subsequent discussions. Specifi- c4 =−0.017947, c5 = 0.003159; cally,truescores(ti)anderrorscores(eij) were simulated from their 5. c0 =−0.276330,c1 = 1.506715,c2 = 0.311114,c3 =−0.274078, σ2ρ =− = σ2 = μ = σ2 = e XX c4 0.011595, c5 0.007683; respective distributions with e 1, t 5 and t ( −ρ ) . 1 XX k 6. c =−0.304852, c = 0.381063, c = 0.356941, c = 0.132688, The observed scores (X ) were subsequently obtained using 0 1 2 3 ij c =−0.017363, c = 0.003570. Eq. (2). 4 5 In addition, true score or error score distributions were manip- It is noted that the effect of the true score or error score distrib- ulated to be symmetric (so that skew, γ , is 0) or non-symmetric 1 ution was investigated independently, holding the other constant (γ > 0) with kurtosis (γ ) being 0, negative or positive. It is 1 2 by assuming it to be normal. noted that only positively skewed distributions were considered Hence, a total of 4 (sample sizes) × 3 (test lengths) × 3 (lev- in the study because due to the symmetric property, negative els of population reliability) × 6 (distributions) × 2 (true or error skew should have the same effect as positive skew. Generating score) = 432 conditions were considered in the simulation study. non-normal distributions in this study involves the use of power Each condition involved 100,000 replications, where coefficient method polynomials. Fleishman (1978) introduced this popu- alpha was estimated using Eq. (7) for simulated test scores (X ). lar moment matching technique for generating univariate non- ij The 100,000 estimates of α can be considered as random samples normal distributions. Headrick (2002, 2010) further extended from the sampling distribution of αˆ, and its summary statistics from third-order to fifth-order polynomials to lower the skew including the observed mean, SD, and 95% interval provide infor- and kurtosis boundary. As is pointed out by Headrick (2010, p. mation about this distribution. In particular, the observed mean 26), for distributions with a mean of 0 and a variance of 1, the indicates whether the sample coefficient is biased. If it equals α, αˆ skew and kurtosis have to satisfy γ γ2 − 2, and hence it is 2 1 is unbiased; otherwise, it is biased either positively or negatively not plausible to consider all possible combinations of skew and depending on whether it is larger or smaller than α. The SD of kurtosis using power method polynomials. Given this, six distrib- the sampling distribution is what we usually call the SE. It reflects utions with the following combinations of skew and kurtosis were the uncertainty in estimating α, with a smaller SE suggesting more considered: precision and hence less uncertainty in the estimation. The SE is directly related to the 95% observed interval, as the larger it is, the γ = γ = 1. 1 0, 2 0 (normal distribution); more spread the distribution is and the wider the interval will be. γ = γ =− 2. 1 0, 2 1.385 (symmetric platykurtic distribution); With respect to the observed interval, it contains about 95% of γ = γ = 3. 1 0, 2 25 (symmetric leptokurtic distribution); αˆ around its center location from its empirical sampling distrib- γ = γ = 4. 1 0.96, 2 0.13 (non-symmetric distribution); ution. If α falls inside the interval, αˆ is not significantly different γ = γ =− 5. 1 0.48, 2 0.92 (non-symmetric platykurtic distribu- from α even though it is not unbiased. On the other hand, if α tion); falls outside of the interval, which means that 95% of the esti- γ = γ = 6. 1 2.5, 2 25 (non-symmetric leptokurtic distribution). mates differ from α, we can consider αˆ to be significantly different from α. A normal distribution was included so that it could be used as In addition to these summary statistics, the accuracy of the a baseline against which the non-normal distributions could be estimate was evaluated by the root mean square error (RMSE) and compared. To actually generate univariate distributions using the bias, which are defined as fifth-order polynomial transformation, a random variate Z is first ∼ generated from a standard normal distribution, Z N (0,1). Then αˆ − α 2 the following polynomial, RMSE = , (9) 100, 000 Y = c + c Z + c Z 2 + c Z 3 + c Z 4 + c Z 5 (8) 0 1 2 3 4 5 and is used to obtain Y. With appropriate coefficients (c0, ..., c5), Y αˆ − α bias = , (10) would follow a distribution with a mean of 0, a variance of 1, and 100, 000 the desired levels of skew and kurtosis (see Headrick, 2002, for a detailed description of the procedure). A subsequent linear trans- respectively. The larger the RMSE is, the less accurate the sample formation would rescale the distribution to have a desired location coefficient is in estimating the population coefficient. Similarly, or scale parameter. In this study, Y could be the true score (ti)or the larger the absolute value of the bias is, the more bias the sam- the error score (eij). For the six distributions considered for ti or ple coefficient involves. As the equations suggest, RMSE is always eij herein, the corresponding coefficients are: positive, with values close to zero reflecting less error in estimating the actual reliability. On the other hand, bias can be negative or 1. c0 = 0, c1 = 1, c2 = 0, c3 = 0, c4 = 0, c5 = 0; positive. A positive bias suggests that the sample coefficient tends 2. c0 = 0, c1 = 1.643377, c2 = 0, c3 =−0.319988, c4 = 0, c5 = to overestimate the reliability, and a negative bias suggests that it 0.011344; tends to underestimate the reliability. In effect, bias provides simi- 3. c0 = 0, c1 = 0.262543, c2 = 0, c3 = 0.201036, c4 = 0, c5 = lar information as the observed mean of the sampling distribution 0.000162; of αˆ.
www.frontiersin.org February 2012 | Volume 3 | Article 34 | 30 Sheng and Sheng Effect of non-normality on coefficient alpha
RESULTS distribution to determine if αˆ was affected by non-normality in The simulations were carried out using MATLAB (MathWorks, true scores. Take the condition where a test of 5 items with the 2010), with the source code being provided in the Section “Appen- actual reliability being 0.3 was given to 30 persons as an example. dix.” Simulation results are summarized in Tables 1–3 for condi- A normal distribution resulted in an observed mean of 0.230 and tions where true scores follow one of the six distributions specified a SE of 0.241 for the sampling distribution of αˆ (see Table 1). in the previous section. Here, results from the five non-normal Compared with it, a symmetric platykurtic distribution, with an distributions were mainly compared with those from the normal observed mean of 0.234 and a SE of 0.235, did not differ much.
Table 1 | Observed mean and SD of the sample alpha (αˆ ) for the simulated situations where the true score (ti) distribution is normal or non-normal. nk Mean (αˆ ) SD (αˆ )
dist1 dist2 dist3 dist4 dist5 dist6 dist1 dist2 dist3 dist4 dist5 dist6