Nuijten Research 30 05 2018.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
Tilburg University Research on research Nuijten, Michèle Publication date: 2018 Document Version Publisher's PDF, also known as Version of record Link to publication in Tilburg University Research Portal Citation for published version (APA): Nuijten, M. (2018). Research on research: A meta-scientific study of problems and solutions in psychological science. Gildeprint. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal Take down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Download date: 28. sep. 2021 Research on Research A Meta-Scientific Study of Problems and Solutions in Psychological Science Michèle B. Nuijten Author: Michèle B. Nuijten Cover design: Niels Bongers – www.nielsbongers.nl Printed by: Gildeprint – www.gildeprint.nl ISBN: 978-94-6233-928-6 Research on Research A Meta-Scientific Study of Problems and Solutions in Psychological Science Proefschrift ter verkrijging van de graad van doctor aan Tilburg University op gezag van de rector magnificus, prof. dr. E. H. L. Aarts, in het openbaar te verdedigen ten overstaan van een door het college voor promoties aangewezen commissie in de aula van de Universiteit op woensdag 30 mei 2018 om 14:00 uur door Michèle Bieneke Nuijten, geboren te Utrecht. Promotiecommissie Promotores: Prof. dr. J. M. Wicherts Prof. dr. M. A. L. M. van Assen Overige leden: Prof. dr. C. D. Chambers Prof. dr. E. J. Wagenmakers Prof. dr. R. A. Zwaan Dr. M. Bakker Contents 1 Introduction 7 Part I: Statistical Reporting Inconsistencies 2 The prevalence of statistical reporting errors in psychology (1985-2013) 17 3 The validity of the tool “statcheck” in discovering statistical reporting inconsistencies 61 4 Journal data sharing policies and statistical reporting inconsistencies in psychology 83 5 Preventing statistical errors in scientific journals 127 6 Discussion Part I 135 Part II: Bias in Effect Sizes 7 The replication paradox: combining studies can decrease accuracy of effect size estimates 145 8 Standard analyses fail to show that US studies overestimate effect sizes in softer research 171 9 Effect sizes, power, and biases in intelligence research: a meta-meta-analysis 175 10 Discussion Part II 223 11 Epilogue 231 References 239 Summary 263 Nederlandse samenvatting 267 Dankwoord 271 Chapter 1 Introduction CHAPTER 1 Can we trust psychological research findings? This question is asked more and more, and there is growing concern that many published findings are overly optimistic (Francis, 2014; Ioannidis, 2005, 2008; John, Loewenstein, & Prelec, 2012; Open Science Collaboration, 2015; Simmons, Nelson, & Simonsohn, 2011). An increasing number of studies show that we might have good reason to doubt the validity of published psychological findings, and researchers are even starting to speak of a “crisis of confidence” or a “replicability crisis” (Baker, 2016a; Pashler & Harris, 2012; Pashler & Wagenmakers, 2012; Spellman, 2015). 1.1 Replicability in Psychology The growing concern about psychology’s trustworthiness is fueled by the finding that a large number of published psychological findings could not be replicated in novel samples. For instance, the large-scale, collaborative Reproducibility Project: Psychology (RPP) investigated the replicability of 100 psychology studies (Open Science Collaboration, 2015). Two of the main findings in this project were that the percentage of statistically significant effects dropped from 97% in the original studies to only 36% in the replications, and that the effect sizes in the replications were only about half the size of those in the original studies. Other multi-lab initiatives also failed to replicate key findings in psychology (Alogna et al., 2014; Eerland et al., 2016; Hagger et al., 2016; Wagenmakers et al., 2016) There are several possible explanations for the low replicability rates in psychology. One possibility is that meaningful differences between the original studies and their replications caused the differences in results (Baumeister, 2016; Dijksterhuis, 2014; Iso-Ahola, 2017; Stroebe & Strack, 2014). Indeed, there are some indications that some effects show large between-study variability, which could explain the low replicability rates (Klein et al., 2014) . Another explanation, however, is that the original studies overestimated the effects or were false positives (chance) findings. 1.2 Bias and Errors Several research findings are in line with the notion that published effects are overestimated. For instance, the large majority of studies in psychology find support for the tested hypothesis (Fanelli, 2010; Francis, 2014; Sterling, Rosenbaum, & Weinkam, 1995). However, this is incompatible with the generally low statistical power of studies in the psychological literature (Bakker, van Dijk, & Wicherts, 2012; Button et al., 2013; Cohen, 1962; Jennions & Moller, 2003; Maxwell, 2004; Schimmack, 2012). Low power decreases the probability that a study finds a significant effect. Conversely, and perhaps counterintuitively, the lower the power, the higher the probability that a significant finding is a false positive. The large number of underpowered studies in psychology that do find significant effects therefore might indicate a problem with the trustworthiness of these findings. 8 INTRODUCTION The notion that many findings are overestimated also becomes clear in meta-analyses. Meta-analysis is a crucial scientific tool to quantitatively synthesize the results of different studies on the same research question (Borenstein, Hedges, Higgins, & Rothstein, 2009). The results of meta-analyses inspire policies and treatments, so it is essential that the effects reported in them are valid. However, in many fields meta-analytic effects appear to be overestimated (Ferguson & Brannick, 2012; Ioannidis, 2011; Niemeyer, Musch, & Pietrowsky, 2012, 2013; Sterne, Gavaghan, & Egger, 2000; Sutton, Duval, Tweedie, Abrams, & Jones, 2000). One of the main causes seems to be publication bias; the phenomenon that statistically significant findings have a higher probability of being published than nonsignificant findings (Greenwald, 1975). The evidence that the field of psychology is affected by publication bias is overwhelming. Studies found that manuscripts without significant results are both less likely to be submitted and less likely to be accepted for publication (Cooper, DeNeve, & Charlton, 1997; Coursol & Wagner, 1986; Dickersin, Chan, Chalmers, Sacks, & Smith, 1987; Epstein, 1990; Franco, Malhotra, & Simonovits, 2014; Greenwald, 1975; Mahoney, 1977). Furthermore, published studies seem to have systematically larger effects than unpublished ones (Franco et al., 2014; Polanin, Tanner-Smith, & Hennessy, 2015). The de facto requirement to report statistically significant results in journal articles can lead to unwanted strategic behavior in data analysis (Bakker et al., 2012). Data analysis in psychology is very flexible: there are many possible statistical analyses to answer the same research question (Gelman & Loken, 2014; Wicherts et al., 2016). It can be shown that strategic use of this flexibility will almost always result in at least one significant finding; one that is likely to be a false positive (Bakker et al., 2012; Simmons et al., 2011). This becomes even more problematic, if only the analyses that “worked” are reported and presented as if they were planned from the start (Kerr, 1998; Wagenmakers, Wetzels, Borsboom, Maas, & Kievit, 2012). Survey results show that many psychologists admit to such “questionable research practices” (QRPs; Agnoli, Wicherts, Veldkamp, Albiero, & Cubelli, 2017; John et al., 2012), and use of study registers and later disclosures by researchers provide direct evidence that indeed some of these practices are quite common (Franco, Malhotra, & Simonovits, 2016; LeBel et al., 2013). Another example of a QRP that illustrates a strong focus on finding significant results, is wrongly rounding down p-values to < .05. This is a particularly surprising strategy, since this can be readily observed in published papers. If a p-value is wrongly rounded down, it often leads to a statistical reporting inconsistency. Statistical reporting inconsistencies occur when the test statistic, the degrees of freedom, and the p-value in a null hypothesis significance test (NHST) are not internally consistent. If the reported p-value is significant, whereas the recalculated p-value based on the reported degrees of freedom and test statistic is not, or vice versa, this is considered a gross inconsistency. Several studies found a high prevalence of such 9 CHAPTER 1 reporting inconsistencies (e.g., Bakker & Wicherts, 2011; Caperos & Pardo, 2013). Even though the majority of inconsistencies seemed to be innocent typos and rounding errors, there is evidence for a systematic bias towards finding significant results, in line with the notion that some researchers may wrongly round