STATISTICAL TOOLS A replication crisis in methodological ?

Statisticians have been keen to critique statistical aspects of the “replication crisis” in other scientific disciplines. But new statistical tools are often published and promoted without any thought to replicability. This needs to change, argue Anne‑Laure Boulesteix, Sabine Hoffmann, Alethea Charlton and Heidi Seibold

magine you need to take a drug. A to be more efficient than other statistical practices should also be avoided in the new drug is available that has been methods in a few example data sets selected development and reporting of new statistical investigated mainly through in vitro by the developer of the method. Would you methods. Iexperiments (i.e., in test tubes, rather feel confident using it? Weirdly, if you are a than in living organisms). It was shown to statistician, you probably would. Sins improve survival in a few patients selected Statisticians are among the first to call Table 1 sets out “seven sins of by the pharmaceutical company. Would for more rigour in clinical trials and other methodological research”, inspired by a you feel safe taking this drug? Probably applied fields of . Yet, in their own recent Significance article by Held and not, especially if you are a statistician. You methodological research, statisticians Schwab.1 These practices include “fishing would ask why no pre-registered randomised commonly make claims on the performance expeditions” (i.e. running numerous different clinical trial was conducted to investigate the and utility of methods based merely on analyses in the hope that one will yield efficacy and safety of the drug. theory, limited simulations, or arbitrarily good results), then “selectively reporting” Now imagine you need to use a statistical selected real data examples. In the current the good results while leaving the others method for some data analysis. One of the replication crisis in , statisticians in the metaphorical “file drawer”. In some available methods, a new method, was caution against questionable research cases, this “file drawer problem” affects investigated mainly through simulations practices in fields like , whole projects, whose results are deemed (i.e. using synthetic data sets). It was shown and . Yet the same questionable unexciting and are therefore not published

18 SIGNIFICANCE October 2020 © 2020 The Royal Statistical Society STATISTICAL TOOLS

Table 1: The seven sins of methodological statistical research. The sesevenven sins ooff memethodologicalthodological rresearchesearch FurtherFurther readingreading Fishing expeditions/selective reporting Jelizarow et al.2; Hutson3 Publication Boulesteix et al.4 Lack of neutral comparison studies Boulesteix et al.5, 6 Lack of replication studies Liu and Meng7 Poor design of comparison studies Keogh and Kasetty8; Boulesteix et al.6; Christodoulou et al.9 Lack of meta-analyses Gardner et al.10 Lack of reporting guidelines

(ourselves included) can stumble into this familiar with the various proposed methods. pitfall subconsciously, with no intention to The STRATOS initiative (stratos-initiative. “cheat”, encouraged by the fact that new org) is, we believe, a step in the right techniques are introduced in the scientific direction, aiming to provide guidance for the literature using only examples where they statistical analysis of observational medical seem to work perfectly. In a survey of papers studies. STRATOS emphasises the importance on new techniques, for example, we found of comparison studies performed by groups that all – without exception – were claimed to of experts from different “statistical schools”. perform better than existing competitors.11 There are also efforts such as OpenML Clearly, methodological results are affected (openml.org), which tries to tackle this issue by something akin to . in machine learning by opening up results of However, discussing publication bias, which thousands of machine learning benchmarks has attracted a lot of attention in medical to the public and allowing anyone to add and social since the 1950s, seemed their own results. to be surprisingly taboo in methodological However, the pressure on researchers research until we tried to define the concept to publish in journals, and the reluctance in this context.4 of journals to accept the results of neutral Our contention is that, as a result of comparison studies, remains a crucial publication bias and fishing expeditions, obstacle. Contrast this with clinical the scientific literature is rife with statistical research, where clinical trials are considered

Matt Artz/Unsplash.com Matt methods that supposedly perform better important pieces of scientific work, even if than all other methods – but which are never the treatment approach has been described compared to other methods except by their elsewhere before. If statistical methods (potentially biased) inventors. were treated like drugs, there would be a strong demand for neutral and well-planned at all, further exacerbating so-called Comparisons comparison studies: patients would refuse “publication bias”. The replicability of methodological research a drug that has not been reliably proven to In an intriguing example of how fishing findings has, to our knowledge, never be better, so why are we using statistical expeditions and selective reporting work, been systematically investigated, which methods based on the results of one Jelizarow et al.2 showed that they could is somewhat unexpected given the many (potentially biased) study? make a new discriminant analysis method empirical studies devoted to replicability in What is the role of replication studies? seem better than existing methods, simply by other scientific fields in the last decade. It is We all agree that they are needed in picking the best results using different data not hard to imagine, however, that claims applied research, but does this also hold sets, method variants and pre-processing about the superiority of new methods over true for methodological research? The approaches. In reality, the new method existing ones may be overly optimistic and goal of such studies would be to confirm was no better than those already in use. not replicable. Such concerns could be put the results of previous methodological Such problems are not limited to classical to rest if the statistical community were to papers, using, say, alternative simulation statistical methods. We see the same conduct more neutral comparison studies, designs, other real data sets and a issues in machine learning and artificial meaning studies that are not conducted with different implementation. Such formal intelligence.3 the aim of demonstrating the superiority of replication studies are rare to non-existent While most scientists would agree that a particular (new) method, and are authored in methodological research. Would they selective reporting is bad practice, many by researchers who are, on the whole, equally be deemed non-innovative and not

October 2020 significancemagazine.com 19 STATISTICAL Anne‑Laure Boulesteix is a professor Sabine Hoffmann is a postdoctoral TOOLS of biometrics in the Institute for Medical researcher in the Institute for Medical Information Processing, Biometry Information Processing, Biometry and Epidemiology at LMU München, and Epidemiology at LMU München, Germany. Germany.

worthy of publication by most renowned statistics journals? The pitfalls of existing methods are often discovered accidentally and demonstrated in the scientific literature many years after the original publication. This may result in flawed methods becoming widely used and, in the worst cases, accumulating years of potentially misleading results. The numerous reactions from the statistical community to a tweet on this general issue (see Figure 1) suggest that this is perceived to be a huge problem, and one prominent Figure 1: Co-author’s tweet, asking for examples of widely used, but ultimately flawed, statistical methods. example is the so-called magnitude-based

20 SIGNIFICANCE October 2020 Alethea Charlton is a student Heidi Seibold is a postdoctoral assistant in the Institute for Medical researcher at LMU München, Information Processing, Biometry Bielefeld University and Helmholz and Epidemiology at LMU München, Zentrum München, Germany. Germany.

inference method, which was widely used of methods described in the methodological Ministry of Education and Research (BMBF; in sport statistics but eventually found to statistical literature are extremely rare, and grant no. 01IS18036A) for funding, and the be flawed.12 what is more, how they should be performed Twitter community who helped improve our is unclear. paper with valuable comments and literature Design Lastly, we should not fall at the final recommendations (see bit.ly/2EfmHyz and There is a clear need for more neutral hurdle: reporting, another important bit.ly/3jgu4EL). comparisons and replications of issue related to replicability. Appropriate methodological statistical research, but how reporting has been the subject of much References should such studies be performed? conversation over the past decade, in 1. Held, L. and Schwab, S. (2020) Improving the In many fields related to statistics, fields ranging from randomised clinical of science. Significance, 17(1), 10–11. such as computational biology and trials to prediction models relying on 2. Jelizarow, M., Guillemot, V., Tenenhaus, A., Strimmer, bioinformatics, scientists can rely on a artificial intelligence in health science. To K. and Boulesteix, A. L. (2010) Over-optimism in wide body of literature offering guidance date, however, no guidance is available bioinformatics: An illustration. Bioinformatics, 26(16), on how to perform comparison studies. for reporting of methodological statistical 1990–1998. But, surprisingly, the design of comparison research. Our personal experience is that 3. Hutson, M. (2018) Artificial intelligence faces studies of statistical methods has hardly critical information is often missing in reproducibility crisis. Science, 359, 725–726. been addressed, even though the methodological research articles, such as 4. Boulesteix, A. L., Stierle, V. and Hapfelmeier, A. (2015) design of is an intrinsically the exact way in which the simulated data Publication bias in methodological computational statistical issue. were generated. This incomplete reporting research. Cancer Informatics, 14, 11–19. Research on the appropriate design of impedes understanding of the advantages, 5. Boulesteix, A. L., Hable, R., Lauer, S. and Eugster, M. benchmark studies, regardless of whether limitations and expected performance of J. (2015) A statistical framework for hypothesis testing in based on simulation13 or on real data,6 is the methods considered, not to mention real data comparison studies. American Statistician, still in its infancy. We can learn a lot from potentially rendering studies impossible to 69(3), 201–212. the world of clinical trials in this respect. reproduce. A recommended approach in 6. Boulesteix, A. L., Wilson, R. and Hapfelmeier, A. For example, the calculation of the required this context is the publication of analysis (2017) Towards evidence-based computational statistics: number of real data sets and strategies to codes, which allow readers to reproduce a Lessons from clinical research on the role and design of avoid bias (e.g., when handling missing study’s analysis at the click of a mouse. This real-data benchmark studies. BMC Medical Research values) are concepts that are relevant in both is already required by some journals (such as , 17(1), 138. clinical and benchmarking studies. Thus far, the Biometrical Journal) and will hopefully 7. Liu, K. and Meng, X. L. (2016) There is individualized however, the focus has been almost entirely become more common in the coming years. treatment. Why not individualized inference? Annual on the world of clinical trials. To sum up, we argue that statisticians Review of Statistics and Its Applications, 3, 79–111. Another key concept, virtually unused can learn a great deal from clinical research 8. Keogh, E. and Kasetty, S. (2003) On the need for time in the field of methodological research, is (and other fields) about comparison series data mining benchmarks: A survey and empirical meta-analysis. Well established in health and studies, reporting and research synthesis, demonstration. Data Mining and Knowledge Discovery, social sciences, meta-analyses systematically and that these valuable lessons should 7(4), 349–371. collate all existing research on a specific topic be applied to methodological statistical 9. Christodoulou, E., Ma, J., Collins, G. S., Steyerberg, and are considered to be the highest level of research, ultimately leading towards more E. W., Verbakel, J. Y., van Calster, B., et al. (2019) evidence. A few first attempts to summarise evidence-based statistical practice. After all, A systematic review shows no performance benefit of existing methodological literature have improving the replicability of methodological machine learning over logistic regression for clinical emerged, including a formal meta-analysis research is an important step in improving prediction models. Journal of Clinical Epidemiology, 110, of methods for the assessment of a certain research quality across all fields that apply 12–22. type of software in computational biology/ statistical methods. n 10. Gardner, P. P., Watson, R. J., Morgan, X. C., Draper, bioinformatics,10 and a systematic review J. L., Finn, R. D., Morales, S. E. and Stott, M. B. (2019) of the performance of machine learning Acknowledgements Identifying accurate metagenome and amplicon versus logistic regression.9 However, despite We thank the German Research Foundation software via a meta-analysis of sequence to taxonomy these important initial steps, quantitative (DFG; individual grants BO3139/4-3 and benchmarking studies. PeerJ, 7, e6160. or systematic reviews on the performance BO3139/7-1 to ALB) and the German Federal 11. Boulesteix, A. L., Lauer, S. and Eugster, M. J. (2013) A plea for neutral comparison studies in computational sciences. PLoS ONE, 8(4), e61562. There is a clear need for more neutral comparisons and 12. Sainani, K. (2018) The problem with “magnitude- replications of methodological statistical research, but based inference”. Medicine & Science in Sports & how should such studies be performed? Surprisingly, the Exercise, 50(1), 2166–2176. 13. Morris, T. P., White, I. R. and Crowther, M. J. (2019) design of comparison studies of statistical methods has Using simulation studies to evaluate statistical methods.

Matt Artz/Unsplash.com Matt hardly been addressed Statistics in Medicine, 38(11), 2074–2102.

October 2020 significancemagazine.com 21