A Replication Crisis in Methodological Research?

STATISTICAL TOOLS A replication crisis in methodological research? Statisticians have been keen to critique statistical aspects of the “replication crisis” in other scientific disciplines. But new statistical tools are often published and promoted without any thought to replicability. This needs to change, argue Anne‑Laure Boulesteix, Sabine Hoffmann, Alethea Charlton and Heidi Seibold magine you need to take a drug. A to be more efficient than other statistical practices should also be avoided in the new drug is available that has been methods in a few example data sets selected development and reporting of new statistical investigated mainly through in vitro by the developer of the method. Would you methods. Iexperiments (i.e., in test tubes, rather feel confident using it? Weirdly, if you are a than in living organisms). It was shown to statistician, you probably would. Sins improve survival in a few patients selected Statisticians are among the first to call Table 1 sets out “seven sins of by the pharmaceutical company. Would for more rigour in clinical trials and other methodological research”, inspired by a you feel safe taking this drug? Probably applied fields of statistics. Yet, in their own recent Significance article by Held and not, especially if you are a statistician. You methodological research, statisticians Schwab.1 These practices include “fishing would ask why no pre-registered randomised commonly make claims on the performance expeditions” (i.e. running numerous different clinical trial was conducted to investigate the and utility of methods based merely on analyses in the hope that one will yield efficacy and safety of the drug. theory, limited simulations, or arbitrarily good results), then “selectively reporting” Now imagine you need to use a statistical selected real data examples. In the current the good results while leaving the others method for some data analysis. One of the replication crisis in science, statisticians in the metaphorical “file drawer”. In some available methods, a new method, was caution against questionable research cases, this “file drawer problem” affects investigated mainly through simulations practices in fields like psychology, biology whole projects, whose results are deemed (i.e. using synthetic data sets). It was shown and medicine. Yet the same questionable unexciting and are therefore not published 18 SIGNIFICANCE October 2020 © 2020 The Royal Statistical Society STATISTICAL TOOLS Table 1: The seven sins of methodological statistical research. The sesevenven sins ooff memethodologicalthodological rresearchesearch FFurtherurther rreadingeading Fishing expeditions/selective reporting Jelizarow et al.2; Hutson3 Publication bias Boulesteix et al.4 Lack of neutral comparison studies Boulesteix et al.5, 6 Lack of replication studies Liu and Meng7 Poor design of comparison studies Keogh and Kasetty8; Boulesteix et al.6; Christodoulou et al.9 Lack of meta-analyses Gardner et al.10 Lack of reporting guidelines (ourselves included) can stumble into this familiar with the various proposed methods. pitfall subconsciously, with no intention to The STRATOS initiative (stratos-initiative. “cheat”, encouraged by the fact that new org) is, we believe, a step in the right techniques are introduced in the scientific direction, aiming to provide guidance for the literature using only examples where they statistical analysis of observational medical seem to work perfectly. In a survey of papers studies. STRATOS emphasises the importance on new techniques, for example, we found of comparison studies performed by groups that all – without exception – were claimed to of experts from different “statistical schools”. perform better than existing competitors.11 There are also efforts such as OpenML Clearly, methodological results are affected (openml.org), which tries to tackle this issue by something akin to publication bias. in machine learning by opening up results of However, discussing publication bias, which thousands of machine learning benchmarks has attracted a lot of attention in medical to the public and allowing anyone to add and social sciences since the 1950s, seemed their own results. to be surprisingly taboo in methodological However, the pressure on researchers research until we tried to define the concept to publish in journals, and the reluctance in this context.4 of journals to accept the results of neutral Our contention is that, as a result of comparison studies, remains a crucial publication bias and fishing expeditions, obstacle. Contrast this with clinical the scientific literature is rife with statistical research, where clinical trials are considered Matt Artz/Unsplash.com Matt methods that supposedly perform better important pieces of scientific work, even if than all other methods – but which are never the treatment approach has been described compared to other methods except by their elsewhere before. If statistical methods (potentially biased) inventors. were treated like drugs, there would be a strong demand for neutral and well-planned at all, further exacerbating so-called Comparisons comparison studies: patients would refuse “publication bias”. The replicability of methodological research a drug that has not been reliably proven to In an intriguing example of how fishing findings has, to our knowledge, never be better, so why are we using statistical expeditions and selective reporting work, been systematically investigated, which methods based on the results of one Jelizarow et al.2 showed that they could is somewhat unexpected given the many (potentially biased) study? make a new discriminant analysis method empirical studies devoted to replicability in What is the role of replication studies? seem better than existing methods, simply by other scientific fields in the last decade. It is We all agree that they are needed in picking the best results using different data not hard to imagine, however, that claims applied research, but does this also hold sets, method variants and pre-processing about the superiority of new methods over true for methodological research? The approaches. In reality, the new method existing ones may be overly optimistic and goal of such studies would be to confirm was no better than those already in use. not replicable. Such concerns could be put the results of previous methodological Such problems are not limited to classical to rest if the statistical community were to papers, using, say, alternative simulation statistical methods. We see the same conduct more neutral comparison studies, designs, other real data sets and a issues in machine learning and artificial meaning studies that are not conducted with different implementation. Such formal intelligence.3 the aim of demonstrating the superiority of replication studies are rare to non-existent While most scientists would agree that a particular (new) method, and are authored in methodological research. Would they selective reporting is bad practice, many by researchers who are, on the whole, equally be deemed non-innovative and not October 2020 significancemagazine.com 19 STATISTICAL Anne‑Laure Boulesteix is a professor Sabine Hoffmann is a postdoctoral TOOLS of biometrics in the Institute for Medical researcher in the Institute for Medical Information Processing, Biometry Information Processing, Biometry and Epidemiology at LMU München, and Epidemiology at LMU München, Germany. Germany. worthy of publication by most renowned statistics journals? The pitfalls of existing methods are often discovered accidentally and demonstrated in the scientific literature many years after the original publication. This may result in flawed methods becoming widely used and, in the worst cases, accumulating years of potentially misleading results. The numerous reactions from the statistical community to a tweet on this general issue (see Figure 1) suggest that this is perceived to be a huge problem, and one prominent Figure 1: Co-author’s tweet, asking for examples of widely used, but ultimately flawed, statistical methods. example is the so-called magnitude-based 20 SIGNIFICANCE October 2020 Alethea Charlton is a student Heidi Seibold is a postdoctoral assistant in the Institute for Medical researcher at LMU München, Information Processing, Biometry Bielefeld University and Helmholz and Epidemiology at LMU München, Zentrum München, Germany. Germany. inference method, which was widely used of methods described in the methodological Ministry of Education and Research (BMBF; in sport statistics but eventually found to statistical literature are extremely rare, and grant no. 01IS18036A) for funding, and the be flawed.12 what is more, how they should be performed Twitter community who helped improve our is unclear. paper with valuable comments and literature Design Lastly, we should not fall at the final recommendations (see bit.ly/2EfmHyz and There is a clear need for more neutral hurdle: reporting, another important bit.ly/3jgu4EL). comparisons and replications of issue related to replicability. Appropriate methodological statistical research, but how reporting has been the subject of much References should such studies be performed? conversation over the past decade, in 1. Held, L. and Schwab, S. (2020) Improving the In many fields related to statistics, fields ranging from randomised clinical reproducibility of science. Significance, 17(1), 10–11. such as computational biology and trials to prediction models relying on 2. Jelizarow, M., Guillemot, V., Tenenhaus, A., Strimmer, bioinformatics,

Load more