<<

FEATURE ARTICLE

META-RESEARCH Dataset decay and the problem of sequential analyses on open datasets

Abstract Open data allows researchers to explore pre-existing datasets in new ways. However, if many researchers reuse the same dataset, multiple statistical testing may increase false positives. Here we demonstrate that sequential hypothesis testing on the same dataset by multiple researchers can inflate error rates. We go on to discuss a number of correction procedures that can reduce the number of false positives, and the challenges associated with these correction procedures.

WILLIAMHEDLEYTHOMPSON*,JESSEYWRIGHT,PATRICKGBISSETTAND RUSSELLAPOLDRACK

Introduction sequential procedures correct for the latest in a In recent years, there has been a push to make non-simultaneous series of tests. There are sev- the scientific datasets associated with published eral proposed solutions to address multiple papers openly available to other researchers sequential analyses, namely a-spending and a- (Nosek et al., 2015). Making data open will investing procedures (Aharoni and Rosset, allow other researchers to both reproduce pub- 2014; Foster and Stine, 2008), which strictly lished analyses and ask new questions of existing control false positive rate. Here we will also pro- datasets (Molloy, 2011; Pisani et al., 2016). pose a third, a-debt, which does not maintain a The ability to explore pre-existing datasets in constant false positive rate but allows it to grow new ways should make research more efficient controllably. and has the potential to yield new discoveries Sequential correction procedures are harder *For correspondence: william. (Weston et al., 2019). to implement than simultaneous procedures as [email protected] The availability of open datasets will increase they require keeping track of the total number over time as funders mandate and reward data Competing interests: The of tests that have been performed by others. authors declare that no sharing and other open research practices Further, in order to ensure data are still shared, competing interests exist. (McKiernan et al., 2016). However, researchers the sequential correction procedures should not re-analyzing these datasets will need to exercise Funding: See page 15 be antagonistic with current data-sharing incen- caution if they intend to perform hypothesis test- tives and infrastructure. Thus, we have identified Reviewing editor: Chris I Baker, ing. At present, researchers reusing datasets three desiderata regarding open data and multi- National Institute of Mental tend to correct for the number of statistical tests Health, National Institutes of ple hypothesis testing: that they perform on the datasets. However, as Health, United States we discuss in this article, when performing Sharing incentive Copyright Thompson et al. hypothesis testing it is important to take into This article is distributed under Data producers should be able to share their account all of the statistical tests that have been the terms of the Creative data without negatively impacting their initial performed on the datasets. Commons Attribution License, statistical tests. Otherwise, this reduces the A distinction can be made between simulta- which permits unrestricted use incentive to share data. and redistribution provided that neous and sequential correction procedures the original author and source are when correcting for multiple tests. Simultaneous credited. procedures correct for all tests at once, while

Thompson et al. eLife 2020;9:e53498. DOI: https://doi.org/10.7554/eLife.53498 1 of 17 Feature Article Meta-Research Dataset decay and the problem of sequential analyses on open datasets

Open access performed one statistical test. In this scenario, Minimal to no restrictions should be placed on with the same data, we have two published posi- accessing open data, other than those necessary tive findings compared to the single positive to protect the confidentiality of human subjects. finding in the previous scenario. Unless a reason- Otherwise, the data are no longer open. able justification exists for this difference between the two scenarios, this is troubling. Stable false positive rate What are the consequences of these two dif- The false positive rate (i.e., type I error) should ferent scenarios? A famous example of the con- not increase due to reusing open data. Other- sequences of uncorrected multiple simultaneous wise, scientific results become less reliable with statistical tests is the finding of fMRI BOLD acti- each reuse. vation in a dead salmon when appropriate cor- We will show that obtaining all three of these rections for multiple tests were not performed desiderata is not possible. We will demonstrate (Bennett et al., 2010; Bennett et al., 2009). below that the current practice of ignoring Now let us imagine this dead salmon dataset is sequential tests leads to an increased false posi- published online but, in the original , tive rate in the scientific literature. Further, we only one part of the salmon was analyzed, and show that sequentially correcting for data reuse no evidence was found supporting the hypothe- can reduce the number of false positives com- sis of neural activity in a dead salmon. Subse- pared to current practice. However, all the pro- quent researchers could access this dataset, test posals considered here must still compromise different regions of the salmon and report their (to some degree) on one of the above uncorrected findings. Eventually, we would see desiderata. reports of dead salmon activations if no sequen- tial correction strategy is applied, but each of these individual findings would appear An intuitive example of the completely legitimate by current correction problem standards. Before proceeding with technical details of the We will now explore the idea of sequential problem, we outline an intuitive problem regard- tests in more detail, but this example highlights ing sequential statistical testing and open data. some crucial arguments that need to be dis- Imagine there is a dataset which contains the cussed. Can we justify the variables (v1, v2, v3). Let us now imagine that one without correcting for sequential tests? If not, researcher performs the statistical tests to ana- what methods could sequentially correct for the lyze the relationship between v1 ~ v2 and v1 ~ v3 multiple statistical tests? In order to fully grapple and decides that a p < 0.05 is treated as a posi- with these questions, we first need to discuss tive finding (i.e. null hypothesis rejected). The the notion of a statistical family and whether analysis yields p-values of p = 0.001 and p = 0.04 sequential analyses create new families. respectively. In many cases, we expect the researcher to correct for the fact that two statis- tical tests are being performed. Thus, the Statistical families researcher chooses to apply a Bonferroni correc- A family is a set of tests which we relate the tion such that p < 0.025 is the adjusted thresh- same error rate to (familywise error). What con- old for statistical significance. In this case, both stitutes a family has been challenging to pre- tests are published, but only one of the findings cisely define, and the existing guidelines often is treated as a positive finding. contain additional imprecise terminology (e.g. Alternatively, let us consider a different sce- Cox, 1965; Games, 1971; Hancock and Klock- nario with sequential analyses and open data. ars, 1996; Hochberg and Tamhane, 1987; Instead, the researcher only performs one statis- Miller, 1981). Generally, tests are considered tical test (v1 ~ v2, p = 0.001). No correction is per- part of a family when: (i) multiple variables are formed, and it is considered a positive finding being tested with no predefined hypothesis (i.e. (i.e. null hypothesis rejected). The dataset is then exploration or data-dredging), or (ii) multiple published online. A second researcher now per- pre-specified tests together help support the forms the second test (v1 ~ v3, p = 0.04) and same or associated research questions deems this a positive finding too because it is (Hancock and Klockars, 1996; Hochberg and under a p < 0.05 threshold and they have only Tamhane, 1987). Even if following these

Thompson et al. eLife 2020;9:e53498. DOI: https://doi.org/10.7554/eLife.53498 2 of 17 Feature Article Meta-Research Dataset decay and the problem of sequential analyses on open datasets

Figure 1. Correction procedures can reduce the probability of false positives. (A) The probability of there being at least one false positive (y-axis) increases as the number of statistical tests increases (x-axis). The use of a correction procedure reduces the probability of there being at least one false positive (B: a-debt; C: a-spending; D: a-investing). Plots are based on simulations: see main text for details. Dotted line in each panel indicates a probability of 0.05.

guidelines, there can still be considerable dis- Some may find Wagenmakers et al., 2012 agreements about what constituents a statistical definition to be too stringent and instead would family, which can include both very liberal and rather allow that confirmatory hypotheses can very conservative inclusion criteria. An example be stated at later dates despite the researchers of this discrepancy is seen in using a factorial having some information about the data from ANOVA. Some have argued that the main effect previous use. Others have said confirmatory and are separate families as they analyses may not require preregistrations answer ’conceptually distinct questions’ (e.g. (Jebb et al., 2017) and have argued that confir- page 291 of Maxwell and Delaney, 2004), while matory analyses on open data are possible others would argue the opposite and state they (Weston et al., 2019). If analyses on open data are the same family (e.g. Cramer et al., 2016; can be considered confirmatory, then we need Hancock and Klockars, 1996). Given the sub- to consider the second guideline about whether stantial leeway regarding the definition of family, statistical tests are answering similar or the same recommendations have directed researchers to research questions. The answer to this question define and justify their family of tests a priori is not always obvious, as was highlighted above (Hancock and Klockars, 1996; Miller, 1981). regarding factorial ANOVA. However, if a study A crucial distinction in the definition of a fam- reusing data can justify itself as confirmatory, ily is whether the analysis is confirmatory (i.e. then it must also justify that it is asking a ’con- hypothesis-driven) or exploratory. Given issues ceptually distinct question’ from previous instan- regarding in recent years ces that used the data. We are not claiming that (Open Science Collaboration, 2015), there has this is not possible to justify, but the justification been considerable effort placed into clearly ought to be done if no sequential correction is demarcating what is exploratory and what is applied as new families are not created just confirmatory. One prominent definition is that because the data is being reused (see next confirmatory research requires preregistration section). before seeing the data (Wagenmakers et al., We stress that our intention here is not to 2012). However, current practice often involves establish the absolute definition of the term fam- releasing open data with the original research ily; it has been an ongoing debate for decades, article. Thus, all data reuse may be guided by which we do not intend to solve. We believe our the original or subsequent analyses (a HARKing- upcoming argument holds regardless of the def- like problem where methods are formulated inition. This section aimed to provide a working after some results are known [Button, 2019]). definition of family that allows for both small Therefore, if adopting this prominent definition and large families to be justified. In the next sec- of confirmatory research (Wagenmakers et al., tion, we argue that regardless of the specific 2012), it follows that any reuse of open data definition of family, sequential testing by itself after publication must be exploratory unless the does not create a new family by virtue of it analysis is preregistered before the data release. being a sequential test.

Thompson et al. eLife 2020;9:e53498. DOI: https://doi.org/10.7554/eLife.53498 3 of 17 Feature Article Meta-Research Dataset decay and the problem of sequential analyses on open datasets Families of tests through time exploratory analyses and this rule-of-thumb indi- The crucial question for the present purpose is cate that many sequential tests should be con- whether the reuse of data constitutes a new fam- sidered part of the same family when reusing ily of tests. If data reuse creates a new family of open data. We, therefore, suggest that research- tests, then there is no need to perform a ers should apply corrections for multiple tests sequential correction procedure in order to when reusing data or provide a justification for maintain control over familywise error. Alterna- the lack of such corrections (as they would need tively, if a new family has not been created sim- to in the case of simultaneous tests belonging to ply by reusing data, then we need to consider different families). sequential correction procedures. There are two ways in which sequential tests with open data can differ from simultaneous The consequence of not taking tests (where correction is needed): a time lag multiple sequential testing between tests and/or different individuals per- seriously In this section, we consider the consequences of forming the tests. Neither of these two proper- uncorrected sequential testing and several pro- ties is sufficient to justify the emergence of a cedures to correct for them. We start with a sim- new family of tests. First, the temporal displace- ulation to test the false positive rate of the ment of statistical tests can not be considered different sequential correction procedures by sufficient reason for creating a new family of sta- performing 100 sequential statistical tests (Pear- tistical tests, as the speed with which a son correlations) where the simulated covariance researcher analyzes a dataset is not relevant to between all variables was 0 (see Methods for the need to control for multiple statistical tests. additional details). The simulations ran for 1000 If it were, then a simple correction procedure iterations, and the familywise error was calcu- would be to wait a specified length of time lated using a two-tailed statistical significance before performing the next statistical test. Sec- threshold of p<0.05. ond, it should not matter who performs the We first consider what happens when the tests; otherwise, one could correct for multiple sequential tests are uncorrected. Unsurprisingly, tests by crowd-sourcing the analysis. Thus if we the results are identical to not correcting for were to decide that either of the two differenti- simultaneous tests (Figure 1A). There will almost ating properties of sequential tests on open always be at least one false positive any time data creates a new family, undesirable correction one performs 100 sequential analyses with this procedures would be allowable. To prevent this, simulation. This rate of false positives is dramati- statistical tests on open data, which can be run cally above the desired familywise error rate of by different people, and at different times, can at least one false positive in 5% of the simula- be part of the same family of tests. Since they tion’s iterations: uncorrected sequential tests can be in the same family, sequential tests on necessarily lead to more false positives. open data need to consider correction proce- To correct for this false positive increase, we dures to control the rate of false positives across consider several correction procedures. The first the family. sequential procedure we consider is a-debt. For We have demonstrated the possibility that the ith sequential test, this procedure considers families of tests can belong to sequential analy- there to be i tests that should be corrected. This ses. However, in practice, when does this occur? procedure effectively performs a Bonferroni cor- The scale of the problem rests partly in what is rection – i.e. the threshold of statistical signifi- classed as an exploratory analysis or not. If all a1 a cance becomes i where 1 is the first statistical data reuse is considered part of the same family threshold (here 0.05). Thus, on the first test a1 = due to it being exploratory, this creates a large 0.05, then on the second sequential test a2 = family. If however, this definition is rejected, 0.025, a3 = 0.0167, and so on. While each then it depends on the research question. Due sequential test is effectively a Bonferroni correc- to the fuzzy nature of ’family’, and the argument tion considering all previous tests, this does not above showing that data reuse does not create retroactively change the inference of any previ- new families automatically, we propose a simple ous statistical tests. When a new test is per- rule-of-thumb: if the sequential tests would be formed, the previous test’s a is now too lenient considered within the same family if performed considering all the tests that have been per- simultaneously, then they are part of the same formed. Thus, when considering all tests family in sequential tests. The definition of together, the false positive rate will increase,

Thompson et al. eLife 2020;9:e53498. DOI: https://doi.org/10.7554/eLife.53498 4 of 17 Feature Article Meta-Research Dataset decay and the problem of sequential analyses on open datasets

accumulating a false positive ’debt’. This debt to both type I and type II errors in regards to the entails that the method does not ensure the order of sequential tests. Thus, we simulated type I error rate remains under a specific value, true positives (between 1-10) where the covari- instead allows it to controllably increase under a ance of these variables and the dependent vari- ’debt ceiling’ with each sequential test (the debt able were set to p (p ranged between 0 and 1). ceiling is the sum of all a1 to at at t). The debt Further, l controlled the sequential test order ceiling will always increase, but the rate of determining the probability that a test was a increase in debt slows down. These phenomena true positive. When l is positive, it entails a were confirmed in the simulations (Figure 1B). higher likelihood that earlier tests will be one of Finally, the method can mathematically ensure the true positives (and vice versa when l was that the false negative rate (i.e., type II error) is negative; see methods). All other parameters equal to or better than simultaneous correction are the same as the previous simulation. Simulta- with Bonferroni (See Methods). neous correction procedures (Bonferroni and The next two procedures we consider have FDR) of all 100 tests were also included to con- previously been suggested in the literature a- trast the different sequential procedures to spending and a-investing (Aharoni and Rosset, these methods. 2014; Foster and Stine, 2008). The first has a The results reveal that the order of the tests total amount of ’a wealth’, and the sum of all is pivotal for sequential correction procedures. the statistical thresholds for all sequential tests Unsurprisingly, the uncorrected and simulta- can never exceed this amount (i.e., if the a neous correction procedures do not depend on wealth is 0.05 then the sum of all thresholds on the sequential order of tests (Figure 2ABC). The sequential tests must be less than 0.05). Here, sequential correction procedures all increased for each sequential test, we spend half the their true positive rate (i.e., fewer type II errors)

remaining wealth (i.e., a1 is 0.025, a2 is 0.0125, when the true positives were earlier in the analy- and so on). In the simulations, the sequential sis order (Figure 2A). We also observe that a- tests limit the probability of there being at least debt had the highest true positive rate of the one false positive to less than 0.05 (Figure 1C). sequential procedures and, when the true posi- Finally, a-investing allows for the significance tives were later in the test sequence, performed threshold to increase or decrease as researchers on par with Bonferroni. Further, when the true perform additional tests. Again there is a con- positives were earlier, a-debt outperformed cept of a-wealth. If a test rejects the null hypoth- Bonferroni at identifying them. a-investing and esis, there is an increase in the remaining a- a-spending cannot give such assurances when wealth that future tests can use and, if the the true positives are later in the analysis reverse occurs, the remaining a-wealth sequence (i.e. l is negative) there is less sensitiv- decreases (see methods). a-investing ensures ity to true positives (i.e. type II errors). a-debt is control of the at an assigned more sensitive to true positives compared to a- level. Here we invest 50% of the remaining spending because the threshold for the mth wealth for each statistical test. In the simulations, sequential test decreases linearly in a-debt and this method also remains under 0.05 familywise exponentially in a-spending. This fact results in a error rate as the sequential tests increase more lenient statistical threshold for a-debt in (Figure 1D). later sequential tests. The main conclusion from this set of simula- The false positive rate and false discovery tions is that the current practice of not correct- rate are both very high for the uncorrected pro- ing for open data reuse results in a substantial cedure (Figure 2BC). a-debt and a-spending increase in the number of false positives pre- both have a decrease in false positives and false sented in the literature. discovery rate when l is positive (Figure 2BC). The false discovery rate for a-debt generally lies between the spending (smallest) and investing Sensitivity to the order of procedures (largest and one that aims to be sequential tests below 0.05). Also, for all methods, the true posi- The previous simulation did not consider any tive rate breaks down as expected when the true positives in the data (i.e. cases where we covariance between variables approaches the should reject the null hypothesis). Since the sta- noise level. Thus we split the false discovery rate tistical threshold for significance changes as the along four quadrants based on l and the noise number of sequential tests increases, it becomes floor (Figure 2D). The quadrants where true crucial to evaluate the sensitivity of each method positive covariance is above the noise floor (Q1

Thompson et al. eLife 2020;9:e53498. DOI: https://doi.org/10.7554/eLife.53498 5 of 17 Feature Article Meta-Research Dataset decay and the problem of sequential analyses on open datasets

Figure 2. The order of sequential tests can impact true positive sensitivity. (A) The true positive rate in the uncorrected case (left-most panel), in two cases of simultaneous correction (second and third panels), and in three cases of sequential correction (fourth, fifth and sixth panels). In each panel the true positive rate after 100 tests is plotted as a function of two simulation parameters: l (x-axis) and the simulated covariance of the true positives (y- axis). When l is positive (negative), it increases the probability of the true positives being an earlier (later) test. Plots are based on simulations in which there are ten true positives in the data: see main text for details. (B) Same as A for the false positive rate. (C) Same as A for the false discovery rate. (D) Same as C for the average false discovery rate in four quadrants. Q1 has l <0; covariance >0.25. Q2 has l >0; covariance >0.25. Q3 has l <0; covariance <0.25. Q4 has l >0; covariance <0.25. The probability of true positives being an earlier test is highest in Q2 and Q4 as l >0 in these quadrants. (E) Same as D with the false discovery rate (y-axis) plotted against the percentage of true positives (x-axis) for the four quadrants. The dotted lines in D and E indicate a false discovery rate of 0.05. Code is available at https://github.com/wiheto/datasetdecay (Thompson, 2020; copy archived at https://github.com/elifesciences-publications/datasetdecay).

and Q2) has a false discovery rate of less than true positives in the dataset, we found that Q1 0.05 for all procedures except uncorrected and Q2 generally decrease as the number of (Figure 2D). Finally, when varying the number of true positives grows for a-spending and a-debt,

Thompson et al. eLife 2020;9:e53498. DOI: https://doi.org/10.7554/eLife.53498 6 of 17 Feature Article Meta-Research Dataset decay and the problem of sequential analyses on open datasets

whereas a-investing remains the 0.05 mark cortical thickness) had simultaneous FDR correc- regardless of the number of true positives tion while considering each linear model (i.e. (Figure 2E). each behavior) sequentially. The procedures All three sequential correction procedures considered were: uncorrected sequential analy- performed well at identifying true positives sis with both Bonferroni and FDR simultaneous when these tests were made early on in the anal- correction procedures; all three sequential cor- ysis sequence. When the true positive tests are rection procedures with FDR simultaneous cor- performed later, a-debt has the most sensitivity rection within each model. For the sequential for true positives and a-investing is the only pro- tests, the orders were randomized in two ways: cedure that has a stable false discovery rate (i) uniformly; (ii) weighting the earlier tests to be regardless of the number of true positives (the the significant findings found during the baseline other two methods appear to be more conserva- conditions (see Methods). The latter considers tive). The true positive sensitivity and false dis- how the methods perform if there is a higher covery rate of each of the three sequential chance that researchers test hypotheses that correction methods considered depend on the produce positive findings earlier in the analysis order of statistical tests and how many true posi- sequence rather than later. Sequential analyses tives are in the data. had the order of tests randomized 100 times. We asked two questions with these models. First, we identified the number of positive find- Uncorrected sequential tests will ings that would be reported for the different flood the scientific literature with correction methods (a positive finding is consid- false positives ered to be when the null hypothesis is rejected We have demonstrated a possible problem with at p < 0.05, two tail). Second, we asked how sequential tests on simulations. These results many additional scientific articles would be pub- show that sequential correction strategies are lished claiming to have identified a positive more liberal than their simultaneous counter- result (i.e. a null hypothesis has been rejected) parts. Therefore we should expect more false for the different correction methods. Impor- positives if sequential correction methods were tantly, in this evaluation of empirical data, we performed on a dataset. We now turn our atten- are not necessarily concerned with the number tion to empirical data from a well-known shared of true relationships with this analysis. Primarily, dataset in neuroscience to examine the effect of we consider the differences in the inferred statis- multiple reuses of the dataset. This empirical tical relationships when comparing the different example is to confirm the simulations and show sequential correction procedures to a baseline that more positive findings (i.e. null hypothesis of the simultaneous correction procedures. rejected) will be identified with sequential cor- These simultaneous procedures allow us to con- rection. We used 68 cortical thickness estimates trast the sequential approaches with current from the 1200 subject release of the HCP data- practices (Bonferroni, a conservative procedure, set (Van Essen et al., 2012). All subjects and FDR, a more liberal measure). Thus any pro- belonging to this dataset gave informed consent cedure that is more stringent than the Bonfer- (see Van Essen et al., 2013 for more details). roni baseline will be too conservative (more type IRB protocol #31848 approved by the Stanford II errors). Any procedure that is less stringent IRB approves the analysis of shared data. We than FDR will have an increased false discovery then used 182 behavioral measures ranging rate, implying more false positives (relative to from task performance to survey responses (see the true positives). Note that, we are tackling Supplementary file 1). For simplicity, we ignore only issues regarding correction procedures to all previous publications using the HCP dataset multiple hypothesis tests; determining the truth (of which there are now several hundred) for our of any particular outcome would require addi- p-value correction calculation. tional replication. We fit 182 linear models in which each behav- Figure 3 shows the results for all correction ior (dependent variable) was modeled as a func- procedures. Using sequentially uncorrected tests tion of each of the 68 cortical thickness leads to an increase in positive findings (30/44 estimates (independent variables), resulting in a Bonferroni/FDR), compared to a baseline of 2 total of 12,376 statistical tests. As a baseline, we findings when correcting for all tests simulta- corrected all statistical tests simultaneously with neously (for both Bonferroni and FDR proce- Bonferroni and FDR. For all other procedures, dures). The sequentially uncorrected procedures the independent variables within each (i.e. would also result in 29/30 (Bonferroni/FDR)

Thompson et al. eLife 2020;9:e53498. DOI: https://doi.org/10.7554/eLife.53498 7 of 17 Feature Article Meta-Research Dataset decay and the problem of sequential analyses on open datasets

Figure 3. Demonstrating the impact of different correction procedures with a real dataset. (A) The number of significant statistical tests (x-axis) that are possible for various correction procedures in a real dataset from the Human Connectome Project: see the main text for more details, supplementary file 1 for a list of the variables used in the analysis, and https://github.com/wiheto/datasetdecay copy archived at https://github.com/ elifesciences-publications/datasetdecay for the code. (B) The potential number of publications (x-axis) that could result from the tests shown in panel A. This assumes that a publication requires a null hypothesis to be rejected in order to yield a positive finding. The dotted line shows the baseline from the two simultaneous correction procedures. Error bars show the and circles mark min/max number of findings/studies for the sequential correction procedures with a randomly permuted test order.

publications that claim to identify at least one a-investing: 2.54 [1/5]). All procedures now positive result instead of the simultaneous base- increase their number of findings above base- line of two publications (Bonferroni and FDR), line. On average a-debt with a random order reflecting a 1,400% increase in publications has a 19% increase in the number of published claiming positive results. If we accept that the studies with positive findings, substantially less two baseline estimates are a good trade-off than the increase in the number of uncorrected between error rates, then we have good reason studies. Two conclusions emerge. First, a-debt to believe this increase reflects false positives. remains sensitive to the number of findings The sequential correction procedures were found regardless of the sequence of tests (fewer closer to baseline but saw based on type II errors) and can never fall above the Bon- the order of the statistical tests. If the order was ferroni in regards to type II errors. At the same completely random, then a-debt found, on aver- time, the other two sequential procedures can age, 2.77 positive findings (min/max: 2/6) and be more conservative than Bonferroni. Second, 2.53 publications claiming positive results (min/ while a-debt does not ensure the false positive max: 2/4) would be published. The random rate remains under a specific level (more type I order leads to an increase in the number of false errors), it dramatically closes the gap between positives compared to baseline but considerably the uncorrected and simultaneous number of less than the sequentially uncorrected proce- findings. dure. In , a-spending found 0.33 positive We have shown with both simulation and an findings (min/max: 0/5) resulting in 0.22 studies empirical example of how sequential statistical with positive findings (min/max: 0/2) and a- tests, if left uncorrected, will lead to a rise of investing found 0.48 (min/max: 0/8) positive false positive results. Further, we have explored findings and 0.37 (min/max 0/4) studies with different sequential correction procedures and positive findings; both of which are below the shown their susceptibility to both false negatives conservative baseline of 2. When the order is and false positives. Broadly, we conclude that informed by the baseline findings, the sequential the potential of a dataset to identify new statisti- corrections procedures increase in the number cally significant relationships will decay over time of findings (findings [min/max]: a-debt: 3.49 [2/ as the number of sequential statistical tests 7], a-spending: 2.58 [1/4], a-investing: 3.54 [1/ increases when controlling for sequential tests. 10]; and publications with positive findings [min/ In the rest of the discussion section, we first dis- max]: a-debt: 2.38 [2/4], a-spending: 1.97 [1/3], cuss the implications the different sequential

Thompson et al. eLife 2020;9:e53498. DOI: https://doi.org/10.7554/eLife.53498 8 of 17 Feature Article Meta-Research Dataset decay and the problem of sequential analyses on open datasets

procedures have in regards to the desiderata more lenient over time with this correction pro- outlined in the introduction. Then we discuss cedure, which might lead to even more false other possible solutions that could potentially positives (thus violating the no increase in false mitigate dataset decay. positives desideratum). Unless all statistical tests are time-stamped, which is possible but would require significant institutional change, this pro- Consequence for sequential tests cedure would be hard to implement. and open data Implementing a-debt would improve upon We stated three desiderata for open data in the current practices but will compromise on the sta- introduction: sharing incentive, open access, and ble false positive rate desideratum. However, it a stable false positive rate. Having demonstrated will have little effect on the sharing incentive some properties of sequential correction proce- desideratum as the original study does not need dures, we revisit these aims and consider how to account for any future sequential tests. The the implementation of sequential correction pro- open-access desideratum is also less likely to be cedures in practice would meet these desider- compromised as it is less critical to identify the ata. The current practice of leaving sequential true-positives directly (i.e. it has the lowest type hypothesis tests uncorrected leads to a dramatic II error rate of the sequential procedures). increase in the false positive rate. While our pro- Finally, while compromising the false positive posed sequential correction techniques would desideratum, its false positive rate a marked mitigate this problem, all three require improvement compared to sequentially uncor- compromising on one or more of the desiderata rected tests. (summarized in Table 1). Finally, a practical issue that must be taken Implementing a-spending would violate the into consideration with all sequential correction sharing incentive desideratum as it forces the ini- procedures is whether it is ever possible to know tial analysis to use a smaller statistical threshold the actual number of tests performed on an to avoid using the entire wealth of a. This unrestricted dataset. This issue relates to the file change could potentially happen with appropri- drawer problem where there is a bias towards ate institutional change, but placing restrictions the publication of positive findings compared to on the initial investigator(s) (and increased type II error rate) would likely serve as a disincentive null findings (Rosenthal, 1979). Until this is for those researchers to share their data. It also resolved, to fully sequentially correct for the places incentives to restrict access to open data number of previous tests corrected, an estima- (violating the open access desideratum) as per- tion of the number of tests may be required forming additional tests would lead to a more (e.g. by identifying publication biases; rapid decay in the ability to detect true positives Samartsidis et al., 2017; Simonsohn et al., in a given dataset. 2013). Using such estimations is less problem- a Implementing a-investing, would violate the atic with -debt because this only requires the open access desideratum for two reasons. First, number of tests to be known. Comparatively, a- like a-spending there is an incentive to restrict investing requires the entire results chain of sta- incorrect statistical tests due to the sensitivity to tistical tests to be known and a-spending order. Second, a-investing would require track- requires knowing every a value that has been ing and time-stamping all statistical tests made used, both of which would require additional on the dataset. Given the known issues of file assumptions to estimate. However, even if a- drawer problem (Rosenthal, 1979), this is cur- debt correction underestimates the number of rently problematic to implement (see below). previous statistical tests, the number of false Also, publication bias for positive outcomes positives will be reduced compared to no would result in statistical thresholds becoming sequential correction.

Table 1. Summary of the different sequential correction methods and the open-data desiderata. Yes indicates that the method is compatible with the desideratum. Sharing incentive Open access Stable false positive rate a-spending No No Yes a-investing Yes No Yes a-debt Yes Yes No

Thompson et al. eLife 2020;9:e53498. DOI: https://doi.org/10.7554/eLife.53498 9 of 17 Feature Article Meta-Research Dataset decay and the problem of sequential analyses on open datasets Towards a solution can, for other reasons, justify why their statistical Statistics is a multifaceted tool for experimental family is separate in their analysis and state how researchers to use, but it (rarely) aims to provide it is different from previous statistical tests per- universal solutions for all problems and use formed on the data, there is no necessity to cases. Thus, it may be hard to expect a one size sequentially correct. Thus providing sufficient fits all solution to the problem of sequential tests justification for new a family in a paper can effec- on open data. Indeed, the idiosyncrasies within tively reset the alpha wealth. different disciplines regarding the size of data, open data infrastructure, and how often new Restrained or coordinated alpha-levels data is collected, may necessitate that they One of the reasons the a-values decays quickly adopt different solutions. Thus, any prescription in a-invest and a-spend is the 50% invest/spend we offer now is, at best, tentative. Further, the rate that we chose in this article uses a large solutions also often compromise the desiderata portion of the total a-wealth in the initial statisti- in some way. That being said, there are some cal tests. For example, the first two tests in a- suggestions which should assist in mitigating the spend, use 75% of the overall a-wealth. Different problem to different degrees. Some of these spending or investing strategies are possible, suggestions only require the individual which could restrain the decay of the remaining researcher to adapt their practices, others a-wealth, allowing for more discoveries in later require entire disciplines to form a consensus, statistical tests. For example, a discipline could and others require infrastructural changes. This decide that the first ten statistical tests spend section deals with solutions compatible with the 5% of the a wealth, then the next ten spends null hypothesis testing framework, the next sec- 2.5% of the overall wealth. Such a strategy tion considers solutions specific to other would still always remain under the overall perspectives. wealth, but allow more people to utilize the dataset. However, imposing this restrained or Preregistration grace period of analyses fair-use of a-spending would either require con- prior to open data release sensus from all researchers (however, this strat- To increase the number of confirmatory analyses egy would be in vain if just one researcher fails that can be performed on an open dataset, one to comply) or restricting data access possible solution is to have a ’preregistration (compromising the open access desideratum). grace period’. Here a description of the data Importantly, this solution does not mitigate the can be provided, and data re-users will have the decay of the alpha threshold; it just reduces the opportunity to write a preregistration prior to decay. the data being released. This solution allows for confirmatory analyses to be performed on open Metadata about reuse coupled to datasets data while simultaneously being part of different One of the problems regarding sequential cor- statistical families. This idea follows rections is knowing how many tests have been Wagenmakers et al., 2012 definition of confir- made using the dataset. This issue was partially matory analysis. Consequently, once the dataset addressed above with suggestions for estimat- or the first study using the dataset are pub- ing the number of preceding tests. Additionally, lished, the problems outlined in this paper will repositories could provide information about all remain for all subsequent (non pre-registered) known previous uses of the data. Thus if data analyses reusing the data. repositories were able to track summaries of tests performed and which variables involved in Increased justification of the statistical the tests, this would, at the very least, help family guide future users with rough estimates. In order One of the recurring problems regarding statisti- for this number to be precise, it would, however, cal testing is that, given the require limiting the access to the dataset Wagenmakers et al., 2012 definition, it is hard (compromising the open access desideratum). to class open data reuse as confirmatory after data release. However, if disciplines decide that Held out data on repositories confirmatory analyses on open data (post-publi- A way to allow hypothesis testing or predictive cation) are possible, one of our main arguments frameworks (see below) to reuse the data is if above is that a new paper does not automati- the infrastructure exists that prevents the cally create a new statistical family. If researchers researcher from ever seeing some portion of the

Thompson et al. eLife 2020;9:e53498. DOI: https://doi.org/10.7554/eLife.53498 10 of 17 Feature Article Meta-Research Dataset decay and the problem of sequential analyses on open datasets data. Dataset repositories could hold out data Different perspective-specific which data re-users can query their results solutions regarding sequential against to either replicate their findings or test testing their predictive models. This perspective has The solutions above focused on possible solu- seen success in machine learning competitions tions compatible within the null hypothesis test- which hold out test data. Additional require- ing framework to deal with sequential statistical ments could be added to this perspective, such tests, although many are compatible with other as requiring preregistrations in order to query perspectives as well. There are a few other per- the held out data. However, there have been spectives about data analysis and statistical concerns that held out data can lead to overfit- inferences that are worth considering, three of ting (e.g. by copying the best fitting model) which we discuss here. Each provide some per- (Neto et al., 2016) although others have argued spective-specific solution to the sequential test- this does not generally appear to be the case ing problem. Any of these possible avenues may when evaluating overfitting (Roelofs et al., be superior to the ones we have considered in 2019). However, Roelofs et al., 2019 noted that this article, but none appear to readily applica- overfitting appears to occur on smaller datasets, ble in all situations without some additional which might prevent it from being a general considerations. The first alternative is . Mul- solution for all disciplines. tiple comparisons in Bayesian frameworks are Narrow hypotheses and minimal statistical often circumnavigated by partial pooling and families regularizing priors (Gelman et al., 2013; Kruschke and Liddell, 2017). While Bayesian One way to avoid the sequential testing prob- statistics can suffer from similar problems as lem is to ensure small family sizes. If we can jus- NHST if misapplied (Gigerenzer and Marewski, tify that there should be inherently small family 2014), it often deals with multiple tests without sizes, then there is no need to worry about the explicitly correcting for them, and may provide sequential problems outlined here. This solution an avenue for sequential correction to be would also entail that each researcher does not avoided. These techniques should allow for the need to justify their own particular family choice sequential evaluation of different independent (as suggested above), but rather a specific con- variables against a single dependent variable sensus of what the contested concept family when using regularizing priors, especially as actually is achieved. This would require: these different models could also be contrasted (1) confirmatory hypothesis testing on open data explicitly to see which model fits the data best. is possible, (2) encouraging narrow (i.e. very spe- However, sequential tests could be problematic cific) hypotheses that will help maintain minimal when the dependent variable changes and the family sizes, as the specificity of the hypothesis false positive rate should be maintained across will limit the overlap with any other statistical models. If uncorrected, this could create a simi- test. Narrow hypotheses for confirmatory analy- lar sequential problem as outlined in the empiri- ses can lead to families which are small, and can cal example in the article. Nevertheless, there avoid correcting for multiple statistical tests are multiple avenues where this could be fixed (both simultaneous and sequential). This strategy (e.g. sequentially adjusting the prior odds in is a possible solution to the problem. However, Bayes-factor inferences). The extent of sequen- science does not merely consist of narrow tial analysis on open dataset within the Bayesian hypotheses. Broader hypotheses can still be hypothesis testing frameworks, and possible sol- used in confirmatory studies (for example, utions, is an avenue of future investigation. genetic or neuroimaging datasets often ask The second alternative is using held-out data broader questions not knowing which specific within prediction frameworks. Instead of using gene or brain area is involved, but know that a , this framework evaluates a gene or brain region should be involved to con- model by how well it performs on predicting firm a hypothesis about a larger mechanism). unseen test data (Yarkoni and Westfall, 2017). Thus, while possibly solving a portion of the However, a well-known problem when creating problem, this solution is unlikely to be a general models to predict on test datasets is overfitting. solution for all fields, datasets, and types of This phenomenon occurs, for example, if a researcher queries the test dataset multiple hypotheses. times. Reusing test data will occur when

Thompson et al. eLife 2020;9:e53498. DOI: https://doi.org/10.7554/eLife.53498 11 of 17 Feature Article Meta-Research Dataset decay and the problem of sequential analyses on open datasets

sequentially reusing open data. Held-out data given a particular sample. The trade-off of this on data repositories, as discussed above, is one benefit is that more false positives will enter the potential solution here. Further, within machine scientific literature. We remain strong advocates learning, there have been advances towards hav- of open data and data sharing. We are not advo- ing reusable held-out data that can be queried cating that every single reuse of a dataset must multiple times (Dwork et al., 2015; necessarily correct for sequential tests and we Dwork et al., 2017; Rogers et al., 2019). This have outlined multiple circumstances throughout avenue is promising, but there appear to be this article where this is the case. However, some drawbacks for sequential reuse. First, this researchers using openly shared data should be line of work within ’adaptive data analysis’ gen- sensitive to the possibility of accumulating false erally considers a single user querying the hold- positives and ensuing dataset decay that will out test data multiple times while optimizing occur with repeated reuse. Ensuring findings are their model/analysis. Second, this is ultimately a replicated using independent samples will cross-validation technique which is not necessar- greatly decrease the false positive rate, since the ily the best tool in datasets where sample sizes chance of two identical false positives relation- are small, (Varoquaux, 2018) which is often the ships occurring, even on well-explored datasets, case with open data and thus not a general solu- is small. tion to this problem. Third, additional assump- tions exist in these methods (e.g., there is still a ’budget limit’ in Dwork et al., 2015, and ’mostly Methods guessing correctly’ is required in Rogers et al., 2019). However, this avenue of research has the Preliminary assumptions potential to provide a better solution than what In this article, we put forward the argument that we have proposed here. sequential statistical tests on open data could The third and perhaps most radical alterna- lead to an increase in the number of false posi- tive is to consider all open data analysis to be tives. This argument requires several assump- exploratory data analysis (EDA). In EDA, the pri- tions regarding (1) the type of datasets mary utility becomes generating hypotheses and analyzed; (2) what kind of statistical inferences testing assumptions of methods (Donoho, 2017; are performed; (3) the types of sequential analy- Jebb et al., 2017; Thompson et al., 2020; ses considered. Tukey, 1980). Some may still consider this reframing problematic, as it could make findings The type of dataset based on open data seem less important. How- we consider a dataset to be a fixed static snap- ever, accepting that all analyses on open data is shot of data collected at a specific time point. EDA would involve less focus on statistical infer- There are other cases of combining datasets or ence — the sequential testing problem disap- datasets that grow over time, but we will not pears. An increase of EDA on exploratory consider those here. Second, we assume a data- analyses would lead to an increase of EDA set to be a random sample of a population and results which may not replicate. However, this is not a dataset that contains information about a not necessarily problematic. There would be no full population. increase of false positives within confirmatory studies in the scientific literature and the The type of statistical testing increase EDA studies will provide a fruitful guide We have framed our discussion of statistical about which confirmatory studies to undertake. inference using null hypothesis statistical testing Implementing this suggestion would require lit- (NHST). This assumption entails that we will use tle infrastructural or methodological change; thresholded p-values to infer whether a finding however, it would require an institutional shift in differs from the null hypothesis. Our decision for how researchers interpret open data results. This this choice is motivated by a belief that the suggestions of EDA on open data also fits with NHST framework being the most established recent proposals calling for exploration to be framework for dealing with multiple statistical conducted openly (Thompson et al., 2020). tests. There have been multiple valid critiques and suggestions to improve upon this statistical practice by moving away from thresholded Conclusion p-values to evaluate hypotheses (Cum- One of the benefits of open data is that it allows ming, 2014; Ioannidis, 2019; Lee, 2016; multiple perspectives to approach a question, McShane et al., 2019; Wasserstein et al.,

Thompson et al. eLife 2020;9:e53498. DOI: https://doi.org/10.7554/eLife.53498 12 of 17 Feature Article Meta-Research Dataset decay and the problem of sequential analyses on open datasets

2019). Crucially, however, many proposed alter- be a false positive. The simulation was repeated native approaches within statistical inference do for 1000 iterations. not circumnavigate the problem of multiple sta- The second simulation had three additional tistical testing. For example, if confidence inter- variables. First, a variable that controlled the vals are reported and used for inference number of true positives in the data. This vari- regarding hypotheses, these should also be able varied between 1-10. Second, the selected adjusted for multiple statistical tests (see, e.g. true positive variables, along with the depen- Tukey, 1991). Thus, any alternative statistical dent variable, had their covariance assigned as frameworks that still must correct for multiple p. p varied between 0 and 1 in steps of 0.025. simultaneous statistical testing will have the Finally, we wanted to test the effect of the analy- same sequential statistical testing problem that sis order to identify when the true positive were we outline here. Thus, while we have chosen included in the statistical tests. Each sequential NHST for simplicity and familiarity, this does not analysis, (m1, m2, m3 ...), could be assigned to be entail that the problem is isolated to NHST. Sol- a ’true positive’ (i.e., covariance of p with the utions for different frameworks may however dif- dependent variable) or a ’true negative’ (covari- fer (see the discussion section for Bayesian ance of 0 with dependent variable). First, m1 approaches and prediction-based inference would be assigned one of the trials, then m2 and perspectives). so forth. This procedure continued until there were only true positives or true negatives The types of analyses remaining. The procedure assigns the ith analysis Sequential analyses involve statistical tests on to be randomly assigned, weighted by l. If l the same data. Here, we consider sequential was 0, then there was a 50% chance that mi analyses that reuse the same data and analyses would be a true positive or true negative. If l to be a part of the same statistical family (see was 1, a true positive was 100% more likely to section on statistical families for more details). be assigned to mi (i.e. an odds ratio of 1+l:1), Briefly, this involves either the statistical infer- The reverse occurred if l was negative (i.e. -1 ences being classed as exploratory or answering meant a true negative was 100% more likely at the same confirmatory hypothesis or research mi). question. Further, we only consider analyses that are not supersets of previous analyses. This Empirical example assumption entails that we are excluding analy- Data from the Human Connectome Project ses where a may improve upon (HCP) 1200 subject release was used (Van Essen a previous statistical model by, for example, et al., 2012). We selected 68 estimates of corti- adding an additional layer in a hierarchical cal thickness to be the independent variables for model. Other types of data reuse may not be 182 continuous behavioral and psychological appropriate for sequential correction methods variables dependent variables. Whenever possi- and are not considered here. ble, the age-adjusted values were used. While we have restrained our analysis with Supplementary file 1 shows the variables these assumptions and definitions, it is done pri- selected in the analysis. marily to simplify the argument regarding the For each analysis, we fit an ordinary least problem we are identifying. The degree to which squares model using Statsmodels (0.10.0-dev0 sequential tests are problematic in more +1579, https://github.com/statsmodels/statsmo- advanced cases remains outside the scope of dels/). For all statistical models, we first stan- this paper. dardized all variables to have a of 0 and a standard deviation of 1. We dropped any miss- Simulations ing values for a subject for that specific analysis. The first simulation sampled data for one depen- Significance was considered for any independent dent variable and 100 independent variables variable if it had a p-value < 0.05, two-tailed for from a multivariate Gaussian distribution (mean: the different correction methods. 0, standard deviation: 1, covariance: 0). We con- We then quantified the number of findings ducted 100 different pairwise sequential analy- and the number of potential published studies ses in a random order. For each analysis, we with positive results that the different correction quantified the relationship between an indepen- methods would present. The number of findings dent and the dependent variable using a Pear- is the sum of independent variables that were son correlation. If the correlation had a two- considered positive findings (i.e. p < 0.05, two- tailed p-value less than 0.05, we considered it to tailed). The number of potential studies that

Thompson et al. eLife 2020;9:e53498. DOI: https://doi.org/10.7554/eLife.53498 13 of 17 Feature Article Meta-Research Dataset decay and the problem of sequential analyses on open datasets identify positive results is the number of depen- Sequential correction procedures dent variables that had at least one positive find- Uncorrected. This procedure is to not correct for ing. The rationale for the second metric is to any sequential analyses. This analogous to reus- consider how many potential non null-finding ing open data with no consideration for any publications would exist in the literature if a sep- sequential tests that occur due to data reuse. arate group conducted each analysis. For all sequential hypothesis tests, p<0.05 was For the sequential correction procedures, we considered a significant or positive finding. used two different orderings of the tests. The a-debt. A sequential correction procedure first was with a uniformly random order. The sec- that, to our knowledge, has not previously been ond was an informed order that pretends we proposed. At the first hypothesis tested, a1 sets somehow a priori knew which variables will be the statistical significance threshold (here 0.05). correlated. The motivation behind an informed At the ith hypothesis tested the statistical order is because it may be unrealistic that scien- a ¼ a1 threshold is i i . The rationale here is that, at tists ask sequential questions in a random order. the ith test, a is applied The ’informed’ order was created by identifying that considers there to be i number of tests per- the significant statistical tests when using simul- formed. This method lets the false positive rate taneous correction procedures (see below). With increase (i.e. the debt of reusing the dataset) as the baseline results, we identified analyses which each test corrects for the overall number of were baseline positives (i.e. significant with any tests, but all earlier tests have a more liberal of the simultaneous baseline procedures. There threshold. The total possible ’debt’ incurred for were two analyses) and the other analyses that m number of sequential tests can be calculated were baseline negatives. Then, as in simulation m a by Pi¼1 i and will determine the actual false 2, the first analysis, m1 was randomly assigned to positive rate. be a baseline positive or negative with equal a-spending. A predefined a0 is selected, probability. This informed ordering means that which is called the a-wealth. At the ith test the the baseline positives would usually appear in an statistical threshold, ai, a value is selected to earlier in the sequence order. All sequential cor- i meet the condition that P ¼1 aj

Thompson et al. eLife 2020;9:e53498. DOI: https://doi.org/10.7554/eLife.53498 14 of 17 Feature Article Meta-Research Dataset decay and the problem of sequential analyses on open datasets

When combining the simultaneous and sequential correction procedures in the empirical Additional files example, we used the sequential correction pro- Supplementary files . cedure to derive ai, which we then used as the Supplementary file 1. The variables selected for anal- threshold in the simultaneous correction. ysis in Figure 3.

Acknowledgements . Transparent reporting form We thank Pontus Plave´n-Sigray, Lieke de Boer, Nina Becker, Granville Matheson, Bjo¨ rn Schiffler, Data availability and Gitanjali Bhattacharjee for helpful discus- All empirical data used in Figure 3 originates from the sions and feedback. Human Connectome Project (https://www.humancon- nectome.org/) from the 1200 healthy subject release. William Hedley Thompson is in the Department of Code for the simulations and analyses is available at https://github.com/wiheto/datasetdecay. Psychology, Stanford University, Stanford, United States, and the Department of Clinical Neuroscience, References Karolinska Institutet, Stockholm, Sweden Aharoni E, Rosset S. 2014. Generalized a -investing: [email protected] definitions, optimality results and application to public https://orcid.org/0000-0002-0533-6035 databases. Journal of the Royal Statistical Society: Series B 76:771–794. DOI: https://doi.org/10.1111/ Jessey Wright is in the Department of Psychology and rssb.12048 the Department of Philosophy, Stanford University, Benjamini Y, Hochberg Y. 1995. Controlling the false Stanford, United States discovery rate: a practical and powerful approach to https://orcid.org/0000-0001-5003-0572 multiple testing. Journal of the Royal Statistical Patrick G Bissett is in the Department of Psychology, Society: Series B 57:289–300. DOI: https://doi.org/10. Stanford University, Stanford, United States 2307/2346101 Bennett CM, Wolford GL, Miller MB. 2009. The Russell A Poldrack is in the Department of principled control of false positives in neuroimaging. Psychology, Stanford University, Stanford, United Social Cognitive and Affective Neuroscience 4:417– States 422. DOI: https://doi.org/10.1093/scan/nsp053, https://orcid.org/0000-0001-6755-0259 PMID: 20042432 Bennett CM, Baird AA, Miller MB, George L. 2010. Author contributions: William Hedley Thompson, Con- Neural correlates of interspecies perspective taking in ceptualization, Formal analysis, Investigation, Visualiza- the post-mortem Atlantic salmon: an argument for tion, Methodology; Jessey Wright, Patrick G Bissett, proper multiple comparisons correction. Journal of Conceptualization; Russell A Poldrack, Conceptualiza- Serendipitous and Unexpected Results 1:1–5. tion, Resources, Supervision DOI: https://doi.org/10.1016/S1053-8119(09)71202-9 Button KS. 2019. Double-dipping revisited. Nature Competing interests: The authors declare that no Neuroscience 22:688–690. DOI: https://doi.org/10. competing interests exist. 1038/s41593-019-0398-z, PMID: 31011228 Cox DR. 1965. A remark on multiple comparison Received 12 November 2019 methods. Technometrics 7:223–224. DOI: https://doi. Accepted 04 May 2020 org/10.1080/00401706.1965.10490250 Published 19 May 2020 Cramer AO, van Ravenzwaaij D, Matzke D, Steingroever H, Wetzels R, Grasman RP, Waldorp LJ, Funding Wagenmakers EJ. 2016. Hidden multiplicity in Grant reference exploratory multiway ANOVA: prevalence and Funder number Author remedies. Psychonomic Bulletin & Review 23:640–647. DOI: https://doi.org/10.3758/s13423-015-0913-5, Knut och Alice 2016.0473 William Hedley Wallenbergs Thompson PMID: 26374437 Stiftelse Cumming G. 2014. The new statistics: why and how. Psychological Science 25:7–29. DOI: https://doi.org/ The funders had no role in study design, 10.1177/0956797613504966, PMID: 24220629 and interpretation, or the decision to submit the work Donoho D. 2017. 50 years of data science. Journal of Computational and Graphical Statistics 26:745–766. for publication. DOI: https://doi.org/10.1080/10618600.2017.1384734 Dwork C, Feldman V, Hardt M, Pitassi T, Reingold O, Roth A. 2015. Preserving statistical in adaptive Decision letter and Author response data analysis. Proceedings of the Annual ACM Decision letter https://doi.org/10.7554/eLife.53498.sa1 Symposium on Theory of Computing 14–17. DOI: https://doi.org/10.1145/2746539.2746580 Author response https://doi.org/10.7554/eLife.53498. Dwork C, Feldman V, Hardt M, Pitassi T, Reingold O, sa2 Roth A. 2017. Guilt-free data reuse. Communications of the ACM 60:86–93. DOI: https://doi.org/10.1145/ 3051088

Thompson et al. eLife 2020;9:e53498. DOI: https://doi.org/10.7554/eLife.53498 15 of 17 Feature Article Meta-Research Dataset decay and the problem of sequential analyses on open datasets

Foster DP, Stine RA. 2008. a-investing: a procedure based competitions. arXiv. http://arxiv.org/abs/1607. for sequential control of expected false discoveries. 00091. Journal of the Royal Statistical Society: Series B 70: Nosek BA, Alter G, Banks GC, Borsboom D, Bowman 429–444. DOI: https://doi.org/10.1111/j.1467-9868. SD, Breckler SJ, Buck S, Chambers CD, Chin G, 2007.00643.x Christensen G, Contestabile M, Dafoe A, Eich E, Games PA. 1971. Multiple comparisons of means. Freese J, Glennerster R, Goroff D, Green DP, Hesse B, American Educational Research Journal 8:531–565. Humphreys M, Ishiyama J, et al. 2015. Promoting an DOI: https://doi.org/10.3102/00028312008003531 open research culture. Science 348:1422–1425. Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, DOI: https://doi.org/10.1126/science.aab2374 Rubin DB. 2013. Bayesian Data Analysis. Chapman Open Science Collaboration. 2015. Estimating the Hall/CRC. reproducibility of psychological science. Science 349: Gigerenzer G, Marewski JN. 2014. Surrogate science: aac4716. DOI: https://doi.org/10.1126/science. the idol of a universal method for scientific inference. aac4716, PMID: 26315443 Journal of Management 41:421–440. DOI: https://doi. Perneger TV. 1998. What’s wrong with Bonferroni org/10.1177/0149206314547522 adjustments. BMJ 316:1236–1238. DOI: https://doi. Hancock GR, Klockars AJ. 1996. The quest for a: org/10.1136/bmj.316.7139.1236, PMID: 9553006 developments in multiple comparison procedures in Pisani E, Aaby P, Breugelmans JG, Carr D, Groves T, the quarter century since games (1971). Review of Helinski M, Kamuya D, Kern S, Littler K, Marsh V, Educational Research 66:269–306. DOI: https://doi. Mboup S, Merson L, Sankoh O, Serafini M, Schneider org/10.2307/1170524 M, Schoenenberger V, Guerin PJ. 2016. Beyond open Hochberg J, Tamhane AC. 1987. Introduction. In: data: realising the health benefits of sharing data. BMJ Hochberg J, Tamhane A. C (Eds). Multiple Comparison 355:i5295. DOI: https://doi.org/10.1136/bmj.i5295, Procedures. John Wiley & Sons. p. 1–16. DOI: https:// PMID: 27758792 doi.org/10.1002/9780470316672 Roelofs R, Miller J, Hardt M, Fridovich-keil S, Schmidt Ioannidis JPA. 2019. Options for publishing research L, Recht B. 2019. A meta-analysis of overfitting in without any P-values. European Heart Journal 40: machine learning. Advances in Neural Information 2555–2556. DOI: https://doi.org/10.1093/eurheartj/ Processing Systems. http://papers.neurips.cc/paper/ ehz556, PMID: 31411717 9117-a-meta-analysis-of-overfitting-in-machine- Jebb AT, Parrigon S, Woo SE. 2017. Exploratory data learning. analysis as a foundation of inductive research. Human Rogers R, Roth A, Smith A, Srebro N, Thakkar O, Resource Management Review 27:265–276. Woodworth B. 2019. Guaranteed validity for empirical DOI: https://doi.org/10.1016/j.hrmr.2016.08.003 approaches to adaptive data analysis. arXiv. https:// Kruschke JK, Liddell TM. 2017. The bayesian new arxiv.org/pdf/1906.09231.pdf. statistics: from a Bayesian perspective. Psychonomic Rosenthal R. 1979. The file drawer problem and Bulletin & Review 25:178–206. DOI: https://doi.org/10. tolerance for null results. Psychological Bulletin 86: 3758/s13423-016-1221-4 638–641. DOI: https://doi.org/10.1037/0033-2909.86. Lee DK. 2016. Alternatives to P value: confidence 3.638 interval and . Korean Journal of Samartsidis P, Montagna S, Laird AR, Fox PT, Johnson Anesthesiology 69:555–562. DOI: https://doi.org/10. TD, Nichols TE. 2017. Estimating the number of 4097/kjae.2016.69.6.555, PMID: 27924194 missing in a neuroimaging meta-analysis. Maxwell SE, Delaney H. 2004. Designing experiments bioRxiv. DOI: https://doi.org/10.1101/225425 and analyzing data: A model comparison perspective. Simonsohn U, Nelson LD, Simmons JP. 2013. P-curve: In: Mixed Models. 2nd Ed. Lawrence Erlbaum a key to the file drawer. Journal of Experimental Associates, Inc, Publishers. Psychology: General 143:1 –38. DOI: https://doi.org/ Mayo D. 2017. A poor prognosis for the diagnostic 10.1037/a0033242 screening critique of statistical tests. OSF Preprints. Thompson WH, Wright J, Bissett PG. 2020. Open DOI: https://doi.org/10.17605/OSF.IO/PS38B exploration. eLife 9:e52157. DOI: https://doi.org/10. McKiernan EC, Bourne PE, Brown CT, Buck S, Kenall 7554/eLife.52157, PMID: 31916934 A, Lin J, McDougall D, Nosek BA, Ram K, Soderberg Thompson WH. 2020. datasetdecay. GitHub. c06a705. CK, Spies JR, Thaney K, Updegrove A, Woo KH, https://github.com/wiheto/datasetdecay Yarkoni T. 2016. How open science helps researchers Tukey JW. 1980. We need both exploratory and succeed. eLife 5:e16800. DOI: https://doi.org/10.7554/ confirmatory. American Statistician 34:23–25. eLife.16800, PMID: 27387362 DOI: https://doi.org/10.1080/00031305.1980. McShane BB, Gal D, Gelman A, Robert C, Tackett JL. 10482706 2019. Abandon statistical significance. The American Tukey JW. 1991. The philosophy of multiple Statistician 73:235–245. DOI: https://doi.org/10.1080/ comparisons. Statistical Science 6:100–116. 00031305.2018.1527253 DOI: https://doi.org/10.1214/ss/1177011945 Miller RG. 1981. Simultaneous Statistical Inference. Van Essen DC, Ugurbil K, Auerbach E, Barch D, New York, NY: Springer. DOI: https://doi.org/10.1007/ Behrens TE, Bucholz R, Chang A, Chen L, Corbetta M, 978-3-642-45182-9 Curtiss SW, Della Penna S, Feinberg D, Glasser MF, Molloy JC. 2011. The Open Knowledge Foundation: Harel N, Heath AC, Larson-Prior L, Marcus D, open data means better science. PLOS Biology 9: Michalareas G, Moeller S, Oostenveld R, et al. 2012. e1001195. DOI: https://doi.org/10.1371/journal.pbio. The Human Connectome Project: a data acquisition 1001195, PMID: 22162946 perspective. NeuroImage 62:2222–2231. DOI: https:// Neto EC, Hoff BR, Bare C, Bot BM, Yu T, Magravite L, doi.org/10.1016/j.neuroimage.2012.02.018, Stolovitzky G. 2016. Reducing overfitting in challenge- PMID: 22366334

Thompson et al. eLife 2020;9:e53498. DOI: https://doi.org/10.7554/eLife.53498 16 of 17 Feature Article Meta-Research Dataset decay and the problem of sequential analyses on open datasets

Van Essen DC, Smith SM, Barch DM, Behrens TE, 1–19. DOI: https://doi.org/10.1080/00031305.2019. Yacoub E, Ugurbil K, WU-Minn HCP Consortium. 2013. 1583913 The WU-Minn Human Connectome Project: an Weston SJ, Ritchie SJ, Rohrer JM, Przybylski AK. 2019. overview. NeuroImage 80:62–79. DOI: https://doi.org/ Recommendations for increasing the transparency of 10.1016/j.neuroimage.2013.05.041, PMID: 23684880 analysis of preexisting data sets. Advances in Methods Varoquaux G. 2018. Cross-validation failure: small and Practices in Psychological Science 2:214–227. sample sizes lead to large error bars. NeuroImage DOI: https://doi.org/10.1177/2515245919848684, 180:68–77. DOI: https://doi.org/10.1016/j. PMID: 32190814 neuroimage.2017.06.061, PMID: 28655633 Yarkoni T, Westfall J. 2017. Choosing prediction over Wagenmakers EJ, Wetzels R, Borsboom D, van der explanation in psychology: lessons from machine Maas HL, Kievit RA. 2012. An agenda for purely learning. Perspectives on Psychological Science 12: confirmatory research. Perspectives on Psychological 1100–1122. DOI: https://doi.org/10.1177/ Science 7:632–638. DOI: https://doi.org/10.1177/ 1745691617693393, PMID: 28841086 1745691612463078, PMID: 26168122 Wasserstein RL, Schirm AL, Lazar NA. 2019. Moving to a world beyond “p<0.05". American Statistician 73:

Thompson et al. eLife 2020;9:e53498. DOI: https://doi.org/10.7554/eLife.53498 17 of 17