The Statistical Crisis in Science

The Statistical Crisis in Science Data-dependent analysis— a "garden of forking paths"— explains why many statistically significant comparisons don't hold up. Andrew Gelman and Eric Loken here is a growing realization a short mathematics test when it is This multiple comparisons issue is that reported "statistically sig expressed in two different contexts, well known in statistics and has been nificant" claims in scientific involving either healthcare or the called "p-hacking" in an influential publications are routinely mis military. The question may be framed 2011 paper by the psychology re Ttaken. Researchers typically expressnonspecifically as an investigation of searchers Joseph Simmons, Leif Nel the confidence in their data in terms possible associations between party son, and Uri Simonsohn. Our main of p-value: the probability that a per affiliation and mathematical reasoning point in the present article is that it ceived result is actually the result of across contexts. The null hypothesis is is possible to have multiple potential random variation. The value of p (for that the political context is irrelevant comparisons (that is, a data analysis "probability") is a way of measuring to the task, and the alternative hypoth whose details are highly contingent the extent to which a data set provides esis is that context matters and the dif on data, invalidating published p-val- evidence against a so-called null hy ference in performance between the ues) without the researcher perform pothesis. By convention, a p-value be two parties would be different in the ing any conscious procedure of fishing low 0.05 is considered a meaningful military and healthcare contexts. through the data or explicitly examin refutation of the null hypothesis; how At this point a huge number of pos ing multiple comparisons. ever, such conclusions are less solid sible comparisons could be performed, than they appear. all consistent with the researcher's the How to Test a Hypothesis The idea is that when p is less than ory. for example, the null hypothesis In general, we could think of four some prespecified value such as 0.05, could be rejected (with statistical sig classes of procedures for hypothesis the null hypothesis is rejected by the nificance) among men and not among testing: (1) a simple classical test based data, allowing researchers to claim women—explicable under the theory on a unique test statistic, T, which strong evidence in favor of the alterna that men are more ideological than when applied to the observed data tive. The concept of p-values was origi women. The pattern could be found yields T(y), where y represents the nally developed by statistician Ronald among women but not among men— data; (2) a classical test prechosen from Fisher in the 1920s in the context of his explicable under the theory that wom a set of possible tests, yielding T(y;cp), research on crop variance in Hertford en are more sensitive to context than with preregistered (p (for example, <p shire, England. Fisher offered the idea men. Or the pattern could be statisti might correspond to choices of con of p-values as a means of protecting cally significant for neither group, but trol variables in a regression, transfor researchers from declaring truth based the difference could be significant (still mations, the decision of which main on patterns in noise. In an ironic twist, fitting the theory, as described above). effect or interaction to focus on); (3) p-values are now often manipulated to Or the effect might only appear among researcher degrees of freedom without lend credence to noisy claims based on men who are being questioned by fe fishing, which consists of computing small samples. male interviewers. a single test based on the data, but in In general, p-values are based on We might see a difference between an environment where a different test what would have happened under the sexes in the healthcare context but would have been performed given dif other possible data sets. As a hypo not the military context; this would ferent data; the result of such a course thetical example, suppose a researcher make sense given that health care is is T(y;(p(y)), where the function (p(») is interested in how Democrats and currently a highly politically salient is observed in the observed case. It is Republicans perform differently in issue and the military is less so. And generally considered unethical to (4) how are independents and nonparti commit outright fishing, computing sans handled? They could be exclud T(y;<p;) for j = 1,.../. This would be a Andrew Gelman is a professor in the depart ed entirely, depending on how many matter of performing / tests and then ments of statistics and political science at were in the sample. And so on: A sin reporting the best result given the Columbia University and the author o/Red gle overarching research hypothesis— data, thus T(y; cpbest(y)). State, Blue State, Rich State, Poor State: in this case, the idea that issue context It would take a highly unscrupulous Why Americans Vote the Way They Do (2008). Eric Loken is a research associate interacts with political partisanship to researcher to perform test after test in a professor of human development at Pennsyl affect mathematical problem-solving search for statistical significance (which vania State University. E-mail: gelman@stat. skills—corresponds to many possible could almost certainly be found at the columbia.edu choices of a decision variable. 0.05 or even the 0.01 level, given all 460 American Scientist, Volume 102 the options above and economic status (SES) the many more that will oppose wealth would be possible in redistribution, and that a real study). The diffi stronger men with low SES cult challenge lies else will support redistribution. where: Given a par These researchers had ticular data set, it can enough degrees of freedom for seem entirely appropri them to be able to find any number ate to look at the data and of apparent needles in the haystack construct reasonable rules of their data—and, again, it would be for data exclusion, coding, and easy enough to come across the statisti analysis that can lead to statistic, cally significant comparisons without significance. In such a case, resell Lit "fishing" by simply looking at the data ers need to perform only one test, data-based claim is more plausible to and noticing large differences that are but that test is conditional on the data; the extent it is a priori more likely, and consistent with their substantive theory. hence, T(y;cp(y)), with the same effect as any claim is less plausible to the extent Most notably, the authors report a if they had deliberately fished for those that it is estimated with more error. statistically significant interaction with results. As political scientists Macartan There are many roads to statistical no statistically significant main effect— Humphreys, Raul Sanchez de la Sierra, significance; if data are gathered with that is, they did not find that men with and Peter van der Windt wrote in 2013, no preconceptions at all, statistical sig bigger arm circumference had more a researcher faced with multiple rea nificance can obviously be obtained conservative positions on economic re sonable measures can think—perhaps even from pure noise by the simple distribution. What they found was that correctly—that the one that produces a means of repeatedly performing com the correlation of arm circumference significant result is more likely to be the parisons, excluding data in different with opposition to redistribution of least noisy measure, but then decide— ways, examining different interactions, wealth was higher among men of high incorrectly—to draw inferences based controlling for different predictors, socioeconomic status. Had they seen on that one measure alone. In the hy and so forth. Realistically, though, a the main effect (in either direction), pothetical example presented earlier, researcher will come into a study with they could have come up with a theo finding a difference in the healthcare strong substantive hypotheses, to the retically justified explanation for that, context might be taken as evidence that extent that, for any given data set, the too. And if there had been no main ef that is the most important context in appropriate analysis can seem evident fect and no interaction, they could have which to explore differences. ly clear. But even if the chosen data looked for other interactions. Perhaps, This error carries particular risks in analysis is a deterministic function of for example, the correlations could the context of small effect sizes, small the observed data, this does not elimi have differed when comparing stu sample sizes, large measurement er nate the problem posed by multiple dents with or without older siblings? rors, and high variation (which com comparisons. As we wrote in a 2013 critique for bine to give low power, hence less reli Slate, nothing in this report suggests able results even when they happen Arm Strength and Economic Status that fishing or p-hacking—which to be statistically significant, as dis In 2013, a research group led by Mi would imply an active pursuit of sta cussed by Katherine Button and her chael Petersen of Aarhus University tistical significance—was involved at coauthors in a 2013 paper in Nature published a study that claimed to find all. Of course, it is reasonable for scien Reviews: Neuroscience). Multiplicity is an association between men's upper- tists to refine their hypotheses in light of less consequential in settings with large body strength, interacted with socio the data. When the desired pattern does real differences, large samples, small economic status, and their attitudes not show up as a main effect, it makes measurement errors, and low variation.

The Statistical Crisis in Science

Downloading of a Human Consciousness Into a Digital Computer Would Involve ‘A Certain Loss of Our Finer Feelings and Qualities’89

Caveat Emptor: the Risks of Using Big Data for Human Development

Reproducibility: Promoting Scientific Rigor and Transparency

Rejecting Statistical Significance Tests: Defanging the Arguments

Replicability Problems in Science: It's Not the P-Values' Fault

Just How Easy Is It to Cheat a Linear Regression? Philip Pham A

Distributional Null Hypothesis Testing with the T Distribution Arxiv

Responsible Use of Statistical Methods Larry Nelson North Carolina State University at Raleigh

P-Hacking' Now an Insiders' Term for Scientific Malpractice Has Worked Its Way Into Pop Culture

Investigating the Effect of the Multiple Comparisons Problem in Visual Analysis

Reproducibility and Privacy: Research Data Use, Reuse, and Abuse

(P>0.05): Significance Thresholds and the Crisis of Unreplicable Research