<<

FAT TAILS RESEARCH PROGRAM

A Short Note on P-Value Hacking Nassim Nicholas Taleb Tandon School of Engineering

Abstract—We present the expected values from p-value hack- Expected min p-val ing as a choice of the minimum p-value among m independents tests, which can be considerably lower than the "true" p-value, 0.10 even with a single trial, owing to the extreme of the meta-distribution. 0.08 We first present an exact (meta- distribution) for p-values across ensembles of statistically iden- 0.06 tical phenomena. We derive the distribution for small samples ∗ 2 < n ≤ n ≈ 30 as well as the limiting one as the size n 0.04 becomes large. We also look at the properties of the "power" of a test through the distribution of its inverse for a given p-value 0.02 and parametrization. The formulas allow the investigation of the stability of the m trials reproduction of results and "p-hacking" and other aspects of 2 4 6 8 10 12 14 meta-analysis. P-values are shown to be extremely skewed and volatile, Fig. 1. The "p-hacking" value across m trials for the "true" p-value regardless of the sample size n, and vary greatly across repetitions pM = .15 and expected "true" value ps = .22. We can observe how easily of exactly same protocols under identical stochastic copies of the one can reach spurious values < .02 with a small number of trials. phenomenon; such volatility makes the minimum p value diverge significantly from the "true" one. Setting the power is shown to PDF offer little remedy unless sample size is increased markedly or 10 the p-value is lowered by at least one order of magnitude.

-VALUE hacking, just like an option or other mem- 8 n=5 bers in the class of convex payoffs, is a function that P n=10 benefits from the underlying and higher variability. The researcher or group of researchers have an 6 n=15 implicit "option" to pick the most favorable result in m trials, n=20 n=25 without disclosing the number of attempts, so we tend to get 4 a rosier picture of the end result than reality. The distribution of the minimum p-value and the "optionality" can be made explicit, expressed in a parsimonious formula allowing for the 2 understanding of biases in scientific studies, particularly under environments with high publication pressure. p 0.00 0.05 0.10 0.15 0.20 Assume that we know the "true" p-value, ps, what would its realizations look like across various attempts on statistically Fig. 2. The different values for Equ. 1 showing convergence to the limiting identical copies of the phenomena? By true value ps, we distribution. its expected value by the across an m ensemble of possible samples for the phenomenon arXiv:1603.07532v4 [stat.AP] 25 Jan 2018 P P .12 will be below .05. This implies serious gaming and "p- under scrutiny, that is 1 P p −→ p (where −→ denotes m ≤m i s hacking" by researchers, even under a moderate amount of convergence in probability). A similar convergence argument repetition of . can be also made for the corresponding "true median" pM . The distribution of n small samples can be made explicit (albeit Although with compact support, the distribution exhibits with special inverse functions), as well as its parsimonious the attributes of extreme fat-tailedness. For an observed limiting one for n large, with no other parameter than the p-value of, say, .02, the "true" p-value is likely to be >.1 median value pM . We were unable to get an explicit form for (and very possibly close to .2), with a standard ps but we go around it with the use of the median. >.2 (sic) and a mean deviation of around .35 (sic, sic). It turns out, as we can see in Fig. 3 the distribution is Because of the excessive skewness, measures of disper- extremely asymmetric (right-skewed), to the point where 75% sion in L1 and L2 (and higher norms) vary hardly with of the realizations of a "true" p-value of .05 will be <.05 (a ps, so the standard deviation is not proportional, meaning borderline situation is 3× as likely to pass than fail a given an in-sample .01 p-value has a significant probability of protocol), and, what is worse, 60% of the true p-value of having a true value > .3.

Second version, January 2018, First version was March 2015.

N. N. Taleb 1 FAT TAILS RESEARCH PROGRAM

So clearly we don’t know what we are talking about with n degrees of freedom, and, crucially, supposed to deliver when we talk about p-values. a mean of ζ¯,

n+1 Earlier attempts for an explicit meta-distribution in the  n  2 literature were found in [1] and [2], though for situations of (ζ¯−ζ)2+n f(ζ; ζ¯) = √ Gaussian subordination and less parsimonious parametrization. n 1  nB 2 , 2 The severity of the problem of significance of the so-called "statistically significant" has been discussed in [3] and offered where B(.,.) is the standard beta function. Let g(.) be the one- a remedy via Bayesian methods in [4], which in fact recom- tailed of the Student T distribution with zero mends the same tightening of standards to p-values ≈ .01. mean and n degrees of freedom: But the gravity of the extreme skewness of the distribution  1 n 1   I n , ζ ≥ 0 of p-values is only apparent when one looks at the meta-  2 ζ2+n 2 2 g(ζ) = (Z > ζ) =   distribution. P 1 1 n   2 I ζ2 2 , 2 + 1 ζ < 0  2 For notation, we use n for the sample size of a given study ζ +n and m the number of trials leading to a p-value. where I(.,.) is the incomplete Beta function. We now look for the distribution of g ◦ f(ζ). Given that I. DERIVATION OF THE METADISTRIBUTION OF P-VALUES g(.) is a legit Borel function, and naming p the probability Proposition 1. Let P be a ∈ [0, 1]) corre- as a random variable, we have by a standard result for the sponding to the sample-derived one-tailed p-value from the transformation: paired T-test (unknown variance) with median value f g(−1)(p) M(P ) = pM ∈ [0, 1] derived from a sample of n size. ϕ(p, ζ¯) = The distribution across the ensemble of statistically identical |g0 g(−1)(p) | copies of the sample has for PDF We can convert ζ¯ into the corresponding median survival ( 1 ϕ(p; pM )L for p < 2 probability because of symmetry of Z. Since one half the ϕ(p; pM ) = 1 ¯ ϕ(p; pM )H for p > observations fall on either side of ζ, we can ascertain that 2 ¯ 1 the transformation is median preserving: g(ζ) = 2 , hence 1 ¯ 1 n 1  ϕ(pM ,.) = . Hence we end up having {ζ : I n , = 1 (−n−1) 2 2 ζ¯2+n 2 2 ϕ(p; p ) = λ 2   M L p ¯ 1 1 n  s pM } (positive case) and {ζ : 2 I ζ2 2 , 2 + 1 = pM } λ (λ − 1) ζ2+n − p pM (negative case). Replacing we get Eq.1 and Proposition 1 is (λ − 1) λ − 2p(1 − λ ) λ p(1 − λ ) λ + 1 p pM p p pM pM done.  n/2  1   √ √  We note that n does not increase significance, since p- 2 1−λ λ  1 − √ √p pM + 1 − 1 values are computed from normalized variables (hence the λp λ 1−λ 1−λp p pM M universality of the meta-distribution); a high n corresponds to an increased convergence to the Gaussian. For large n, we 1 0  2 (−n−1) ϕ(p; pM )H = 1 − λp can prove the following proposition:   0  Proposition 2. Under the same assumptions as above, the λp − 1 (λpM − 1) n+1  q  2 limiting distribution for ϕ(.): 0 0  0 p λp (−λpM ) + 2 1 − λp λp (1 − λpM ) λpM + 1 −1 −1 −1 −erfc (2pM )(erfc (2pM )−2erfc (2p)) (1) lim ϕ(p; pM ) = e (2) n→∞ where λ = I−1 n , 1 , λ = I−1 1 , n , λ0 = p 2p 2 2 pM 1−2pM 2 2 p where erfc(.) is the complementary error function and −1 1 n  −1 −1 I2p−1 2 , 2 , and I(.) (., .) is the inverse beta regularized erfc(.) its inverse. function. The limiting CDF Φ(.) Remark 1. For p= 1 the distribution doesn’t exist in theory, 1 2 Φ(k; p ) = erfc erf−1(1 − 2k) − erf−1(1 − 2p ) (3) but does in practice and we can work around it with the M 2 M 1 1 sequence pmk = ± , as in the graph showing a convergence mv 2 k s Proof. For large n, the distribution of Z = √v becomes that to the Uniform distribution on [0, 1] in Figure 4. Also note n of a Gaussian, and the one-tailed survival function g(.) = that what is called the "null" hypothesis is effectively a set of   √ 1 erfc √ζ , ζ(p) → 2erfc−1(p). measure 0. 2 2 Proof. Let Z be a random normalized variable with realiza- This limiting distribution applies for paired tests with known tions ζ, from a vector ~v of n realizations, with sample mean or assumed sample variance since the test becomes a Gaussian mv −mh s mv, and sample sv, ζ = √v (where mh variable, equivalent to the convergence of the T-test (Student n is the level it is tested against), hence assumed to ∼ Student T T) to the Gaussian when n is large.

N. N. Taleb 2 FAT TAILS RESEARCH PROGRAM

PDF/Frequ. lowest p-values of many experiments, or try until one of the tests produces statistical significance. ∼ 53% of realizations <.05 ∼25% of realizations <.01 Proposition 3. The distribution of the minimum of m obser- 0.15 vations of statistically identical p-values becomes (under the limiting distribution of proposition 2): 5% 0.10 p-value −1 −1 −1 erfc (2pM )(2erfc (2p)−erfc (2pM )) cutpoint (true mean) ϕm(p; pM ) = m e  m−1 Median 1 −1 −1  1 − erfc erfc (2p) − erfc (2pM ) (5) 0.05 2 Tn Proof. P (p1 > p, p2 > p, . . . , pm > p) = i=1 Φ(pi) = Φ(¯ p)m. Taking the first derivative we get the result. 0.00 p 0.05 0.10 0.15 0.20 Outside the limiting distribution: we integrate numerically Fig. 3. The probability distribution of a one-tailed p-value with expected for different values of m as shown in figure 1. So, more value .11 generated by Monte Carlo () as well as analytically with precisely, for m trials, the expectation is calculated as: ϕ(.) (the solid line). We draw all possible subsamples from an ensemble with given properties. The excessive skewness of the distribution makes the Z 1 Z p m−1 value considerably higher than most observations, hence causing illusions of E(pmin) = −m ϕ(p; pM ) ϕ(u, .) du dp "statistical significance". 0 0

φ III.OTHER DERIVATIONS 5 Inverse Power of Test Let β be the for a given p-value p, for 4 .025 random draws X from unobserved parameter θ and a sample .1 size of n. To gauge the reliability of β as a true measure of 3 .15 power, we perform an inverse problem: 0.5 β Xθ,p,n 2 ∆ β−1(X) 1

Proposition 4. Let βc be the projection of the power of the p test from the realizations assumed to be student T distributed 0.0 0.2 0.4 0.6 0.8 1.0 and evaluated under the parameter θ. We have

Fig. 4. The probability distribution of p at different values of pM . We observe ( 1 1 Φ(βc)L for βc < how pM = 2 leads to a uniform distribution. 2 Φ(βc) = 1 Φ(βc)H for βc > 2 Remark 2. For values of p close to 0, ϕ in Equ. 2 can be where n usefully calculated as: p − 2 Φ(βc)L = 1 − γ1γ1 s √   ! 1 γ n+1 ϕ(p; p ) = 2πp log √ √ 1 2 M M 2 − q 1  q 1 1  2πp 2 −1 −(γ1−1)γ1−2 −(γ1−1)γ1+γ1 2 −1− −1 M γ3 γ3 γ3 r s    p   1  1 − (γ − 1) γ − log 2π log −2 log(p) − log 2π log −2 log(pM ) 1 1 2πp2 2πp2 e M (6) + O(p2). (4) √ n 1 n − 2 The approximation works more precisely for the band of Φ(βc)H = γ2 (1 − γ2) B , relevant values 0 < p < 1 . 2 2 2π   1 n+1 From this we can get numerical results for convolutions of √ √ √ √ 2  1 1  −2( −(γ2−1)γ2+γ2) −1+2 −1+2 −(γ2−1)γ2−1 ϕ using the Fourier Transform or similar methods. γ3 γ3 + 1 γ2−1 γ3 p− (γ − 1) γ B n , 1  II.P-VALUE HACKING 2 2 2 2 (7) We can and get the distribution of the minimum p-value per where γ = I−1 n , 1 , γ = I−1 1 , n , and γ = m trials across statistically identical situations thus get an idea 1 2βc 2 2 2 2βc−1 2 2 3 of "p-hacking", defined as attempts by researchers to get the I−1 n , 1 . (1,2ps−1) 2 2

N. N. Taleb 3 FAT TAILS RESEARCH PROGRAM

IV. APPLICATION AND CONCLUSION • One can safely see that under such stochasticity for the realizations of p-values and the distribution of its minimum, to get what a scientist would expect from a 5% confidence level (and the inferences they get from it), one needs a p-value of at least one order of magnitude smaller. • Attempts at replicating papers, such as the open science project [5], should consider a margin of error in its own procedure and a pronounced bias towards favorable results (Type-I error). There should be no surprise that a previously deemed significant test fails during –in fact it is the replication of results deemed significant at a close margin that should be surprising. • The "power" of a test has the same problem unless one either lowers p-values or sets the test at higher levels, such at .99.

ACKNOWLEDGMENT Marco Avellaneda, Pasquale Cirillo, Yaneer Bar-Yam, friendly people on twitter, less friendly verbagiastic psychol- ogists on twitter, ...

REFERENCES [1] H. J. Hung, R. T. O’Neill, P. Bauer, and K. Kohne, “The behavior of the p-value when the is true,” Biometrics, pp. 11–22, 1997. [2] H. Sackrowitz and E. Samuel-Cahn, “P values as random variables— expected p values,” The American , vol. 53, no. 4, pp. 326– 331, 1999. [3] A. Gelman and H. Stern, “The difference between “significant” and “not significant” is not itself statistically significant,” The American Statistician, vol. 60, no. 4, pp. 328–331, 2006. [4] V. E. Johnson, “Revised standards for statistical evidence,” Proceedings of the National Academy of Sciences, vol. 110, no. 48, pp. 19 313–19 317, 2013. [5] O. S. Collaboration et al., “Estimating the reproducibility of psychological science,” Science, vol. 349, no. 6251, p. aac4716, 2015.

N. N. Taleb 4