07-Osborne (Best)-45409.qxd 10/9/2007 4:31 PM Page 103

7

REPLICATION

PETER R. KILLEEN

We come finally, however, to the relation of the ideal theory to real world, or “real” probabil- ity....To someone who wants [applications, a consistent mathematician] would say that the ideal system runs parallel to the usual theory: “If this is what you want, try it: it is not my business to justify application of the system; that can only be done by philosophizing; I am a mathematician.” In practice he is apt to say: “try this; if it works that will justify it.” But now he is not merely philosophizing; he is committing the characteristic fallacy. Inductive experi- ence that the system works is not evidence. Littlewood (1953, p. 73)

PROBABILITY AS A MODEL SYSTEM others, it matured as a coherent model system, inheriting most features of the earlier versions of For millennia, Euclidean geometry was a state- the probability calculus. ment of fact about the world order. Only in the The abstraction of model systems from the 19th century did it come to be recognized world permits their development as coherent, instead as a model system—an “ideal theory”— clear, and concise logics. But the abstraction that worked exceedingly well when applied to has another legacy: the eventual need for scien- many parts of the real world. It then stepped tists to reconnect the model system to the down from a truth about the world to its current empirical world. That such rapprochement is place as first among equals as models of the even possible is amazing; it stimulated Wigner’s world—the most useful of a cohort of geome- well-known allusion to “the unreasonable tries, each of differential service in particular effectiveness of mathematics in describing the cases, on spherical surfaces and relativistic uni- world.” Realizing such “unreasonably effective” verses and fractal percolates. In like manner, descriptions, however, can present reasonably probability theory was born as an explanation of formidable difficulties—difficulties that are the contingent world—“‘real’ probability”— sometimes overcome only by fiat, as noted by and, with the work of Kolmogorov among many Littlewood, a mathematician of no ability,

Author’s Note: The research was supported by National Foundation Grant IBN 0236821 and National Institute of Mental Health Grant 1R01MH066860.

103 07-Osborne (Best)-45409.qxd 10/9/2007 4:31 PM Page 104

104 BEST PRACTICES IN RESEARCH DESIGN

and M. Kline (1980), a scholar of comparable Connecting Probability to acuity. The toolbox that helps us apply the You are faced with two columns of numbers, “ideal theory” of probability to scientific ques- data collected from two groups of subjects. tions is called inferential statistics. These tools What do you want to know? Not, of course, are being continually sharpened, with new “whether there’s a significant difference between designs replacing old. them.” If they are identical, you would have Intellectual ontogeny recapitulates its cul- looked for the clerical error. If they are different, tural phylogeny. Just as we must outgrow naive they are different. You can review them 100 physics, we must outgrow naive statistics. The times, and they will continue to be different, former is an easier transition than the latter. hopefully 100 times; p = 1.0. “Significantly dif- Not only must we as students of contingency ferent,”you might emphasize, irritated. But what deal with the gamblers’ fallacies and exchange does that mean? “That the probability that they paradoxes; we must also cope with the acade- would be so different by chance is less than 5%,” mics’ fallacies and statistical paradoxes that are you recite. OK. Progress. Now we just need clar- visited upon us as idols of our theater, the uni- ification of probability, so, and chance. versity classroom. The first step, one already taken by most readers of this volume, is to rec- ognize that we deal with model systems, some Probability. Probability theory is a deductive cal- more useful than others, not with truths about culus. One starts with probability generators, real things. The second step is to understand such as coins or cards or dice, and makes deduc- the character of the most relevant tools for tions about their behavior. The premises are pre- their application, their strengths and weak- cise: coins with a probability of heads of .50, nesses, and attempt to determine in which perfectly balanced dice, perfectly shuffled cards. cases their marriage to data is one of mere Then elegant theorems solve problems such as convenience and in which it is blessed with “Given an unbiased coin, what is the probability a deeper, Wignerian resonance. That step of flipping 6 heads in a row?” But scientists are requires us to remain appreciative but critical never given such ideal objects. Their modal craftsmen. It requires us to look through the inferences are inductions, not deductions: An halo of mathematics that surrounds all statisti- informant gives them a series of outcomes from cal inference to assess the flipping a coin that landed heads six times in a between tool and task, to ask of each statistical row, and they must determine what probability technique whether it gives us leverage or just of heads should be assigned to the coin. They can adds decoration. solve this mystery either as Dr. Watson or as Mr. This chapter briefly reviews—briefly, because Holmes in the cherished tale, “The Case of the there are so many good alternative sources (e.g., Hypothesis That Had No Teeth.” As you well Harlow, Mulaik, & Steiger, 1997; R. B. Kline, remember, Dr. Watson studiously purged his 2004)—the most basic statistical technique we mind of all prior biases and opined that the use, null hypothesis statistical testing (NHST) probability of the coin being fair was manifestly and its limits. It then describes an alternative (½)6, < .025 and, further, that the best estimate of –6 , prep, that predicts replicability. We the probability of a heads was 1 – 2 . Mr. Holmes remain mindful of Littlewood’s (1953) observa- stuffed his pipe; examined the coin; spun it; tion that “inductive experience that the system asked about its origin, how the coin was released works is not evidence [that it is true].” But then and caught, and how many sequences were Littlewood was a mathematician, not a scientist. required to get that run of six; and then inquired The search for truth about parameters has often about the bank account of the informant and his befuddled the progress of science, which recog- recent associates. Dr. Watson objected that that nizes simpler goals as well: to understand and was going beyond the information given; in any predict. If we “try [a tool, and] it works,”that can case, how could one ever combine all those diverse be very good news and may constitute a signifi- clues into a probability statement that was not cant advance over what has been. So, try this intrinsically subjective? “Elementary,”Mr. Holmes new tool, and see if it works for your inferential observed, “probability theory this is not, my dear problems. Watson; nor is it deduction. When I infer a state 07-Osborne (Best)-45409.qxd 10/9/2007 4:31 PM Page 105

Replication Statistics 105

of nature from evidence, the more evidence the were drawn from a large, normally distributed better I infer. My colleague shall explain how to population of data similar to those of the concatenate evidence in a later chapter of this control group. Increase the power of the test by sage book I saw you nodding over.” estimating the population from the How do we infer probability from a situation of both the experimental and control in which there is no uncertainty—the six heads groups. Then a theoretical distribu- in a row, last week’s soccer cup, your two tion, such as the t distribution, can be directly columns of experimental data? There are two used to infer how often so deviant an outcome root metaphors for probability: For frequentists, as what you measured would have happened by probability is the long-run relative chance under repeated sampling. This set of of an outcome; it does not apply to novel events tactics is the paradigmatic modus operandi (no long run) or to accomplished events (faits for . In modern applications, accomplis support no probability other than the theoretical may be unity). For Bayesians, probability is the relative replaced with an empirical one, obtained by odds that an individual gives to an outcome. Monte Carlo elaboration of the original empiri- You do not have resources or interest to recon- cal distribution function (e.g., Davison & duct your thousands of times, to esti- Hinkley, 1997; Mooney & Duval, 1993). mate the relative frequency of two being so So in this standard scenario, probability different. And even if you did, you’d just be left means the long-run relative frequency that you with a much larger of accomplished data. stipulate in your test of the behavior of an ideal This seems to eliminate the frequentist solution. object. It is against this that you will test your On the other hand, the odds that you give to your data. Unlike a , the parame- outcome will be different from the odds your ters tested (e.g., the null hypothesis that the reviewers or editor or your significant other gives means of the two populations are equal) are not to your outcome. This eliminates any unique inferred from data. Indeed, authorities such as Bayesian probability. What next? Just imagine. Kyburg (1987) argue that use of Bayesian inverse Instead of conducting the experiment thou- inference, leading up from statistics to parame- sands of times, just imagine that it has been ters, undermines all direct inference thereafter, conducted thousands of times. This is Fisher’s including NHST. brilliant solution to the problem of connecting data to the “ideal theory” of probability. Well So. Let us assume your experiment recorded a then, just how big an effect should we imagine standardized difference between the responses that your experiment yielded? Here we must of 30 control subjects and 30 experimental sub- temper imagination with discipline: We must jects of d = 0.50, from which you calculated a imagine that the experiment never worked— p value of < .05. What does that mean? Does it that there was never a real effect in all those mean that the probability of getting a d of 0.50 imaginary trials but that the outcomes were dis- under the null is less than 5%? No, because we tributed by that rogue called Chance. Next, know a priori that the probability of getting graph the proportion of outcomes at various exactly that value is always very close to zero; in effect sizes to give a sampling distribution. If fact, the more decimal places in your measure- your measured outcome happens only rarely ments, the closer to zero. That’s true for any real among this cohort of no real effects, you may number you might have recorded, even d = 0.00, conclude that it really is not of their type—it which Chance favors. does not belong on the group null bench. It is so Again an impasse, but again one that can deviant, you infer, because more than Chance be solved through imagination (providing was at work—the experimental manipulation inductive evidence supporting Einstein’s chest- was at work! This last inference is, as we shall nut “Imagination is more important than see, as common, and commonly sanctioned, as it knowledge”). To get a probability requires an is unjustified. interval on the x-axis. We could take one around If the thousands-of-hypothetical-trials sce- the observed value: say, d ± 0.05. This would nario taxes the computational resources of your work; as we let the interval shrink toward zero, imagination, then imagine instead that the data comparison with a similar extent around 0 07-Osborne (Best)-45409.qxd 10/9/2007 4:31 PM Page 106

106 BEST PRACTICES IN RESEARCH DESIGN

would give us a likelihood analysis of the null FOUNDATIONAL PROBLEMS versus the observed (Royall, 1997). Fisher was a WITH STATISTICAL INFERENCE pioneer of likelihood analyses but could not get likelihoods to behave like probabilities, and so The Inverse Inference Problem. Assume the prob- he suggested a different interval on the evidence ability of the statistic S given the null (N) is less axis: He gave the investigator credit for every- than a prespecified critical number, p(S|N) < α, thing more extreme than the observed statistic µ where the null may be a hypothesis such as E – (a benefice that amazed and pleased many gen- µ = C 0. It does not then follow that the probabil- erations of young researchers, who might have ity of the null given the statistic, p(N|S), is less been told instead,“The null will give effects up to than α. We can get from one to the other, how- that size 95% of the time”). So, what so means ever, via Bayes theorem: here is “more extreme than” what you found, including d scores of1,2,...100....Don’t ask = why those extents of the x-axis never visited by p(N|S) p(S|N)p(N)/p(S). data should play a role in the decision; take the p < .05 and run. By the (prior) probability of the null, p(N), we mean the probability of the mathematical Chance. “The probability [relative frequency implication of the null (for instance, that two µ µ = under random sampling] that your data would population means are equal, E – C 0). That be so [at least that] extreme by chance.” Here prior must be assigned a value before looking at chance means your manipulation did not work the new data (S). By the probability of the statis- and only the null did. How does the null work? tic, p(S), we mean its probability under relevant Typically, thanks to the , states of the world—in this case, the probability in ways that result in a Gaussian distribution of of the data given that the null is true, plus the effects (although for other test statistics or probability of the data given that the null is false: inferences, related distributions such as the t p(S) = p(S|N)p(N) + p(S|~N)(1 – p(N)). By nor- or F are the correct asymptotic distributions). malizing the right-hand side of Bayes’s equation, Two parameters completely determine the p(S) makes the posterior probabilities sum to 1. Gaussian: its mean and variance. Under the Only in the unlikely case that the prior probabil- null, the mean is zero. Its variance? Since we ity of the null equals the probability of the sta- have no other way to determine it, we use the tistic, p(N) = p(S), does p(N|S) = p(S|N). data you brought with you, the variance of Otherwise, to have any sense of the probability your control group. Well . . . the experimental of the null (and, by implication, of the alterna- group could also provide information. Hoping tive we favor, that our manipulation was effec- µ µ that the experimental operations have not per- tive: E – C > 0), we must be able to estimate turbed that too much, we will pool the infor- p(N)/p(S). This is not easy. Even if we could mation from both sources. Then chance means agree on assigning a to the that “with no help from my experimental null, Bayes would not give us the probability of manipulation, the impotent null could have our favored hypothesis (unless we defined it given rise to the observed difference by the luck broadly as “anything but the null”). In light of of the draw [of a sample from our hypothetical these difficulties, Fisher (1959) made it crystal population].” If it would happen in less than clear that we generally cannot get from our 5% of the samples we hypothetically take, we p value to any statement about the probability of call the data significantly different from that the hypotheses. But it was to make exactly such expected under the null. statements about our hypotheses, with the bless- You knew all that from Stat 101, but it is ings of statistical rigor, that we attended all those worth the review. What do you conclude with a statistics courses. We were misled, but it was not, gratifying p < .05? Also as learned in Stat 101, we suspect, the first or last time that happened, you conclude that those are improbably which goes some distance to explaining the con- extreme data. But what many of us thought flation of statistics with lies in the public’s mind. we heard was that we could then reject the “It is important to remember that [relative null. That, of course, is simplistic at best, false frequency] is but one interpretation that can be at worst. given to the formal notion of probability” (Hays, 07-Osborne (Best)-45409.qxd 10/9/2007 4:31 PM Page 107

Replication Statistics 107

1963, p. 63); given our inferential imbroglio, we population of subjects, another investigator may well wonder if that interpreter ever spoke can replicate those results? This can be com- the language of science. Neyman and Pearson puted once we agree on the meaning of replicate. solved the problem of inverse inference by Consider this definition: emphasizing comparison with an alternate To replicate means to repeat the empirical hypothesis, setting criterial regions into which operations and to record data that support the our statistic would either fall, or not, and noting original claim. that, whereas we have no license to change our beliefs about the null even if our statistic falls • If the claim is as modest as “This operation into such a critical (p < α) region, it would works [generates a positive effect],” then any nonetheless be prudent to change our behavior, replication attempt that finds a positive effect absent a change in belief. Like Augustine’s credo could be deemed a successful replication. quia impossibile, this tergiversation does solve Although this may seem a too-modest the problem—but only for those whose faith is threshold for replication, put it in the context stronger than their reason. of what traditional significance tests test: In What did we ever do that got us into this a one-tailed test, 1 – p gives us the probability mess? We wanted to know if our manipulation that our statistic d has the same sign as the (or someone else’s, perhaps Nature’s, if this population parameter (Jones & Tukey, 2000). was an ) really worked. We • If the claim is “This operation generates an

wanted to know how much of the difference of at least dL,” then only replication ≥ between groups was caused by the factor that attempts that return d dL count as successful we used to sort the numbers into two or more replications. columns. We wanted to know if our results will • If the claim is “This operation generates replicate or if we are likely to be embarrassed by a significant effect,” then only replication that most odious of situations, publication of attempts that return a p < .05 are successful unreplicable results. Null hypothesis statistical replications. tests cannot get us from here to there. Let us go • If the claim is “This operation is essentially

back to basics and see if we cannot find a viable worthless, generating effect sizes less than dU ,” alternate route to some of our valid goals. then any replication attempt that returned

a d < dU could be deemed a successful replication. One would have to decide DEFINING REPLICATION beforehand if a d less than, say, –0.5 was as consistent with the claim or constituted Curious, isn’t it, that we have license to use the evidence for a stronger alternative claim, such measured variance (s2) in estimating a parame- as “This operation could backfire.” ter of the population under the null hypothesis 2 (σ ), but we do not use the measured mean to In the following, we shall show how to com- estimate a parameter? Why evaluate data with pute such probabilities. one hand tied behind our back? Assay the fol- δ= δ lowing hypothesis instead: HA: d, where is the value of the population effect size, and d is PREDICTING REPLICABILITY the effect size you measured in your experiment. RATHER THAN INFERRING PARAMETERS But to ask the probability that this alternative to the null is true seems to be creating a new logi- cal fallacy: post hoc, ergo hoc. (Gigerenzer [2004] How. How to estimate these probabilities? The called a similar solecism “The Feynman Fallacy” sampling distribution of effect sizes in replica- because Feynman was so exasperated when a tion, p(d |d ), is required. young colleague asked him to calculate the 2 1 Effect size is calculated as probability of an accomplished event.) But we can ask a different, noncircular question, one that takes advantage of the first two moments of M − M d = E C , (1) the observed data. What is the probability that, s using the same experimental operations and p 07-Osborne (Best)-45409.qxd 10/9/2007 4:31 PM Page 108

108 BEST PRACTICES IN RESEARCH DESIGN

with ME the mean of the experimental group, MC area of the predictive distribution on the right in the mean of the control group, and sp the esti- Figure 7.1. If the claim is that the effect size is mate of the population based greater than d*, the probability of getting sup- = on the pooled standard deviations of both portive evidence is predicted by prep with dL d* = ∞ groups (see the appendix for more information). and dU . The probability of finding a signifi- = Consider an effect size d d1 based on a total of cant effect size in replication is given by prep with 2 n observations randomly sampled from a popu- d = S zα. Other claims take other limits. L dR lation with unknown mean δ and variance σ2. For convenience in calculation of the stan-

The effect size d2 for the next m observations dard case, Equation 2 may be rewritten as constitutes the datum that we wish to predict | /σ based on the original observations: f(d2 d1). d1 dR Because d provides information about d , these 1 2 p = n(0, 1) = d/s , (3) statistics are not independent. They are only rep 1 dR independent when considered as samples from a −∞ large population whose parameter is δ.Solution σ with d estimated by Sd . proceeds by considering the joint density of d R R 2 The variance of d is and δ conditional on the primary observations. This is developed in the appendix, leading to n2 the distribution shown as the flatter curve in s2 ≈ , d n n (n − 4) Figure 7.1. In evaluating a claim concerning exper- E C imental results, calculate replicability by integrat- = + ing that density between the appropriate limits: with n nE nC. When experimental and = control groups are of equal size, nE nC , then d U − − 2 dU d1 dL d1 4 prep = n(d1, s ) = − , (2) 2 ≈ , dR sd sdR sdR n − 4 dL

2 = 2 In cases where a positive effect has been and sd 2s1. Predicting the effects in a different- claimed, and we stipulate that the hypothetical size prospectiveR sample requires adjusting the replication would have the same power as the variance. These considerations are reviewed in accomplished experiment—everything is done Killeen (2005a), whose derivation was corrected = exactly as in the original experiment—then dL 0, along the lines of the appendix by Doros and = ∞ 2 = 2 dU , and S 2S . This is shown as the gray Geier (2005). dR d

p rep

0 d1 P

Figure 7.1 The dark area to the right of the observed effect size d1 gives the probability of finding data more extreme than d1, given the null. The gray area to the right of 0 gives the probability of finding a positive effect in replication (Equation 2). The figure is reproduced from Killeen (2005a), with permission. 07-Osborne (Best)-45409.qxd 10/9/2007 4:31 PM Page 109

Replication Statistics 109

Relation to p. How does the probability of repli- claims more interesting than “My data are cation compare to traditional indices of replica- absurdly improbable under the null,” leaving the bility such as p? The p value is the probability of implication unsaid. Stipulating the null keeps its rejecting the null hypothesis given that the truth value off the table as a conclusion.

datum, d1, is sampled from a world in which the A researcher can say, “My hypothesis led me null is true. It is shown as the area to the right of to predict a positive effect from this manipula-

d1 in Figure 7.1, under a normal density centered tion. I found one, and if you repeat my manipu- σ2 at 0 and having a variance of d, estimated from lation, you have a probability of approximately 2 Sd. The value of prep for the same data—the prob- prep of also finding one.”As ever, the investigator ability of finding a positive effect in replication— can assert that the manipulation caused the is the shaded area to the right of 0 in the normal effect, and the hypothesis that birthed the

curve centered on d1 and having a variance of manipulation was correct, only to the extent all σ2 , estimated from S 2 = S 2.As d moves to the confounds were eliminated, as other causes may dR dR d 1 2 right, or as the variance of d1, S d1, decreases, prep have been more operative than the ones manip- will increase and p decrease in complement. For ulated; those alternative causes may have been use in spreadsheets such as Excel, Cumming systematic, or possibly discernable only as “error = (2005) suggested the notation prep NORMS- [sampling] variance.” In the long run (as n DIST([NORMSINV(1 – p)]/√2), where NORMSDIST is increases), nonetheless, exact replications

the standardized normal distribution function, should succeed with probability prep.Whereas and NORMSINV is the normal p-to-z transforma- failure to reject the null is not easily interpreted, = = tion. Drawn in probability coordinates, the a prep .85 is just as interpretable as a prep .95. relation between these two probabilities is Other advantages are discussed in Killeen

transparent: The z scores for prep decrease lin- (2005a), who did not adequately emphasize that =− early with the z scores for p, zPrep kzp, with prep, based on a single experiment, is only an esti- k = 1/√2. mate of replication probability (Cumming, 2005; Iverson, Myung, & Karabatsos, 2006). Like

Advantages of prep over p. Given the affinity confidence intervals and p values, its accuracy between p and prep, why switch? Indeed, a first depends on just how representative of the popu- reading may give the impression that p is prefer- lation the original sample happened to be. Any able: Sentences that contain it also contain one estimate of replicability may be off, but in

hypotheses, and those are what scientists are the long run, prep provides a reliable estimate of interested in. Conversely, prep does not give the replicability. Evidence of this is seen in prep’s abil- probability that your hypothesis is true, or that ity to predict the proportion of replications in the null is true, or that one or the other or both meta-analyses (Killeen, 2005a). are false. It gives the long-run probability that an exact replication will support a claim. Both the HE ISTRIBUTION OF P P original and the replicate may work for the T D , REP, wrong reasons—widdershins sampling errors in AND LOG-LIKELIHOOD RATIOS both cases. Rational scientists would prefer to

know whether their hypotheses are true or false, In order to compare prep with its cousins, p and not just whether their data are likely to replicate. the log-likelihood ratio (LLR), a simulation was But that knowledge is not granted by statistical conducted to generate an empirical sampling

inference. Null hypothesis statistical tests never let distribution of effect sizes di. Twenty thousand us assign probabilities to hypotheses (Fisher, samples of size n = 60 were taken from a normal σ2 = 1959). This is a manifestation of the unsolved distribution with mean 0.5 and variance d problem of inverse statistical inference (Cohen, 4/(60 – 4). This corresponds to the sampling dis- 1994; Killeen, 2005c). Students are admonished, tribution of differences between the means of “Never use the unfortunate expression ‘accept the experimental and control groups of 30 observa- null hypothesis’” (Wilkinson & the Task Force on tions each, when the true difference between Statistical Inference, 1999, p. 599); without priors them is half a standard deviation (δ=0.5). This = = on the null, they should also be admonished, should yield a typical prep .907 and p .029. “Never use the equally unfortunate expression Values of prep, p, and LLR were calculated, ‘reject the null hypothesis.’” One can make no along with the z score transformation of 07-Osborne (Best)-45409.qxd 10/9/2007 4:31 PM Page 110

110 BEST PRACTICES IN RESEARCH DESIGN

2 = prep, NORMSINV(prep). As expected, the distribution exercise, this is –(.5) /(4/(60 – 4))/2 –1.75. The of prep is negatively skewed (coefficient of skew- distribution of LLRs is shown in the right panel ness [CS] = –1.60; mean = .859; = .906; of Figure 7.2. The likelihood ratio is positively see Figure 7.2 and Cumming, 2005). This skew skewed (CS = 0.93) and, as visible in Figure 7.2,

has been cited as a fault of prep (Iverson et al., the LLR is negatively skewed, to about the = = 2006). However, the distribution of p is even same degree as prep (CS –1.46; mean –2.25, more strongly skewed (CS = 2.41; mean = .093; median = –1.75, just as predicted). Readers may median = .031). Even though the means are experiment with these and related distribu- biased, however, both cases the median values of tions at the excellent site maintained by the test statistics are very close to their predicted Cumming (2006). values. The sampling distribution for the z scores The is an analogous statistic

of prep closely approximated a normal density, favored by Bayesians such as Iverson et al. (2006) having a CS of 0.02. To aggregate values of prep and M. D. Lee and Wagenmakers (2005). If the over studies, it is therefore the z transform of prep hypotheses are simple and their prior plausibili- that should be averaged (inversely weighted by ties equal, than the Bayes factor is the likelihood the number of observations in each sample, if ratio (P. M. Lee, 2004). If these conditions are those differ). not met, then a prior distribution for the param- For comparison, consider the likelihood of eters must be chosen and then integrated out

the data (di) given that the parameter equals (see, e.g., Wagenmakers & Grünwald, 2006). The zero, divided by the likelihood given that the distribution of the Bayes factor will depend on parameter equals that observed, here 0.5. In the the nature of that prior distribution. present scenario, its expected value is the ratio of the ordinate of the normal density at d = 0.5 under the null (with mean at 0) to that of the REFUTATION AND VINDICATION density under the alternate (with mean at 0.5). The natural logarithm of this ratio is the LLR, Predicting a positive effect of any size in replica- which is simply computed as –z2/2. For the present tion may seem too weak a prediction to merit

Distribution of Probabilities Distribution of Log Likelihood Ratios 40 20

35

30 15

25

20 10 % Total % Total 15

10 5

5

0 0 0 0.2 0.4 0.6 0.8 1 −10 −8 −6 −4 −20 Probability LLR

P p rep

Figure 7.2 Twenty thousand samples of a normally distributed , with mean δ=0.5 and variance σ2 = 4/(60 – 4), yielded these distributions of three test statistics. 07-Osborne (Best)-45409.qxd 10/9/2007 4:31 PM Page 111

Replication Statistics 111

attention. It is therefore useful to identify two evidence, as well as to a one-tailed critical region kindred indices of replicability, the probability for p of .01. The probability of strong support at = of strong support, psup, and the probability of a p* .8 is .789. The probability of a replication strong contradictory results, pcon. Let us measure retuning an effect that is significant at p < .05 is = = = the degree of support that a real replication gives given by p* .88, close to p* .9. For prep .95, as its own value of prep. Set some threshold for this happens about 2/3 of the time. calling a result strong, say, p* = .8. Then a repli- The probability of bad news—strong

cation that returns a prep > p* is called “strong contradiction—is also found in Table 7.1. For = = support,”and one that returns a prep greater than prep .95, p* .8, it is .006: Fewer than one p* in the wrong direction (i.e., the replication’s attempted replication out of a hundred will go

prep < 1 – p*) is called “strong contradiction.” that far in the opposite direction from the orig- The middling outcomes between these are called inal data. that yield a prep in excess weak support or contradiction depending on of .9 are unlikely to be refuted (on statistical their sign. The criteria for just what constitutes grounds, at least!). Variations in the execution of strong support and strong refutation are arbi- replication attempts (e.g., those deriving from trary; we could use p*s of .75 or .8 or .9 or any changes in the instruments and other . It is easy to calculate the result- experimental or observational context) will ing probabilities of support and refutation: inevitably add realization variance and, to the extent that they do so, will make these estimates = psup 1 – normsdist[normsinv(p*) – optimistic. normsinv(prep)]; Readers familiar with signal detection theory will immediately see that effect size d is nothing p = 1 – normsdist[normsinv(p*) + other than their familiar index of discriminabil- con ′ normsinv(prep)]. ity d . The criterion p* is analogous to bias: Changes in p* do not move the criterion from Selected values of these expectations are left to right but move two criteria in and out. found in Table 7.1. The first criterion in that The data in Table 7.1 can be represented as table is p* = .5. This returns the probability that points along receiver operating characteristics

a replication will have an effect of the same sign. (ROCs), with psup corresponding to hits and pcon This is obviously just prep. The probability of a to false alarms. The resulting isosensitivity func- contradiction—an effect of an opposite sign—is tions for the first few columns of Table 7.1 are

the complement of prep. Consider next the row shown in Figure 7.3, with the criterion p* = indexed by prep .95. This corresponds to the increasing from .5 to .98 in smaller steps than threshold LLR that Bayesians consider strong shown in that table to draw the curves. These

Table 7.1 The Probability That a Result Will Be Replicated or Refuted at Different Levels of Confidence Given the Strength of the Original Results

Criteria for strong support or contradiction 0.5 0.75 0.8 0.85 0.9

pprep psup pcon psup pcon psup pcon psup pcon psup pcon 0.1000 0.818 0.818 0.182 0.592 0.057 0.526 0.040 0.448 0.026 0.354 0.014 0.0500 0.878 0.878 0.122 0.687 0.033 0.626 0.022 0.550 0.014 0.453 0.007 0.0250 0.917 0.917 0.083 0.762 0.020 0.707 0.013 0.637 0.008 0.542 0.004 0.0100 0.950 0.950 0.050 0.834 0.010 0.789 0.006 0.729 0.004 0.642 0.002 0.0050 0.966 0.966 0.034 0.874 0.006 0.836 0.004 0.784 0.002 0.705 0.001 0.0025 0.976 0.976 0.024 0.905 0.004 0.874 0.002 0.829 0.001 0.759 0.001 0.0010 0.986 0.986 0.014 0.935 0.002 0.910 0.001 0.875 0.001 0.817 0.000 07-Osborne (Best)-45409.qxd 10/9/2007 4:31 PM Page 112

112 BEST PRACTICES IN RESEARCH DESIGN

1

0.8 SUP p

0.6

0.4

0.2 Probability of Strong Support,

0 0 0.05 0.1 0.15 0.2

Probability of Strong Contradiction, pCON

d', p rep 1.6, .95 1.4, .92 1.2, .88 0.9, .82

Figure 7.3 The probability of strong support (ordinates) and probability of strong contradiction (abcissae) both increase as the criterion for strong (p*) is reduced from an austere .98 (lower left origin of ′. curves) to a magnanimous .50. The parameter is prep and its corresponding z score, d

are not the traditional yes/no ROCs, but specify the margin of error in effect size is by yes/no/uncertain, with the last category corre- bounding it with confidence intervals (CI; see sponding to replications that will fall between Chapter 17). Unfortunately, most investigators

psup and pcon in strength. do not really understand what confidence inter- One may also calculate the probability of a vals mean (Fidler, Thomason, Cumming, Finch, nil effect in replication. This is analogous to the & Leeman, 2004). This confusion can be reme- probability of the null. If one takes a nil effect to died by study of Cumming and Finch (2001), be one that falls within, say, 10% of chance (.4 < Loftus and Masson (1994), and Thompson

prep < .6), the probability of a nil effect in repli- (1999), along with the handy chapters of this = cation when prep .95 is 5%. This probability volume. An alternate approach provides a more = L − U may easily be calculated as pnil P sup P sup, intuitive measure of the margin of error in data. L where P sup corresponds to the probability of One such measure is the over which a support returned by the lower limit (p* = .4 in replication will fall with some stipulated proba- U the example), and P sup corresponds to the prob- bility. A convenient interval to use is the stan- ability of support returned by the upper limit dard error of the statistic. Approximately half (p* = .6 in the example). For further discussion, the replication attempts will fall within ± 1 stan- see Sanabria and Killeen (2007). dard error, centered on the measured statistic (Cumming, Williams, & Fidler, 2004). I call this Credible Confidence Intervals. A small but kind of CI a replication interval (RI). Its inter- increasing number of editors are encouraging pretation is direct, its calculation routine, and its researchers to report measures of effect size presentation as error bars hedging the datum (Cumming & Finch, 2005). A convenient way to unchallenged (Estes, 1997). 07-Osborne (Best)-45409.qxd 10/9/2007 4:31 PM Page 113

Replication Statistics 113

EVIDENCE,BELIEF, AND ACTION a unique reference set with which we update our beliefs in light of evidence. This is why serious What prior information should be incorporated crimes are evaluated by large juries of peers: in the evaluation of a research claim? If there is a large, to accommodate a range of reference sets; substantial literature in relevant areas, then the peers, to relate those priors to ones most rele- replicability of a new claim can be predicted more vant to the defendant. The transformation of accurately by incorporating that information. For evidence into belief (concerning a hypothesis or some uses, this is an optimal tactic and consti- proposition) transforms a public datum into tutes a running meta-analysis of the relevant a personal probability. Shared evaluation of literature. The evaluation of the evidence at hand strong evidence will bring those personal prob- would then, however, be confounded with the abilities toward convergence, but they will be particular prior information that was engaged to identical only for those with identical reference optimize predictions. Other consumers, with sets. Savage (1972) took such personal probabil- other background information, would then ities to be “the only probability concept essential have to deconvolute the experimental results to science” (p. 56). Royall (2004) and Maurer from those priors. For most audiences, it seems (2004) disagree. Belief is indeed best con- best to bypass this step, letting the data speak for structed with Bayesian updating and motivates themselves and letting the consumers of the the acceptance or rejection of scientific theories, results add their own qualification based on their but evidence evaluation must be kept insulated own sense of the prior results in the area (Killeen, from priors. Once that evaluation is executed, 2005b). This decision is equivalent to assuming belief adjustment is natural. But beliefs should that the prior distribution of the population concern claims and hypotheses; they should not effect size is flat and is consistent with some ana- contaminate evidence. lysts’ advice to use only likelihood ratios (e.g., 3. What we should do depends both on what Glover & Dixon, 2004; Royall, 1997), not Bayesian we believe and what we value. Decision theory posteriors, thereby eschewing the difficulties in tells us how to combine these factors to deter- the choice of priors. Royall (2004) is perhaps the mine the course of action yielding the greatest most cogent, noting three questions that our sta- expected benefit. In particular, the posterior pre- tistical toolbox can help us address: dictive distribution shown in Figures 7.1 and 7.3 estimate the probability of different effect sizes in 1. How should I evaluate this evidence? replication. If we can assign utility to outcomes 2. What should I believe? as a function of their expected effect size, we can determine what values of d fall above a thresh- 3. What should I do? 1 old of action. The sigmoid function in Figure 7.4 represents such a utility function. Multiplying 1. The first question confronts us when data the probability of each effect size in replication are first assembled. Here, considerations of the by the utility function and summing gives the past (priors) and the future (different prospec- expected utility of replicating the original results. tive populations) confound analysis of the data on the table. Such externals “can obfuscate Given a posterior predictive distribution, it formal tests by including information not is straightforward to construct a decision theory specifically contained within the experiment to guide our action. Winkler (2003) teaches the itself” (Maurer, 2004, p. 17). Royall (2004) and basics, and Killeen (2006) applies them to Maurer (2004) argue for an evidential approach recover traditional practices and extends them using the likelihood ratios. They eschew the use in nontraditional ways. The execution may of priors to convert these into the probability of employ flat priors and identical prospective the null because this ties the analysis to a refer- populations; it then will guide disposition of the ence set that may not be shared by other inter- evidence: whether it be admitted to a corpus or ested consumers of the evidence. rejected, whether the paper be published or not. 2. Belief, however, should take into account Or the execution may employ informative priors prior information, even information that may and may take into consideration the realiza- be particular to the individual. Each of us carries tion variance involved in generalizing to new 07-Osborne (Best)-45409.qxd 10/9/2007 4:31 PM Page 114

114 BEST PRACTICES IN RESEARCH DESIGN

1.2 Probability Density

0.8

Replication Distribution 0.4

0 U(d) Utility Function −0.4 u(d)

−0.8

−1.2 −1 −0.5 0 0.5 1 1.5 2 Effect Size d

Figure 7.4 The Gaussian density is the posterior predictive distribution for a result with an effect size of = d1 0.5. The sigmoid is an example of a utility function on effect size that expresses decreasing marginal utility as a function of effect size and treats positive and negative effects symmetrically.

 prospective populations. It then becomes a 2 2 s = 2(s + sδ ). (5) pragmatic guide for action, an optimal tool to dr d1 j guide medical, industrial, and civic programs. In a meta-analysis of studies involving 8 million participants, Richard, Bond, and Stokes- REALIZATION VARIANCE Zoota (2003) reported a mean within-literature 2 2 variance of σδ = 0.092 (median = 0.08; σδ = Whenever an attempt is made to replicate, it is 0.30), after correction for sampling variance inevitable that details of the procedure and subject (Hedges & Vevea, 1998). This substantial realiza- population will vary. Successful replication with tion variance puts an upper limit on the proba- different instruments, instructions, and kinds of bility of replicating results, even with n subjects lends generality to the results—but it approaching infinity. The limit is given by =−∞ also increases the risk of different results. This risk Equation 3 with dL and may be represented as realization variance, σδ , the j √ uncertainty added by deviations from exact repli- = / δ . dU d1 2s j cation (Raudenbush, 1994; Rubin, 1981; van den Noortgate & Onghena, 2003). The subscript j This limit on replicability is felt most severely indicates that this variance is indexed to a partic- when d1 is small: For the typical realization vari- ular field of research. The estimated variance of ance within a field (0.08), the asymptotic prep effect size in replication becomes = for d1 0.1, 0.2, and 0.3 is .60, .69, and .77. It = requires an effect size of d1 0.5 to raise replica- 2 2 2 2 2 s = (s + sδ ) + (s + sδ ). (4) bility into the respectable range (p = .9). Alas, dR d1 j d2 j rep that is substantially larger than the effect size If the replication involves the same number of typical of the social psychological literature, subjects, then s 2 = s 2 , and the estimated stan- ≈ d2 d1 found by Richard and associates to be d1 0.3. It = dard error of replication is appears that we can expect a typical (d1 0.3) 07-Osborne (Best)-45409.qxd 10/9/2007 4:31 PM Page 115

Replication Statistics 115

Distribution of Meta-analytic Effect Sizes in the Social Psychology Literatures 40

35

30

25 Richard, Bond, & Stokes-Zoota (2003) 20

15

10 Number of Research Literatures

5

0 0 0.5 1 1.5 2 Mean Effect Size d

Figure 7.5 The distribution of effect sizes measured in 474 research literatures in social psychology. The data were extracted from Richard et al. (2003, Figure 1). The curve is Gaussian with mean 0 and overall variance 0.3.

research finding to receive positive support responsible for the outlier near d = 0.) The data at any level in replication only about 75% of in Figure 7.5 are used to create a population the time. from which the simulation will sample effect sizes. The within-literature standard deviation Validation. Simulations were conducted to vali- σδ was set to 0.30, consistent with Richard and

date the logic of prep and the accuracy of normal associates’ estimate. Given these fingerposts, the approximation for the noncentral t sampling simulation is described in Figure 7.6. distribution of d for small n. The analysis of The relative frequency of successful replica- = + Richard et al. (2003) provided a representative tion was gauged for values of n nC nE rang- set of parameters. These investigators displayed ing from 6 to 200, for nine ranges of |d| starting the distribution of effect sizes for social psycho- at 0 with upper limits of 0.08, 0.16, 0.24, 0.33, logical research involving 457 literatures, com- 0.43, 0.53, 0.65, 0.81, and 1.10. These frequen- prising 25,000 studies. Their measure of effect cies, expressed as probabilities of replication = size, r, was converted into d by the relation d (prep), are the ordinates of Figure 7.7. The abcis- 2 –1/2 2r(1 – r ) (Rosenthal, 1994). The authors sae are from Equation 2, with d1 taken as the 2 reported absolute values of effect sizes because midpoints of the nine ranges. sd was calculated the direction of effect often reflects an arbitrary from the simulata (see the appendix). The only coding of dependent variables; however, it also parametric information used in the predictions 2 may inflate the impression of replicability. The was the realization variance σδ, set to 0.09. The smooth curve through the data in Figure 7.5 predictions held good down through very small provides another perspective on their report. ns, with average absolute deviations of 1.2 per- (The nonpublication of small or conflicting centage points, on the order of binomial vari- effects—the “file drawer effect”—might be ability around exact predictions. 07-Osborne (Best)-45409.qxd 10/9/2007 4:31 PM Page 116 studied. n from the literature from j δ each each ables for the control and ables for the control 2 2 E C X X 2 δ 12 i 0 δ 1 1 = = = = µ σ µ σ 1 δ δ 0 δ 0.30 1 1 (replicate). these parameters between The difference E C = = 2 X X the original study predicted the sign the original of predicted study the replicate µ σ δ 1 0 − 0.55 = = µ σ 2 − (original or study) 1 δ ct was not retained. 20,000 times for was iterated The process , 1) 2 2 δ 2 d d N(0, 1) N( , 0.30) δ

N( No Compute Experimental failures for from 0.3). The proportion of from times an effect size 2 was first sampled from the distribution 7.5.was first sampled from shown in Figure of instances Then two Sample Replicate Sample Replicate δ Increment count of = δ Observations Observations δ σ Sample Replicate Control ? N(0, 0.55) Sign Same Iterate from δ Sample Literature Simulation Flowchart , 1) 1 1 d δ 1 d N(0, 1) N( , 0.30) δ

Yes N( Compute from Sample Control 1 successes for δ Increment count of Observations Sample Experimental Sample Experimental Observations The simulation: of effect size A representative experimental groups, by a population shifted with sampled from the latter it represents were selected, were it represents for the original one and one for the replicate. study normally distributed The data were random vari arises from the realization variance of the realization arises from 0.09 ( was tallied; information about the magnitude of effe the replication Figure 7.6 Figure

116 07-Osborne (Best)-45409.qxd 10/9/2007 4:31 PM Page 117

Replication Statistics 117

Predicted and Obtained Replicability in Simulation 1.00

0.90

0.80

0.70 Obtained Replicability

0.60

0.50 0.50 0.60 0.70 0.80 0.90 1.00 Predicted Replicability

200 100 50 40 30 20 16 12 10 8 6

Figure 7.7 Results of the simulation. The obtained proportions of successful replications are plotted against the area over the positive axis of N(d , s 2 ). Symbols indicate the total number of 1 dR observations in the experimental and control groups combined.

META-ANALYSIS for these studies was 0.78, and the proportion predicted by the 20% trimmed mean (Wilcox, 1998) Despite the width of the distributions of p of the z was p = .80. The percentage of pub- rep prep rep shown in Figure 7.2, prep proves generally accurate lished studies showing a positive effect was close in predicting replicability, both in simulations to these predictions, .85. These are unweighted such as those above and in meta-analytic com- predictions because the point is to demonstrate

pendia. For any one study, conclusions should predictive of single estimates of prep, not always be tempered by the image of Figure 7.2 to combine studies in an optimal fashion. Note

and the recognition that our test statistics were that with realization variance set to zero, prep = sampled from one like it. This is why real replica- would seriously overestimate replicability (prep tions are crucial for science. These replications are .93), as expected. best aggregated using meta-analyses. Consider, Based on comparison of published and for example, the recent meta-analysis of body sat- unpublished studies, as well as of studies that isfaction among women of different ethnic focused on ethnicity and those not so focused, the groups (Grabe & Hyde, 2006). Ninety-three stud- authors concluded that there was little evidence ies contributed data, whose median effect size was of publication bias—the inflation of effect sizes 0.30, indicating that Black women were more sat- due to nonpublication of nonsignificant effects. isfied with their body image than were White Another way of estimating the influence of studies women. The authors of the meta-analysis left in the “file drawer” because they did not yield reported a random effects variance of 0.09, which a significant p value is to use the posterior predic- is the average value Richard et al. (2003) reported tive distribution to estimate the number of studies for social-psychological literatures in general. that should have reported nonsignificant effects.

Using that realization variance, the median prep For the median number of observations in each 07-Osborne (Best)-45409.qxd 10/9/2007 4:31 PM Page 118

118 BEST PRACTICES IN RESEARCH DESIGN

group of the meta-analysis, the effect size would given users’ familiarity with traditional analyses, have to exceed 0.36 for significance (α=.05, is to calculate a p value and transform it to = assuming a fixed effect model with zero realiza- prep using the standard conversion, z(prep) tion variance, as is typical in conducting tests of −z(p)/√2. The p values returned from ANOVA significance, and a one-tailed critical region). stat-packs are analogous to t2 and thus allow Integration of the posterior predictive distribu- deviation among scores in either direction (even =−∞ = tion (Equation 3) from dL up to dU 0.36 though they employ just the right tail of the F gives an expected number of studies with non- distribution). They should therefore be halved

significant effects equal to 56; 53 were found in the before converting to prep. In the case of multiple literature. This corroborates the authors’ inference independent comparisons, the probability of of little publication bias. Assuming instead that replicating each of the observed effects equals the

authors of the original studies used two-tailed product of the constituent preps. This may be con- tests requires us to compare the expected and trasted with the probability of finding positive observed number of studies with effect sizes effects if the null was true in all k cases, 2–k. falling between ±0.43. Integration between these Cortina and Nouri (2000) show how to adjust d limits predicts 66 such results, 10 greater than the for correlated measures, and Bakeman (2005) 56 observed. If 10 additional studies are inferred provides detailed recommendations for the use to reside only in file drawers because their effect of generalized eta squared as the preferred effect sizes averaged around 0, when these are added to size statistic for repeated-measure designs. Manly the 93 analyzed, the median effect size decreases (1997) and Westfall and Young (1993) describe from 0.30 to 0.27, leaving all conclusions intact. techniques for multiple testing. It is

The same would be the case if this process were trivial to adjust such resampling to calculate prep iterated, assuming the same proportion were filed directly: (a) Resample from within the control for the larger hypostatized population. This fur- data and independently from within the experi- ther corroborates the authors’ inference of little mental data, (b) calculate the resampled statistics effect from publication bias. (e.g., mean difference, trimmed mean difference, A serious attempt to generalize—to predict etc.) over half the resampled numbers to double replicability—must incorporate realization vari- the variance for the predictive distribution, ance, just as traditional random and mixed (c) count the number of statistics of the same effects models and hierarchical Bayesian models sign as the measured test, and (d) divide by the

internalize it. But few original studies attempt to total number of resamples to estimate prep. 2 incorporate such estimates, resulting in discom- Realization variance (σδ) inflates the sam- fiture when results do not replicate (Ioannidis, pling variance of the effect size. This is awkward 2005a, 2005b). This is compounded by the to introduce into the resampling process, but an prevalent notion that “to replicate” means get- adjustment can be easily made after the fact. ting a significant effect in replication, rather The variance of d in replication is approximately 2 than increasing our confidence in the original 8/(n – 4) + 2σδ. The resampling operation is essen- claim or narrowing our confidence intervals tially a compound Bernoulli process that can, for around an estimate. The posterior predictive these purposes, be approximated by the normal distribution of Figure 7.1 and its integration distribution. Compute the z score corresponding

over relevant domains by Equation 2 provide to the prep resampled as above with no allowance useful tools in the meta-analysis of results. for realization variance, and divide it by  + σ2( − )/ . Deployment 1 δ n 4 4

This development of prep concerns only the The normal transform of this adjusted z score simplest scenario, corresponding to a t test. Most gives the prep that can be expected in realistic inferential questions are more complicated. How attempts to replicate.1

does one calculate and interpret prep values in Conversion of the voluminous statistical more interesting scenarios, such as analyses of literature into prep-native applications remains a variance (ANOVAs) with multiple levels, multi- task for the future—one that will be most expe- variate ANOVAs (MANOVAs), and multiple ditiously and accurately accomplished with ran- ? The provisional answer, domization techniques, discussed below. 07-Osborne (Best)-45409.qxd 10/9/2007 4:31 PM Page 119

Replication Statistics 119

all of the above analyses. Traditional statistical PROBLEMS WITH PREP tests, as well as prep as developed to this point, Frequentists will object to the introduction of dis- assume that the data are randomly sampled tributions of parameters needed for the present from the population to which generalization is derivation of p . Parameters are by their defini- desired—from the population whose parameter rep δ tion fixed, if unknown, quantities. Frequentists will under the null (e.g., 0) is typically assumed to also be concerned that the choice of any particular be zero. But this is a feat that is rarely attempted informative prior can introduce an element of in science. Ludbrook and Dudley (1998; cited in subjectivity into the calculation of probability. The Lunneborg, 2000, p. 551) surveyed 252 studies introduction of uninformative priors, on the other from biomedical journals and found that exper- hand, will expose the arbitrariness of choosing the imental groups were constructed by random particular version of them: Should an ignorance sampling in only 4% of the cases. Of the 96% prior for the mass of a box be uniform on the that were randomly assigned (rather than ran- length of its side or uniform on the cube of that domly sampled), 84% of the analyses employed length (Seidenfeld, 1979)? This debate has a long inappropriate t or F tests—inappropriate and nuanced history. because those tests assume random samples, not Bayesians (e.g., Wagenmakers & Grünwald, convenience samples. The situation is unlikely to

2006) object that prep distracts us from an oppor- be different in the fields known to the readers of tunity to compare alternative hypotheses. If this book. One can speculate why analyses are so credible alternate hypotheses are available, both misaligned with data. The potential causes are the Neyman-Pearson framework and Bayesian multivariate and include bad education, limita- analyses are to be preferred to the present one. tions to otherwise convenient stat-packs, the But if those are not available, postulating them desire for statistics that permit generalization to reintroduces the very sources of subjectivity and a population (even though the dependence on context-sensitive perspectives technique a fortiori prohibits such generaliza-

that prep permits us to sidestep. tion), and the ubiquity of reviewers who recog- An inelegance in the above analyses is that nize that, even though statistical tests are merely they invoked the unknown population parame- arbitrary conventional filters that cannot legiti- ter δ, only to marginalize it. Why not go directly mately be used to reject null hypotheses, they

from d1 to d2? As noted by O’Hagan and Forster retain pragmatic value for rejecting dull (2004), “If we take the view that all inference is hypotheses (Nickerson, 2000). directly or indirectly motivated by decision problems, then it can be argued that all inference should be predictive, since inference about PERMUTATION STATISTICS unobservable parameters has no direct relevance to decisions....An extreme version ofthe pre- There is another way. It was introduced by dictivist approach is to regard parameters as nei- Fisher, developed for general cases by Pitman ther meaningful nor necessary” (p. 90). Alas, as (Pitman, 1937a, 1937b), and realized in a practi- these authors, as well as Cumming (2005) and cal manner by modern tech-

Doros and Geier (2005), note, unless d1 and d2 niques. Randomization, or rerandomization, or are treated as random samples from a popula- permutation tests are ways of comparing distri- tion, they are not independent of one another, butions of scores. They do not, like their com- and the requisite evaluation of joint distribu- puterized siblings the bootstrap tests, attempt to tions of nonindependent variables is difficult. estimate or conditionalize on population param- Fisher spent many of his latter years attempting eters. Instead, they ask how frequently a random to accomplish such direct inference with his reassignment of the observed scores would gen- fiducial probability theory, but his work erate differences at least as large as those was judged unsuccessful (Macdonald, 2005; observed. In executing such tests, the observed Seidenfeld, 1979). We must resort to the intro- scores are randomly reassigned (without replace- duction and marginalization of parameters ment) to ad hoc groups, the relevant statistics described in the appendix. (mean, trimmed mean, median, variance, t score, More important than the above objections is etc.) computed, another random reassignment the invalidity of a fundamental assumption in executed, and so on, thousands of times to create 07-Osborne (Best)-45409.qxd 10/9/2007 4:31 PM Page 120

120 BEST PRACTICES IN RESEARCH DESIGN

a distribution of the sampling error expected experimental and control groups. There is a under chance. Calculate the percentile of the natural affinity between permutation tech- observed statistic in that distribution. One minus niques and information-theoretic analyses: As that percentile gives an analog to the p value. If the information gained from distinguishing the observed statistic is in the 95th percentile, groups increases, there will be a corresponding only 5% of the time would random assignments decrease in the number of alternate hypotheses give deviations that large or larger. that will provide more information than the investigator’s. Replicability may be measured in What the Permutation Distribution Means. In a terms of the probability of finding effects in deterministic world, no effects are uncaused, replication that continue to make the distinc- although the causes may be varied and complex tion between groups worthwhile. and different for each observation (Hacking, Although these considerations can give 1990). Chance is the name for these otherwise deeper meaning to the analysis, most consumers unnamed, and generally unnamable, causes; it of statistics will be content to understand the appears as the error variance that is added to our results of permutation analyses as an analog of regression equations to permit them to balance. the p value, the proportion of randomizations In resampling, we give the error variance free that provide a more informative sorting of the rein. The resulting randomization distributions observations than the experimenter’s labels may be viewed as random samples from “experimental” and “control.” There are many thousands of these unnamed “hypotheses”— good programs available to carry out this analy- each corresponding to a different pattern in the sis; see Cai (2006) and the references in Good data—that might account for the observations (2000), Higgins (2004), and Manly (1997). with more or less accuracy. The empirical distri- Permutation tests ask, “How often would this bution function generated by the random shuf- happen by chance?” not “How likely is this to fles of data among groups gives the final happen again by design?” The short answer to of hypotheses. All hypotheses except “How can I generalize this result?” is the same as the investigator’s are vague (“chance”) and post that given to users of traditional statistical hoc, so the investigator’s are typically preferred, design who have not sampled randomly from unless there are too many alternate reshuffles the populations to which they would generalize: that sort the data into more extreme configura- “At your own hazard.” To predict replicability in tions—that is, unless they constitute more than, an attempt with an n of the same size, follow the say, 5% of the distribution. same steps as given for bootstrap techniques Another way of thinking about the parti- above, including within-group shuffling, half- tioning is as the result of the random motion sizing, and correction for realization variance. of particles (data) in space. This is a problem For permutation techniques, however, the ran- in thermodynamics, for analysis of which domization is without replacement (whereas in Boltzman created the measure of bootstrapping, it is with replacement). The half- called entropy. Using similar logic and mathe- sizing makes this analysis a hybrid of permuta- matics, we may calculate the amount by which tion and bootstrap techniques—the bootstrap is entropy is reduced by learning to which constructed not out of the unlimited population group—experimental or control—each obser- of the bootstrap but out of populations twice vation belongs. If there is no real effect, the the size of the original investigation. It permits information transmitted by the group designa- extrapolation to a replication that samples from tion will be approximately 0. The information a small population derived from individuals transmitted by knowledge of the group grows identical to those in the original experiment, with effect size and with the logarithm of the one from which the original was also ostensibly number of observations. Information-theoretic sampled. Because permutation techniques are measures such as Kullback-Leibler (K-L) dis- generally more powerful than bootstrap tance, as well as its unbiased realization in the techniques (see Mielke & Berry, 2001, and Akaike information criterion, are modern Chapter 19, this volume), predictions of replica- extensions of Boltzman’s approach. The K-L bility will be higher than for bootstrap tech- distance is the average information that each niques, making inclusion of nonzero realization observation adds toward discriminating the variance even more important for realistic 07-Osborne (Best)-45409.qxd 10/9/2007 4:31 PM Page 121

Replication Statistics 121

projections. The same post hoc correction subjective, but as long as reference sets are described above may be used: Divide the z score unique, probabilities must always be conditional of replicability by on those priors.  Permutation techniques are another step for- 2 1 + σδ (n − 4)/4 . ward, in that they more closely model the scien- tific process. Modern computer packages make As noted by Higgins (2004), Lunneborg (2000), them easy to implement and easier to teach than and others, computer-implemented permuta- traditional statistical pedagogy based on tion tests are the to which modern Fisher/Neyman-Pearson inference. techniques such as ANOVA are an approxima- What this chapter hopes to convey is that by tion; they are worth learning to use. setting our sights a bit lower—down from the heavens of Platonic parameters to the earthier Aristotelian enterprise of predicting replicability— THE THREE PATHS OF our inferences will be simpler and more useful. STATISTICAL INFERENCE Much work needs yet to be accomplished: gen- eralizing replicability statistics to cases with Traditional Fisher/Neyman-Pearson statistics multiple degrees of freedom in the numerator, have been the primary of inference in the securing their implementation with permuta- field for half a decade, despite the fact that “fre- tion tests, and utilizing those tests for predicting quentist theory is logically inadequate for the replicability in contexts involving substantial task of uncertain estimation (it provides right realization variance. But accomplishing these answers to wrong questions),...the bridge from tasks should be straightforward, and their exe- statistical technologies to actual working science cution will bring us closer to the fundamental remains sketchy” (Dempster, 1987, p. 2). Twenty task of scientists: to validate observations through years have not greatly changed that assessment. prediction and replication. Littlewood, if you remember from the epigraph, did not see it as his “business to justify applica- tion of the system”—an Olympian view shared APPENDIX by some modern . Because NHST in particular cannot provide mathematical esti- The Statistics of Effect Size mates of the probabilities of hypotheses, it is constitutionally unfit for deciding between null and alternate hypotheses. Introductory statistics Pooled variance is texts should carry warning labels: “The Statisti- cian General warns that use of the algorithms s2 (n − 1) + s2(n − 1) s2 = C C E E . (A1) contained herein justifies no inferences about p n − 2 hypotheses and no generalization to popula- tions unsampled. Their assumptions seldom match their applications. Their critical regions The Route to Posterior can displace critical judgments. Their peremp- Predictive Distributions tory authority can damage unborn hypotheses. Their punctilio shifts authority from scientists Bayesian statistics provides a standard way | to stat-packs. Addictive.” to calculate f (d2 d1)(see, e.g., Bolstad, 2004; Bayesian analysis (Chapter 33, this volume) is Winkler, 2003): Predicate the unknown param- a step forward. It provides the machinery for eter, such as the population mean effect size δ; deriving the posterior predictive distribution on update that predication with the observed data;

which prep is based and on which a decision and then calculate the posterior predictions theory for science may be erected. It has a vigor- over all possible values of the parameter, ous literature (e.g., Howson & Urbach, 1996; weighted by the probability of the parameter Jaynes & Bretthorst, 2003). As currently given the observed data. The predication is a deployed, it is sometimes hobbled by continuing nuisance, and eliminating the nuisance parame- to maintain the frequentists’ focus on parame- ter by integrating it out in the last step is called ters. It has been blamed for making probabilities marginalization. 07-Osborne (Best)-45409.qxd 10/9/2007 4:31 PM Page 122

122 BEST PRACTICES IN RESEARCH DESIGN

| ∫ δ | δ f (d2 d1) = f (d2, d1) d All simulations in this chapter used Resampling ∫ |δ, δ| δ Stats© software (Bruce, 2003), which creates = f (d2 d1) f ( d1) d = ∫ f (d |δ,d ) f (d |δ ) f (δ )dδ random numbers with Park and Miller’s (1988) 2 1 1 “Real Version 1” multiplicative linear congruen- δ| where f ( d1) is the posterior distribution of tial algorithm. the parameter in light of the observations, and f (δ) is the prior distribution of the parameter. Integration of the last line over all population NOTE parameters δ delivers the posterior predictive δ distribution. If f ( ) is assumed to have a very 1. The half-sizing recommended in Step b can large variance (flat or ignorance priors), then be bypassed by replacing the 1 in this radical with 2, the observed data dominate the result. If a which allows for the increased variance inherent in credible prior distribution is available (it is the posterior predictive distribution. generally sought by “empirical Bayesians”), then the final predictions of replicability will be more accurate. REFERENCES Consider the case in which the prior on the δ population mean is normally distributed. Bakeman, R. (2005). Recommended effect size δ| ′ ′2 Then its posterior f ( d1) is n(d1, s1 ), where the statistics for repeated measures designs. primed variables are weighted averages of Behavior Research Methods, 37, 379–384. the priors and the observed statistics, with the Bolstad, W. M. (2004). Introduction to Bayesian weights proportional to the precisions (recipro- statistics. Hoboken, NJ: John Wiley. −2 −2 Bruce, P. (2003). Resampling Stats [Excel Add-in]. cal variances) of the means (sprior and s 1 ). If we are relatively ignorant a priori of the value of Arlington, VA: Resampling Stats, Inc. the parameter, its distribution is flat relative to Cai, L. (2006). Multi-response permutation that of the observed statistic (s2 >> s2), and procedure as an alternative to the analysis of prior 1 variance: An SPSS implementation. Behavior then d ′ ≈ d and s′ ≈ s . This is the case devel- 1 1 1 1 Research Methods, 38, 51–59. oped here. When the sampling distribution of Cohen, J. (1994). The earth is round (p < .05). the statistic is also approximately normal—a American Psychologist, 49, 997–1003. reasonable assumption for measures of effect Cortina, J. M., & Nouri, H. (2000). Effect size for size even when n is relatively small (Hedges ANOVA designs. Thousand Oaks, CA: Sage. & Olkin, 1985)—then factoring, completing Cumming, G. (2005). Understanding the average the square, and removing constants leads probability of replication: Comment on Killeen eventually (Bolstad, 2004; Winkler, 2003) to a (2005). Psychological Science, 16, 1002–1004. Cumming, G. (2006). ESCI. Retrieved from normal density with mean d1 and variance s 2 = s2 + s2 = 2s2: http://www.latrobe.edu.au/psy/esci/ dR 1 2 1 Cumming, G., & Finch, S. (2001). A primer on the understanding, use and calculation of − 1 ( − )2 ( 2+ 2) d2 d1 confidence intervals based on central and ( | ) ∝ 2 s2 s1 . f d2 d1 e noncentral distributions. Educational and Psychological Measurement, 61, 532–575. Integration of this between appropriate lim- Cumming, G., & Finch, S. (2005). Inference by eye: its, as in Equations 2 and 3, leads to the central Confidence intervals, and how to read pictures results of this chapter. of data. American Psychologist, 60, 170–180. Cumming, G., Williams, J., & Fidler, F. (2004). Replication and researchers’ understanding of confidence intervals and bars. The Simulations Understanding Statistics, 3, 299–311. In the simulations for Figures 7.6 and 7.7, d′ Davison, A. C., & Hinkley, D. V. (1997). Bootstrap was calculated using Equations 1 and A1. methods and their application. New York: Variance was calculated using Equation 5 and Cambridge University Press. Dempster, A. P. (1987). Probability and the future of statistics. In I. B. MacNeill & G. J. Umphrey 4 s2 ≈ . (Eds.), Foundations of statistical inference d n − 4 (Vol. 2, pp. 1–7). Dordrecht, Holland: D. Reidel. 07-Osborne (Best)-45409.qxd 10/9/2007 4:31 PM Page 123

Replication Statistics 123

Doros, G., & Geier, A. B. (2005). Comment on “An Jones, L. V., & Tukey, J. W. (2000). A sensible alternative to null-hypothesis significance tests.” formulation of the significance test. Psychological Science, 16, 1005–1006. Psychological Methods, 5, 411–414. Estes, W. K. (1997). On the communication of Killeen, P. R. (2005a). An alternative to null information by displays of standard errors and hypothesis significance tests. Psychological confidence intervals. Psychonomic Bulletin & Science, 16, 345–353. Review, 4, 330–341. Killeen, P. R. (2005b). Replicability, confidence, and Fidler, F., Thomason, N., Cumming, G., Finch, S., & priors. Psychological Science, 16, 1009–1012. Leeman, J. (2004). Editors can lead researchers Killeen, P. R. (2005c). Tea-tests. The General to confidence intervals, but can’t make them Psychologist, 40(2), 16–19. think: Statistical reform lessons from medicine. Killeen, P. R. (2006). Beyond statistical inference: Psychological Science, 15, 119–126. A decision theory for science. Psychonomic Fisher, R. A. (1959). Statistical methods and scientific Bulletin & Review, 13, 549–562. inference (2nd ed.). New York: Hafner. Kline, M. (1980). Mathematics, the loss of certainty. Gigerenzer, G. (2004). Mindless statistics. Journal of New York: Oxford University Press. Socio-, 33(5), 587–606. Kline, R. B. (2004). Beyond significance testing: Glover, S., & Dixon, P. (2004). Likelihood ratios: Reforming methods in behavioral A simple and flexible statistic for empirical research. Washington, DC: American psychologists. Psychonomic Bulletin & Review, Psychological Association. 11, 791–806. Kyburg, H. E., Jr. (1987). The basic Bayesian blunder. Good, P. (2000). Permutation tests: A practical guide In I. B. MacNeill & G. J. Umphrey (Eds.), to resampling methods for testing hypotheses Foundations of statistical inference: Vol. 2. (2nd ed.). New York: Springer-Verlag. (pp. 219–232). Boston: D. Reidel. Grabe, S., & Hyde, J. S. (2006). Ethnicity and body Lee, M. D., & Wagenmakers, E.-J. (2005). Bayesian satisfaction among women in the United States: statistical inference in psychology: Comment on A meta-analysis. Psychological Bulletin, 132, Trafimow (2003). Psychological Review, 112, 622–640. 662–668. Hacking, I. (1990). The taming of chance. New York: Lee, P. M. (2004). Bayesian statistics: An introduction Cambridge University Press. (3rd ed.). New York: Hodder/Oxford University Harlow, L. L., Mulaik, S. A., & Steiger, J. H. (Eds.). Press. (1997). What if there were no significance tests? Littlewood, J. E. (1953). A mathematician’s miscellany. Mahwah, NJ: Lawrence Erlbaum. London: Methuen. Hays, W. L. (1963). Statistics for psychologists. New Loftus, G. R., & Masson, M. E. J. (1994). Using York: Holt, Rinehart and Winston. confidence intervals in within-subject designs. Hedges, L. V., & Olkin, I. (1985). Statistical methods Psychonomic Bulletin & Review, 1, 476–490. for meta-analysis. New York: Academic Press. Ludbrook, J., & Dudley, H. (1998). Why permutation Hedges, L. V., & Vevea, J. L. (1998). Fixed- and tests are superior to t and F tests in medical random-effects models in meta-analysis. research. American , 52, 127–132. Psychological Methods, 3, 486–504. Lunneborg, C. E. (2000). Data analysis by resampling: Higgins, J. J. (2004). An introduction to modern Concepts and applications. Pacific Grove, nonparameteric statistics. Pacific Grove, CA: CA: Brooks/Cole/Duxbury. Brooks/Cole. Macdonald, R. R. (2005). Why replication Howson, C., & Urbach, P. (1996). Scientific reasoning: probabilities depend on prior probability The Bayesian approach (2nd ed.). Chicago: distributions: A rejoinder to Killeen (2005). Open Court. Psychological Science, 16(12), 1007–1008. Ioannidis, J. P. A. (2005a). Contradicted and initially Manly, B. F. J. (1997). Randomization, bootstrap and stronger effects in highly cited clinical research. Monte Carlo methods in biology. New York: Journal of the American Medical Association, Chapman & Hall/CRC. 294(2), 218–228. Maurer, B. A. (2004). Models of scientific inquiry Ioannidis, J. P.A. (2005b). Why most published research and statistical practice: Implications for the findings are false. PLoS Medicine, 2(8), e124. structure of scientific knowledge. In M. L. Taper Iverson, G., Myung, I. J., & Karabatsos, G. (2006, & S. R. Lele (Eds.), The nature of scientific August). P-rep, p-values and . evidence: Statistical, philosophical, and empirical Paper presented at the Society for Mathematical considerations (pp. 17–50). Chicago: University Psychology, Vancouver, BC. of Chicago Press. Jaynes, E. T., & Bretthorst, G. L. (2003). Probability Mielke, P. W., & Berry, K. J. (2001). Permutation theory: The logic of science. Cambridge, UK: methods: A distance function approach. Cambridge University Press. New York: Springer. 07-Osborne (Best)-45409.qxd 10/9/2007 4:31 PM Page 124

124 BEST PRACTICES IN RESEARCH DESIGN

Mooney, C. Z., & Duval, R. D. (1993). Bootstrapping: considerations (pp. 119–152). Chicago: A nonparametric approach to statistical inference University of Chicago Press. (Vol. 95). Newbury Park, CA: Sage. Rubin, D. B. (1981). Estimation in parallel Nickerson, R. S. (2000). Null hypothesis significance randomized experiments. Journal of Educational testing: A review of an old and continuing Statistics, 6, 377–400. controversy. Psychological Methods, 5, 241–301. Sanabria, F., & Killeen, P. R. (2007). Better statistics O’Hagan, A., & Forster, J. (2004). Kendall’s for better decisions: Rejecting null hypotheses advanced theory of statistics: Vol. 2B. Bayesian statistical tests in favor of replication statistics. inference (2nd ed.). New York: Oxford Psychology in Schools, 44, 471–481. University Press. Savage, L. J. (1972). The foundations of statistics (2nd Park, S. K., & Miller, K. W. (1988). Random number ed.). New York: Dover. generators: Good ones are hard to find. Seidenfeld, T. (1979). Philosophical problems of Communications of the ACM, 31(10), 1192–1201. statistical inference: Learning from R. A. Fisher. Pitman, E. J. G. (1937a). Significance tests which may London: D. Reidel. be applied to samples from any populations. Thompson, B. (1999). If tests Supplement to the Journal of the Royal Statistical are broken/misused, what practices should Society, 4(1), 119–130. supplement or replace them? Theory Psychology, Pitman, E. J. G. (1937b). Significance tests which 9(2), 165–181. may be applied to samples from any van den Noortgate, W., & Onghena, P. (2003). populations: II. The test. Estimating the mean effect size in meta-analysis: Supplement to the Journal of the Royal Statistical Bias, precision, and mean squared error of Society, 4(2), 225–232. different weighting methods. Behavior Research Raudenbush, S. W. (1994). Random effects models. Methods, Instruments, & Computers, 35, 504–511. In H. Cooper & L. V. Hedges (Eds.), The Wagenmakers, E.-J., & Grünwald, P. (2006). handbook of research synthesis (pp. 301–321). A Bayesian perspective on hypothesis testing: New York: Russell Sage Foundation. A comment on Killeen (2005). Psychological Richard, F. D., Bond, C. F. J., & Stokes-Zoota, J. J. Science, 17, 641–642. (2003). One hundred years of social psychology Westfall, P. H., & Young, S. S. (1993). Resampling- quantitatively described. Review of General based multiple testing: Examples and methods for Psychology, 7, 331–363. p-value adjustments. New York: John Wiley. Rosenthal, R. (1994). Parametric measures of effect Wilcox, R. R. (1998). How many discoveries have size. In H. Cooper & L. V. Hedges (Eds.), The been lost by ignoring modern statistical handbook of research synthesis (pp. 231–244). methods? American Psychologist, 53, 300–314. New York: Russell Sage Foundation. Wilkinson, L., & the Task Force on Statistical Royall, R. (1997). Statistical evidence: A likelihood Inference. (1999). Statistical methods in . London: Chapman & Hall. psychology: Guidelines and explanations. Royall, R. (2004). The likelihood paradigm for American Psychologist, 54, 594–604. statistical evidence. In M. L. Taper & S. R. Lele Winkler, R. L. (2003). An introduction to Bayesian (Eds.), The nature of scientific evidence: inference and decision (2nd ed.). Gainesville, FL: Statistical, philosophical, and empirical Probabilistic Publishing.