<<

Computing Bayes factors to measure evidence from : An extension of the BIC approximation

Thomas J. Faulkenberry∗ Tarleton State University

Bayesian inference affords scientists with powerful tools for testing hypotheses. One of these tools is the Bayes factor, which indexes the extent to which support for one hypothesis over another is updated after seeing the . Part of the hesitance to adopt this approach may stem from an unfamiliarity with the computational tools necessary for computing Bayes factors. Previous work has shown that closed form approximations of Bayes factors are relatively easy to obtain for between-groups methods, such as an analysis of or t-test. In this paper, I extend this approximation to develop a formula for the Bayes factor that directly uses infor- mation that is typically reported for ANOVAs (e.g., the F ratio and degrees of freedom). After giving two examples of its use, I report the results of simulations which show that even with minimal input, this approximate Bayes factor produces similar results to existing software solutions.

Note: to appear in Biometrical Letters.

I. INTRODUCTION A. The Bayes factor

Bayesian inference is a method of measurement that is Hypothesis testing is the primary tool for statistical based on the computation of P (H | D), which is called inference across much of the biological and behavioral the of a hypothesis H, given data D. sciences. As such, most scientists are trained in classical Bayes’ theorem casts this probability as null hypothesis significance testing (NHST). The scenario for testing a hypothesis is likely familiar to most readers of this journal. Suppose one wants to test a specific re- P (D | H) · P (H) P (H | D) = . (1) search hypothesis (e.g., some treatment has an effect on P (D) some outcome measure). NHST works by first assuming a null hypothesis (e.g., the treatment has no effect) and One may think of Equation 1 in the following manner: then computing some test for a sample of data. before observing data D, one assigns a This sample test statistic is then compared to a hypothet- P (H) to hypothesis H. After observing data, one can ical distribution of test that would arise if the then update this prior probability to a posterior proba- null hypothesis were true. If the sample’s test statistic bility P (H | D) by multiplying the prior P (H) by the is in the tail of the distribution (that is, it should oc- likelihood P (D | H). This product is then rescaled to cur with low probability), the scientist decides to reject a (i.e., total probability = 1) by the null hypothesis in favor of the . dividing by the marginal probability P (D). Further, the p-value, which indicates how surprising the Bayes’ theorem provides a natural way to test hy- sample would be if the null hypothesis were true, is often potheses. Suppose we have two competing hypotheses: taken as a measure of evidence: the lower the p-value, an alternative hypothesis H1 and a null hypothesis H0. the stronger the evidence. We can directly compare the posterior probabilities of H and H by computing their ratio; that is, we can While orthodox across many disciplines, NHST does 1 0 compute the posterior in favor of H over H as have philosophical criticisms [13]. Also, the p-value is 1 0 P (H1 | D)/P (H0 | D). Using Bayes’ theorem (Equation arXiv:1803.00360v2 [stat.CO] 2 Mar 2018 prone to misinterpretation [2, 3]. Finally, NHST is ide- 1), it is trivial to see that ally suited to providing support for the alternative hy- pothesis, but the procedure does not work in the case where one wants to measure support for the null hypoth- P (H1 | D) P (D | H1) P (H1) esis. That is, we can reject the null, but we cannot accept = · . (2) P (H0 | D) P (D | H0) P (H0) the null. To overcome this limitation, we can use an al- | {z } | {z } | {z } ternative method for testing hypotheses that is based on posterior odds Bayes factor prior odds Bayesian inference: the Bayes factor. This equation can also be interpreted in terms of the “updating” metaphor that was explained above. Specifi- cally, the posterior odds are equal to the prior odds multi- plied by an updating factor. This updating factor is equal to the ratio of likelihoods P (D | H1) and P (D | H0), and is called the Bayes factor [4]. Intuitively, the Bayes factor ∗ [email protected] can be interpreted as the weight of evidence provided by 2 a set of data D. For example, suppose that one assigned values are often over a continuous parameter space, this the prior odds of H1 and H0 equal to 1; that is, H1 and computation requires integration, and thus the formula H0 are a priori assumed to be equally likely. Then, sup- for the Bayes factor amounts to pose that after observing data D, the Bayes factor was computed to be 10. Now, the posterior odds (the odds of H1 over H0 after observing data) is 10:1 in favor of R H1 over H0. As such, the Bayes factor provides an easily P (D | H0, θ)π0(θ)dθ θ∈Θ0 interpretable measure of the evidence in favor of H1. BF01 = R (3) P (D | H1, θ)π1(θ)dθ In order to help with interpreting Bayes factors, var- θ∈Θ1 ious classification schemes have been proposed. One simple scheme is a four-way classification proposed by Raftery [9], where Bayes factors between 1 and 3 are where Θ0 and Θ1 are the parameter spaces for models considered weak evidence; between 3 and 20 constitutes H0 and H1, respectively, and π0 and π1 are the prior positive evidence; between 20 and 150 constitutes strong probability density functions of the parameters of H0 and evidence; and beyond 150 is considered very strong evi- H1, respectively. dence. Thus, in order to compute BF , one must specify the Note that in the discussion above, there was no spe- 01 priors π and π for H and H . Further, the integrals cific assumption about the order in which we addressed 0 1 0 1 usually do not have closed-form solutions, so numerical H and H . If instead we wanted to assess the weight of 1 0 integration techniques are necessary. These requirements evidence in favor of H over H , Equation 2 could sim- 0 1 lend a computation of the Bayes factor to be inaccessible ply be adjusted by taking reciprocals. As such, implied to all but the most ardent researchers who have at least direction is important when computing Bayes factors, so a more-than-modest amount of mathematical training. one must be careful to define notation when representing Bayes factors. A common convention is to define BF10 as Fortunately, there are an increasing number of solu- the Bayes factor for H1 over H0; similarly, BF01 would tions that avoid a direct encounter with computations represent the Bayes factor for H0 over H1. Note that of the above type. Recently, researchers have proposed BF10 = 1/BF01. default priors for standard experimental designs such as In summary, the Bayes factor provides an index of pref- t-tests [7, 11] and ANOVA [10]. These default priors are erence for one hypothesis over another that has some ad- implemented in software packages such as the R pack- vantages over NHST. First, the Bayes factor tells us by age BayesFactor [8], and as such, have provided a user- how much a data sample should update our belief in one friendly method for researchers to compute Bayes factors hypothesis over a competing one. Second, though NHST without the computational overhead needed in Equation does not allow one to accept a null hypothesis, doing so 3. While these software solutions work quite well for com- within a Bayesian framework makes perfect sense. Given puting Bayes factors from raw data, they are a bit lim- these advantages, it may be surprising that Bayesian in- ited in the following context. Suppose that in the course ference has not been used more often in the empirical of reading some published literature, a researcher comes sciences. One reason for the lack of more widespread across a result that is presented as “nonsignificant”, with adoption may be that Bayes factors are quite difficult to associated test statistic F (1, 23) = 2.21, p = 0.15. In an compute. We tackle this issue in the next section. NHST context, this nonsignificant result does not provide evidence for the null hypothesis; rather, it just implies that we cannot reject the null. A natural question would II. COMPUTING BAYES FACTORS be what, if any, support does this result provide for the null hypothesis? Of course, a Bayes factor would be use- As an example, suppose we are interested in computing ful here, but without the raw data, we cannot use the the Bayes factor for a null hypothesis H0 over an alterna- previously mentioned software solutions. To this end, it tive hypothesis H1, given data D. Recall from Equation would be advantageous if there were some easy way to 2 that this Bayes factor (denoted BF01) is equal to compute a Bayes factor directly from the reported test statistic.

P (D | H0) It turns out that this computation is indeed possible, BF01 = . at least in certain cases. In the following, I will show P (D | H1) how one particular method for computing Bayes factors While this equation may seem conceptually quite sim- [the BIC approximation; 9] can be adapted to solve this ple, it is computationally much more difficult. This is problem, thus allowing researchers to compute approxi- because in order to compute the numerator and denomi- mate Bayes factors from alone (with nator, one must parameterize the hypotheses (or models, no need for raw data). Further, I will show through sim- to be more clear), and then each likelihood is computed ulations that this method compares well to the default by conditioning over all possible parameter values and Bayes factors for ANOVA developed by Rouder et al. summing over this set. Since these potential parameter [10]. 3

III. THE BIC APPROXIMATION OF THE Now, we can use algebra to re-express ∆BIC10 in terms BAYES FACTOR of F :

Wagenmakers [13] demonstrated a method [based on   SS2 earlier work by 9] for computing approximate Bayes fac- ∆BIC10 = n log + df1 log n SS + SS tors using the BIC (Bayesian Information Criterion). For 1 2 ! a given model Hi, the BIC is defined as 1 = n log + df1 log n SS1 + 1 SS2 BIC(H ) = −2 log L + k · log n, df2 ! i i i df1 = n log + df1 log n SS1 · df2 + df2 where n is the number of observations, ki is the number SS2 df1 df1 df of free parameters of model Hi, and Li is the maximum 2 ! df1 likelihood for model Hi. He then showed that the Bayes = n log + df1 log n F + df2 factor for Ho over H1 can be approximated as df1   df2 = n log + df1 log n.   F df1 + df2 BF01 ≈ exp ∆BIC10/2 , (4) where ∆BIC10 = BIC(H1) − BIC(H0). Further, Wagen- Substituting this into Equation 4, we can compute: makers [13] showed that when comparing an alternative hypothesis H1 to a null hypothesis H0, BF01 ≈ exp (∆BIC10/2) 1   df   = exp n log 2 + df log n ! 2 F df + df 1 SSE 1 2 1     ∆BIC10 = n log + (k1 − k0) log n. (5) n df2 df1 SSE0 = exp log + log n 2 F df1 + df2 2 n/2 In this equation, SSE0 and SSE1 represent the sum of  df  2 df1/2 squares for the error terms in models H0 and H1, re- = · n F df1 + df2 spectively. Both Wagenmakers [13] and Masson [6] give s n df1 excellent examples of how to use this approximation to df2 · n = n compute Bayes factors, assuming one is given informa- (F df1 + df2) tion about SSE0 and SSE1, as is the case with most v statistical software. However, we will now consider the u ndf1 = u . situation where one is given the statistical summary (i.e., t n F df1 + 1 F (1, 23) = 2.21), but not the ANOVA output. df2 Suppose we wish to examine an effect of some indepen- Rearranging this last expression slightly yields the ap- dent variable with associated F -ratio F (df , df ), where 1 2 proximation: df1 represents the degrees of freedom associated with the s manipulation, and df2 represents the degrees of freedom  −n SS /df F df1 associated with the error term. Then, F = 1 1 = BF ≈ ndf1 1 + (6) SS2/df2 01 df2 SS1 df2 · , where SS1 and SS2 are the sum of squared er- SS2 df1 rors associated with the manipulation and the error term, Practically speaking, the approximation given in Equa- respectively. tion 6 offers nothing new over the previous formulations From Equation 5, we see that of the BIC approximation given in Wagenmakers [13] and Masson [6]. However, it does have two advantages over these previous formulations. First, one can directly take   SSE1 reported ANOVA statistics (e.g., sample size, degrees of ∆BIC10 = n log + (k1 − k0) log n SSE0 freedom, and the F -ratio) and compute BF01 without   SS2 having to compute SSE0 or SSE1. We should note that = n log + df log n. 2 1 Masson [6] correctly mentions that SSE1/SSE0 = 1−η , SS1 + SS2 p 2 so if a paper reports ηp, the need for computing SSE0 This equality holds because SSE1 represents the sum of and SSE1 is nullified. However, the method of Masson squares that is not explained by H1, which is simply SS2 [6] is still essentially a two-step process; one first com- (the error term). Similarly, SSE0 is the sum of squares putes ∆BIC10, which in turn is used to compute BF01. not explained by H0, which is the sum of SS1 and SS2 In contrast, the expression derived in Equation 6 is a [see 13, p. 799]. Finally, in the context of comparing H1 one-step process that can easily be implemented using a and H0 in an ANOVA design, we have k1 − k0 = df1. scientific calculator or a simple spreadsheet. 4

IV. EXAMPLE COMPUTATIONS a placebo, t(71) = 2.0, p = 0.049. Borota et al. [1] claimed this result as evidence that caffeine enhances In this section, we will discuss two examples of using memory consolidation. Equation 6 to compute Bayes factors. In the first exam- As before, we can measure the evidence provided from ple, I will show how to compute and interpret a Bayes this data sample by computing a Bayes factor. However, factor for a reported null effect in the field of experimen- we note that because Equation 6 casts the Bayes factor in tal psychology. In the second example, I will show how to terms of an F -ratio, it may not be immediately obvious modify Equation 6 to work with an independent samples whether we can use Equation 6 in this context. It turns t-test. out to be straightforward to modify Equation 6 to work for an independent samples t-test. All we need are two 2 simple transformations: (1) F = t , and (2) df1 = 1. A. Example 1 Applying these to Equation 6, we get

s Sevos et al. [12] performed an to assess  −n F df1 whether schizophrenics could internal simulate motor ac- BF ≈ ndf1 1 + 01 df tions when perceiving graspable objects. The evidence 2 s for such internal simulation comes from a statistical in-  t2(1)−n teraction between response-orientation compatibility and = n1 1 + df the presence of an individual name prime. Sevos et al. 2 s reported that in a sample of n = 18 schizophrenics, there  t2 −n was no between this compatibility and name = n 1 + df prime, F (1, 17) = 2.584, p = 0.126. Critically, Sevos 2 et al. claimed that this null effect was evidence for the absence of sensorimotor simulation, which, as they indi- cate, would imply that schizophrenics would have to rely We can now apply this equation to the reported results on higher cognitive processes for even the most simple of Borota et al. [1]. We see that daily tasks. This claim is based on a null effect, which as pointed out earlier, is problematic for a null hypothe- s −n  t2  sis testing framework. I will now show how to compute BF01 ≈ n 1 + df2 a Bayes factor BF01 to assess the evidence for this null effect. s −73  (2.0)2  To this end, we use Equation 6 to compute = 73 1 + 71 = 1.16 s  −n F df1 BF ≈ ndf1 1 + 01 df 2 So, perhaps counterintuitively, the significant result re- s  2.584 · 1−18 ported in Borota et al. [1] turns out to be weak evidence = 181 1 + 17 in support of the null! Such results are an example of Lindley’s paradox [5], where “significant” p values be- = 1.187. tween 0.04 and 0.05 can actually imply evidence in favor of the null when analyzed in a Bayesian framework. This Bayes factor can be interpeted as follows: after seeing the data, our belief in the null hypothesis should increase only by a factor of 1.19. In other words, this V. SIMULATIONS: BIC APPROXIMATION data is not very informative toward our belief in the null, VERSUS DEFAULT BAYESIAN ANOVA which implies that the claim of a null effect in Sevos et al. [12] may be a bit optimistic. According the classification At this stage, it is clear that Equation 6 provides a scheme of Raftery [9], this result provides weak evidence straightforward method for computing an approximate for the null. Bayes factor, especially in cases when one is given only minimal output from a reported ANOVA or t test. How- ever, it is not yet clear to what extent this BIC approx- B. Example 2 imation would result in the same decision if a Bayesian [e.g., 10] were performed on the raw Borota et al. [1] observed that with a sample of n = 73 data. To answer this question, I performed a series of participants, those who received 200 mg of caffeine per- simulations. formed significantly better on a test of object memory Each simulation consisted of 1000 randomly generated compared to a control group of participants who received data sets under a 2 × 3 factorial design. The choice of 5 this design is to replicate experimental conditions that clearly, as the kernel density plots for the two different are common across many applications in the biological types of Bayes factors exhibit a large amount of over- and behavioral sciences. Further, I simulated varying lap for the g = 0.05 and g = 0.2 conditions. It is notable common levels of statistical power in these experiments that the BIC approximation tended to underestimate the by testing 3 different cell-size conditions: n = 20, 50, or BayesFactor output in the g = 0 case. However, as can be 80. Specifically each data set consisted of a vector y seen in the “Consistency” column of Table I, regardless generated as of effect size conditions, the two different types of Bayes factors resulted in the same decision in a large proportion of simulations (at least 98.4% of simulation trials). yijk = αi + τj + γij + εijk A similar picture emerges for the main effect τ. As can be seen in Table II and Figure 2, the BIC approxima- where i = 1, 2, j = 1, 2, 3, and k = 1, . . . , n. The “effects” tion and the BayesFactor outputs are largely consistent α, τ, and γ were generated from multivariate normal dis- and result in mostly the same model choice decisions. As tributions with 0 and variance g, yielding three with the results for main effect α, there is some slight dif- different effect sizes obtained by setting g = 0, 0.05, and ference in the kernel density plots when simulating null 0.2 [as in 14]. In all, there were 9 different simulations, effects (i.e., the condition g = 0). However, both meth- generated by crossing the 3 cell sizes (n = 20, 50, 80) with ods chose the same model on at least 92.7% of simulated the 3 effect sizes (g = 0, 0.05, 0.2). trials, showing a good amount of consistency. For each data set, I computed (1) a Bayesian ANOVA Finally, we can see in Table III and Figure 3 that using the BayesFactor package in R [8] and (2) the BIC the BIC approximation closely mirrors the output of the approximation using Equation 6 from the traditional BayesFactor package for the interaction effect γ. Indeed, ANOVA. Bayes factors were computed as BF10 to as- the kernel density plots in Figure 3 show considerable sess evidence in favor of the alternative hypothesis over overlap between the distributions of BIC values and the the null hypothesis. Similar to Wang [14], I set the distributions of BayesFactor outputs, and this picture is decision criterion to select the alternative hypothesis if consistent across all three effect sizes (g = 0, 0.05, 0.2). log(BF ) > 0, and the null hypothesis otherwise. Be- As expected from this picture, both methods arrive at cause the different cell sizes resulted in similar outcomes, very similar model choices, picking the same model on at for brevity I only report the n = 50 cell size condition least 97% of trials. in the summaries below. Also note that all BayesFac- tor models were fit with a “wide” prior, which is roughly equivalent to the unit-information prior used by Raftery [9] for the BIC approximation. VI. CONCLUSION First, I will report the results of computing Bayes fac- tors for the main effect α in each of the effect size con- The BIC approximation given in Equation 6 pro- ditions g = 0, g = 0.05, and g = 0.2. Five-number vides an easy-to-use estimate of Bayes factors for simple summaries for log(BF ) are reported for the n = 50 sim- between-subject ANOVA and t test designs. It requires ulation in Table I, as well as the proportion of simulated only minimal information, which makes it well-suited for data sets for which the Bayesian ANOVA and the BIC using in a meta-analytic context. In simulations, the es- approximation from Equation 6 selected the same model. timates derived from Equation 6 compare favorably to As shown in Table I, the BIC approximation from Bayes factors computed using existing software solutions Equation 6 provides a similar distribution of Bayes fac- with raw data. Thus, the researcher can confidently add tors compared to those computed from the BayesFactor this BIC approximation to the ever-growing collection of package in R. Figure 1 shows this pattern of results quite Bayesian tools for scientific measurement.

[1] Borota, D., Murray, E., Keceli, G., Chang, A., Watabe, [5] Lindley, D. V. (1957). A statistical paradox. Biometrika, J. M., Ly, M., Toscano, J. P., and Yassa, M. A. 44(1-2):187–192. (2014). Post-study caffeine administration enhances [6] Masson, M. E. J. (2011). A tutorial on a practical memory consolidation in humans. Nature Neuroscience, Bayesian alternative to null-hypothesis significance test- 17(2):201–203. ing. Behavior Research Methods, 43(3):679–690. [2] Gigerenzer, G. (2004). Mindless statistics. The Journal [7] Morey, R. D. and Rouder, J. N. (2011). Bayes factor of Socio-Economics, 33(5):587–606. approaches for testing interval null hypotheses. Psycho- [3] Hoekstra, R., Morey, R. D., Rouder, J. N., and Wa- logical Methods, 16(4):406–419. genmakers, E.-J. (2014). Robust misinterpretation of [8] Morey, R. D. and Rouder, J. N. (2015). BayesFactor: confidence intervals. Psychonomic Bulletin & Review, Computation of Bayes Factors for Common Designs.R 21(5):1157–1164. package version 0.9.12-2. [4] Jeffreys, H. (1961). The Theory of Probability (3rd ed.). [9] Raftery, A. E. (1995). Bayesian in social Oxford University Press, Oxford, UK. research. Sociological Methodology, 25:111–163. 6

TABLE I. Summary of values for log BF for main effect α with cell size n = 50

g BF type Min Q1 Q3 Max Consistency 0 BayesFactor -2.40 -2.35 -2.14 -1.74 3.38 BIC -2.85 -2.80 -2.59 -2.17 3.17 0.985

0.05 BayesFactor -2.40 -1.83 -0.20 4.24 47.49 BIC -2.85 -2.23 -0.41 4.44 53.29 0.984

0.2 BayesFactor -2.40 -0.61 4.88 17.26 119.69 BIC -2.85 -0.63 6.76 21.71 146.02 0.987

FIG. 1. Kernel density plots of distributions of log BF for main effect α, presented as a function of Bayes factor type (BIC versus BayesFactor) and effect size (g = 0, 0.05, 0.2).

TABLE II. Summary of values for log BF for main effect τ with cell size n = 50

g BF type Min Q1 Median Q3 Max Consistency 0 BayesFactor -3.97 -3.68 -3.31 -2.70 2.25 BIC -3.40 -3.10 -2.70 -2.05 3.19 0.980

0.05 BayesFactor -3.97 -1.89 1.15 6.28 43.55 BIC -3.40 -1.05 2.28 8.07 48.17 0.927

0.2 BayesFactor -3.92 3.19 12.00 28.51 134.04 BIC -3.34 6.03 17.15 36.58 149.39 0.952

TABLE III. Summary of values for log BF for interaction effect γ with cell size n = 50

g BF type Min Q1 Median Q3 Max Consistency 0 BayesFactor -3.97 -3.08 -2.72 -2.05 3.47 BIC -3.40 -3.12 -2.74 -1.98 4.03 0.990

0.05 BayesFactor -3.57 -2.30 -0.96 1.24 20.64 BIC -3.40 -2.27 -0.78 1.64 22.40 0.970

0.2 BayesFactor -3.37 -0.40 3.26 9.57 54.17 BIC -3.39 -0.20 3.84 10.72 57.61 0.975 7

FIG. 2. Kernel density plots of distributions of log BF for main effect τ, presented as a function of Bayes factor type (BIC versus BayesFactor) and effect size (g = 0, 0.05, 0.2).

FIG. 3. Kernel density plots of distributions of log BF for the interaction effect γ, presented as a function of Bayes factor type (BIC versus BayesFactor) and effect size (g = 0, 0.05, 0.2).

[10] Rouder, J. N., Morey, R. D., Speckman, P. L., and in context on object-affordance effects in schizophrenia? Province, J. M. (2012). Default Bayes factors for ANOVA Perception of property and goals of action. Frontiers in designs. Journal of Mathematical Psychology, 56(5):356– Psychology, 7:1551. 374. [13] Wagenmakers, E.-J. (2007). A practical solution to the [11] Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., pervasive problems of p values. Psychonomic Bulletin & and Iverson, G. (2009). Bayesian t tests for accepting Review, 14(5):779–804. and rejecting the null hypothesis. Psychonomic Bulletin [14] Wang, M. (2017). Mixtures of g-priors for analysis of & Review, 16(2):225–237. variance models with a diverging number of parameters. [12] Sevos, J., Grosselin, A., Brouillet, D., Pellet, J., and Mas- Bayesian Analysis, 12(2):511–532. soubre, C. (2016). Is there any influence of variations