Bayesian Alternatives for Common Null-Hypothesis Significance Tests
Total Page:16
File Type:pdf, Size:1020Kb
Quintana and Eriksen Bayesian alternatives for common null-hypothesis significance tests in psychiatry: A non-technical guide using JASP Daniel S Quintana* and Jon Alm Eriksen *Correspondence: [email protected] Abstract NORMENT, KG Jebsen Centre for Psychosis Research, Division of Despite its popularity, null hypothesis significance testing (NHST) has several Mental Health and Addiction, limitations that restrict statistical inference. Bayesian analysis can address many University of Oslo and Oslo of these limitations, however, this approach has been underutilized largely due to University Hospital, Oslo, Norway Full list of author information is a dearth of accessible software options. JASP is a new open-source statistical available at the end of the article package that facilitates both Bayesian and NHST analysis using a graphical interface. The aim of this article is to provide an applied introduction to Bayesian inference using JASP. First, we will provide a brief background outlining the benefits of Bayesian inference. We will then use JASP to demonstrate Bayesian alternatives for several common null hypothesis significance tests. While Bayesian analysis is by no means a new method, this type of statistical inference has been largely inaccessible for most psychiatry researchers. JASP provides a straightforward means of performing Bayesian analysis using a graphical “point and click” environment that will be familiar to researchers conversant with other graphical statistical packages, such as SPSS. Keywords: Bayesian statistics; null hypothesis significance tests; p-values; statistics; software Consider the development of a new drug to treat social dysfunction. Before reaching the public, scientists need to demonstrate drug efficacy by performing a randomized- controlled trial (RCT) whereby participants are randomized to receive the drug or placebo. It would be expected that participants randomized to the drug group would have greater symptom improvement than those in the placebo group. Hence, this RCT evaluates two competing hypotheses: the null hypothesis (H0) that the drug performs equivalently to placebo, and the alternative hypothesis (H1) that the drug performs better than placebo. The almost universal approach to draw these inferences from such data is by using null hypothesis significance testing (NHST). NHST generates a p-value, which is the probability of observing the study outcome or a more extreme result, assuming that H0 is true. The p-value is then used as a criterion to either reject or fail to reject the null hypothesis, with the threshold for rejecting the null hypothesis typically set at 0.05. Despite its enduring popularity, the NHST p-value has been the subject of a growing chorus of criticism. Excellent treatments of p-value limitations are already available [1, 2, 3], so we will only briefly cover issues particularly relevant for re- search in psychiatry. First, by performing the hypothetical RCT described above, researchers hope to discover the probability of the theory that the drug performs Quintana and Eriksen Page 2 of 10 better than than placebo (H1), given the data. However, the p-value is only con- cerned with disproving the the null hypothesis (H0), as p-values only consider the extremeness of the data under the null hypothesis. There is no means to assess how much better the data favors H1 over H0, or vice versa, with p-values. Second, the NHST p-value assesses the probability of observed outcome given the null hy- pothesis (H0), not the arguably more interesting quantity: the probability of the hypotheses (both H0 and H1) given the observation. Even a “large” non-significant p-value does not provide evidence for the null hypothesis. Perhaps the researchers in this scenario also want to demonstrate that both groups report equivalent fre- quencies of reported side e↵ects? However, a test of group equivalency is beyond the reach of the conventional p-value test approaches (but see the “two one-sided test” for an approach that uses the same framework underlying p-values [4, 5]). Third, unless an aprioripower analysis is performed, there would no clear indication if the data were sensitive enough to detect a true e↵ect when using p-values. In other words, did a study have a large enough sample size to have sufficient statistical power to detect a given e↵ect? A common approach to this problem is to perform a posthoc power analysis. However, this is an exercise in tail-chasing as the observed p-value of a test directly determines the observed power [6]. In other words, posthoc power analysis only yields another method of presenting the p-value without gain- ing any new information. Fourth, a p-value generated from a NHST evaluates the existence but not the magnitude of an e↵ect—typically the more pertinent question for researchers. Fifth, p-values are inordinately dependent on sample size. If there is an underlying e↵ect, a larger sample size will tend to yield a smaller p-value. This lack of equivalency complicates the comparison of p-values between studies with di↵erent sample sizes. As the p-value is an inadequate metric for e↵ect magnitude and between-study comparisons, researchers have been encouraged to present corresponding standard- ized measures of e↵ect size, such as Cohen’s d, alongside p-values [7]. As well as producing a measure of e↵ect magnitude, standardized e↵ect sizes are also com- parable between studies, regardless of sample size di↵erences. However, e↵ect size interpretation comes with its own limitations. Researchers can use conventional cut- o↵s to quantify small (d =0.2), medium (d =0.5), or large e↵ects (d =0.8), but like p-value cuto↵s of 0.05 or 0.01, these thresholds are largely arbitrary. Despite Cohen proposing these cuto↵s as a last resort for when the average size of e↵ects are unknown [8], these have morphed into de facto guidelines over time. Given the breadth of research areas, recent analyses have unsurprisingly shown that e↵ect size distributions do not necessarily follow Cohen’s guidelines [9, 10]. Moreover, ef- fect sizes still cannot solve the inherent p-value problem of gaining no information about the null hypothesis, no matter how small the e↵ect size. Altogether, a NHST approach o↵ers limited options for the statistical inference of biobehavioral data. Bayesian inference The use of p-values to determine a true relationship falls under the traditional school of frequentist statistics, but there is an alternative approach o↵ered within the framework of Bayesian statistics. There have been several publications on the philosophical and practical di↵erences between the two view points [11, 12, 13], Quintana and Eriksen Page 3 of 10 but it suffices for our purposes to note that only the Bayesian view allows us to assign probabilities to the hypotheses themselves. In contrast to NHST, the Bayesian framework allows us to assign probabilities to both H0 and H1, and provides a means of comparing these probabilities. Under the Bayesian view, assigning probabilities to hypotheses are possible, and we can calculate the probabilities of both H0 and H1 given the study outcome. Bayes’ factor is the ratio between the marginal likelihoods of the null model (cor- responding to the null hypothesis) and the alternative model (corresponding the alternative hypothesis). It is denoted by BF01. The order of the subscripts indicate that the marginal likelihood of the null model is in the numerator, and that the marginal likelihood of the alternative model is in the denominator; BF10 is the recip- rocal value. Due to Bayes’ theorem, this factor has a very appealing interpretation. If we assign an equal prior probability for the alternative and the null hypothesis of being true — that is, we are willing accept that the alternative model just as plausible as the null model before we consider the observations — then Bayes’ factor is simply the posterior odds of the hypotheses given the observed data. In Bayesian terminology, posterior means simply “conditioned on the observation” (i.e., BF01 equals the ratio of the posterior probability for the null hypothesis being true, to the posterior probability for the alternative hypothesis being true). While a detailed treatment of Bayesian statistics is outside the scope of this article (the interested reader can consult standard textbooks such as [14]), it is important to emphasize that the definition Bayes’ factors is not an ad hoc construction; it follows naturally from using Bayes’ theorem to assign probabilities to the hypotheses given experi- mental observations. Prior distributions of the parameters in a statistical model are central to Bayesian inference. A prior distribution quantifies our beliefs and prior knowledge about the parameters in question, and these beliefs are encoded as a probability distribution. Bayes’ factor depends on the prior distribution of the model parameter for both the Null and the alternative model, as the marginal likelihood is computed by weighting the likelihood function by the prior distribution, and then integrating this over the entire parameter space. In order to calculate Bayes’ factor we, therefore, first need to quantify our prior information about the parameters of the model (e.g., the e↵ect) before the data is considered — we will return to this in the examples below. There are, however, no general rules for choosing a completely uninformed prior distribution for which there also is consensus among statisticians. If the parameter is the e↵ect of a drug, and if we have no apriorireason to believe that the drug has a positive or a negative e↵ect, we may argue that the prior distribution should be symmetric and centered around zero e↵ect, but it is hard to determine the complete functional form from such symmetry arguments. If the parameter for some reason is constrained to an interval, one often choose the prior which is uniform on that interval.