<<

Running Head: NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL 1

This manuscript has been accepted for publication in Journals of Gerontology Series B: Psychological , and can be found at https://doi.org/10.1093/geronb/gbz033. Please use this link to find the most up-to-date information for citing this paper.

A Bayesian Analysis of Evidence in Support of the Null Hypothesis in Gerontological

Psychology (or Lack Thereof)

Christopher R. Brydges, PhD & Allison A. M. Bielak, PhD FGSA

Department of Human Development and Family Studies, Colorado State University

Author Contact and Address: Christopher R. Brydges, PhD Department of Human Development and Family Studies Colorado State University 1570 Campus Delivery Fort Collins, CO 80523 USA Tel: +1 970 825-3165 Fax: +1 970 491-7975 Email: [email protected]

Word count: 5,038 (main text); 206 (abstract).

NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 2

Abstract Objective: Non-significant p values derived from null hypothesis significance testing do not distinguish between true null effects or cases where the data are insensitive in distinguishing the hypotheses. This study aimed to investigate the of Bayesian analyses in gerontological psychology, a statistical technique that can distinguish between conclusive and inconclusive non-significant results, by using Bayes factors (BFs) to reanalyze non-significant results from published gerontological research.

Method: Non-significant results mentioned in abstracts of articles published in 2017 volumes of ten top gerontological psychology journals were extracted (N = 409) and categorized based on whether Bayesian analyses were conducted. BFs were calculated from non-significant t-tests within this sample to determine how frequently the null hypothesis was strongly supported.

Results: Non-significant results were directly tested with Bayes factors in 1.22% of studies.

Bayesian reanalyses of 195 non-significant t-tests found that only 7.69% of the findings provided strong evidence in support of the null hypothesis.

Conclusions: Bayesian analyses are rarely used in gerontological research, and a large proportion of null findings were deemed inconclusive when reanalyzed with BFs. Researchers are encouraged to use BFs to test the validity of non-significant results, and ensure that sufficient sample sizes are used so that the meaningfulness of null findings can be evaluated.

Keywords: Bayes factor; Meta-research; Null hypothesis significance testing; Statistical power.

NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 3

A Bayesian Analysis of Evidence in Support of the Null Hypothesis in Gerontological

Psychology (or Lack Thereof)

Null hypothesis significance testing (NHST; Fisher, 1925; Neyman & Pearson, 1933) is arguably the most commonly used method of data analysis in psychology (Nickerson, 2000).

The basic logic of NHST is that the null hypothesis, that there is no difference/association between groups/variables and any observed difference is due to sampling error (Carver, 1978;

Nickerson, 2000), may be rejected if the observed p value obtained from a statistical test (such as a t-test or correlation) is less than a predefined α threshold (typically .05 in psychology).

Conversely, if the p value is greater than the α threshold, the researcher fails to reject the null hypothesis. It is important to note that a non-significant NHST result does not in fact support the null hypothesis per se; rather the terminology is that we failed to reject the null hypothesis.

Therefore, judgement regarding whether the null hypothesis is correct or not should be withheld

(Cohen, 1994), as a p value > .05 does not indicate whether the non-significant result is due to a true null effect or the data being insensitive in distinguishing the hypotheses. This distinction is important because a non-significant result can be masking a true effect in a study with low statistical power. Whilst this is not necessarily a limitation of NHST, researchers often make no attempt to distinguish between a true null effect and an underpowered result which can result in potentially incorrect conclusions being drawn from the data (Aczel, Palfi et al., 2018). Only relatively recently, however, have alternative methods to overcome this issue have begun to gain increased attention (e.g., Wagenmakers et al., 2018). To date, no previous research has examined the strength of null findings (i.e., the proportion of results that are true null results in comparison to those that are underpowered) in gerontological psychology. NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 4

As noted above, the challenge of only being able to fail to reject the null hypothesis is that we never know whether a null result is meaningful or not. Isaacowitz (2018) directly addressed this issue in his editorial in Journals of Gerontology: Series B, Psychological

Sciences, and encouraged researchers to apply equivalence testing and Bayes factors (Lakens,

McLatchie, Isager, Scheel, & Dienes, 2018) to null results to determine their meaningfulness.

Specifically, researchers in gerontological psychology were encouraged to directly test any non- significant results in their analyses to determine whether there genuinely was no effect, or if a non-significant result was observed simply because the data were not sensitive enough to observe the effect. Identifying meaningful null results may increase the likelihood of their publication (Isaacowitz, 2018), thereby reducing to only publish significant results (i.e., ) and incentives to engage in questionable research practices. Testing null results is of particular relevance to gerontological research because it is very likely that there are processes that are relatively stable across age that are of interest to psychologists (Isaacowitz,

2018), such as memory formation processes (Spaniol, Schain, & Bowen, 2014), emotion regulation (Martins, Sheppes, Gross, & Mather, 2016), and subjective experience of pain

(Gibson & Lussier, 2012). In these cases, it would therefore be common to not find significant effects across age groups in adulthood. Consequently, it is critical for our field to be able to accurately distinguish true null age group differences from results that did not find age group differences due to low power or data that are insensitive to distinguishing differences between groups/variables. Therefore, having the knowledge to conduct statistical analyses that directly address the meaningfulness of non-significant results is of the utmost importance. This paper aimed to investigate the prevalence of Bayesian analyses and the strength of evidence in favor of null findings in gerontological psychology, a statistical technique that can distinguish between NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 5 conclusive and inconclusive non-significant results (i.e., true null effects and results from insensitive/underpowered studies), by using Bayes factors (BFs) to reanalyze non-significant results from published gerontological research.

Why use Bayes Factors?

To address the lack of discrimination between meaningful and ambiguous non-significant results in NHST, BFs have become an increasingly popular statistical method to complement

NHST (van de Schoot, Winter, Ryan, Zondervan-Zwijnenburg, & Depaoli, 2017). BFs are based on formal definitions of the probability distributions of the null and hypotheses, and the observed data (Morey & Rouder, 2011). The null hypothesis is often modelled as no difference, and the probability distribution of the alternate hypothesis is dependent on the theory being tested. There are typically several issues when deciding upon this distribution, such as the expected size and potential variability of the effect being investigated, and/or the confidence a researcher has in observing a specific effect size (Dienes & MacLatchie, 2018; Etz, Gronau, Dablander,

Edelsbrunner, & Baribault, 2018), and is a source of contention for experts in Bayesian analyses

(Aczel, Hoekstra et al., 2018).

From here, BFs are values that indicate the strength of evidence in the data that favors one hypothesis over another, by comparing the likelihood (i.e., probability) of one hypothesis being true over the likelihood of the other being true, given the observed data (see Rouder, 2016, for a walkthrough of choosing the prior distributions to obtaining a BF). This is not to say that one hypothesis is true and the other is not, but Bayesian analyses test whether one hypothesis is more or less likely than the other, and estimate how much more or less likely it is given the observed data. Therefore, a major advantage of Bayesian analyses over NHST is that the strength of evidence can be tested in terms of both the null and alternate hypotheses (in this study NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 6

BF01 refers to the likelihood of the null hypothesis (0) being true over the alternate (1), and BF10 refers to the likelihood of the alternate hypothesis being true over the null). Bayesian analyses make it possible to say ‘there is very strong evidence in favor of the null hypothesis over the alternate, given the data’, which cannot be done with NHST.

A second advantage of BFs over NHST is that the values are easily interpretable in terms of the strength of a finding. In NHST, a p value of .001 does not imply that a result is ‘more significant’ or has a stronger effect than a p value of .04, and a p value of .80 does not suggest stronger evidence in favor of the null hypothesis than a p value of .08 (Dienes, 2014).

Conversely, BFs provide an estimate of the likelihood of one hypothesis over another given the data, and this value is commonly categorized as anecdotal, moderate, strong, or extreme evidence for/against one hypothesis over another (Jeffreys, 1961; see Table 1). In cases of an anecdotal BF, the sample size is not often sufficient enough for the data to provide conclusive evidence for either hypothesis (i.e., the study is underpowered). It should be noted that these values are distinct from effect sizes (e.g., Cohen’s d), which measure the magnitude of a difference in the data, rather than the likelihood of one hypothesis occurring over another

(Wetzels et al., 2011).

Close approximations of BFs can easily be calculated on test statistics and do not necessarily require raw data for calculation (Ly et al., 2018). Rouder, Speckman, Sun, Morey, &

Iverson (2009) provide the following formula for calculating a BF01 using only N and t values, based on the Bayesian Information Criterion (Wagenmakers, 2007):

−푁 푡2 2 퐵퐹 = √푁 (1 + ) 01 푁 − 1

NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 7

Table 1

Bayes Factor Values and Their Interpretation Based on Jeffreys’ (1961) Recommendations

Interpretation BF10 BF01

Evidence in favor of H1 in comparison to H0

Extreme > 100 < 1/100

Strong 10-100 1/100-1/10

Moderate 3-10 1/3-1/10

Anecdotal 1-3 1/3-1

Evidence in favor of H0 in comparison to H1

Anecdotal 1/3-1 1-3

Moderate 1/10-1/3 3-10

Strong 1/100-1/10 10-100

Extreme < 1/100 > 100

Note. H1 = alternate hypothesis; H0 = null hypothesis; BF10 = Bayes factor in favor of the alternate hypothesis; BF01 = Bayes factor in favor of the null hypothesis.

Bayes Factors in Psychological Research

Aczel, Palfi et al. (2018) examined 137 non-significant findings mentioned in the abstracts of three prominent psychology journals: Psychonomic Bulletin & Review, Journal of

Experimental Psychology: General, and Psychological , and conducted a Bayesian reanalysis of the non-significant t-tests extracted from these studies. They found that fewer than

5% of the findings provided strong evidence in favor of the null hypothesis (BF01 > 10) over the alternate hypothesis. In fact, 25% did not even reach the threshold considered as moderate evidence in favor of the null (BF01 > 3), implying that results being presented as having no NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 8 differences were in fact inconclusive. The authors concluded that most non-significant results in general psychology do not provide evidence in support of the null hypothesis, defined as BF01 >

10, potentially due to psychological studies often being statistically underpowered and/or insensitive in distinguishing the hypotheses (Bakker, van Dijk, & Wicherts, 2012; Cohen, 1990).

In fact, in the same way that researchers should test a large sample of participants in order to increase the probability of finding a significant result, another reason to collect sufficiently large sample sizes is to be able to conclusively determine whether there is a true null effect in a study

(Lakens et al., 2018). Additionally, Aczel, Palfi et al. (2018) also found that Bayesian analyses were underutilized within the context of evaluating non-significant results: only 10% of the studies (14/137) used Bayesian analyses to analyze a non-significant result.

The Current Study

It is unfortunately often the case that no attempt is made to distinguish between a true null effect and an underpowered analysis, resulting in potentially incorrect conclusions about null findings. Bayesian analyses are advantageous over NHST because they allow for the direct testing of the likelihood of one hypothesis in comparison to another, and because the reported BF is directly and easily interpretable (Dienes, 2014). However, these analyses are not commonly used by researchers (Aczel, Palfi et al., 2018), possibly due to a lack of understanding on the researchers’ part (Etz et al., 2018). Importantly, reanalysis of null NHST results showed that only a small minority of non-significant test results provided strong evidence in favor of the null hypothesis. From this, the current study had two aims: first, to determine how often Bayesian analyses are conducted when null results are reported in gerontological psychology. Second, to investigate the proportion of non-significant results in gerontological psychology that provide strong evidence in favor of the null, and the proportion that do not provide conclusive evidence NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 9 in favor of either the null or the alternate hypothesis. Given that these analytical techniques have only recently begun to enter the awareness of gerontological psychologists (Isaacowitz, 2018), and that there is evidence of low statistical power in the field (Brydges & Bielak, 2018), it was hypothesized that only a very small proportion of non-significant NHST results would be accompanied by Bayesian analyses. Additionally, following the findings of Aczel, Palfi et al.

(2018), it was hypothesized that a sizeable proportion of NHST results would be considered as anecdotal when BF01 values were calculated. Conversely, only a small proportion of NHST results were hypothesized to be considered as strong evidence in favor of the null hypothesis when BF01 values were calculated. Lastly, it was hypothesized that sample sizes and p values would be associated with the BF01 values, as previous research has found that small sample sizes and p values correlate with BFs (Aczel, Palfi et al., 2018; Wetzels et al., 2011). Significant correlations would suggest that there is a sizeable proportion of inconclusive (i.e., underpowered) non-significant findings in the field, potentially resulting in the misinterpretation of true but underpowered effects.

Method

The study design is based on that of Aczel, Palfi et al. (2018). Raw data, an Rmarkdown code file, and figures are publicly accessible on the Open Science Framework

(https://osf.io/y8e4b/).

Sample

The abstracts of every empirical research article with human participants published in

2017 in the journals Journal of the American Geriatrics Society, The Gerontologist, American

Journal of Geriatric Psychiatry, Journals of Gerontology: Series B, Psychological Sciences and

Social Sciences, International Journal of Geriatric Psychiatry, BMC Geriatrics, Aging & Mental NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 10

Health, Geriatrics & Gerontology International, Psychology and Aging, and International

Psychogeriatrics were selected (overall N = 1,819; see Table 2 for a journal-by-journal breakdown). These ten journals were chosen as they are the ten highest-ranked Gerontology journals on Clarivate Analytics’ journal citation ranking for 2017 that regularly publish psychological research.1 From this collection, articles that contained at least one negative empirical statement in their abstracts were selected (n = 409). Negative statements included explicit statements about the absence of an effect (e.g., “had no effect,” “did not differ”) or referring to a non-significant finding (e.g., “was not significant”). For each negative statement, the main text and supplement of the article were screened by the first author in order to record the associated p value(s), the type of statistical analysis, and the sentence describing the results of the analysis.

Categorical Analyses on Prevalence of Bayesian Analyses

To determine how often Bayesian analyses were used on non-significant results in gerontological psychology, the extracted claims from the abstracts were categorized into either

Bayesian or non-Bayesian analyses. If the authors stated that BFs were used to quantify evidence in favor of the null hypothesis (e.g., “The Bayes factor favored the null hypothesis over the alternate hypothesis”), the claim was classified as Bayesian. If Bayesian analyses were not mentioned, the claim was classified as non-Bayesian.

Bayesian Analyses on Strength of Null Hypothesis

When the negative claim was based on a t-statistic (one-sample, paired-samples, or independent-samples t-test), the t-value and the number of participants in each experimental

1 These journals were ranked 2-11, in the order listed. Journals of Gerontology: Series A, Biological Sciences and Medical Sciences was the number 1-ranked journal, and was not included in the current study as it is a medical and biological sciences journal. NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 11

group were recorded, and BF01s were calculated to determine the strength of evidence for the null hypothesis. Only t-tests were used to remain consistent with Aczel, Palfi et al. (2018), and also because Aczel, Palfi, & Szaszi (2017) found relatively consistent results when using t, F, and r test statistics for statistically significant results. Within the context of the current study,

BF01s larger than 1 indicate evidence in favor of the null hypothesis, whereas BF01s smaller than

1 indicate evidence in favor of the alternate hypothesis. Based on Jeffreys’ (1961) criteria, BF01s between 0.33 and 3 are considered anecdotal evidence, BF01s between 3 and 10 indicate moderate evidence in favor of the null hypothesis, and BF01s larger than 10 indicate strong evidence in favor of the null hypothesis (Table 1).

The ttest.tstat function of the BayesFactor package (Morey, Rouder, & Jamil, 2015) in R version 3.4.4 (R Core Team, 2018) was used to obtain the BF01s associated with the reported t- statistics and degrees of freedom. The JZS default prior distribution (a two-tailed Cauchy distribution centered on zero with a scaling factor of 0.707; Rouder et al., 2009) was used to model the alternate hypothesis. The prior distribution refers to the probability of various size effects being observed in the study. In this particular case, there is a 50% probability for each t- statistic reporting an effect size of -0.707 ≤ Cohen’s d ≤ 0.707, though other distributions, such as uniform, normal, t, or any other can be used (with appropriate justifications). The JZS default prior distribution (referred to as the ‘default prior’) is commonly used because small effects are more likely than large effects, and a scaling factor of 0.707 is considered moderate in comparison to 0.5 (too conservative) and 1.0 (too relaxed; Rouder et al., 2009).

Additionally, following the analyses conducted by Aczel, Palfi et al. (2018), the robustness of the results was tested by recalculating the BF01s with two alternative prior distributions. This was done because there are many ways to model the predictions of the NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 12 alternate hypothesis through prior distributions (Dienes, 2011). These distributions determine the probability of the alternate hypothesis given the data. As such, different distributions could result in different support towards the alternate hypothesis. First, a two-tailed normal distribution centered on zero (M = 0) and with SD = 0.5, was used. This distribution predicts that the standardized effect size will likely be in the range of -1 to 1, and that smaller effects (i.e., those closer to zero) are more likely than larger effect sizes. This prior is referred to as the

‘normal prior’ henceforth. The second alternative was a mixture of two t distributions, one centered on 0.35 and the other on -0.35, with three degrees of freedom and a scaling factor of

0.102. This is to make the final distribution non-directional and symmetrical around zero (see

Gronau, Ly, & Wagenmakers, 2017, and the supplementary materials of Aczel, Palfi et al., 2018, for more information). This distribution is referred to as the ‘informed prior’.

Correlational Analyses Evaluating Associations between Bayes Factors, N, and p

To test associations between BF01s, sample size, and p values, Bayesian parameter estimation was conducted (see Kruschke, 2011, for an introductory explanation). As these associations were nonlinear, Kendall’s τ with associated 95% credible intervals (CIs) were calculated in order to estimate the population effect sizes (Kendall & Gibbons, 1990). The

KendallTauB function from the DescTools R package (Signorell, 2017) was used to calculate

Kendall’s τ. The τ value and sample size were then used to compute the 95% CIs with the credibleIntervalKendallTau function (van Doorn, Ly, Marsman, & Wagenmakers, 2018) in R.

The BF01s calculated from the default prior were used, and the τ value was estimated with a two- tailed default prior distribution.

Running Head: NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 13

Table 2

Frequencies of Articles, Abstracts, Negative Statements, Statistical Tests, and p values through the Screening Process

2017 Empirical Abstracts containing Total negative Extracted Extracted Research Articles negative statements statements statistical tests p values Journal (N) (N) (N) (N) (N) Journal of the American Geriatrics 357 88 107 231 148 Society The Gerontologist 132 12 15 30 18

American Journal of Geriatric Psychiatry 120 31 40 121 76

Journals of Gerontology: Series B, 106 18 21 31 13 Psychological Sciences and Social Sciences International Journal of Geriatric 153 49 54 166 120 Psychiatry BMC Geriatrics 266 58 74 219 151

Aging & Mental Health 144 19 23 94 52

Geriatrics & Gerontology International 319 85 107 300 196

Psychology and Aging 63 16 21 30 14

International Psychogeriatrics 159 33 46 135 71

Total 1,819 409 508 1,357 859 NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 14

Results

Screening

At least one negative statement was found in 409 of the 1,819 abstracts. These 409 abstracts contained 508 negative statements, which were linked to 1,357 statistical tests from the articles. From here, 859 reported p values were collected from these tests (see Table 2 for a detailed breakdown of these numbers). The number of reported p values is smaller than the number of tests because the p value was not reported in all cases. Additionally, a small number of tests used non-frequentist statistics (i.e., BFs).

Bayesian Reanalyses of Non-significant Results

Of the 409 studies that reported a negative statement, only five (1.22%) used Bayesian analyses to determine the strength of evidence in favor of the null hypothesis. From the 859 statistical tests collected from the articles, 195 non-significant t-tests were identified. With the default prior, the 195 BF01s (i.e., evidence in favor of the null hypothesis) resulted in 15 (7.69%) strong, 108 (55.38%) moderate, and 71 (36.41%) anecdotal BF01s. No BFs provided extreme evidence in favor of the null hypothesis (BF01 > 100). The remaining one BF01 provided anecdotal evidence in favor of the alternate hypothesis (BF01 = 0.606).

The robustness analyses showed generally similar results. Figure 1 shows the BF01s ordered by size, and the percentages of the BF01s in the different evidence categories for each of the three prior distributions. With the default prior, 63.07% (n = 123) of the BF01s provided at least moderate evidence in favor of the null hypothesis (i.e., BF01 > 3). Conversely, 44.62% (n =

87) of BF01s calculated with the informed prior indicated at least moderate support of the null hypothesis, and only 32.31% (n = 63) of BF01s calculated with the normal prior exceeded a value of 3. The evidential category of the BF01s changed in 32.82% (n = 64) of cases when the NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 15 informed prior was used instead of the default, and use of the normal rather than the default prior resulted in 71 (36.41%) changes in evidential category. Importantly, however, the differences between the values of the BF01s calculated with the different prior distributions were not substantial in the majority of cases. As Figure 1 shows, the large number of differences in evidence categorizations is due to the fact that a large proportion of the BF01s were scattered around the category thresholds.

Figure 1. Bayes factors in favor of the null hypothesis for the 195 reported non-significant t- tests. For each t-test, Bayes factors were calculated with default, informed, and normal prior specifications of the alternative hypothesis. Scaling of the y-axis has been log-e transformed to facilitate visualization of the relationships between the Bayes factors calculated with different prior specifications. The labels on the right-hand side of the y-axis represent Jeffreys’ (1961) scheme for classifying the strength of evidence. To the left of each label, the numbers indicate the percentage of all results falling in the indicated category when the Bayes factors were calculated using default, informed, and normal prior specifications, respectively (from top to bottom). NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 16

Associations between Bayes Factors, N, and p

To investigate if small sample size (i.e., underpowered studies) was associated with the sizeable proportion of anecdotal BF01 values, the relationship between sample size and BF01 was investigated by computing Kendall’s τ and its 95% CI. A positive correlation between sample size and BF01 was observed, τ = .48, 95% CI = .38-.57. Figure 2 shows that studies with small sample sizes (n < 35) are overrepresented in the anecdotal category: Although studies with sample sizes less than 35 only make up 22.56% of the total cases, they make up almost half of the studies providing anecdotal evidence in favor of the null hypothesis (47.89%). Conversely,

22.73% (10 cases) of the small samples produced moderate evidence in favor of the null hypothesis. Stronger evidence appeared to be associated with sample sizes ≥ 225.

Figure 2. Scatterplot showing the relationship between the sample sizes of the selected studies and the corresponding BF01s. Scaling of both axes has been log-e transformed to facilitate visualization of the relationship. Plotted points above the solid black line indicate evidence in favor of the null hypothesis. NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 17

Additionally, to explore the extent to which the p values and the corresponding BF01s were associated, the reported p values were plotted against the BF01s (see Figure 3), and

Bayesian parameter estimation was conducted. The BF01s and p values moderately correlated, τ

= .41, 95% CI = .31-.49, though this association appears to be mainly driven be a positive relationship between p values smaller than .3 and the BF01s, as the values of the BF01s leveled off for p values higher than .3 (Figure 3). The figure also shows that high p values do not guarantee strong evidence for the null hypothesis, though it does appear to result in a smaller likelihood of anecdotal evidence.

Figure 3. Scatterplot showing the relationship between the p values from the selected studies and the corresponding default BF01s. Scaling of the y-axis has been log-e transformed to facilitate visualization of the relationship. Plotted points above the solid black line indicate evidence in favor of the null hypothesis.

NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 18

Discussion

The current study aimed to investigate the prevalence of Bayesian analyses in gerontological psychology, and to use BFs to reanalyze non-significant results from gerontological research, demonstrating how they can distinguish between conclusive and inconclusive non-significant results. It was hypothesized that BFs were rarely used to directly test non-significant results, and that a sizeable proportion of non-significant results would be considered as anecdotal when BF01 values were calculated. Conversely, only a small proportion of non-significant results were hypothesized to be considered as strong evidence in favor of the null hypothesis when BF01 values were calculated, and positive correlations were expected to be observed between BF01 values, sample size, and p values. The results supported these hypotheses.

Results of the current study found that only five of the 409 studies (1.22%) that reported a negative statement directly tested whether the data supported the null hypothesis through

Bayesian analysis. This value is even lower than the 10% reported by Aczel, Palfi et al. (2018) for general psychology. Further, none of the analyzed studies used equivalence testing either, a method described by Lakens et al. (2018) that can be used to statistically reject the presence of effects large enough to be considered worthwhile. Although we can only speculate regarding the reasons for the lack of Bayesian testing in this field, such infrequent use of this method is likely to be at least partially due to researchers being unfamiliar with Bayesian methods (Etz et al.,

2018). However, these analytical techniques are becoming increasingly common (van de Schoot et al., 2017), and free new statistical software such as JASP (JASP Team, 2019; Quintana &

Williams, 2018) and introductory primer papers (e.g., Dienes, 2014; Etz & Vandekerckhove,

2018; Kruschke & Liddell, 2018a) are lowering the entry barriers for researchers. Additionally, NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 19

JASP allows for easy and efficient calculation of BFs from NHST summary statistics (i.e., the sample size and test statistic) for researchers wanting to have a very light introduction to

Bayesian analyses, and/or for those reading or reviewing papers where the raw data is not publicly available (Ly et al., 2018).

It was also found that only 7.69% of non-significant t-tests provided strong evidence in favor of the null hypothesis (based on Jeffreys’, 1961, criteria of BF01 > 10), while over one-third of results only provided anecdotal evidence in favor of the null hypothesis (i.e., inconclusive evidence, based on Jeffreys’, 1961, criteria of BF01 < 3). These results were similar to those reported by Aczel, Palfi et al. (2018), who found that most non-significant results in three general psychology journals either provided anecdotal or moderate evidence in favor of the null hypothesis, and rarely strong evidence. They suggested that their results may be at least partly due to small sample sizes in psychology. Hoekstra, Monden, van Ravenzwaaij, & Wagenwakers

(2018) conducted similar analyses on non-significant results in medicine and found much stronger evidence in favor of the null with far larger sample sizes than those typically found in psychology, suggesting that this may well be the case. From a broader perspective, this lack of strong evidence in favor of the null hypothesis implies that caution should be shown in trusting null findings that were published in past research, especially those with small sample sizes.

Specifically, just as replication of a significant finding increases confidence in that result, the same logic must be applied to null effects (which may be harder to practice given a bias toward publishing only results that are significant; Kühberger, Fritz, & Scherndl, 2014). Increased sample sizes, testing the strength of evidence in favor of the null hypothesis through BFs, and calculating BFs in published research using only summary statistics (Ly et al., 2018) are simple ways to audit and increase confidence in null findings. NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 20

Additionally, the correlation between BFs and sample size demonstrated that larger sample sizes were generally associated with larger BF01s (i.e., indicative of a true null effect), whereas small sample sizes, especially those less than N = 35, were overrepresented in the anecdotal evidence category. This is not to say that researchers should consider a sample size of

35 as an acceptable minimum to confidently state there is a true null effect. Rather, while researchers are taught about having adequate statistical power in order to correctly reject the null hypothesis and find a statistically significant result for NHST (i.e., avoid a type II error; Cohen,

1992), it is also important for sample sizes to be large enough to distinguish between evidence in support of the null hypothesis and inconclusive results.2 Regardless of whether researchers continue to use NHST, adopt Bayesian analyses, or use a combination of both, published research in psychology is commonly underpowered. Whilst Cohen (1992) recommends a minimum acceptable power of .80, where the null hypothesis will be correctly rejected in 80% of studies, Szucs and Ioannidis (2017) found that median power to detect large effects in psychology was .81, but this value was much lower for medium effects (.60) and especially for small effects (.16). Increasing study sample size would only increase the reliability and replicability of the finding, be it a true difference or a true null effect (Isaacowitz, 2018). It is also interesting to note that the correlation between BF and N may be lower than expected. Given that sample size is taken into account in the calculation of the BF01s (see the formula presented in the introduction), it stands to reason that this association could be stronger, although Aczel, Palfi et al. (2018) did report a correlation of similar strength to the current study (τ = .45, 95% CI =

.26-.59). We suspect this is at least partly due to a large degree of heterogeneity between study

2 It should be noted there are some conceptual differences between NHST and Bayesian conceptions of statistical power (see Kruschke & Liddell, 2018b). Lakens (2016) and Schimmack (2015) have written blog posts with R code to conduct power analyses for Bayesian analyses. NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 21 effect sizes (due to the broad range of topics of the articles and wide range of study designs), which are also taken into account when calculating BF01s. Wetzels et al. (2011) demonstrated that standardized effect sizes even larger than 0.8 could be considered as anecdotal evidence in favor of the null hypothesis, so it is likely that effect size has a moderating effect on the association between BFs and sample size.

It should be noted, however, that Bayesian statistics do have some limitations. One criticism of Bayesian analyses is the potential subjectivity of the prior distribution (Dienes &

MacLatchie, 2018; Etz et al., 2018). The prior distribution can be any distribution (e.g., normal, uniform, Cauchy, t, etc.), and an unscrupulous researcher could easily change the prior distribution during data analysis to increase the BF value to show stronger evidence in favor of one hypothesis over the other. Indeed, the robustness analyses conducted in the current study found that evidential categorization changed in approximately one-third of cases simply by changing the prior distribution (for comparison, Aczel, Palfi et al., 2018, found that 31.7-52.4% of cases changed evidential category as a result of changing the prior distribution). To combat this, researchers should a) consider conducting robustness analyses with different prior distributions, and b) be transparent with their choice of priors, and justification for the priors

(though researchers pre-registering their study would be ideal; Dienes, 2016). Relatedly, it can be difficult to choose a suitable prior distribution, especially for researchers lacking experience in Bayesian analyses (Morey, Romeijn, & Rouder, 2016). Both these limitations highlight the need for researchers to educate themselves on conducting and interpreting Bayesian analyses before running any analyses, as uninformative or inaccurate prior distributions are likely to result in spurious results. NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 22

Researchers should also be aware that there is ongoing discussion of the role of BFs in social science, particularly in comparison to NHST (e.g., Albers, Kiers, & van Ravenzwaaij,

2018). Aczel, Hoekstra et al. (2018) conducted a survey of nine prominent Bayesian statisticians in behavioral sciences, and found that although experts agreed on the majority of topics regarding Bayesian analyses, there were disagreements in some topics, such as employing uniform decision thresholds and reporting of analyses and results. However, all experts agreed that blindly using blanket policies should be avoided at all costs, and that common sense should be used when deciding upon an analytical method. This is also consistent with Lakens et al.

(2018), who suggested that the type of analysis conducted, be it some kind of Bayesian analysis or an equivalence test, should be determined by the research question being asked (and that effect sizes should always accompany hypothesis tests).

In conclusion, Bayesian analyses are advantageous over NHST in regards to non- significant results in that BFs can be calculated to determine the strength of evidence in favor of the null hypothesis over the alternate hypothesis. However, these analyses are rarely used in gerontological research, making it challenging to know which null results are meaningful.

Further, a large proportion of null findings were deemed inconclusive when reanalyzed from a

Bayesian perspective. Given the range of aging research topics that may be relatively stable across age and/or time, in terms of cognitive (e.g., Spaniol et al., 2014), emotional (Martins et al., 2016), and functional outcomes (Gibson & Lussier, 2012), we encourage researchers to use

BFs in order to test the validity of non-significant results, and ensure that sufficient sample sizes and appropriate methodological and statistical designs are used so that null findings can be evaluated and interpreted with confidence.

NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 23

References

Aczel, B., Hoekstra, R., Gelman, A., Wagenmakers, E.-J., Klugkist, I. G., Rouder, J. N., … van

Ravenzwaaij, D. (2018). Expert opinions on how to conduct and report Bayesian

inference. Retrieved from https://psyarxiv.com/23m7f

Aczel, B., Palfi, B., & Szaszi, B. (2017). Estimating the evidential value of significant results in

psychological science. PLoS ONE, 12(8), e0182651. doi:10.1371/journal.pone.0182651

Aczel, B., Palfi, B., Szollosi, A., Kovacs, M., Szaszi, B., Szecsi, P., … Wagenmakers, E.-J.

(2018). Quantifying support for the null hypothesis in psychology: An empirical

investigation. Advances in Methods and Practices in Psychological Science, 1, 357-366.

doi:10.1177/2515245918773742

Albers, C. J., Kiers, H. A. L., & van Ravenzwaaij, D. (2018). Credible confidence: A pragmatic

view on the frequentist vs Bayesian debate. Collabra: Psychology, 4(1), 31.

doi:10.1512/collabra.149

Bakker, M., van Dijk, A., & Wicherts, J. M. (2012). The rules of the game called psychological

science. Perspectives on Psychological Science, 7, 543-554.

doi:10.1177/1745691612459060

Brydges, C. R., & Bielak, A. A. M. (2018). Evaluation of publication bias and statistical power

in gerontological psychology. Innovation in Aging, 2(suppl_1), 932.

doi:10.1093/geroni/igy031.3463

Carver, R. (1978). The case against testing. Harvard Educational Review,

48, 378-399. doi:10.17763/haer.48.3.t490261645281841

Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 1304-1312.

Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155-159. NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 24

Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003.

doi:10.1037/0003-066X.49.12.997

Dienes, Z. (2011). Bayesian versus orthodox statistics: Which side are you on? Perspectives on

Psychological Science, 6, 274-290. doi:10.1177/1745691611406920

Dienes, Z. (2014). Using Bayes to get the most out of non-significant results. Frontiers in

Psychology, 5, 781. doi:10.3389/fpsyg.2014.00781

Dienes, Z. (2016). How Bayes factors change scientific practice. Journal of Mathematical

Psychology, 72, 78-89. doi:10.1016/j.jmp.2015.10.003

Dienes, Z. & Mclatchie, N. (2018). Four reasons to prefer Bayesian analyses over significance

testing. Psychonomic Bulletin & Review, 25, 207. doi:10.3758/s13423-017-1266-z

Etz, A., Gronau, Q. F., Dablander, F., Edelsbrunner, P. A., & Baribault, B. (2018). How to

become a Bayesian in eight easy steps: An annotated reading list. Psychonomic Bulletin

& Review, 25, 219-234. doi:10.3758/s13423-017-1317-5

Etz, A., & Vandekerckhove, J. (2018). Introduction to Bayesian inference for psychology.

Psychonomic Bulletin & Review, 25, 5-34. doi:10.3758/s13423-017-1262-3

Fisher, R. A. (1925). Statistical methods for research workers. Edinburgh, UK: Oliver & Boyd.

Gibson, S. J., & Lussier, D. (2012). Prevalence and relevance of pain in older persons. Pain

Medicine, 13(suppl_2), S23–S26. doi:10.1111/j.1526-4637.2012.01349.x

Gronau, Q. F., Ly, A., & Wagenmakers, E.-J. (2017). Informed Bayesian t-tests. Retrieved from

http://arxiv.org/abs/1704/02479

Hoekstra, R., Monden, R., van Ravenzwaaij, D., & Wagenmakers, E.-J. (2018). Bayesian

reanalysis of null results reported in medicine: Strong yet variable evidence for the NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 25

absence of treatment effects. PLoS ONE, 13(4), e0195474.

doi:10.1371/journal.pone.0195474

Isaacowitz, D. M. (2018). Planning for the future of psychological research on aging. The

Journals of Gerontology: Series B, 73, 361-362. doi:10.1093/geronb/gbx142

JASP Team (2019). JASP (Version 0.9.2) [Computer software]. Retrieved from https://jasp-

stats.org

Jeffreys, H. (1961). The theory of probability. Oxford, UK: Oxford University Press.

Kendall, M. G., & Gibbons, J. D. (1990). Rank correlation methods (5th ed.). London, UK:

Edward Arnold.

Kruschke, J. K. (2011). Bayesian assessment of null values via parameter estimation and model

comparison. Perspectives on Psychological Science, 6, 299-312.

doi:10.1177/1745691611406925

Kruschke, J. K., & Liddell, T. M. (2018a). Bayesian data analysis for newcomers. Psychonomic

Bulletin & Review, 25, 155-177. doi:10.3758/s13423-017-1272-1

Kruschke, J. K., & Liddell, T. M. (2018b). The Bayesian New Statistics: Hypothesis testing,

estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychonomic

Bulletin & Review, 25, 178-206. doi:10.3758/s13423-016-1221-4

Kühberger, A., Fritz, A., & Scherndl, T. (2014). Publication bias in psychology: a diagnosis

based on the correlation between effect size and sample size. PloS ONE, 9(9), e105825.

doi:10.1371/journal.pone.0105825

Lakens, D. (2016, January 14). Power analysis for default Bayesian t-tests [Blog post]. Retrieved

from http://daniellakens.blogspot.com/2016/01/power-analysis-for-default-bayesian-

t.html NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 26

Lakens, D., McLatchie, N., Isager, P. M., Scheel, A. M., & Dienes, Z. (2018). Improving

inferences about null effects with Bayes factors and equivalence tests. The Journals of

Gerontology: Series B. Advance online publication. doi:10.1093/geronb/gby065

Ly, A., Raj, A., Etz, A., Marsman, M., Gronau, Q. F., & Wagenmakers, E.-J. (2018). Bayesian

reanalyses from summary statistics: A guide for academic consumers. Advances in

Methods and Practices in Psychological Science, 1, 367-374.

doi:10.1177/2515245918779348

Martins, B., Sheppes, G., Gross, J. J., & Mather, M. (2016). Age differences in emotion

regulation choice: Older adults use distraction less than younger adults in high-intensity

positive contexts. The Journals of Gerontology, Series B: Psychological Sciences and

Social Sciences, 73, 603–611. doi:10.1093/geronb/gbw028

Morey, R. D., Romeijn, J.-W., & Rouder, J. N. (2016). The philosophy of Bayes factors and the

quantification of statistical evidence. Journal of Mathematical Psychology, 72, 6-18.

doi:10.1016/j.jmp.2015.11.001

Morey, R. D., & Rouder, J. N. (2011). Bayes factor approaches for testing null hypotheses.

Psychological Methods, 16, 406-419. doi:10.1037/a0024377

Morey, R. D., Rouder, J. N., & Jamil, T. (2015). BayesFactor: Computation of Bayes factors for

common designs (Version 0.9.12-2) [Computer software]. Retrieved from https://cran.r-

project.org/web/packages/BayesFactor/BayesFactor.pdf

Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical

hypotheses. Philosophical Transactions of the Royal Society A: Mathematical, Physical

and Engineering Sciences, 231, 289-337. NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 27

Nickerson, R. S. (2000). Null hypothesis significance testing: A review of an old and continuing

controversy. Psychological Methods, 5, 241-301. doi:10.1037/1082-989X.5.2.241

Quintana, D. S., & Williams, D. R. (2018). Bayesian alternatives for common null-hypothesis

significance tests in psychiatry: A non-technical guide to using JASP. BMC Psychiatry,

18, 178. doi:10.1186/s12888-018-1761-4

R Core Team (2018). R: A language and environment for statistical computing. Vienna, Austria:

R Foundation for Statistical Computing. Retrieved from https://www.r-project.org/

Rouder, J. N. (2016, January 24). Roll Your Own: How to Compute Bayes Factors For Your

Priors [Blog post]. Retrieved from http://jeffrouder.blogspot.com/2016/01/what-priors-

should-i-use-part-i.html

Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t tests for

accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16, 225–

237. doi:10.3758/PBR.16.2.225

Schimmack, U. (2015, May 16). Power Analysis for Bayes-Factor: What is the Probability that a

Study Produces an Informative Bayes-Factor? [Blog post]. Retrieved from

https://replicationindex.wordpress.com/2015/05/16/power-analysis-for-bayes-factor-

what-is-the-probability-that-a-study-produces-an-informative-bayes-factor/

Signorell, A. (2017). DescTools: Tools for descriptive statistics (Version 0.99.22) [Computer

software]. Retrieved from https://cran.r-project.org/web/packages/DescTools/index.html

Spaniol, J., Schain, C., & Bowen, H. J. (2014). Reward-enhanced memory in younger and older

adults. The Journals of Gerontology, Series B: Psychological Sciences and Social

Sciences, 69, 730–740. doi:10.1093/geronb/gbt044 NON-SIGNIFICANT RESULTS IN GERONTOLOGICAL PSYCHOLOGY 28

Szucs, D., & Ioannidis, J. P. A. (2017). Empirical assessment of published effect sizes and power

in the recent cognitive neuroscience and psychology literature. PLoS Biology, 15(3),

e2000797. doi:10.1371/journal.pbio.2000797 van de Schoot, R., Winter, S. D., Ryan, O., Zondervan-Zwijnenburg, M., & Depaoli, S. (2017).

A of Bayesian articles in psychology: The last 25 years. Psychological

Methods, 22, 217-239. doi:10.1037/met0000100 van Doorn, J., Ly, A., Marsman, M., & Wagenmakers, E.-J. (2018). Bayesian inference for

Kendall’s rank correlation coefficient. The American Statistician, 72, 303-308.

doi:10.1080/00031305.2016.1264998

Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values.

Psychonomic Bulletin & Review, 14, 779-804. doi:10.3758/BF03194105

Wagenmakers, E.-J., Marsman, M., Jamil, T., Ly, A., Verhagen, J., Love, J., … Morey, R. D.

(2018). Bayesian inference for psychology. Part I: Theoretical advantages and practical

ramifications. Psychonomic Bulletin & Review, 25, 35-57. doi:10.3758/s13423-017-1343-

3

Wetzels, R., Matzke, D., Lee, M. D., Rouder, J. N., Iverson, G. J., & Wagenmakers, E.-J. (2011).

Statistical evidence in experimental psychology: An empirical comparison using 855 t

tests. Perspectives on Psychological Science, 6, 291-298.

doi:10.1177/1745691611406923