A Bayesian Treatise on The Replication Crisis in

by

True Gibson

A thesis

submitted to

Oregon State University

Honors College

in partial fulfillment of

the requirements for the

degree of

Honors Baccalaureate of Science in Biochemistry and Biophysics

Honors Scholar

Presented May 25, 2018

Commencement June 2018

AN ABSTRACT OF THE THESIS OF

True Gibson for the degree of Honors Baccalaureate of Science in Biochemistry and Biophysics presented on May 25, 2018. Title: A Bayesian Treatise on The

Replication Crisis in Science.

Abstract approved: ______

Jonathan Kaplan

Abstract

Over the past decade, it has come to light that many published scientific findings cannot be reproduced. This has led to the replication crisis in science. Many researchers feel that they can no longer trust much of what they read in scientific journals, and the public is becoming ever more dubious of published findings. I argue herein that the replication crisis is a result of the mistaken belief that the methods of classical statistical inference provide a coherent objective basis for evaluating scientific hypotheses. To solve the replication crisis, I suggest that the scientific community ought to invoke a Bayesian framework of statistical inference which integrates both successful and failed replication attempts to determine the confidence they should have in any given scientific hypothesis.

Key Words: Inductive Reasoning, Replication Crisis, , Statistical

Inference, Bayesian

Corresponding e-mail address: [email protected]

© Copyright by True Gibson

May 25, 2018

All Rights Reserved

A Bayesian Treatise on The Replication Crisis in Science

by

True Gibson

A thesis

submitted to

Oregon State University

Honors College

in partial fulfillment of

the requirements for the

degree of

Honors Baccalaureate of Science in Biochemistry and Biophysics

Honors Scholar

Presented May 25, 2018

Commencement June 2018

Honors Baccalaureate of Science in Biochemistry and Biophysics

project of True Gibson presented on May 25, 2018.

APPROVED:

______Jonathan Kaplan, Mentor, representing Department of Philosophy

______Sharyn Clough, Committee Member, representing Department of Philosophy

______Chong Fang, Committee Member, representing Department of

______Toni Doolen, Dean, Oregon State University Honors College

I understand that my project will become part of the permanent collection of Oregon State University, Honors College. My signature below authorizes release of my project to any reader upon request.

______True Gibson, Author

I would like to thank Dr. Jonathan Kaplan for his valuable input into this project, as well as his patience and willingness to assist me through this tumultuous process. I would also like to thank Dr. Sharyn Clough and Dr. Chong Fang for their support and their contributions to the project. Finally, I must thank my family and friends for giving me the strength and encouragement to see this thesis through to the end.

Table of Contents

Section I. Introduction...... 1

i. The Replication Crisis...... 1

ii. Responses from the Scientific Community...... 8

iii. A Superior Response...... 14

Section II. Faults in the Classical Approach...... 17

i. Choosing a Test Statistic...... 17

ii. Misinterpretations of Significance...... 19

iii. Randomization as Justification...... 22

iv. Deciding Between Hypotheses...... 25

Section III. The Bayesian Framework...... 30

i. Bayes’s Theorem...... 30

ii. Understanding Probabilities...... 33

iii. Convergence of Beliefs...... 36

iv. Bayes Factors...... 39

v. Obstacles and Limitations...... 43

Section IV. Conclusion...... 47

i. The State of Science...... 47

ii. Looking Ahead...... 48

Section V. References...... 51

Section I. Introduction

i. The Replication Crisis

Concerns about the reproducibility of scientific discoveries have deep roots, but in recent years these qualms have been magnified drastically - especially in the fields of and . The resulting methodological crisis has been coined the “replication crisis.” It has been famously estimated that “most claimed research findings are false.” (Ioannidis, 2005a). This troubling realization has caused a good deal of strife in the scientific community: scientists are becoming increasingly unsure of how to react to the publications they read, and the public has developed an attitude of distrust concerning the state of modern science. The foregoing situation has caused many people to wonder how we got to this point, and why we are becoming less certain about our scientific hypotheses despite having greater access to more powerful tools for and analysis than ever before in history.

I propose that the main causal factor of the replication crisis has been the scientific community’s adherence to the classical of statistical inference. The methods laid out in the 20th century by eminent statisticians such as

Ronald Fisher, Jerzy Neyman, and Egon Pearson have been incorporated ubiquitously into the , due to the belief that they provide an objective basis for making inferences about scientific hypotheses. This belief is, however, mistaken. Upon closer analysis, it will be shown that the classical methods of statistical inference contain inexorable elements of subjectivity and

- 1 - moreover fail to yield a coherent objective basis of inference concerning scientific hypotheses. To solve this deep-seated issue in the methodology of science, I propose a Bayesian framework. This Bayesian framework allows one to readily incorporate new scientific findings to update one’s belief as to the verity of any scientific hypothesis. Widespread endorsement of such a Bayesian system would mitigate the lack of confidence caused by the replication crisis – particularly in fields such as psychology, medicine, and the social where contradictory findings abound.

The replication crisis is at its heart an interdisciplinary scientific phenomenon, as all sciences place reproducibility among their most important desiderata. The status of reproducibility as a sine qua non of scientific hypotheses traces back to the requisite condition of all scientific hypotheses: . As was first popularized by Popper (1962), a scientific hypothesis must make a prediction about what ought to occur given some initial conditions. For instance, a believer in the oxygen theory of combustion would predict that a piece of wood will not burn in an atmosphere of pure nitrogen gas. Such a statement would serve as a suitable scientific hypothesis in the Popperian sense. Similarly, a more outlandish person might predict that the piece of wood will burn in a nitrogen atmosphere, if they happen to subscribe to the phlogiston theory of combustion.

This too would be a valid hypothesis. The two scientists with incongruent world views could then conduct a series of tests, by repeatedly attempting to set a piece of wood, situated in a chamber filled with pure nitrogen, aflame. According to

Popper, the logical asymmetry between verification and falsification dictates that

- 2 - no amount of “confirmatory” experimental outcomes will be sufficient to verify any scientific hypothesis. On the other hand, a single experimental outcome which contradicts the prediction made by a scientific hypothesis is sufficient grounds to refute that hypothesis (Popper, 1959). It follows that a single failure to set the piece of wood aflame would effectively refute the phlogiston theorist’s hypothesis. Such an outcome, however, would not confirm the oxygen theorist’s hypothesis, since the possibility the wood catching fire in a subsequent trial of the cannot

(and can never) be ruled out.

Popper’s treatment of the topic is insufficient, because it does not hold true for statistical hypotheses. By definition, statistical hypotheses do not make deterministic predictions about what will occur given some initial conditions; instead, statistical hypotheses assign probabilities to different outcomes which may occur given some set of initial conditions. The case of statistical hypotheses severely complicates the desideratum of falsifiability. How can we determine the level of discrepancy we ought to allow between the actual outcomes of and the predictions made by statistical hypotheses? It would be a clear mistake to conclude that a coin must be biased (that is, it does not have exactly a 50% chance of landing heads up and a 50% chance of landing tails up) based on the experimental outcome of 7 heads and 3 tails in a sequence of 10 coin flips. After all, the statistical of such an experiment allows for the occurrence of that outcome. The probability of the result 7 heads and 3 tails (abbreviated as: 7H, 3T) is of course less than the probability of the result 5H, 5T based on the hypothesis that the coin is unbiased, but it is not a large enough discrepancy to reject the hypothesis outright.

- 3 -

But what is the appropriate stopping point? How large need be the discrepancy between prediction and observation before we reject a statistical hypothesis? The solution to this problem that the scientific community has come to accept comes from the employment of classical inference.

The core principles of the classical approach to statistical inference hold that objective conclusions as to the veracity of a hypothesis can be made by comparing the probability of the outcome of some repeatable experiment, given the hypothesis, to the probability of other possible results. Whether the methodology of classical statistics can live up to this principle will be explored in Section II. The central classical statistical method used in the scientific community is the use of significance tests to assess hypotheses. Significance tests are usually represented in the form of “p-values.” According to Fisher’s theory of significance tests, the decision to reject the null hypothesis should be made by summing the probabilities of those possible outcomes who are less or equally likely to have occurred (given the null hypothesis) than the outcome that did in fact occur. (Fisher, 1925, pg. 504).

If the sum of probabilities representing outcomes less likely than the observed outcome – which represents the “p-value” of the observed outcome – is less than a predetermined critical value (usually 0.05 or 5%), then the null hypothesis is said to have been rejected at the that level. Note that this procedure in no way allows one to logically infer the truth or falsity of any statistical hypothesis from a particular result (Howson & Urbach, p. 128). Instead, it simply states that the results of an experiment are rather improbable, assuming the null hypothesis is true.

Despite the rather straightforward definitional meaning of “,”

- 4 - the term is quite often erroneously equated with de facto logical proof for or against a hypothesis. An optimist would imagine that, even if a statistically significant result were obtained from an experiment when the null hypothesis was true (a.k.a. a false positive, or a type I error), then future repetitions of the experiment would prove it. Conversely, when an experimental outcome failed to reject the null hypothesis at the appropriate level when the null hypothesis was false (a.k.a. a false negative, or a type II error), future experiments would hopefully succeed in rejecting the null hypothesis.

The aforementioned optimist is, however, unfamiliar with the way modern science is conducted. The hypotheses of seminal studies which fail to produce statistically significant results are most often abandoned (Hewitt, 2008). Moreover, studies which do produce statistically significant results are rarely replicated, because “replication is not valued as highly as discovering theoretically novel, but possibly nonreplicable results.” (Duncan et al., 2014). The resulting dearth of follow-up studies and replication attempts is often cited as a main cause of the replication crisis. In addition, questionable research practices such as “p-hacking” and selective reporting have been implicated as factors contributing to the replication crisis (John et al., 2012; Head et al., 2015; Grange et al., 2017). Though no truly complete quantification of the scope and severity of the replication crisis exists, many researchers have attempted to estimate the reproducibility of certain scientific fields. Some of these attempts have been compiled in Table 1 below.

- 5 -

Author(s) Field Result Etz & Vandekerckhove Psychology 64% of 72 replication studies proved (2016) inconclusive Aarts et al. (2015) Psychology 61% of 100 replication studies conducted in-house failed Makel et al. (2012) Psychology Only 1% of psychology publications since 1900 were replication attempts Ioannidis (2005b) Medicine Just 44% of 49 highly-cited studies were successfully replicated Begley & Ellis (2012) Cancer 89% of 53 landmark studies could not be reproduced Table 1. The status of several scientific fields heavily affected by the replication crisis, detailed by meta-analyses.

While these startling meta-analyses provide a quantitative characterization of the replication crisis in fields such as psychology and medicine, the qualitative evidence provided by high-profile studies which failed upon subsequent replication are what have sparked the general popularization of science’s reproducibility issue.

One of the most famous of these was a study claiming that holding so-called “power poses,” dominant non-verbal displays, caused appreciable neuroendocrine and behavioral changes (Carney et al., 2010). The initial study was well-received and gained quite a bit of praise from both the scientific community of and the mainstream media. Just as the young authors of the original “power pose” study were launching their careers off the momentum generated by their 2010 publication, a replication attempt was published.

- 6 -

The replication study (Ranehill et al., 2014) had a much larger sample size than the original study (200 versus 42), included a double-blind procedure (unlike the original study), and had participants hold the “power pose” for much longer than the original study (3 minutes versus 1 minute). The study failed to reproduce any neuroendocrine changes and reproduced only minor behavioral changes (which were assumed by Ranehill et al. to be “demand effects” – results occurring when subjects intuit the goal of the study), effectively refuting the incautious claims made by Carney et al. in their original paper. The apparent revelation made by the failed replication had a dramatic effect on the community of social psychology. The

“power pose” hypothesis became a poster-child for weakly-supported hypotheses which enjoyed undeserved credence simply because no one had yet attempted to reproduce the results of their original supporting studies (Dominus, 2017).

Harbingers of the impending replication crisis often pointed to the case of the

“power pose” hypothesis as evidence that the scientific literature in certain fields was rife with erroneous claims and fallacious hypotheses. This high-profile case of a failed replication woke many researchers up to the unpleasant fact that much of the published scientific literature may be incorrect. The resulting flurry of discussion about how to mitigate the practices of science to prevent the prevalence of false findings led in part to the pervasiveness of the replication crisis.

Despite the seemingly dire situation at hand, the emergence of the replication crisis has not been met with complacency. Significant efforts have been made by the scientific community in an attempt to attenuate the publication of false claims and subsidize the practice of replication. We shall see in the end, though,

- 7 - that these efforts do not present any tenable long-term solutions for eliminating the replication crisis in science.

ii. Responses from the Scientific Community

One approach which enacted to combat the replication crisis is the foundation of journals which exclusively publish replication studies, and increased publishing of replications by already-established scientific journals (Novotney,

2014). It is no secret that scientific journals prefer publishing original and positive results, and generally dislike to publishing negative results and replication studies.

Indeed, a recent analysis of 1,151 psychology journals showed that only 3% of those journals explicitly stated that they were open to accepting replication studies

(Martin & Clarke, 2017). The deeply ingrained publication of journals is believed to be a major factor which has contributed to the replication crisis.

Following this reasoning, several new journals have been founded with the explicit mission of providing a platform for the dissemination of negative results

(Goodchild, 2015).

Such endeavors are admirable, and will surely produce a beneficial impact on the severity and scope of the replication crisis. Widespread endorsement of publishing negative results, however, seems unlikely. The reason for this is that journals profit only when people are interested in what they publish. Journals will not publish negative results because people don’t want to read studies which report negative (failed) results. This unfortunate yet undeniable fact means that

- 8 - widespread endorsement of the publication of negative results by journals won’t be happening any time soon.

In a similar vein of logic, many researchers have advocated the practice of

“preregistration” to increase accountability and transparency while decreasing and the “file drawer” effect. The process of preregistration entails researchers specifying their specific plans for a study before conducting the study.

This practice makes analyzing the results of the study in an unbiased manner much easier, as there is less doubt about what data could have been manipulated or omitted in the publication of the study. Moreover, preregistration helps the problem of null findings being stowed away in the author’s “file drawer,” because a documented account of the study would be available for future reference, even if no publication emerged. Other benefits resulting from preregistration exist, but those which can significantly mitigate the replication crisis are those listed above.

Some researchers oppose implementing the practice of preregistration, claiming that it would “stifle creativity” and “prove paralyzing in terms of making adjustments during the research process” (Alvarez, 2014). In my view, the benefits of preregistration practices outweigh their costs – and it seems plausible that a middle-ground could exist between ultra-strict enforcement of the preregistered study plan and the status quo of an utter lack of a priori declaration of a study’s plan, allowing for maximum flexibility in the direction of the actual study itself.

Many scientific organizations support preregistration, the best-known of which is the Framework (OSF). Further support of preregistration would surely bolster the already significant efforts being made to combat the replication

- 9 - crisis, but there remain various other avenues by which scientific practices could be improved.

Another measure that has been instituted to allay the replication crisis is the lowering of significance thresholds by various journals. It has been argued that lowing the p-value threshold at which a research finding can be considered

“statistically significant” should be lowered to a value less than the traditionally accepted 0.05 level. Indeed, many groups have written declarations pleading prominent scientific journals to lower their p-value thresholds. One such group recently proposed “to change the default p-value threshold for statistical significance claims of new discoveries from 0.05 to 0.005.” (Benjamin et al., 2017).

This proposed change would redefine statistical significance to apply only to data corresponding to a p-value lower than the 0.005 level, while findings falling between 0.05 and 0.005 would be relabeled as “statistically suggestive.” Such a move has been previously made in limited fields of research. For example, in the field of population genomics, significance thresholds are as low as the 5 ∗ 10−8 level (Ioannidis, 2018). Lowering p-value thresholds seems to be a plausible immediate-impact action which could help mitigate the acute problems posed by the replication crisis. For reasons which shall be delineated presently, however, the lowering of p-value thresholds will prove to be an ultimately insufficient strategy for eliminating the problem of false research claims.

A further recommendation, which has been made by many researchers, is increasing emphasis on “conceptual replication” studies. Conceptual replications are studies which do not attempt to exactly recreate the conditions of an earlier

- 10 - study, but which do attempt to test the same hypothesis as the earlier study. In such cases, it is claimed that determining the veracity of the hypothesis in question can be achieved by “triangulating” the results of many related studies (Crandall &

Sherman, 2017; Lynch et al., 2015). This initiative focuses on broadening the range and generalization of research findings, as opposed to simply attempting to exactly recreate and corroborate the conditions of some original study to produce the same effect as was previously measured. The main methodology used to bolster the credibility of a hypothesis is to aggregate data over various subjects, stimuli, times

(trials and sessions), and modes of measurement. As stated by Seymour Epstein

(1980), “the value of [conceptual replication] lies in removing the influence of incidental variables of no apparent theoretical interest that either cannot be controlled or, if controlled, would yield generalizations too narrow in scope to be scientifically useful.” In this sense, I agree that conceptual replication and aggregation of different lines of evidence is useful for yielding robust, generalized hypotheses.

Notwithstanding, there is still a critical need for “direct replications” which seek to replicate the conditions of the original study in every possible way. It has been argued that placing an emphasis on conceptual replications over direct ones would "interact in an insidious fashion with publication bias and ... the natural tendency for results perceived as ‘interesting’ to circulate among scientists through informal channels.” (Pashler & Harris, 2012). This is a valid concern which further reinforces my belief that conceptual replications should only be seen as providing support as to the generalizability of a hypothesis. Indeed, without direct replications

- 11 - triangulation via conceptual replication is rendered impossible, since the credibility of those individual studies which constitute the set of conceptual replications will always be meager. The ideal scenario would be a hypothesis which has been tested in many ways, and whose supporting (conceptual replication) studies have also been exactly replicated with success. In short, both conceptual and direct replications are necessary to the production of robust scientific hypotheses.

A more long-term initiative to reduce the severity of the replication crisis is to highlight the importance of replication through scientific pedagogy. In an experimental approach, currently being implemented at MIT and Stanford, undergraduate students in laboratory classes replicate recent findings as part of their training in experimental methods (Frank & Saxe, 2012). It is argued that doing so will help quickly build an evidence base for replication studies by using the collective effort of thousands of students, while allowing undergraduates “generate reliable and valid findings in research-methods classes with associated pedagogical benefits.” (Grahe et al., 2012). While this idea is certainly creative and alluring, it faces several legitimate concerns in its implementation.

One obvious concern with this proposal is that results generated by undergraduates may be less trustworthy than those generated by professional scientists. It seems safe to assume that students who are just beginning to learn how to conduct scientific research would be more likely to make errors in the process of conducting research. Moreover, it cannot be ignored that – much like in real scientific research – students in laboratory classes often feel pressured to generate coherent, positive results in their projects, under the belief that it may improve the

- 12 - grade they receive. Thus, while using students as a resource to conduct replication studies may be useful due to the sheer manpower available, it may ultimately produce an error-laden, inaccurate reservoir of replication data. Furthermore, there may be major impediments to replicating certain studies due to the elaborate nature of the study, or because of the high cost associated with the study. So, while implementing replication studies into undergraduate curricula may be useful in limited cases of testing inexpensive-to-test, simple hypotheses, it does not seem like a particularly effective large-scale approach to solving the replication crisis. It remains true, however, that there is a concerning dearth of strong data associated with replication studies, which may be due to the underpowered nature of most such studies.

An age-old practice for increasing the strength of scientific studies is to use larger sample sizes. In the field of psychology, where confounding factors abound, it has been shown that enormous sample sizes are necessary to assure that randomization generates equivalent study groups, especially when the putative effect under scrutiny is weak. Accordingly, “the implication is that if a series of replications with low sample sizes fail to confirm a study’s findings, that need not mean the original findings can be considered the result of a [false positive] …

However, if the original study, too, had a low sample size, those original results may be…unreplicable.” (Peters & Gruijters, 2017). Unfortunately, this unwelcome realization about the paucity of adequate sample sizes in the psychological literature will most likely go unheeded.

- 13 -

Despite the strong evidence that studies seeking to measure weak effects with small sample sizes are doomed to produce ambiguous results, the “” nature of academia and the reluctance of funding agencies to fund studies with large sample sizes makes the ideal of eliminating small sample-sized studies seem quite unlikely for the near future. While I do not disagree that increasing the sample sizes used in science would be undoubtedly beneficial to the field in reducing the severity of the replication crisis, I do not see this measure as being useful in practice due to its infeasibility. Instead, we ought to develop a better framework for incorporating the results from many, possibly even conflicting, studies in a way that allows us to update our beliefs about hypotheses in a systematic, rational way.

iii. A Superior Response

Despite the various measures effected to mitigate the crisis of confidence in science, the reputations of fields such as psychology, medicine, and the social sciences have been particularly injured by the recent adumbration of the replication crisis. In my estimation, the sum of the aforementioned measures fundamentally fail to address the cause of the replication crisis, for reasons I will now explain.

The unfortunate reality is that the efforts to combat the replication crisis will ultimately do little to mitigate the state of fields such as psychology and medicine.

Despite the reasonable assumptions behind many of the newly-implemented practices (such as preregistration, increased sample sizes, etc.) their efficacy will be insufficient to reverse the massive amounts of false claims and questionable

- 14 - practices occurring in the scientific community daily. The measures which have been listed serve as post hoc “bandage” fixes which are meant to “stop the bleeding” so to speak. The wound, however, is far too large to be healed by simple bandages. To fully address the causal factors which have led to the replication crisis and change the trajectory of psychological, medical, and other fields of science, it will be necessary to examine the foundational assumptions upon which we evaluate hypotheses. Specifically, it will be the statistical tests by which hypotheses are evaluated which must be replaced for the replication crisis to be resolved. Only by rejecting the long-held practice of using “significance tests” to evaluate scientific hypotheses will the confusion surrounding the status of psychology and related fields become solvable. It will be shown that a superior method exists by which scientists and laypeople alike can evaluate scientific hypotheses in the light of new evidence: this is the Bayesian method.

The Bayesian method of testing scientific hypotheses is based on Bayes’s theorem, which describes a formula for evaluating the likelihood of some hypothesis being true given some observation, which may support or undermine that hypothesis. To assess this likelihood, one must decide how much credence they give the hypothesis before the observation in question has been made. Thus,

Bayes’s theorem effectively describes a method by which one can update the likelihood they assign to a given hypothesis after observing some experimental data. This characteristic of Bayes’s theorem is unrivaled in the classical statistical methods of Fisher, Neyman and Pearson. The Bayesian method of evaluating scientific hypotheses is a method by which one can readily synthesize data of many

- 15 - different kinds coming from different sources; a task which is impossible when using classical statistical methods. This is exactly why the Bayesian method is an attractive solution to the replication crisis.

Scientists currently have no good way to integrate the results of many different studies of the same scientific hypothesis, especially when the studies produce contradictory results. This would not be the case in the counterfactual world where the scientific community adheres to Bayesian statistical methods instead of classical ones. The case for the Bayesian framework will come in two parts: first, it will be shown why classical statistical methods must be rejected. The proof of this claim will be manifold, but will include arguments which show the ways in which classical statistical methods can be manipulated, subjective, and misleading.

Then, it will be demonstrated how a Bayesian framework can be used to provide an intuitive, logically consistent evaluation of scientific hypotheses in the light of mounting evidence. This will be completed by first explicating the logical basis of the Bayesian framework, and showing by example to show how the framework can be coherently used. Now let us turn our attention to the classical statistical methods which are currently used in science, and critically examine their core assumptions.

- 16 -

Section II. Faults in the Classical Approach

A common charge against Bayesianism, which shall be explored further in

Section III, is that it contains too much room for subjectivity to be useful in the context of science. What the critics who make such claims fail to realize, however, is that the same the so-called “objective” methods of classical statistics are replete with the same elements of subjectivity which apparently disqualify Bayesian statistics from being useful in science. This fact is seldom recognized due to the surreptitious ways in which subjective decisions are used in classical statistical inference. In contrast to this “subtle subjectivism,” the Bayesian statistical framework openly admits its incorporation of subjective beliefs at the outset, and is thus attacked for it by those who value objectivity above all else. My brief characterization may seem to unfairly cast Bayesian statistics in a martyrly light and classical statistics in a perfidious one; but let us now see if this characterization is justified.

i. Choosing a Test Statistic

To illustrate the subjectivity of classical statistical inference, let us first examine its methods and commitments. The “experimental outcomes” discussed herein always come in the form of test statistics. Test statistics are ways of summarizing the outcome of an experiment. In general, a test statistic is

“characterized as a real-valued functioned defined on the outcome space… [it] associates a unique number with each possible outcome, thereby describing

- 17 - outcomes in a concise numerical way.” (Howson & Urbach, p. 131). For instance, consider this scenario adapted from Howson & Urbach (1989, pp. 131-132). Let’s say I conducted an experiment in which I flipped a coin twenty times. I might report the results of my experiment by writing down the number of heads I observed. The value I record for the number of heads observed (let us call this value t) would serve as the test statistic for my experiment. There exist innumerable other test statistics

I could have used to summarize the results of my experiment. I could have, for example, described my result by recounting the entire sequence of coin flip outcomes in the experiment, e.g. [H, H, T, T, T, etc.]. Note that the second example statistic subsumes the first. From the sequence of coin flip outcomes, one can determine the number of heads that were observed in the entire experiment. One would more readily employ the latter statistic, despite its inferior information content in comparison to the former statistic. This is because, in evaluating the fairness of a coin, all possible outcome sequences have equally diminutive probabilities. On the other hand, the number-of-heads-observed statistic discriminates between possible outcomes by ascribing to them different probabilities. Indeed, it would be impossible to make sense of a hypothesis about the fairness of the coin using the sequence-of-coin-flip-outcomes statistic, because all other possible outcomes are equally likely to have occurred. Thus, the null hypothesis could never be rejected when using such a test statistic. On the other hand, if one were so inclined, a meaningless p-value using the sequence-of-coin- flip-outcomes statistic can be readily calculated. Still, it may not always be so trivial to decide which test statistic should be used.

- 18 -

For instance, given the outcome of six heads in a series of twenty coin flips, with the null hypothesis stating that the coin has a 50% chance of landing heads- up, then the associated p-value of the outcome described by t = 6 would be 0.058, and the result would fail to reject the null hypothesis at the 0.05 significance level.

If, however, we employed a slightly different test statistic, we will obtain quite a different conclusion from the same result. Consider a modified test statistic, called t’, in which the outcomes containing 5 heads and 10 heads were combined, and those with 14 or 15 heads were combined. The associated p-value of the outcome described by t’ = 6 would be 0.049. While the difference in p-values is miniscule between the two test statistics t and t’, it is the difference between publication and rejection in fields where the critical significance level is 0.05. One might object that using such an arbitrary test statistic as t’ would be absurd, but this practice is in fact one of the main mechanisms by which “p-hacking” occurs. The ability of Fisherian significance tests to prevent such manipulations is utterly lacking, and this illustrates one of the avenues by which subjectivity creeps into classical statistical methods: by the choice of test statistic. As we shall see, choices about how and when to make inferences provide numerous alternative openings for the injection of subjectivity into the methods of classical statistical inference. But first, let us explore the limitations of significance tests in more depth.

ii. Misinterpretations of Significance

An all-too-common disconnect between Fisher’s theory of significance tests and the way in which significance tests are truly used by scientists poses another

- 19 - problem. A significance test (p-value) is not sufficient, on its own, to provide support for or against a hypothesis. Additional statistics must be supplied alongside the traditionally-reported p-value, such as the power of the experiment and the found. To demonstrate the necessity of these additional statistics in the disclosure of scientific findings, we can use a hypothetical scenario.

Consider a “phase I” randomized clinical trial designed to test whether a certain drug candidate (henceforth referred to as “baysamine”) increases the chance of recovery from a certain illness (henceforth referred to as “replicitis”). The experimenters determine before conducting the trial that the power of the study is

0.70, that is, the study has a 70% chance of observing data that indicates baysamine is effective at treating replicitis, if it is indeed the case that baysamine is effective at treating replicitis. Note that the power of the study is equivalent to 1-, where  is the probability of a false negative. The researchers proceed with the drug trial, following all the traditional procedures of randomization, double-blind testing, and creating a control group. After conducting the study, the researchers compile and analyze the data. According to their analysis, there was a statistically significant difference between the rates of recovery from replicitis between the baysamine- treated group and the control group. The results indicated that those treated with baysamine, on average, had a quicker recovery from replicitis than did those in the control group, with an associated p-value of 0.05. The researchers are overjoyed, because to them, the data seems to indicate that they’ve found a successful drug for treating replicitis! After all, the p-value of the data indicates that there is only a 5% chance of observing data equally or more extreme than the data collected if the null

- 20 - hypothesis (that baysamine did not affect recovery from replicitis) were true. This intuitively seems like a reasonable conclusion; but in fact, it is not. The researchers cannot yet determine the likelihood of their hypothesis being true because they have not done a proper statistical analysis. Indeed, the researchers cannot calculate the probability of their hypothesis being true within the classical statistical framework.

This is because, to calculate the true probability of the hypothesis, one must first determine the prior probability of the hypothesis being tested.

The fraction of drugs tested in phase I clinical trials which eventually proceed to FDA approval is only about 1 in 10 (Hay et al., 2014). Given this information, we might reasonably assign a prior probability of 0.1 to the hypothesis that baysamine is an effective treatment for replicitis. If we decide upon this prior probability, then we can represent the logical process of assessing the posterior likelihood of the hypothesis graphically (Figure 1).

Figure 1. Logical representation of assessing the outcome of a drug trial.

- 21 -

Using this analysis, we can calculate the Effective Ineffective probabilities of each possible outcome type drug drug True- Positive Negative (Table 2). In the present example, we can (7%) (3%) disregard the chance of seeing a false-negative False- Negative Positive or true-negative, since we have already (85.5%) (4.5%) observed a positive outcome. The only thing left Table 2. Probabilities of to do is to determine the probability that the possible drug trial outcomes. positive outcome we observed is a true positive, and not a false one.

Of all possible positive outcomes in this experiment, roughly 60% of them represent true positives, while 40% of them represent false positives. Thus, we clearly cannot conclude with confidence that baysamine is effective at treating replicitis. Despite the apparently strong evidence generated in the clinical trial, there is still only a 60% chance that the hypothesis is true! This hypothetical example is meant to illustrate how and why significance tests are wholly insufficient to assess the likelihood of a hypothesis. Even with additional measures such as power, the hypothesis cannot be truly evaluated unless it is assigned prior probability. But we are not done examining the numerous flaws with significance tests. We will now consider the case of randomization: a technique central to the methods of classical statistical inference.

iii. Randomization as Justification

Randomization was described by Fisher as the method “by which the validity of the test of significance may be guaranteed against corruption by the causes of

- 22 - disturbance which have not been eliminated [by being controlled].” (Fisher 1947, pg. 19). Clearly then, the procedure of randomization must be justified if tests of significance are to be trusted. Randomization has been so widely accepted in the scientific community that its validity is rarely questioned. Upon closer examination, however, randomization will fail to achieve the elimination of confounders that it purports to do. The principle of randomization holds that by randomly sampling members of a reference population, one can assume that the effect of possibly confounding factors on the outcome of the trial will be canceled out. The issue here is that to randomize a population with respect to some causal factor, one must be able to identify that causal factor at the outset. It follows then, that all possibly confounders must be identified a priori for the randomization process to eliminate the effects all those factors. Of course, doing so would be impossible in practice, since scientists can never eliminate the possibility that any arbitrary factor could be a confounder without studying its effect in isolation. Furthermore, there are infinitely many possibly confounding factors in any experiment, and most these are impossible to control for.

Fisher did not seem to worry too much about this fact, as he held that,

“subsequent causes of differentiation… can either be predetermined before the treatments have been randomized, or, if it has not been done, can be randomized on their own account; and other causes of differentiation will either be (a) consequences of differences already randomized, or (b) natural consequences of the difference in treatment to be tested, of which on the null hypothesis there will be none, by definition, or (c) effects supervening by chance independently from the

- 23 - treatments applied.” (Fisher, 1947, pp. 20-21). Fisher’s claim is that possible confounders which have not been identified and subsequently “randomized out” will not affect the outcome of the test, because they will either not actually affect the outcome, or they will be naturally randomized by the randomization of other, identified confounding factors. This claim is strikingly naïve in my view, because in any experiment there are any number of unforeseeable influences which could confound an experiment if not controlled for. In such cases, the legitimacy of significance tests which Fisher claimed was guaranteed by the process of randomization must be called into question.

An often-argued defense of randomization is that it can be used by the skilled experimenter to “randomize out all the factors which are suspected to be causally important but which are not actually part of the experimental structure.” (Kendall

& Stuart, 1983, pg. 137). That deciding which factors to randomize is a matter of the experimenter’s best judgment is a commonly held understanding of randomization’s role in classical statistical inference. Following this understanding, it is usually said that factors which are very unlikely to be confounders can be assumed to have no bearing on the effect under investigation. Making such an assumption obviously involves a subjective judgement. But even if the influence of a “negligible” factor on the effect in question were to be systematically tested, one would be required to conduct an appropriately randomized trial to have confidence in the result of the test. This situation clearly leads to an infinite regress, and as such there can be no logical basis for the assumption that the effects of certain factors may be assumed to be “almost certainly negligible” (Howson & Urbach, 1989, pg.

- 24 -

150). Thus, the sneakily subjective practice of significance testing is itself based on randomization, which as has been shown, necessarily involves subjective judgements on behalf of the experimenter. The role of subjective decision-making in Fisher’s theory of significance tests is clear, but what of the popular Neyman-

Pearson framework of significance tests? We shall in due course see that it too is wrought by the same subjectivity as Fisher’s theory.

iv. Deciding Between Hypotheses

When Neyman & Pearson (1928) proposed their theory of statistical inference, they were primarily seeking to improve on Fisher’s theory by 1) providing a logical basis for the procedures they recommend, and 2) removing the ambiguity about which test statistics are acceptable to use in significance testing (Howson & Urbach,

1989, pg. 155). In achieving the latter, they seem to have – for the most – part succeeded. But achieving the former was a more difficult task, which they were ultimately unable to do. Before diving into the criticisms of Neyman-Pearson theory of significance tests, we must first characterize the theory itself.

The measures of a significance test in the Neyman-Pearson framework are the power (i.e. one minus the probability of observing a “false negative”), and the size

(i.e. the p-value, or the probability of observing a “false positive”). Naturally, one would aim to maximize the power and minimize the size of their experiment in order to avoid both type I and type II errors. There exists a competition here between power and size, as “in most cases, a diminution in size produces a contraction in power, and vice versa.” (Howson & Urbach, 1989, pg. 158).

- 25 -

Recognizing this, Neyman and Pearson recommend that one should fix the desired size of a test at an appropriate level (usually 0.05) and then maximize the power. A putative advantage of the Neyman-Pearson theory is that one can conduct sample size estimations and power analyses, which lets one determine, a priori, the sample size required to achieve the desired power of a test given a fixed size.

To demonstrate the Neyman-Pearson approach, let us put it to work by considering an example initially described by Kyburg (1974, pp. 26-35). This example asks of Neyman-Pearson methods a simple task: decide between two statistical hypotheses. Consider an individual who bought a large shipment of tulip bulbs containing a mixture of bulbs producing red tulips and bulbs producing yellow tulips. Assume that the different bulb types are indistinguishable until the moment they flower. Furthermore, the buyer has forgotten whether the shipment contained 40% yellow tulip bulbs and 60% red tulip bulbs, or vice versa. Let h’ represent the claim that the shipment contains 60% yellow tulip bulbs and 40% red tulip bulbs, and h” represent the claim that the shipment contains 40% yellow tulip bulbs and 60% red tulip bulbs.

An experiment designed to decide between h’ and h” might entail randomly selecting n tulip bulbs from the shipment, and then observing how many grow into red tulips and how many grow into yellow tulips. Following the methodology of a power analysis and sample size estimation, the minimum numbers of red-flowering bulbs needed in a random sample of size n such that the h’ (which we shall consider for the time being to be the null hypothesis) could be rejected at the 0.05 level is described below (Table 3).

- 26 -

Sample size Proportion of red tulips Power of the test

n needed to reject h’ against h”

10 0.70 0.37 20 0.60 0.50 50 0.50 0.93

100 0.480 0.99 1000 0.426 1.0 10000 0.4080 1.0

100000 0.4028 1.0 Table 3. Minimum experimental outcomes required to reject h’. Note that for large sample sizes, the proportion of red tulips needed to reject h’ approaches 0.4, which h’ itself predicts. Table data reproduced from Howson &

Urbach (1989).

The results listed above are generated from utilizing the Neyman-Pearson approach, and yet they go against common sense. Recall that the task in this scenario was to decide between two hypotheses: h’ (the null hypothesis) and h”

(the alternative hypothesis). Within the constraints of the task, we are forced by the methods of Neyman and Pearson to reject h’ at the 0.05 level in the hypothetical event that we observe 40.3% red tulips in a sample size of 100,000 tulip bulbs. Even though h’ predicts an outcome much closer to the observed outcome than does h”, we must apparently reject h’ and accept h” to complete the task of deciding between the two hypotheses. This absurd result demonstrates the ability to manipulate the widely-considered-to-be cogent methods of the Neyman-Pearson theory of statistical inference. Indeed, this type of manipulation – be it intentional or

- 27 - accidental – is likely to be responsible for a large amount of the fallacious hypotheses in the scientific literature.

The situation becomes even more murky when we consider the subjectivity involved in choosing the null hypothesis. The prospect of achieving coherent inductive support by means of Neyman-Pearson becomes quite clearly impossible when we realize that in the foregoing example, the choice to make h’ the null hypothesis was up to my personal discretion. Indeed, to illustrate how this subjective step of Neyman-Pearson significance testing can have a crucial bearing on which hypothesis is accepted and which is rejected, consider an experiment where 50 out of 100 tulip bulbs produced red flowers (Table 3, n = 100).

Were h’ to be deemed the null hypothesis, it would in this scenario be rejected at the 0.05 level, and h” would be accepted. If, on the other hand, h” were the null hypothesis, then it would too be rejected at the 0.05 level, and h’ would be accepted!

What solution can Neyman and Pearson give to this apparently fatal flaw in their statistical framework?

Their response was to conclude that, “Without hoping to know whether each hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in following which we will ensure that, in the long run of experience, we shall not be too often wrong.” (Neyman & Pearson, 1933, pg. 142, italics added). They seem to be stating that we ought to err on the side of caution, such that the null hypothesis should just be “the one which according to the scientist’s personal scale of values would lead to the more undesirable practical consequences, were it mistakenly assumed to be true.” (Howson & Urbach, pg.

- 28 -

168). In the foregoing example, there is no discernable practical consequence to taking either h’ or h” as the null hypothesis, and thus Neyman and Pearson’s advice will prove entirely useless.

I have not exhausted all the ways in which significance tests fail to do what we assume they do, but I have included the criticisms which I felt sufficient for and germane to the exposition required in the present essay. The methods of classical statistical inference – both Fisher’s framework and the Neyman-Pearson framework – can be abused in such a way as to yield favorable yet misleading results, and can even lead honest researchers to false conclusions in many cases.

Moreover, the claims to objectivity made by Fisher, Neyman, and Pearson alike have been shown to be unsatisfactory and logically inconsistent. In truth, the methods of classical statistical inference are just as subjective as those employed by Bayesian statistics, if not more so. With these drawbacks in mind, it seems quite reasonable to conclude that the reproducibility crisis is in large part a result of science’s widespread use of the methods advocated by classical statistical inference. It is now time to consider the alternative: the Bayesian framework.

- 29 -

Section III. The Bayesian Framework

i. Bayes’s Theorem

One can derive Bayes’s theorem quite readily from the axioms of the probability calculus. Indeed, all sound statistical measures and probability metrics may be derived using the probability calculus and its axioms. The core axioms of the probability calculus are as follows (Howson and Urbach, 1989, pg. 16).

Consider a set S of statements a, b, c, … which may also contain truth-functional

operations of its elements (e.g. the negation ~a, the disjunction a∨b, etc.). We can then define a probability function P which assigns non-negative real numbers on the unit interval to the sentences in S and displays the following properties:

(ퟏ) 푷(풂) ≥ ퟎ , 퐟퐨퐫 퐚퐥퐥 풂 퐢퐧 푺

(ퟐ) 푷(풕) = ퟏ , 퐢퐟 풕 퐢퐬 퐚 퐭퐚퐮퐭퐨퐥퐨퐠퐲

(ퟑ) 푷(풂 ∨ 풃) = 푷(풂) + 푷(풃), 퐢퐟 푷(풂 ∧ 풃) = ퟎ

Note that equation (3) requires a and b be mutually inconsistent statements; that is, a entails the negation of b and vice versa. These theorems constitute a basis for evaluating absolute or unconditional probabilities. Conditional probabilities can be calculated with the probability function P using the equation:

푷(풂 ∧ 풃) (ퟒ) 푷(풂|풃) = , 퐢퐟 푷(풃) ≠ ퟎ 푷(풃)

From this definition, one can easily arrive at Bayes’s Theorem, as I shall now show.

푷(풂 ∧ 풃) 푷(풂|풃) = , 퐚퐧퐝 퐏(풂 ∧ 풃) = 푷(풃 ∧ 풂) 푷(풃) - 30 -

푷(풃 ∧ 풂) 푷(풃|풂) = , 퐚퐧퐝 퐬퐨 푷(풃 ∧ 풂) = 푷(풂 ∧ 풃) = 푷(풃|풂)푷(풂) 푷(풂)

푷(풃|풂)푷(풂) 퐒퐮퐛퐬퐭퐢퐭퐮퐭퐢퐧퐠, 퐰퐞 퐠퐞퐭: 푷(풂|풃) = ∎ 푷(풃)

While trivial to derive, this result is the foundation of Bayesian inference.

The proposition a is usually a hypothesis which being evaluated in the light of some empirical evidence b. Definitions of the terms in Bayes’s theorem can be intuitively given. The probability of the hypothesis a conditional on the data b is referred to as the posterior probability of the hypothesis, and is symbolized by the term P(a|b).

The probability of the hypothesis a ignorant of the data b (or before collecting the data, or without consideration of the data, etc.) is referred to as the prior probability of the hypothesis, and is symbolized by the term P(a). The probability of the observed data b given the veracity of the hypothesis a is symbolized by the term

P(b|a), and represents the probability assigned by the prior probability distribution to the particular outcome described by b. Finally, the probability of the data itself is symbolized by P(b). The probability of the data, P(b), is loosely analogous to the

Fisherian p-value, which represents the probability of observing the data under the assumption that the null hypothesis (in most cases, ~a) is true.

The prospect of determining prior probabilities of hypotheses raises the question of how one ought to reasonably do such a thing. This step in the method of Bayesian inference is where most scientists, with arms akimbo, scoff in the face of such a subjective task. The traditional ideal of objectivity in the scientific pursuit of truth elicits a certain feeling of disgust with the idea that one should “decide for themselves” how likely a hypothesis is before any data has been collected. Such a

- 31 - dismissive reaction is, however, unwarranted. The fact that one cannot give an objective, accurate estimate of a hypothesis’s likelihood before any evidence has come in is necessarily true. This is a problem for all reasonable individuals – not just Bayesians. Regardless, we all implicitly assign a subjective, personal probability to any yet-to-be tested idea that has been proposed to us. This probability will reflect the degree to which the hypothesis in question seems

“reasonable” to us; the value on which we arrive will undoubtedly depend on various psychological and social factors, as well as our previous life experiences.

It is worth noting here that we shall require that no hypothesis to assigned a prior probability equal to zero or one, except for those hypotheses which express a logical contradiction or tautology, respectively. It has been provocatively claimed that all universal hypotheses ought to have prior probabilities of zero (see Popper,

1959, pg. 364), but doing so would be unnecessarily dogmatic and ultimately just as arbitrary as the subjective determination of priors advocated here. Regardless, we shall see that the nature of Bayes’s theorem is such that any prior probability may be assigned to a hypothesis (again, except for zero or one), no matter how inaccurate it is in reality, without much consequence. As will be shown later on, any prior probability on the open unit interval will ultimately be guided to an accurate, evidence-based posterior probability for a given hypothesis in the light of mounting evidence. Before proving this claim, however, we first need to make explicit the definitions used within the Bayesian framework.

- 32 -

ii. Understanding Probabilities

There is, however, a useful analogy which can help to illustrate the logical basis for determining priors. This is the theoretical basis of betting odds. The

Bayesian basis of betting odds considers “those odds on a hypothesis h which, so far as you can tell, would confer no positive advantage or disadvantage to anyone betting on, rather than against, h at those odds, on the (possibly counterfactual) assumption that the truth-value of h could be unambiguously decided. Such odds, if you can determine them, we shall call your subjectively fair odds on h.” (Howson

& Urbach, 1989, pg. 56). It should be clarified that the utilization of subjectively fair odds does not presuppose the existence of empirically fair odds. Moreover, the theory of betting odds does not imply that individuals bet only on odds they deem to be subjectively fair. Lest we digress, we shall now continue characterizing this concept of betting odds.

Betting odds are ratios, which may be reformulated into quotients, which are referred to as betting quotients. A betting quotient reflects one’s subjective likelihood (, a number on the unit interval) of a hypothesis divided by one minus

 that likelihood ( ). Consequently, betting quotients range in their limits from zero 1− to infinity. Betting quotients shall henceforth be assumed to reflect one’s degree of belief in some hypothesis. The justification for this assumption has been previously established in detail by many, such as Ramsey (1931), de Finetti (1937), and Carnap

(1950). We impose a further constraint on our concept of degrees of belief: that one’s degree of belief must be “consistent,” a term introduced by Ramsey (1931) to characterize degrees of belief having the formal and logical structure of

- 33 - probabilities. With this constraint in place, we may use subjective degrees of belief based on fair betting odds to assume the role of Bayesian prior probabilities. Now let us concern ourselves the properties of these prior probabilities as functions.

An approach to making sense of probabilities is to define them as proportions of certain types of outcomes in a sequence of outcomes of some repeatable experiment, such as tossing a coin. Stated thusly, one would consider the relative proportion of “heads” outcomes in an indefinite sequence of such an experiment as the probability of that coin landing heads-side-up given the set of conditions present in that experiment. This approach is known as the frequency theory of probability. This probability can be represented by a parameter θ, which confers the previously mentioned ratio that is equal to the number of times the coin lands with heads facing up in the first n tosses, divided by n. This conceptual understanding of probabilities has been described by many, but in the present essay, we will adopt the probability theory of Richard von Mises (1964).

Von Mises’s frequentist theory of probability, as described by Howson and

Urbach (1989, pp. 202-206), aims to lay a statistical foundation for the scientific use of probabilities. It has, at its core, two axioms. The first of these axioms is the axiom of convergence. This condition stipulates that in an infinite sequence w of outcomes of some experiment E (whose elements are drawn from some outcome partition O), the relative frequency of some outcome C among the first n members in w shall tend to a definite limiting value as n tends to infinity. In short, the limit of the relative frequency of C in w must exist.

- 34 -

The second axiom in Von Mises’s theory is the axiom of randomness. This axiom states that the frequency limits of all outcomes are invariant in every infinite subsequence of w, regardless of the starting index at which that subsequence begins.

In other words, there can exist no two infinite subsequences within w in which the relative frequency of any outcome type C tends to different limits. Satisfaction of this condition allows one to conclude that the sequence of experimental outcomes w is random. An important consequence of the axiom of randomness is that manifestations of outcome-types in the subsequence w are independent of the other elements in w, and thus exist with constant probabilities (i.e. they are non-

Markovian). Note here that the outcome partition O need not be finite, and can thus represent a continuous distribution of outcomes. It may also, of course, represent finite experimental outcomes, such as the outcome space of a coin flip. In such a case, the outcome partition would simply have two elements, {H,T}, representing the outcome in which the coin lands with heads facing up, H, and the outcome in which the coin lands with tails facing up, T. By these core axioms, Von Mises’s theory of probability lends itself readily to the statistical analysis of experimental outcomes.

Given a hypothetical Von Mises sequence of human height measurements, for example, one could determine the mean human height and standard deviation in human height from the sequence with ease. In this sense, Von Mises’s frequentist theory of probability lends itself readily to the determination of probability distributions too. In addition to the Gaussian distribution generated in the case of a continuous parameter such as height, Von Mises’s theory is easily applied to n-ary

- 35 - experiments, such as flipping a coin, rolling a die, etc. with the use of multinomial distributions (Evans, Hastings, and Peacock, 2000). Mathematical representations aside, Von Mises’s theory provides a basis for probability distributions which assign some probability value for all possible outcomes of an experiment. From this point of departure, we may now demonstrate how the Bayesian framework of statistical inference can logically guide our beliefs in the light of incoming data.

iii. Convergence of Beliefs

Despite the tricky business of assigning subjective prior probabilities to hypotheses, this seemingly disturbing flaw with Bayesian inference vanishes surprisingly quickly with the collection of very little data. Consider an illustrative example borrowed from Howson and Urbach (1989, pp. 238-239), which demonstrates this speedy process of convergence on a shared belief between two people who started with antithetical prior beliefs. Let us say that person 1 starts with the prior belief that the distribution of a normal population has a mean of 10 and a standard deviation of 10. Then, say that person 2 has the prior belief that the distribution is centered at 100 with a standard deviation of 20. These two individuals are starting out with highly disparate prior probability distributions.

These prior

probabilities reflect a difference in initial opinions so stark that the region person 1 deems extremely likely to contain the true mean of the population is regarded by person 2 as practically certain not to contain it, and vice versa. Now imagine that the normal population in question has a true mean of 50 and a true

- 36 - standard deviation of 10. Table 3 and Figure 2 below show how the beliefs of the two individuals will change – be they Bayesians – as more members of the normal population are sampled.

Sample Size Person 1 Person 2

n = Mean Std Dev Mean Std Dev 0 10 10 100 20

1 30 7.1 60 8.9 5 43 4.1 52 4.4 10 46 3.0 51 3.1 20 48 2.2 51 2.2 100 50 1.0 50 1.0 Table 4. Posterior distributions of two individuals with contrasting prior distributions, relative to a random sample of a population with mean 50 and standard deviation 10. Table data reproduced from Howson & Urbach (1989).

Figure 2. A. Probability distributions of person 1 being updated as larger samples are drawn from the population.

- 37 -

Figure 2. B. Probability distributions of person 2 being updated as larger samples are drawn from the population.

Figure 2. C. Overlay of the changing probability distributions person 1 and person

2 shows obvious convergence on the same, true value.

- 38 -

These results demonstrate the irrefutable power of Bayes’s theorem to unite even the most disparate of prior beliefs in the light of compelling evidence. Note, too, that this characteristic is not isolated to continuous measures. A similar example could be constructed using a binary (or n-ary) experiment, such as tossing a coin (a Bernoulli process). In such a case, two individuals with opposite prior probabilities about the parameters of the coin would converge on the true parameters with similar swiftness as in the foregoing example. Experimental outcomes are generally limited to (sets of) discrete values (as in the case of the coin toss), or continuous values (as in the case of human height measurements). It can be concluded then, that employment of Bayes’s theorem will set the beliefs of any

Bayesian straight in the light of sufficient data, regardless of how off-base their prior beliefs may have been.

iv. Bayes Factors

The fact remains that a popular criticism against Bayesian inference is the charge that determining a prior probability for some hypothesis is necessarily subjective. In the case of assessing a hypothesis in the light of a failed or successful replication, however, one’s prior probability needn’t be entirely subjective.

Considering that replications are intended to test whether a previous experiment’s results are in fact reproducible, the hypothesis in question is likely already supported by some data. When calculating the posterior probability of some hypothesis in the light of a replication, then, the correct prior probability to use will be the posterior probability calculated after the observations in the original study

- 39 - have been made. If the replication is just one of many replications that have been attempted, then the prior probability used will reflect the most recent posterior probability calculated (i.e. the posterior probability calculated after the observations made in the original study and all its replications that have been thus far conducted).

It is often desirable to compare hypotheses, a fact made obvious by classical statistics, where an alternative hypothesis is always compared to the null. This can also be achieved in the Bayesian framework by the employment of Bayes factors.

Introduced by Jeffreys (1961), Bayes factors give a comparative likelihood ratio between two hypotheses in consideration. While Bayes factors between any two mutually exclusive hypotheses can be generated, in the present case it is useful for us to consider a Bayes factor which compares the likelihoods of a hypothesis and its negation (the null hypothesis). We can then derive the Bayes factor of a hypothesis h given some experimental outcome e with the following equation adapted from Earp and Trafimow (2015).

푷(풉|풆) 푷(풉) 푷(풆|풉) (ퟓ) = ∙ 푷(~풉|풆) 푷(~풉) 푷(풆|~풉)

Using equation (5), a Bayes factor can be generated which reflects the degree of belief one has in the veracity of some hypothesis h given some data e.

Note that the Bayes factor is the product of two separate components. The first component is referred to as the confidence ratio of the hypothesis, which is the ratio of the prior probabilities one assigns to the hypothesis and its negation. The second component is the likelihood ratio, which describes the ratio of the probabilities of the observed evidence conditional on the hypothesis in question and its negation.

- 40 -

As was the case before when considering a hypothesis which has already been tested to some extent, the confidence ratio used for the evaluation of subsequent data should reflect the Bayes factor calculated from the original experiment(s). This follows the trend of belief-updating where one’s prior probability of some hypothesis shall become the posterior probability calculated after the observation of some experimental evidence. In the case of multiple successive experiments, such as several replication attempts conducted to evaluate some hypothesis, the

Bayes factor can be readily made to incorporate all the evidence available from all the studies conducted thus far.

푷(풉|풆 ) 푷(풉) 푷(풆 |풉) 푷(풆 |풉) 푷(풆 |풉) 풏 = ∙ ퟏ ∙ ퟐ … ∙ 풏 푷(~풉|풆풏) 푷(~풉) 푷(풆ퟏ|~풉) 푷(풆ퟐ|~풉) 푷(풆풏|~풉)

풏 푷(풉|풆 ) 푷(풉) 푷(풆 |풉) (ퟔ) 풏 = ∏ 풊 푷(~풉|풆풏) 푷(~풉) 푷(풆풊|~풉) 풊=ퟏ

Employing equation (6) allows individuals to rather easily evaluate hypotheses in the light of multiple replications, which has thus far been difficult to do in a systematic way. For example, consider a study which claims the verity of some hypothesis. Suppose that having read the original study and contemplated the implications of its results, one arrives at the conclusion that the study was rather robust, and assigns the author’s hypothesis a confidence ratio of 20. This confidence ratio translates roughly to a 95% confidence in the verity of the hypothesis. If the original study was then replicated by five independent researchers, one could evaluate their confidence in the hypothesis in the light of the evidence from the replication attempts. Now, assume that all five of the replication studies failed to produce the effect predicted by the original study’s hypothesis. Let

- 41 - us furthermore assume that the replication studies were all rather weak – as they often are – and that such the likelihood ratios of the replication failures were: 0.5,

0.6, 0.7, 0.65, and 0.55.

A Bayesian’s belief in the original study’s hypothesis would, after consideration of all the replication failures, be described by a confidence ratio of

1.5. This reflects a 60% confidence in the verity of the original hypothesis. This result is in accordance with scientists’ reasonable tendency to both: become less convinced about a hypothesis in the light of contradictory data, and; be hesitant about throwing a hypothesis out too quickly after the observation of a contradictory result. Note that, were a replication attempt successful in reproducing the effect predicted by the original study’s hypothesis, the confidence ratio of the hypothesis would increase as expected. Referring to equation (5), it follows that any successful replication attempt ought to yield a probability ratio greater than unity, while any failed replication attempt ought to yield a probability ratio lesser than unity. This property makes Bayes factors highly useful for integrating various sources of conflicting results in evaluating one’s confidence in a hypothesis.

The Bayesian framework of statistical inference has been explicated. The result is that one who adheres to the foregoing framework can systematically update their beliefs about hypotheses in the light of incoming evidence in a logically consistent manner. This capability is wanting in other systems of statistical inference, which has led in large part to the “reproducibility crisis” in many fields.

In my view it is likely that, if this Bayesian framework were ubiquitously adopted by the scientific community, there would be far less confusion in the various fields

- 42 - affected by the replication crisis. Individuals would have a simple way to update their beliefs about even canonical scientific hypotheses in the face of failed (or successful) replication attempts. Moreover, there would also likely be less variance in the beliefs experts have about any given hypothesis due to the convergent properties of Bayes’s theorem illustrated here. That being said, it is also worth considering the inherent limitations of the Bayesian statistical framework and the obstacles it may face if implemented.

v. Obstacles and Limitations

Despite its many advantages over classical statistical methods, the Bayesian framework is decidedly not a solution to many of the problems facing science. It is still not inferior to classical statistical methods in this way, because they too provide no solutions to these problems. For instance, the Bayesian framework tells us how to evaluate hypotheses, but it does not tell us which hypotheses to evaluate.

Deciding upon the appropriate hypothesis to be tested (and the associated null hypothesis) is no small task, and a logical basis for making such decisions is not provided by Bayesian nor classical statistics. This is to be expected though, because such decisions are outside the scope of the field of statistics. Much has been written about whence come hypotheses and the philosophical “problem of discovery” (see

Kuhn 1962, Hanson 1965, etc.). Such writings, however, have little to do with statistical methods and are primarily concerned with logic, philosophy of mind, and the history of science. The Bayesian framework has clearly demarcated boundaries as to what problems it addresses in science, and we needn’t claim that it solves all

- 43 - our problems. We may conclude, though, that it can solve the problems posed by the replication crisis, for reasons already stated.

In addition to its inability to resolve some of the long-standing problems in the philosophy of science, the Bayesian framework also brings up certain practical obstacles that must be addressed. One such issue with the Bayesian framework is the standardization of priors for publication purposes. The subjectivity-based determination of priors recommended here may be attractive for individuals seeking to analyze their own beliefs about scientific hypotheses, but academic journals will likely be less receptive to this method. For Bayesian statistical methods to be implemented in the bureaucratic realm of modern science, prior probabilities will need to be standardized in some reasonable way.

The best such method in my mind comes from borrowing Ioannidis’s R value, which is defined as such. “Let R be the ratio of the number of ‘true relationships’ to ‘no relationships’ tested in the field. R is a characteristic of the field and can vary a lot depending on whether the field targets highly likely relationships or searches for only one or a few true relationships among thousands and millions of hypotheses that may be postulated” (Ioannidis, 2005a). Academic journals which publish in specific fields might reasonably require the standard prior for any study’s hypothesis be equal to the R value of the journal’s field. In the case of cross-disciplinary journals (such as Science, Nature, etc.), their regulations might require that submitted papers disclose the field most relevant to their study’s hypothesis, and use the R value of that field to determine the appropriate prior to use in that study.

- 44 -

Ioannidis’s R value seems to be a suitable candidate for “standard” priors that journals can use as benchmarks for deciding what to publish. It can also be mentioned here that using field-dependent thresholds for considering what studies provide noteworthy evidence for or against a hypothesis is currently being done via p-value thresholds. Using field-dependent R values would thus be no different in principal to the status quo. In addition, the use of standardized R values does not exclude the ability for authors (or readers) to disclose their own priors in research papers, in addition to the standardized R value used for publication. While the notion of using R values as standardized, field-dependent prior probabilities would certainly need to be worked out in more detail before being employed, it appears to be a lucrative prospect for the large-scale implementation of the Bayesian framework. Large-scale meta-analyses may be required in order to determine R values for each scientific field, but conducting such tedious studies may be seen as investments in a superior statistical framework that brings with it many benefits.

As is the case with most innovations: problems will be dealt with as they arise, but not before that.

Section III demonstrated how the influence of inaccurate prior beliefs is rapidly drowned out by mounting objective evidence when using the Bayesian framework. Even for the most objectivist reader, the element of subjectivity present in Bayesianism which has so horrified scientists in the past can be reconciled with the fact that subjectivity also rears its head in the classical framework of statistical inference, as shown in Section II. We may justifiably conclude that the Bayesian framework holds up against the tests to which it has been subjected, and moreover

- 45 - works in nearly every conceivable scenario involving the interpretation of scientific hypotheses in the face of even contradictory findings.

- 46 -

Section IV. Conclusion

i. The State of Science

The replication crisis in science has been at the forefront of the scientific community’s attention (with an additional focus in the fields most affected) for several years now. Incredible amounts of mental effort have been put into finding the causes and solutions to this crisis. It seems that the futures of fields such as psychology, medicine, and the social sciences hang in the balance, and will be determined by the path that the scientific community decides to take in remedying the replication crisis.

Many of the approaches to solving the crisis which have been advocated by statisticians, philosophers, and scientists have merits. The proposals to “redefine statistical significance” (Benjamin et al., 2017), increase the prevalence of publishing “negative” results (Goodchild, 2015), encourage “preregistration” practices (Alvarez, 2014), and require the use of larger sample sizes (Peters &

Gruijters, 2017) appear to be attractive immediate-impact measures to attenuate the emergent effects of the replication crisis. Others, such as the delegation of replication work to undergraduates through laboratory classes (Frank & Saxe,

2012; Grahe et al., 2012) and the endorsement of “conceptual replications” in lieu of traditional “direct replications” (Crandall & Sherman, 2017; Lynch et al., 2015) are, in my estimation, intriguing but ineffective solutions to the replication crisis.

None of the recommendations listed above, however, attempt to address the causal

- 47 - root of the replication crisis: the scientific community’s inveterate use and endorsement of the methods of classical statistical inference.

The works of Ronald Fisher (1925), Jerzy Neyman and Egon Pearson

(1928) were pivotal in advancing our understanding of statistical hypotheses. They are, however, imperfect. The objectivity they claim characterizes the statistical methods they recommend was never there at all. At the core of the Fisher and

Neyman-Pearson theories of significance tests lies inexorably subjective practices and logically inadequate explanations, as I have shown. Moreover, the methods of classical statistical inference have been shown to be unacceptably susceptible to manipulation such as “p-hacking” and can yield misleading conclusions which contradict the evidence. It can come as no surprise, then, that a community which exclusively uses these theories to make inferences about scientific hypotheses will be led astray – especially in an era where data is being collected at unprecedented rates and being analyzed poorly-regulated manner. The obvious solution for the scientific community, given this realization, is to replace the methods of classical statistical inference with a superior framework. I propose that the best such framework is a Bayesian framework of statistical inference like the one I have described here.

ii. Looking Ahead

As has been demonstrated, the Bayesian framework enables its adherents to integrate incoming data in a logically coherent manner. This allows Bayesians to update their beliefs about any scientific hypothesis in the light of new evidence,

- 48 - which is an ability direly needed in response to the replication crisis. The use of

Bayes factors in the form of confidence ratios may serve as a way for researchers and laypeople alike to quantify and catalog their beliefs about scientific hypotheses

(Earp & Trafimow, 2015). This is a particularly useful capacity which could resolve the confusion caused by sets of replication studies which report contradictory results. The conclusion of this exposition is that the widespread adoption of a

Bayesian framework of statistical inference by the scientific community would result in an effective resolution to the replication crisis and its associated detriments to science.

Although the replication crisis is generally seen in a decidedly negative light, it need not be thought of as a damaging event in the history of science despite its condemnatory implications about the scientific literature in fields such as psychology, medicine, and the social sciences. If the goal of science is to accurately understand and explain the natural world, then we as a community ought to welcome the discovery that many of our accepted hypotheses are in fact false. The occasional acceptance of untrue statements is inevitable in the scientific process, but the identification of erroneous beliefs is a requisite step in the procedure of removing these false hypotheses from our collective understanding of the natural world.

In the words of the late Thomas Bayes himself, “we cannot determine ... [to] what degree repeated experiments confirm a conclusion, without the particular discussion of ... inductive reasoning ... [about] which at present we seem to know little more than that it does sometimes in fact convince us, and at other times not”

- 49 -

(Bayes, 1763). While our understanding of statistical inference and inductive reasoning has advanced dramatically since Bayes’s time, his characterization of our response to repeated experiments still rings true. Due to the inability of classical statistical inference to integrate the results of experiments and their subsequent replications, we have no better recourse than to say that the cumulative results sometimes convince us, and at other times do not. Equipping ourselves with the tools of a Bayesian statistical framework might lead the scientific community towards a world in which hypotheses can be evaluated in a logically coherent manner which integrates all the available evidence.

- 50 -

Section V: References

Aarts, Alexander A., et al. “Estimating the Reproducibility of Psychological

Science.” Science 349, no. 6251 (2015): 943-53.

Alvarez, R. Michael. "The Pros and Cons of Research Preregistration." OUPblog.

September 25, 2014. Accessed May 11, 2018.

Bayes, Thomas, Rev. "An Essay towards Solving a Problem in the Doctrine of

Chances." Philosophical Transactions of the Royal Society of London 53

(1763): 370-418.

Begley, C. Glenn, and Lee M. Ellis. "Raise Standards for Preclinical Cancer

Research." Nature 483, no. 7391 (2012): 531-33.

Benjamin, Daniel J., James Berger, Magnus Johannesson, Brian A Nosek, Eric-

Jan Wagenmakers, Richard Berk, Kenneth Bollen, et al. 2017. “Redefine

Statistical Significance”. PsyArXiv. July 22.

Carnap, Rudolf. Logical Foundations of Probability. Chicago: Univ. of Chicago

Press, 1950.

Carney, Dana R., Amy J.C. Cuddy, and Andy J. Yap. "Power Posing: Brief

Nonverbal Displays Affect Neuroendocrine Levels and Risk

Tolerance." Psychological Science 21, no. 10 (2010): 1363-368.

Crandall, Christian S., and Jeffrey W. Sherman. "On the Scientific Superiority of

Conceptual Replications for Scientific Progress." Journal of Experimental

Social Psychology 66 (2016): 93-99.

Dominus, Susan. "When the Revolution Came for Amy Cuddy." The New York

Times. October 18, 2017. Accessed May 18, 2018.

- 51 -

Duncan, Greg J., Mimi Engel, Amy Claessens, and Chantelle J. Dowsett.

"Replication and Robustness in Developmental Research." Developmental

Psychology 50, no. 11 (2014): 2417-425.

Earp, Brian D., and David Trafimow. "Replication, Falsification, and the Crisis of

Confidence in Social Psychology." Frontiers in Psychology 6 (2015).

Epstein, Seymour. "The Stability of Behavior: Implications for Psychological

Research." American Psychologist 35, no. 9 (September 1980): 790-806.

Etz, Alexander, and Joachim Vandekerckhove. "A Bayesian Perspective on the

Reproducibility Project: Psychology." PLoS One 11, no. 2 (2016).

Evans, Merran, N. A. J. Hastings, and J. Brian Peacock. Statistical Distributions.

New York: Wiley, 2000.

Finetti, Bruno De. "La prévision; ses; lois logiques, ses sources subjectives."

Annales De L'Institut Henri Poincaré 7 (1937): 1-68.

Fisher, Ronald Aylmer. Statistical Methods for Research Workers. Edinburgh:

Oliver & Boyd, 1925.

Fisher, Ronald Aylmer. The Design of Experiments. 4th ed. Edinburgh: Oliver and

Boyd, 1947.

Frank, Michael C., and Rebecca Saxe. "Teaching Replication." Perspectives on

Psychological Science 7, no. 6 (2012): 600-04.

Grahe, Jon, Alan Reifman, Anthony Hermann, Marie Walker, Kathryn Oleson,

Michelle Nario-Redmond, and Richard Wiebe. "Harnessing the

Undiscovered Resource of Student Research Projects." Perspectives on

Psychological Science 7, no. 6 (2012): 605-07.

- 52 -

Grange, Jim, Danny Kingsley, and Ottoline Leyser. "The Science 'Reproducibility

Crisis' – and What Can Be Done about It." The Conversation. May 10,

2018. Accessed May 11, 2018.

Goodchild, Lucy. "Why It's Time to Publish Research ‘failures’." Elsevier. May

5, 2015. Accessed May 11, 2018.

Hanson, Norwood Russell. Patterns of Discovery. Cambridge: Cambridge

University Press, 1965.

Hay, Michael, David W. Thomas, John L. Craighead, Celia Economides, and Jesse

Rosenthal. "Clinical Development Success Rates for Investigational

Drugs." Nature Biotechnology 32, no. 1 (2014): 40-51.

Head, Megan L., Luke Holman, Rob Lanfear, Andrew T. Kahn, and Michael D.

Jennions. "The Extent and Consequences of P-Hacking in Science." PLoS

Biology 13, no. 3 (March 13, 2015).

Hewitt, Catherine E., Natasha Mitchell, and David J. Torgerson. "Listen to the

Data When Results Are Not Significant." BMJ 336, no. 7634 (January 5,

2008): 23-25.

Howson, Colin, and Peter Urbach. Scientific Reasoning: The Bayesian Approach.

La Salle, IL: Open Court, 1989.

Ioannidis, John P. A. "Why Most Published Research Findings Are False." PLoS

Medicine, e124, 2, no. 8 (2005a): 697-701.

Ioannidis, John P. A. "Contradicted and Initially Stronger Effects in Highly Cited

Clinical Research." Journal of the American Medical Association 294, no.

2 (2005b): 218-28.

- 53 -

Ioannidis, John P. A. "The Proposal to Lower P Value Thresholds to

.005." Journal of the American Medical Association 319, no. 14 (April 10,

2018): 1429-430.

Jeffreys, Harold. Theory of Probability. Oxford: Clarendon, 1961.

John, Leslie K., George Loewenstein, and Drazen Prelec. "Measuring the

Prevalence of Questionable Research Practices with Incentives for Truth

Telling." Psychological Science 23, no. 5 (2012): 524-32.

Kendall, Maurice G., and Alan Stuart. The Advanced Theory of Statistics. 4th

ed. Vol. 3. London: Charles Griffin and Company Limited, 1983.

Kuhn, Thomas S. The Structure of Scientific Revolutions. Chicago: University of

Chicago Press, 1962.

Kyburg, Henry Ely. The Logical Foundations of Statistical Inference. Dordrecht:

D. Reidel, 1974.

Lynch, John G., Eric T. Bradlow, Joel C. Huber, and Donald R. Lehmann.

"Reflections on the Replication Corner: In Praise of Conceptual

Replications." International Journal of Research in Marketing 32, no. 4

(2015): 333-42.

Makel, Matthew C., Jonathan A. Plucker, and Boyd Hegarty. "Replications in

Psychology Research: How Often Do They Really Occur?" Perspectives

on Psychological Science 7, no. 6 (2012): 537-42.

Martin, G. N., and Richard M. Clarke. "Are Psychology Journals Anti-

replication? A Snapshot of Editorial Practices." Frontiers in Psychology 8,

no. 523 (April 11, 2017).

- 54 -

Mises, R. Von. Mathematical Theory of Probability and Statistics. New York,

Academic Press, 1964.

Neyman, Jerzy, and Egon S. Pearson. "On The Use And Interpretation Of Certain

Test Criteria For Purposes Of Statistical Inference." Biometrika 20, no. 1-

2 (1928): 175-240, part I. 263-94, part II.

Neyman, Jerzy, and Egon S. Pearson. "On the Problem of the Most Efficient Tests

of Statistical Hypotheses." Philosophical Transactions of the Royal Society

of London. Series A, Containing Papers of a Mathematical or Physical

Character 231 (1933): 289-337.

Novotney, Amy. "Reproducing Results." Monitor on Psychology 45, no. 8

(September 2014): 32.

Pashler, Harold, and Christine R. Harris. "Is the Replicability Crisis Overblown?

Three Arguments Examined." Perspectives on Psychological Science 7,

no. 6 (2012): 531-36.

Peters, G.-J. Y., & Gruijters, S. L. K. Why most experiments in psychology

failed: sample sizes required for randomization to generate equivalent

groups as a partial solution to the replication crisis. (2017): Retrieved

from https://osf.io/preprints/psyarxiv/38vfn

Popper, Karl R. "Science: Conjectures and Refutations." In Conjectures and

Refutations: The Growth of Scientific Knowledge, 33-64. Basic Books,

1962.

Popper, Karl. The Logic of Scientific Discovery. London: Hutchinson, 1959.

- 55 -

Ramsey, F. P. The Foundations of Mathematics and Other Logical Essays.

London: Routledge & Kegan Paul, 1931.

Ranehill, Eva, Anna Dreber, Magnus Johannesson, Susanne Leiberg, Sunhae Sul,

and Roberto A. Weber. "Assessing the Robustness of Power

Posing." Psychological Science 26, no. 5 (March 25, 2015): 653-56.

- 56 -