1

Reproducibility in Clinical

Christopher J. Hopwood and Simine Vazire

University of California, Davis

For: A.G.C. Wright and M.N. Hallquist

Handbook of Methods in Clinical Psychology Reproducibility 2

There is a long history of invalid ideas in clinical psychology, many of which have had profound negative effects on public health and individual lives. “Refrigerator mothers” were once blamed for (Greydnaus & Toledo-Pereya, 2012). Leeching (De Young, 2015), (Ellenberger,

1970), and the box (Isaacs, 1999) were once proposed to treat mental illness. In each of these cases and many others, invalid ideas were corrected by scientific research, to the benefit of the public interest. The ability to identify and correct bad ideas about etiology and intervention is the primary virtue of a scientific approach to clinical psychology. Although the causes of psychopathology or the best way to prevent and treat problems in living remain poorly understood, there has been clear progress in the field that can be attributed directly to the .

Invalid and harmful are not isolated to the distant past. In 1998, Wakefield and colleagues published a paper in a prestigious journal ostensibly linking the measles, mumphs, and rubella (MMR) vaccine to autism. Although it was based on a small sample of only 12 children, the finding was highly publicized. Even though a number of epidemiological studies found no association between the MMR vaccine and autism, rates of MMR decreased and rates of measles, mumphs, and rubella increased in the United Kingdom following the publication of the study (McIntyre

& Leask, 2008). After journalists uncovered multiple conflicts of interest, questionable research practices, and several ethical violations, the journal retracted the paper. Wakefield had apparently reported fraudulent data after having been paid to “find” a link between the MMR vaccine and autism

(Godlee, 2011). He was found guilty of malpractice and lost his medical license in the United Kingdom.

Since the time of the Wakefield et al. study there has been a crescendo of high profile false positives in the psychology literature (e.g., Klein et al., 2014). The issue has been widely discussed in the scientific literature (Baker, 2016; Nosek, 2012), the blogosphere (e.g., Gelman, 2016; Srivastava, 2016;

Vazire, 2016), and the popular press (Aschwanden, 2015; Belluz, 2015; Engber, 2016; Yong, 2016). This is a serious problem. Faulty in clinical psychology negatively affects patients and the public Reproducibility 3 because unhelpful or harmful practices are disseminated to ill effect and because persistent reports of invalid findings in the popular press can erode public trust in the scientific method.

But it is also a fortuitous opportunity to improve clinical (Munafo et al.,

2017; Tackett et al., in press). An upshot of recent discoveries of invalid findings has been a movement dedicated to teaching and disseminating more rigorous scientific methods (e.g.,

Collaboration, 2015). A central focus of this movement has involved investigating the reproducibility of reported effects. Genuine effects should be reproducible, and effects that cannot be reproduced should generally not be taken to be true. The purposes of this chapter are to describe the importance of reproducibility in the scientific method, review recent issues in the social that contributed to the recognition of reproducibility problems, and describe best practices for conducting reproducible research in clinical psychology.

Foundations of the Scientific Method

The scientific method is one way of explaining phenomena. It can be contrasted with other methods, such as explanation via tradition or metaphysics, by a few foundational principles. In this section, we briefly review the distinguishing principles of the scientific method, with a focus on reproducibility.

Observations and Explanations

For an explanation of a phenomenon to be convincing in a scientific sense, it must explain and predict (Hempel & Oppenheim, 1948). Thus, science is fundamentally about the link between observations in nature and explanations about why those observations occurred. When explanations predict the same observations more than once they are increasingly convincing, and when they can predict similar kinds of observations across different contexts and situations, they become increasingly general. Reproducibility 4

Unlike in most other approaches to knowing, science rests on the principle of falsification

(Popper, 1950), or the idea that scientists try to prove their explanations wrong rather than trying to prove them right. It follows from the idea of falsification that observations that are consistent with an explanation add confidence for that explanation but do not prove that it is true, whereas observations that are inconsistent with an explanation indicate that the explanation is at least partly inaccurate. This makes science really difficult, because our interest generally lies in proving something to be true rather than incrementally increasing our confidence that our explanation probably isn’t wrong. And as a general rule, it is easy to fool ourselves, particularly when we are motivated to see things a certain way

(Feynman & Leighton, 1988).

Replication

This is why replication is so important in science. Replication means making the same explanation-relevant observations more than once in order to test the validity of the explanation. There are different levels of replication, ranging from completely direct to completely conceptual or constructive (Lykken, 1968). In the most direct replication, an would be sampled twice by the same person under maximally similar conditions. A replication becomes less direct when the same observations are made under highly similar conditions but by different people. This increases confidence in the observation. For instance, if you told your friend that you had blown up a plastic bottle by pouring vinegar and baking soda into it, she might be skeptical. If she saw it on video, her confidence would increase. If she did it herself, she would probably be basically convinced.

Conceptual replications push the boundaries of the explanation by changing the conditions of the . If you knew that mixing vinegar (an acid) and baking soda (a base) together creates a reaction that produces carbon dioxide, and you knew that carbon dioxide is less dense than regular air, it would follow that mixing vinegar and baking soda together in a plastic bottle would expand the bottle until it burst. Other things would follow, too. For instance, this should also work in closed spaces other Reproducibility 5 than a plastic bottle, and it should also work with other acid-base combinations. The underlying explanation for why the bottle blew up when vinegar was added to baking soda provides a basis for setting up conceptual replications that could test that explanation further and evaluate the boundary conditions of the effect.

To use a more clinical example, it is probable that an astute physician observed many years ago that psychotic patients had an unusually high of family members with mental health problems. Having shared this observation with his colleagues, others may have made similar observations. Eventually, early psychiatrists tested the concordance rates of family members in a controlled manner and confirmed an association, even across multiple ways of assessing psychotic conditions and evaluating family concordance (e.g., Jelliffe, 1911). This important set of observations could be used to support multiple possible explanations, including those related to heritability, the problematic “refrigerator mother” , and others. These explanations were eventually tested as well, leading to the contemporary understanding that the increased incidence of psychotic symptoms among family members is most likely genetically mediated (Gottesman & Shields, 1973). While the etiology of psychotic phenomena remains poorly understood, replicated research on familial concordance that can be explained by nature to a greater degree than nurture advances our understanding significantly. From a scientific perspective, this kind of replicated evidence is always more convincing than any well-crafted argument untethered to replicated observations.

Both direct and conceptual replication are important, for different reasons. Conceptual replications are a riskier test of a , extrapolating from past findings to make new predictions (e.g., predicting the effect in a new setting, population, or using a different measure). However, ifa conceptual replication fails, this result is ambiguous. It could mean that the original theory was wrong, and even the early successful studies should be discounted as evidence for the theory, or it could mean that the researcher extrapolated one step too far, but the earlier successful studies should still be taken Reproducibility 6 as strong evidence in support of the theory. Direct replications are useful because they eliminate this ambiguity. If all attempts to directly replicate a finding fail, this suggests that the original finding should not be taken as evidence for the theory, and that the theory is likely wrong. Thus, from a falsification perspective, direct replications are stronger tests because they expose the researcher to the risk of being forced to abandon their theory. Conceptual replications, on the other hand, help extend a theory when they are successful, but rarely put the researcher at risk of having to discard her entire theory.

Transparency

One of the hallmarks of science is that the basis for scientists’ claims are verifiable by others

(Lupia & Elman, 2014). This is one of the key features that makes science different from other ways of knowing (e.g., intuition, faith, authority). Unlike other ways of knowing that are only accessible to some people or some types of people, science is fundamentally democratic because, in principle, anyone can do it, and anyone who can show via observation that they have identified a novel explanation for something interesting and important should get credit for having done so. From a scientific perspective, we should always be skeptical of people who claim they have made observations that we may not be able to make, or who will not share their methods or results.

As it turns out, transparency was a key issue in the recent reproducibility problems reported in the scientific and popular literature. Economic models can be used to show that a lack of transparency

(i.e., “information asymmetry” between the researcher and the consumer of research) will lead to an erosion of trust in the scientific “market”, because consumers will eventually realize that they cannot tell rigorous scientific products apart from flimsy ones, and so will eventually refuse to put much stock in any scientific findings (Vazire, in press). The growing concerns about the replicability of scientific findings is one sign of this erosion of trust. Happily, the solution is relatively straightforward: increasing transparency will make it easier others to evaluate the rigor of scientific studies and calibrate their confidence in individual results. Increasing transparency is also necessary for making replication more Reproducibility 7 mainstream. Direct replications require access to a comprehensive description of the methods, procedures, and materials used. Moreover, there is growing consensus that, whenever possible, researchers should make their data available for others to attempt to reproduce their analyses and results.

Origins of the Reproducibility Problem1

Methodologists have been emphasizing the issues that led to reproducibility problems for many decades. Meehl (1967) expressed concerns about the standard approach to determining whether an observation fits a hypothesis or not, which is generally referred to as null hypothesis significance testing

(NHST). The essence of the problem with NHST is that it is probabilistic but not intuitive. Significance in an NHST framework means that the likelihood of observing a particular effect is less than some pre- determined value, usually 5%, if the effect does not actually exist in nature. However, it is possible to find apparently positive evidence for an effect that does not exist, and in fact all things equal this would be expected about 5% of the time that we are testing effects that are non-existent, according to current conventions.

One common misinterpretation of this 5% threshold is that this means that only 5% of significant findings are false positives. This is incorrect. When NHST is done correctly, 5% of true null effects will come out significant, but this is unrelated to the proportion of significant findings that are false positives. Thus, even with a 5% threshold for significance and proper use of NHST, the rate of false positives in the published literature could be much higher than 5%, if most researchers are testing effects that are actually null.

It turns out that even the correct interpretation of the 5% false positive rate is often not valid, because many subtle things can be done to increase this false positive rate. Strictly speaking, for p- values to be interpretable in NHST (and for the false positive rate to be fixed at 5% when there is

1 This section borrows extensively from a blog post by Andrew Gelman (2016). Reproducibility 8 actually no effect), researchers must specify one key analysis ahead of time, and compute a p-value only for that analysis. In practice, however, this prescription is rarely followed and researchers nevertheless interpret p-values and assume that their false positive rate is 5%.

A number of common practices inflate the false positive rate (and violate the conditions under which NHST is supposed to be carried out). First, they can “peek” at the data as they are being collected, and decide when to stop data collection when the p-value cross below the .05 threshold. After data collection, researchers can run many analyses (e.g., for many different subgroups, or using many different measures) and only report significant effects, claiming that those were the effects they predicted from the beginning (i.e., Hypothesizing After the Results are Known, or HARKing; Kerr, 1998).

In addition, researchers can tinker with their analyses by, for example, tossing outliers in an unprincipled fashion, transforming variables in such a way that the results come out significant, or adding or removing covariates until the results become significant. Finally, if none of these efforts lead to a significant result, researchers can put the “failed” study in a “file drawer” and start over, and then report only the studies that “worked” (i.e., produced a significant result) in their submitted manuscript. All of these practices, collectively referred to as “p-hacking” or “questionable research practices,” increase the chances of reporting a significant effect that is actually a false positive. Many of them are quite common

(John, Loewenstein, & Prelec, 2012).

The fact that journals have a strong bias to publish findings that are statistically significant compounds this problem (Rosenthal, 1979). Pressure to publish, complete degrees, get jobs, obtain tenure, and enhance one’s status motivate researchers to find and report significant effects, and the foibles of human reasoning make it possible even for the most well-intentioned to justify these research practices (Tversky & Kahneman, 1971). During the last few decades, it has also become increasingly common to report novel findings in the popular press. The popular press is biased toward surprising,

“sexy” findings. However, by definition surprising and counter-intuitive findings are less likely to be Reproducibility 9 observed; otherwise they wouldn’t surprise us. In the end, motivated researchers built up large bodies of work seeming to support a particular explanation for what turn out to be false positive observations, by cleverly (and often unwittingly) finding loopholes in the logic of NHST.

Finally, the common practice of running studies with too-small samples further increases the rate of false positive findings in the literature (Fraley & Vazire, 2014). Even when p-hacking is avoided, studies that are underpowered (i.e., have samples that are too small to reliably detect the true effect) produce inconsistent and imprecise results. Because small samples can produce both overly large and overly small effects, but only the overly large effects are likely to get published, these small sample studies lead to a literature full of exaggerated effects and false positives.

Along the way there were plenty of suggestions for how to avoid these kinds of pitfalls. Cohen

(1994) emphasized bypassing NHST and focusing instead on reporting effect sizes and confidence intervals, and replicating results. Calls for a transition to Bayesian alternatives to NHST (e.g., Kruschke,

2010), which focused on using observations to successively adjust effect size estimates rather than provide single, isolated tests, built upon the momentum created by Cohen and others. Nevertheless,

NHST has retained its status as the . It is also often done inappropriately, as described in detail below.

However, several key events took place during the last decade which have raised awareness of issues with NHST and the importance of reproducibility. One set of events involved apparent evidence for highly implausible observations, such as the existence of extrasensory perception (Bem, 2011) or strikingly high correlations between fMRI variables and personality, emotion, and social cognition constructs (Vul, Harris, Winkielman, & Pashler, 2010). Another was a variety of studies showing that previously reported effects were not reliable. Some of these challenged widely held beliefs in clinical psychology, such as the common view that cognitive behavioral (Cuijpers et al., 2016) or Reproducibility 10 psychotropic (Kirsch et al., 2008) are particularly effective treatments for depression. A third was a few high profile cases of actual fraud, including the Wakefield case described above2.

Impacts of the Reproducibility Problem

The discovery of substantial reproducibility issues in psychological research has caused significant upheaval in the field. A major transition is occurring from an old way of doing things to a new, and hopefully better, way. This transition has revealed both bad news and good news for psychological research.

The Bad News

The most obvious bad news is the dissemination of false positive effects in the scientific literature. Ideally, when an article claims to have found something, it is true. Counter-instances should be rare. When the majority of published research findings are true, consumers can have confidence in the results of empirical reviews or meta-analyses. However, when there is a bias for positive findings in the literature, and a tolerance for p-hacking, the proportion of false positives is sure to increase

(Smaldino & McElreath, 2016). With a literature that is untrustworthy, even meta-analyses with many samples will over-estimate effects (Simohnson, Nelson, Simmons, 2014). In such cases, what appear to be valid explanations for observations are invalid.

This is potentially even more damaging than situations in which the received explanations for phenomena do not have empirical support, such as mother-blaming for psychotic conditions, because explanations that appear to have the imprimatur of science can and should get extra weight in evidence- based mental health. Evidence-based clinical practice depends upon the publication of valid explanations that connect observations in laboratories with observations seen in the consulting room

2 To be clear, while data on this are limited, we generally believe that most of the problems with reproducibility are not traceable to overt fraud, although several high profile cases of fraud did help raise awareness about reproducibility issues. Reproducibility 11

(Anderson, 2006; Drake et al., 2001). When those explanations are inaccurate, even the most up-to-date and conscientiousness practitioner can use invalid techniques that do not help or cause harm to her patients.

For instance, a 2011 study in a high profile journal reported the results of a large-scale, well- funded trial on treatments for chronic fatigue syndrome (White et al., 2011). The study authors concluded that cognitive-behavior therapy and exercise were moderately effective to improve functioning in individuals who suffer from chronic fatigue, and that adverse consequences were very rare. This result was covered widely in the popular press and recommended as an approved treatment by trusted institutions such as the United States Center for Control and Mayo Clinic. However, patients with chronic fatigue questioned the results, and after a prolonged court battle the authors of the original study were compelled to release their data (Rehmeyer, 2016). Re-analyses by independent observers showed that therapy and exercise had minimal benefits for individuals with chronic fatigue

(McGrath, 2015; Tuller, 2015).

The retraction of findings that have been previously disseminated can have a negative impact on the public perception of science as well (Carlson, 2006). Data suggest that public understanding of mental health issues is generally poor and stigmas continue to persist regarding mental health problems and treatments (Schomerus et al., 2012). Constant popular press reports about how explanations and effective interventions for mental health problems turn out to be untrue cannot help the field’s image.

Finally, reproducibility problems can have a personal cost for individual scientists and create challenges for the field. Researchers who have followed the old to produce novel but non- reproducible effects have been rewarded with degrees, jobs, tenure, and esteem. In contrast, other researchers who have tried to replicate or build upon false effects that they believed have had difficulties publishing their research, with attendant impacts on their careers and status in the field.

The Good News Reproducibility 12

On the positive side, there is currently a robust movement, in the sciences generally and psychology in particular, focused on “meta-science”, or issues related to how science is conducted

(Manufo et al., 2017). Reproducibility is perhaps the core value driving this movement. As described above, there are a number of blogs whose content often involves issues related to reproducibility. There are also other social media and internet platforms focused on the issue. For instance, www.retractionwatch.com is a website that publishes content related to research integrity, including recent retractions in the scientific literature. Another site, www.pubpeer.com, is an online “journal club” in which articles can be searched and on which there are blogs and discussions related to meta-science.

The Open Science Framework (www.osf.io) is an online data repository sponsored by the Center for

Open Science, which provides a variety of resources for researchers interested in doing transparent and reproducible work. The Center for Open Science also sponsored research aimed at replicating existing findings across many independent labs (e.g., Open Science Collaboration, 2015), as well as the Society for the Improvement of Psychological Science (http://improvingpsych.org/), an organization of researchers interested in issues related to reproducibility. Major journals have changed their editorial policies in order to encourage reproducible work (e.g., Lindsay, 2015; Cooper, 2016; Vazire, 2015), and many individual researchers have dedicated resources to improving their methods and replicating previous studies.

Collectively these efforts suggest that the old way of doing things will eventually give way to a more reproducible science. However, this movement has differentially affected the sub-disciplines of psychology, and it has not made as much progress in clinical psychology as some other branches as of yet (Tackett et al., in press). The remainder of this chapter focuses on best practices for reproducible science, with a focus on clinical psychological research.

Best Practices in Reproducible Science Reproducibility 13

The surest way to ensure that the scientific literature rests on a solid and reproducible foundation is for individual researchers to use best practices. These practices should ideally adopted for intrinsic reasons, but the extrinsic pressures to adopt these practices is growing, even within clinical psychology (e.g., Lilienfeld, 2016). Below we provide guidelines for how to conduct reproducible research (see also: Funder et al., 2014; Munafo et al., 2017; Nosek et al., 2015).

Be Transparent

A virtue of scientific epistemology is that arguments rest on their merits. Cases are made empirically, rather than through appeals to authority or intuition. Making an empirical argument requires the collection of data that would provide a critical test of a hypothesis. In order to convince critical observers, the hypothesis, methods used to test the hypothesis, and results need to be available for scrutiny. That way, observers can draw their own conclusions about the quality of the test, the validity of the reported results, and the degree to which the results provide a convincing argument regarding the hypothesis. When information about the test is unavailable, observers cannot draw reasonable conclusions, and the scientific method - and trust in it – breaks down (Vazire, in press).

The lack of transparency about why and how scientists conducted their research was a major enabling factor in the replication problems reviewed above. Rather than clearly stating hypotheses prior to data collection, researchers often described their results as if they were confirming hypotheses that were actually generated after the results were known (Kerr, 1998). Instead of publishing all of the variables collected in the study, people reported only the variables that could be used to find significant results. Rules for when data collection would stop, which participants would be included, and which analyses would be conducted were hidden or were simply not decided on ahead of time. And the data researchers used to draw their conclusions were not made available to other scientists.

These kinds of practices led to results that could not be reproduced by other researchers, often because the methods themselves could not be reproduced. This is a clear violation of the basic premises Reproducibility 14 of the scientific approach to knowing. Researchers can take several specific steps to ensure that their own work is transparent and reproducible, including pre-registering their studies, reporting everything, and distinguishing between exploratory and confirmatory hypotheses.

Pre-register. In the basketball game HORSE, the goal is to make shots that others cannot make.

A common strategy in this game is to add a twist that makes a certain shot more difficult, such as bouncing the ball off the backboard before it goes in, or shooting with your opponent’s non-dominant hand. In order for this game to work, you have to call your shots before shooting. If you make a shot that happens to hit the backboard before going in, your opponent does not know if this was intentional or not, and cannot fairly be expected to make the same kind of shot. On the other hand, if you “call your shot” before shooting, they have to follow the same technique.

Good science works the same way (Nosek & Lakens, 2014). Statistically significant effects can typically be found in any dataset if enough tests are conducted, just as very difficult basketball shots can occur by chance even when they were not intended to be difficult. Therefore, if you believe in a hypothesis and have carefully thought through the best way you can think of to test it, your argument will be most convincing if your data work out the way you predicted beforehand.

A number of resources are currently available to pre-register studies, including individual research websites, the Open Science Framework (OSF) website referenced above (www.osf.io) or other repositories (see http://www.nature.com/sdata/policies/repositories#social; https://aspredicted.org/).

The OSF site allows researchers to post hypotheses, data, methods, analytic plans, results, research reports, and other aspects of the research process with time stamps so that independent observers can evaluate scientific arguments. This information does not need to be made fully public right away, in case there are concerns about “scooping”, data theft, or other concerns related to research propriety.

However, because documentation is timestamped, pre-registering through such sites allows researchers to demonstrate that they made their decisions about how to collect and analyze their data before Reproducibility 15 collecting the data. They also motivate good scientific practice in general by compelling researchers to fully think through their designs and analyses (i.e., “call their shots”). Another advantage of using resources like OSF is that it provides a free resource for storing research materials and methods.

Obviously, some materials in are confidential or otherwise sensitive, and care needs to be taken to ensure that posting materials or data to sites like OSF is done ethically (Tackett et al., in press).

When pre-registering specific hypothesis plans is not possible, a researcher can still “pre- register” that a study will include only exploratory tests (which essentially commits them to honesty and prevents them from later claiming that something was predicted). In addition, researchers can preregister “Standard Operating Procedures” (SOP) for their lab, such as how they typically transform variables, deal with outliers, or select covariates (Lin & Green, 2016).

Report Everything. Reproducible science depends upon the full reporting of everything relevant to a given study (Funder et al., 2014), because independent researchers cannot evaluate or replicate studies without all of the information relevant to those studies. This includes information pertinent to different steps in the process, such as the instrumentation used in the study, the nature of the sample, the analyses used to test hypotheses, the results of those analyses, and the actual data used in the study. With respect to instrumentation, it is important to report exactly which instruments or methods were used, including which version of the instrument in cases where there are multiple versions. It is also critical to report estimates of the reliability of study instruments within the study in which inferences are being made, because reliability is a characteristic of an assessment tool within a particular sample, not a characteristics of a tool in the abstract.

Sample information includes the population from which the sample was drawn and relevant demographic characteristics. It is often of interest to examine and report differences between sub- samples (e.g., women and men) on study variables, and in some cases to establish measurement Reproducibility 16 equivalence between sub-samples on those variables. At times there are reasons to remove individuals from the sample prior to analysis, because of evidence that the participant did not produce reliable data or due to extreme scores on study variables. While this can be appropriate in some instances, it has also been used to “find” statistically significant effects in a sub-sample that is not present in the full sample.

This approaches raises concerns about which came first – the principled removal of certain outliers based on pre-determined criteria or the discovery of significant effects after certain participants have been removed. We therefore recommend establishing outlier removal rules prior to data collection and analysis, reporting results with and without outliers, and making the data available so that others can access the full sample. We would also suggest a generally conservative approach to the issue of outliers.

There are many ways to analyze data to test a certain hypotheses. Decisions such as which variable to use as the primary outcome, whether or not to use covariates, which covariates to use, how to score variables (e.g., dimensional measures or extreme groups), and which specific analytic technique to use will impact the results. It is common for different techniques to give different answers, particularly when effects are small as they often are in clinical psychology. This creates significant opportunities for “researcher degrees of freedom” (Simmons, Nelson, & Simohnson, 2011). In other words, the availability of alternative approaches to analyzing data means that researchers motivated to find and publish a significant effect can analyze the same data multiple ways and only report significant results (i.e., “p-hack”). This is a probable pathway to false-positive findings that will not replicate. As an alternative, we recommend determining analytic approaches to test hypotheses before conducting the analyses based on the research question, reporting the results of all analyses conducted, and making the data available to others who might want to analyze the data differently. We would also suggest that, all things equal, the simplest analysis that provides an adequate test of the focal hypothesis is generally best. Reproducibility 17

Researchers sometimes only report whether or not an effect fell below a value, without providing information about the magnitude of the effect or the actual test statistic value.

There are at least four problems with this practice. First, results without information about effect sizes and confidence intervals do not provide enough information for readers to interpret the practical significance of the finding. Second, insufficient reporting makes it impossible for future reviewers and meta-analysts to collate the results with those from other samples. Third, techniques designed to evaluate replicability (e.g., p-curve analysis) rely on exact p-values and other detailed information about statistical tests. Studies that do not report this information cannot easily be examined for replicability.

Finally, we all make mistakes in calculations, syntax, and other aspects of testing research hypotheses.

When all of the information about a hypothesis test is reported, such mistakes can more readily be identified and corrected than when information is limited to the statement that the test statistic fell below a certain threshold. We recommend reporting effect sizes, confidence intervals, and exact p values (or analogous results in other, e.g., Bayesian, frameworks) for all statistical tests in a study.

Finally, in an ideal case, researchers would make the data used in their study available for other researchers to use. As described above, the OSF and other sites make this possible to do with relative ease and safeguards against scooping and other concerns. There are situations in which this is still not desirable. For instance, there are large scale collaborative projects in which funded researchers are actively working on a project. Having put significant resources into obtaining funding and collecting data, those researchers should perhaps have first rights to analyze and write up the data. In that case, it might be reasonable to post only the variables used in a given manuscript rather than the entire dataset, and to have a plan for posting the full data after they have been fully collected and analyses related to primary aims are completed and published. There may also be situations, particularly in clinical psychology, where ethical issues such as re-identification risk would prevent full posting of data (Tackett Reproducibility 18 et al., in press). These issues aside, we recommend that researchers make data available whenever they can. We hope that this eventually becomes the default practice in psychological research.

Distinguish exploratory and confirmatory hypotheses. It is possible to misinterpret recommendations to pre-register studies and report everything as strictures against exploratory research. Objections might follow that exploratory research is an integral part of the creative scientific process (Goldin-Meadow, 2016). The best ideas sometimes occur on accident, because people find unexpected effects in their data. To lose the possibility of principled scientific exploration would slow progress in an already slow-moving endeavor.

We agree that there needs to be room for exploration in psychological research. This includes the principled “fishing” for significant effects, meaning trying to find meaningful relationships in data that were not hypothesized. The critical issue is distinguishing between confirmatory and exploratory research. Confirmatory research involves testing hypotheses that were predicted before the data were collected, using pre-registered methods. Exploratory research involves examining data for potentially meaningful results that could form the basis for future confirmatory research.

Both confirmatory and exploratory research are critical aspects of science, but it is also critical to distinguish them. Obviously, successful confirmatory research should inspire more confidence than positive exploratory research. This distinction gets blurry when results that are actually exploratory are written up as if they were confirmatory (Kerr, 1998). From this perspective, pre-registration and full reporting enable a clearer distinction between confirmatory and exploratory research. We would hope that this kind of clear distinction does not reduce opportunities to publish results, but instead allows for calibration of conclusions to the strength of the evidence.

Exploratory and Confirmatory Hypotheses in Multivariate Modeling. Many of the issues described above have most often been discussed in the case of experimental, bivariate research (e.g., where experimental and control groups are compared on some outcome variable). However, all of the Reproducibility 19 suggestions for best practice extend to multivariate research in which models are tested based on the covariance structure of some set of variables (e.g., confirmatory factor analyses, structural equation models; see Hoyle, 2014 or Kline, 2016 for general reviews). In these kinds of models, the relationships among a number of variables are conceptualized in terms of some underlying theory or conception about how they can be represented in nature.

For example, a researcher might hypothesize that the 10 items on a depression measure correlate because they are all indicators of the same latent construct, depression. In a slightly more complex model, there might be two constructs such as depression and anxiety, and a model might be fit that posits that 10 items load on depression and only depression, 10 other items load on anxiety and only anxiety, and any covariance between the depression and anxiety items can be accounted for by the correlation between the latent depression and anxiety constructs. The adequacy of the model is judged by indicators of how well the data statistically “fit” the theoretical model. These indicators essentially ask the question: if nature really worked the way the model proposes, would the variables correlate this way in these data?

This kind of data analysis can risk the creation models that are unlikely to generalize in order to achieve acceptable statistical fit between model and data within a certain sample (i.e., overfitting).

There are many ways to arrange variables in a covariance model, particularly as the model becomes more complex. There are also a variety of ways to alter models to achieve “good” fit. For instance, you can allow the error terms associated with two specific items (i.e., the variance that is not explained by a higher-level factor) to covary, even though there is not necessarily a good theoretical reason to do this.

Or, parameters can be fixed, such as when you specify that there should be no variation in a certain variable. As a general rule, the more models are altered, the less likely they will generalize to new samples. Thus model alterations should be done with caution. Problems associated with altering models to achieve good fit are compounded when these alterations are presented as if they were predicted all Reproducibility 20 along. As with experimental studies, we would view pre-registration as a necessary condition for calling a hypothesis test (i.e., model result) “confirmatory”.

In summary, the most reproducible models are not altered to achieve acceptable fit in a particular sample, and are pre-registered and cross-validated in multiple samples. When models are modified, we recommend that researchers are transparent about what those alterations were and on what basis they were made.

Shift focus from Statistical Significance to Effect Sizes and Confidence Intervals

As discussed above, philosophers of science have been calling for a shift in emphasis from NHST to effect estimation for decades (e.g., Cohen, 1994; Fraley & Marks, 2007; Meehl, 1978). Statistical significance in psychological research almost always means that a result is less likely than some cutoff, often 5%, to have been observed in a given sample if the effect represented by that result did not actually exist in nature. The value of statistical significance is that it specifies how willing researchers are to commit Type I errors (believing an effect which does not actually exist) when there is no effect.

Criticisms of NHST. There are two main criticisms of NHST and the focus on statistical significance. First, it is a counter-intuitive scheme, in that we are actually not usually interested in how likely a given observed effect would be if there was no true effect. We are more interested in how likely it is that there is a true effect, and what the magnitude of the effect is. Second, significance values are influenced by both the magnitude of effects and statistical power, which is determined largely by sample size. In very large samples, very small effects will be statistically significant, whereas in very small samples, large effects may not be. Thus statistical significance tests do not provide direct information about the effect of interest and can only be interpreted in the context of sample size and other factors related to statistical power.

Effect Sizes and Confidence Intervals. Effect sizes convey information about the direction and magnitude of a hypothesized effect. There are a variety of effect size coefficients, but the two most Reproducibility 21 common standardized metrics for effect sizes are the correlation coefficient and Cohen’s d coefficient.

When squared, correlations indicate the amount of variability in one variable that can be explained by variability in another variable. Cohen’s d indicates how different two group means are in the metric of the pooled standard deviation across groups. Effect sizes can also be expressed in unstandardized units or estimated for multivariate models. The important point here is that these values indicate what we are generally most interested in when testing univariate hypotheses: the degree to which variables are related or groups differ. However, effect sizes should never be interpreted in the absence of their corresponding confidence intervals. Confidence intervals convey the precision with which we measured the effect, and smaller confidence intervals are better (i.e., indicate more precise estimates). With small samples, it is relatively easy to get extremely large effects by chance, but the confidence intervals around these point estimates of the effect size will also be very large. Thus, the point estimates alone can be very misleading. Conversely, very small effects can nevertheless be practically meaningful (e.g., if we have identified a predictor of longevity), but only if they are estimated very precisely.

The importance of effect sizes can be illustrated by the following example. If a clinician were selecting from between two treatment approaches for a patient diagnosed with schizophrenia, and all she knew was that one treatment was more effective than the other, the responsible decision would be to choose the more responsible treatment. However, if she also knew that the more effective treatment required twice as much resources and came with a higher risk of adverse effects, the decision would become more complicated. In this situation, the effect size would be extremely valuable information. If the effect size difference between the two treatments were small, the clinician would be more inclined to choose the treatment that was slightly less effective but also less costly and which came with fewer risks. If the effect size difference were large, that treatment might still be desirable despite these other factors. Moreover, the confidence that the clinician should have in the effect size estimates should depend on how small the confidence intervals around those effect sizes are. Reproducibility 22

Conduct Adequately Powered Studies. Statistical power means the probability that a true effect in nature would be statistically significant in a given experiment. From a NHST perspective, power is the probability of detecting a real effect (i.e., rejecting the null hypothesis of no effect) if a real effect exists.

Studies should generally be designed that have sufficient power to detect the effects being tested. A typical convention for power is .80, which means that there is a 20% chance that a researcher would fail to detect an effect even though there is a real effect in nature (Cohen, 1988). However, in order to determine what sample size is needed to achieve 80% power (or any other level of power), we must make an assumption about the likely size of the effect. It is very easy to go wrong at this step, especially because if we knew the size of the effect we were looking for, we wouldn’t need to conduct research on it. Thus, some researchers have called for alternative ways to determine sample size, such as planning for precision (Cumming, 2014) or sequential analysis (Lakens, 2014).

There are several ways to enhance the power to find effects. The most obvious is to increase one’s sample size, and there have been many calls for researchers to take this recommendation very seriously (e.g., Fraley & Vazire, 2014). Another advantage of large samples is that they lend to more precise estimates of effect sizes (i.e., confidence intervals around effect sizes decrease as sample sizes increase). Another approach to increasing power is to use instruments that are as reliable as possible, because measurement error lowers the sensitivity of instrumentation to detect effects. In group comparison designs, increasing the homogeneity of groups can have a similar effect, because within- group heterogeneity can be associated statistical error.

Power is related to reproducibility because many findings that have failed to replicate were originally found in very small samples. This is intuitive when you consider that most published effect sizes in psychology are around r = .20 (Fraley & Marks, 2007; Richard, Bond, & Stokes-Zoota, 2003), and this is likely an overestimate because publication bias inflates effect sizes. 194 participants are required to have .80 power to observe statistical significance when a .20 correlation is anticipated. Thus, any time Reproducibility 23 a sample in a correlational study is less than 200 and the effect seems likely to be in the average range of effects found in psychology, it is reasonable to be skeptical about the finding.

Power can be difficult to achieve, because psychological measurement is imperfect and there are practical issues associated with getting large samples. Indeed, small samples can be defended in certain cases, such as when there are ethical issues related to sampling (e.g., in research with animals or vulnerable human populations or research that uses resource-intensive methods). However, even in these cases, it is not clear that running underpowered studies with small samples is more ethical than running well-powered studies, because even though the latter accrue more costs, they also produce more benefits (i.e., more reliable findings). In other words, if it is ethically questionable to kill 50 mice to produce a robust finding, it does not follow that it is more justifiable to kill only six mice to produce a flimsy finding that is not likely to be true. Moreover, the fact that large samples are difficult or impossible in such research does not protect the results from issues related to low power. There are solutions to underpowered studies, including the explicit reporting of confidence intervals accompanied by an acknowledgement regarding statistical power or Bayesian data analytic frameworks in which decreased confidence in the effect based on power is an explicit aspect of the model.

Bayesian Inference. Controversies between a focus on statistical significance and effect size actually go back to the earliest days of inferential . 18th century statistician Sir and others (e.g., Neyman and Pearson) advocated a frequentist hypothesis testing framework that led to modern NHST and effect estimation approaches. Thomas Bayes, who lived a century before Fisher, is credited with an alternative approach that focused on confidence in observed effects rather than the probability of inaccurate conclusions. In a Bayesian approach, a prior (prediction) is made before data are collected about the effect that will be observed in an experiment. This prior can be based on previous evidence regarding the effect or a researcher’s educated guess. In either case it is quantified and becomes a formal aspect of evaluating the validity of a hypothesis via combination with the actual Reproducibility 24 magnitude of the observed effect (i.e., the likelihood). By combining the prior with the observed value, a researcher creates a posterior estimate. If the prior is extreme (i.e., strongly favors one expected result), a result that does not match that prior may not have much of an effect on the posterior. On the other hand, if the result was based on strong methods (e.g., precise instrumentation in a well-controlled study with a large sample) and the prior was weak (e.g., gave many different results equal probability of being true), then the posterior will be significantly different than the prior following the observation.

It has been argued that Bayesian approaches are more intuitive than frequentist approaches like

NHST. That is, normally scientists have expectations about what they will find but are open to adjusting those expectations based on the results of their research. In contrast, frequentist approaches are somewhat counter-intuitive, because scientists, like other people, don’t usually think in terms of the likelihood that they would observe something if it weren’t actually true.

Bayesian approaches to data analysis have been offered as one solution to reproducibility problems. In general, any approach which requires researchers to make formal predictions prior to data collection and which focuses on effect sizes and precision of effect estimates could help yield more reproducible science. However, there is nothing inherent in the Bayesian approach that requires formal predictions to be made ahead of time (i.e., pre-registration), or prevents tinkering with analyses to get a desired result, any more than the rules of NHST technically prohibit these practices.

Another touted advantage of Bayesian approaches is that they build upon previous research by definition. There would seem to be more value in previous results which can inform prior estimates than would be the case in frequentist research, in which every study tests hypotheses anew. However, we caution that Bayesian approaches to data analysis are not a panacea to reproducibility problems. In principle, researchers could still change their priors after the fact, exercise researcher degrees of freedom in selecting their participants, variables, or analyses, or selectively report their results. Recent trends suggest that the field may move in the direction of greater balance between Bayesian and Reproducibility 25 frequentist approaches to data analysis. All things equal that is probably positive, if for no other reason than this balance provides researchers with more tools with which to engage in reproducible science

(though it may also provide more opportunities to obfuscate).

Replicate

Replication is the sine qua non of the scientific method. Without replication, there is no way to know how reliable a particular observation is, and thus no way to know how well an observation fits with a given explanation for some natural phenomenon. As described above, replications range from direct to conceptual. Direct replications increase confidence in the particulars of a given study or experiment, whereas conceptual replications increase generalizability and test the boundaries of a given effect.

Historically, replications, and particularly direct replications, have been thought of as relatively unexciting because they do not add new information to the literature. However, it is important to remember that this view assumes that the original effects were valid, which has often not been the case in psychological research. As a result of this view, replications have been difficult to publish, with some journals explicitly stating that they are only interested in novel research. There has been a related distaste for null results, leading to the “file drawer effect” (Rosenthal, 1979) in which there is a systematic bias in the literature in favor of positive results. This includes false positive results, which skews empirical or meta-analytic reviews of the literature.

One positive outcome of the meta-science movement is that a number of journals now explicitly state their interest in replication reports. Whereas replications used to be very difficult to publish, there are now outlets for this kind of work. We encourage researchers to replicate their own work and the work of others. In general, we would suggest a progression from direct to more conceptual replication, such that studies in which novel effects are reported are initially followed by direct replications to ensure the reliability of those effects, and then eventually followed by conceptual replications to Reproducibility 26 examine the boundaries of the effect. In clinical treatment research, for example, a randomized controlled effectiveness trial suggesting the superiority of a treatment over treatment as usual might first be followed by a replication by different researchers. Having established a reliable effect, further research might be conducted in the form of efficacy trials which test the validity of the treatment in the community, or dismantling trials in which specific elements of the treatment are evaluated for their specific effect on the clinical outcome.

In some instances, replication can be stressful for personal or political reasons. If researcher A publishes a study suggesting that the treatment they developed is superior to other treatments, researcher B might be loath to publish data suggesting that the treatment is similar in effectiveness to other approaches. Adversarial collaborations are useful in such situations. In an adversarial collaboration, researcher A and B would work together to design and conduct a study, so that both parties would presumably agree regarding the value of that study and accept the results, whatever they may be.

When there are power differences, such as when researcher A is a senior scholar with an editorial position at a major journal whereas researcher B is a graduate student, the stress associated with replicating others’ work can be amplified. It is also the case that replication studies, despite their importance, may never be valued as much as novel research. For this reason, young researchers whose careers depend on producing impactful research may be punished for focusing on replication studies.

We hope that these views will change, and that all scholars will be rewarded fairly for contributing to science, including via conducting replications. However, in the current state of the field we would suggest that there is a particular responsibility for more senior, tenured psychologists to welcome replications of their own work and conduct replication research themselves.

Summary Reproducibility 27

There is now clearly a wrong but common way to do psychology research. First, you plan a study

(with or without a prediction in mind), and begin collecting data. You collect data until you get the effect you expected, aiming for the smallest sample that will allow you to detect the effect. You use several tools at your disposal to try to uncover the effect you’re looking for, including eliminating outliers, swapping variables, adding covariates, or analyzing the data in different ways to see if you can find the effect. If you still do not get the effect you expected, you change your hypotheses but write up the results as if the modified hypothesis was predicted all along. If you don’t get any effect at all, you sweep the study under the rug (or into the file drawer) and try again, or move onto a different question, without publishing the results of your study. All replications are conceptual extensions of the original effect. Journals, employers, funders, and the press reward novel and interesting findings and do not require pre-registration, transparency, or robust findings. The literature is littered with false positives to the point that it is unclear how to interpret any effects, leading to a crisis of confidence in the entire endeavor. In clinical psychology, this crisis has significant negative consequences for public health.

We have described a way to do psychological research that is more likely to lead to reproducible findings. In this alternative approach, scientists think through and pre-register studies. They make all methods and data open to the public to the extent possible. Studies have adequate power and use maximally precise measures. Null effects are valued as much as positive effects, and the question shifts from “is this effect real” to “how substantial is the effect and how precisely can we estimate it”. Effects are routinely replicated. Journals and universities adjust their incentive structures to reward reproducible science, and the popular press and media follow suit.

It seems clear that researchers from all areas of psychology have engaged in poor practices for a long time, and many were probably unaware they were doing so. This has led to a situation in which much of the published research is of questionable value. In clinical psychology, false positive effects can have a negative effect on public health and individual lives, as illustrated by some of the examples Reproducibility 28 above. The upside of this situation is that a lot can be learned from the discoveries of the last few decades. One hopes that the hard-earned lessons of the reproducibility problem will lead to a permanent correction and, ultimately to more reproducible effects that can contribute to a better understanding of psychopathology, health, prevention, and intervention in clinical psychology. Reproducibility 29

References

Anderson, N.B. (2006). Evidence-based practice in psychology. American Psychologist, 61, 271-285.

Aschwanden, C. (2015). Science isn’t broken. Retrieved from http://fivethirtyeight.com/features/science-

isnt-broken/.

Baker, M. (2016). Is there a reproducibility crisis? Nature, 533, 452-455.

Belluz, J. (2015). Scientists often fail when they try to reproduce studies. This scientist explains why.

Retrieved from http://www.vox.com/2015/8/27/9212161/psychology-replication.

Bem, D.J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on

cognition and affect. Journal of Personality and , 100, 407-425.

Carlson, E.A. (2006). Times of triumph, times of doubt: Science and the battle for public trust. Cold

Spring Harbor, NY: Cold Spring Harbor Press.

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd Ed). Hillsdale, NJ: Lawrence

Erlbaum.

Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003.

Cooper, L. (2016). Editorial. Journal of Personality and Social Psychology, 110, 431-434.

Cuijpers, P., Cristea, I.A., Karyotaki, E., Reignders, M., & Huibers, M.J.J. (2016). How effective are

cognitive behavior for major depression and anxiety disorders? A meta-analytic

update of the evidence. World Psychiatry, 15, 245-258.

Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25, 7-29.

De Young, M. (2015). Encyclopedia of asylum therapeutics, 1750-1950. Jefferson, NC: McFarland and Co.

Drake, R.E., Goldman, H.H., Leff, H.S., Lehman, A.F., Dixon, L., Mueser, K.T., & Torrey, W.C. (2001).

Implementing evidence-based practices in routine mental service settings. Psychiatric Services,

52, 179-182.

Ellenberger, Henri (1970). The Discovery of the Unconscious. New York: Basic Books. Reproducibility 30

Engber, D. (2016). Cancer research is broken. Retrieved from

http://www.slate.com/articles/health_and_science/future_tense/2016/04/

biomedicine_facing_a_worse_replication_crisis_than_the_one_plaguing_psychology.html

Feynman, R.P. & Leighton, R. (1988). Surely you’re joking, Mr. Feynman. New York, NY: W.W. Norton &

Co.

Fraley, R.C. & Marks, M.J. (2007). The null hypothesis significance testing debate and its implications for

personality research. In R.W. Robins, R.C. Fraley, & R.F. Krueger (Eds.), Handbook of Research

Methods in Personality Psychology (pp. 149-169). New York, NY: Guilford Press.

Fraley, R.C. & Vazire, S. (2014). The N-Pact factor: Evaluating the quality of empirical journals with

respect to sample size and statistical power. PLOS One, 9, e109019.

Funder, D.C., Levine, J.M., Mackie, D.M., Morf, C.C., Sansone, C., Vazire, S., & West, S.G. (2014).

Improving the dependability of research in personality and social psychology: Recommendations

for research and educational practice. Personality and Social Psychology Review, 18, 3-12.

Gelman, A. (2016). What has happened down here is the winds have changed. Statistical Modeling,

Causal Inference, and Social Science. Retrieved from

http://andrewgelman.com/2016/09/21/what-has-happened-down-here-is-the-winds-have-

changed/.

Godlee, F. (2011). The fraud behind the MMR scare. British Medical Journal, 342, d22.

Goldin-Meadow, S. (2016). Why pre-registration makes me nervous. Observer, 7. Retrieved online

February 9th, 2017 at https://www.psychologicalscience.org/observer/why-preregistration-makes-me- nervous#.WJzFz7YrLUo

Gottesman, I.I. & Shields, J. (1973). Genetic theorizing and schizophrenia. The British Journal of

Psychiatry, 122, 15-30. Reproducibility 31

Greydnaus, D.E. & Toledo-Pereya, L.H. (2012). Historical perspectives on autism: Its past record of

discovery and its present state of solipsism, skepticism, and sorrowful suspicion. Pediatric Clinics

of North America, 59, 1-11.

Hempel, C. & Oppenheim, P. (1948). Studies in the logic of explanation. , 15, 135–

175.

Hoyle, R.H. (2012). Handbook of Structural Equation Modeling. New York: Guilford.

Isaacs, K. (1999). Searching for Science in Psychoanalysis. Journal of Contemporary Psychotherapy, 29,

235–252.

Jelliffe, S.E. (1911). Predementia praecox: The hereditary and constitutional features of the dementia

praecox makeup. Journal of Nervous and Mental Disease, 38, 1-26.

John, L.K., Loewenstein, G., & Prelec, D. (2012). Measuring the of questionable research

practices with incentives for truth telling. Psychological Science, 23, 524-532.

Kerr, N.L. (1998). HARKing: Hypothesizing after the results are known. Personality and Social Psychology

Review, 2, 196-217.

Kirsch, I., Deacon, B.J., Huedo-Medina, T.B., Scoboria, A., Moore, T.J., & Johnson, B.T. (2008). Initial

severity and antidepressant benefits: A meta-analysis of data submitted to the Food and Drug

Administration. PLOS , 5, e45.

Klein, R.A., Ratliff, K.A., Vianello, M., Adams, R.B., Bahnik, S. et al. (2014). Investigating variation in

replicability: A “many labs” replication project. Social Psychology, 45, 142-152.

Kline, R.B. (2016). Principles and Practice of Structural Equation Modeling (4th Ed.). New York: Guilford.

Kruschke, J.K. (2010). What to believe: Bayesian methods for data analysis. Trends in Cognitive Sciences,

14, 293-300.

Lakens, D. (2014). Performing high-powered studies efficiently with sequential analysis. European

Journal of Social Psychology, 44, 701-710. Reproducibility 32

Lin, W. & Green, D. P. (2016). Standard operating procedures: A safety net for pre-analysis plans. PS:

Political Science & Politics, 49, 495-500.

Lindsay, D.S. (2015). Replication in psychological science. Psychological Science, 26, 1827-1832.

Lykken, D.T. (1968). Statistical significance in psychological research. Psychological Bulletin, 70, 151-159.

Lupia, A. & Elman, C. (2014). Openness in political science: Data access and research transparency. PS:

Political Science and Politics, 47, 19-42.

McGrath, S. (2015). Omission of data weakens the case for causal mediation in the PACE trial. The

Lancet, 2, e7-e8.

McIntyre, P. & Leask, J. (2008). Improving uptake of MMR vaccine. British Medical Journal, 336, 729.

Meehl, P.E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of

Science, 34, 103-115.

Meehl, P.E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of

soft psychology. Journal of Consulting and Clinical Psychology, 46, 806-834.

Munafo, M.R., Nosek, B.A., Bishop, D.V.M., Button, K.S., Chambers, C.D. et al., (2017). A manifesto for

reproducible science. Nature: Human Behavior, 0021.

Nosek, B.A. (2012). An open, large-scale, collaborative effort to estimate the reproducibility of

psychological science. Perspectives on Psychological Science, 7, 657-660.

Nosek, B.A., Alter, G., Banks, G., Boorsboom, D., Bowman, S., et al. (2015). Promoting an open research

culture: The TOP guidelines for journals. Retrieved from https://osf.io/vj54c/

Nosek, B.A. & Lakens, D. (2014). Registered reports: A method to increase the credibility of published

results. Social Psychology, 45, 137-141.

Nosek, B.A., Spies, J.R., Motyl, M. (2012). Scientific utopia: II. Restructuring incentives and practices to

promote truth over publishability. Perspectives on Psychological Science, 7, 615-631. Reproducibility 33

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science,

349, aac4716-1- aac4716-8.

Popper, K. (1950). The logic of scientific discovery. New York, NY: Routledge.

Rehmeyer, J. (2016). Bad science misled millions with chronic fatigue syndrome. Here’s how we fought

back. Retrieved from https://www.statnews.com/2016/09/21/chronic-fatigue-syndrome-pace-

trial/.

Richard, F.D., Bond, C.F., & Stokes-Zoota, J.J. (2003). One hundred years of social psychology

quantitatively described. Review of General Psychology, 7, 331-363.

Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86,

638-641.

Schomerus, G., Schwahn, C., Holzinger, A., Corrigan, P.W., Grabe, H.J., Carta, M.G., & Angermeyer, M.C.

(2012). Evolution of public attitudes about mental illness: A and meta-

analysis. Acta Pyschiatrica Scandinavica, 125, 440-452.

Simonsohn, U., Nelson, L.D., & Simmons, J.P. (2014). P-curve: A key to the file-drawer. Journal of

Experimental Psychology: General, 143, 534-547.

Simonsohn, U., Nelson, L.D., & Simmons, J.P. (2014). P-curve and effect size: Correcting for publication

bias using only significant results. Perspectives on Psychological Science, 9, 666-681.

Smaldino, P.E. & McElreath, R. (2016). The natural selection of bad science. Royal Society Open Science.

Srivastava, S. (2016). Everything is fucked: The syllabus. The Hardest Science. Retrieved from

https://hardsci.wordpress.com/2016/08/11/everything-is-fucked-the-syllabus/.

Tackett, J.L., Lilienfeld, S.O., Patrick, C.J., Johnson, S.L., Krueger, R.F., et al. (in press). It’s time to broaden

the replicability conversation: Thoughts for and from clinical psychological science. Perspectives

on Psychological Science. Reproducibility 34

Tuller, D. (2015). Trial by Error: The troubling case of the PACE chronic fatigue syndrome study. Virology

Blog. Retrieved from http://www.virology.ws/2015/10/21/trial-by-error-i/.

Tversky, A. & Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin, 1971, 105-

110.

Vazire, S. (in press). Quality uncertainty erodes trust in science. Collabra: Psychology.

Vazire, S. (2016). Editorial. Social Psychological and Personality Science, 7, 3-7.

Vazire, S. (2016). i have found the solution and it is us. sometimes i’m wrong. Retrieved from

http://sometimesimwrong.typepad.com/wrong/2016/08/solution-is-us.html.

Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2010). Puzzingly high correlations in fMRI studies of

emotion, personality, and social cognition. Perspectives on Psychological Science, 4, 274-290.

Wakefield, A.J., Murch, S.H., Anthony, A., Linnell, J., Clsson, D.M., Malik, M., Berelowitz, M., Dhillon, A.P.,

Thomson, M.A., Harvey, P., Valentine, A., Davies, S.E., & Walker-Smith, J.A. (1998). Ileal-

lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in

children. , 351, 637-641.

White, P.D., Goldsmith, K.A., Johnson, A.L., Potts, L., Walwyn, R., et al. (2011). Comparison of adaptive

pacing therapy, cognitive behavior therapy, graded exercise therapy, and specialist medical are

for chronic fatigue syndrome (PACE): A randomized trial. The Lancet, 377, 823-836.

Yong, E. (2016). Psychology’s can’t be wished away. Retrieved from

https://www.theatlantic.com/science/archive/2016/03/psychologys-replication-crisis-cant-be-

wished-away/472272/