<<

1

Common Methodological Problems in Randomized Controlled Trials of Preventive Interventions

Christine M. Steeger†, PhD, [email protected]

Pamela R. Buckley†, PhD, [email protected]

Fred C. Pampel, PhD, [email protected]

Charleen J. Gust, MA, [email protected]

Karl G. Hill, PhD, [email protected]

Institute of Behavioral Science, University of Colorado Boulder, 1440 15th St., Boulder, CO 80309

† These authors contributed equally to this work.

Correspondence concerning this article should be addressed to Christine Steeger, Institute of Behavioral Science, University of Colorado Boulder, 1440 15th St., Boulder, CO 80309; phone: 303-735-7146; [email protected]

Acknowledgements: The authors would like to thank Abigail Fagan, Delbert Elliott, Denise Gottfredson, and Amanda Ladika for their comments and critical read of the manuscript, Sharon Mihalic for paper concepts, and Jennifer Balliet for participating in data entry and data coding.

Declarations

Funding: This study was funded by Arnold Ventures.

Conflicts of interest/competing interests: The authors declare that they are members of the Blueprints for Healthy Youth Development staff and that they have no financial or other conflict of interest with respect to any of the specific interventions, policies or procedures discussed in this article.

Ethics approval/consent: This paper does not contain with human participants or animals.

Data: Available from authors upon request.

Materials and/or Code availability: n/a

Author contributions: Concepts and design (CS; PB; FP); data entry, coding, management, and analysis (CS; PB; FP; CG); drafting of manuscript (CS; PB; FP); intellectual contributions, reviewing, and critical editing of manuscript content (CS; PB; FP; CG; KH). All authors have read and approved the final manuscript. 2

Abstract

Objective. Randomized controlled trials (RCTs) are often considered the gold standard in evaluating whether intervention results are in line with causal claims of beneficial effects.

However, given that poor design and incorrect analysis may lead to biased outcomes, simply employing an RCT is not enough to say an intervention “works.” This paper applies a subset of the Society for Prevention Research (SPR) Standards of Evidence for Efficacy, Effectiveness, and -up Research, with a focus on internal (making causal inferences) to determine the degree to which RCTs of preventive interventions are well-designed and analyzed, and whether authors provide a clear description of the methods used to report their study findings.

Methods. We conducted a descriptive analysis of 851 RCTs published from 2010-2020 and reviewed by the Blueprints for Healthy Youth Development web-based registry of scientifically- proven and scalable interventions. We used Blueprints’ evaluation criteria that correspond to a subset of SPR’s standards of evidence. Results. Only 22% of the sample satisfied important criteria for minimizing biases that threaten . Overall, we identified an average of

1-2 methodological weaknesses per RCT. The most frequent sources of bias were problems related to baseline non-equivalence (i.e., differences between conditions at ) or differential attrition (i.e., differences between completers versus attritors or differences between study conditions that may compromise the randomization). Additionally, over half the sample

(51%) had missing or incomplete tests to rule out these potential sources of bias. Conclusions.

Most preventive intervention RCTs need improvement in rigor to permit causal inference claims that an intervention is effective. Researchers also must improve reporting of methods and results to fully assess methodological quality. These advancements will increase the usefulness of preventive interventions by ensuring the credibility and usability of RCT findings. Keywords:

Randomized controlled trial, RCT, preventive interventions, internal validity, CONSORT 3 4

Introduction

Randomized controlled trials (RCTs) are often considered the gold standard for determining experimental validity and the causal effects of preventive interventions (Shadish,

Cook, & Campbell, 2002; West & Thoemmes, 2010). With high-quality implementation, RCTs allow for causal inferences and estimates of average treatment effects that are more reliable and credible than those from other empirical methods (Deaton & Cartwright, 2018). Despite the strength and appropriateness of the RCT to evaluate an intervention (i.e., program, practice, or policy), simply using an RCT design and reporting results is not sufficient to determine whether an intervention “works.” Given that poorly implemented RCTs may produce biased outcomes

(Schulz, Altman, Moher, & Group, 2010), an RCT must be correctly designed, implemented, and analyzed in order to make causal inferences and claim beneficial effects of an intervention. That is, RCTs must be internally valid to minimize several sources of bias (systematic error).

When feasible and appropriate, randomization is necessary to ensure sound causal conclusions of positive intervention effects, which inform policy and practice decisions for communities (Montgomery et al., 2018). Social and psychological interventions, however, are often complex and contextually dependent upon the difficult-to-control environments in which they are delivered (e.g., schools, correctional facilities, health care settings; Bonell, 2002; Grant,

Montgomery, et al., 2013; Grant, Mayo-Wilson, Melendez-Torres, & Montgomery, 2013).

Understanding RCTs therefore requires a detailed, transparent description of the interventions tested and the methods used to evaluate them (Grant, Montgomery, et al., 2013). Transparent reporting is crucial for assessing the validity and efficacy or effectiveness of intervention studies used to inform evidence-based decision making and policymaking.

To guide social science researchers in the requisite methodological criteria for establishing efficacy (i.e., the extent to which an intervention does more good than harm when 5 delivered under optimal conditions) and effectiveness (i.e., intervention effects when delivered in real-world conditions; Flay et al., 2005), two seminal papers on the methodological standards of evidence were developed by prevention scientists and endorsed by the Society for Prevention

Research (SPR). With a goal of increasing consistency in reviews of prevention research, these standards originated in Flay et al. (2005) and were updated in Gottfredson et al. (2015) as the prevention science field has progressed in the number and quality of preventive interventions for reducing youth problem behaviors.

How well researchers apply and report the SPR standards of evidence criteria for high- quality RCTs in the prevention science field, however, is unknown. This paper uses a subset of the SPR standards of evidence that must be met for preventive interventions to be judged “tested and efficacious” or “tested and effective” (Flay et al., 2005; Gottfredson et al., 2015). We focus on threats to internal validity to determine whether RCTs of preventive interventions are well- implemented and well-reported – and if not, what are the most common design and analysis flaws? And what information on methods is missing? This study’s larger goal is to improve the design, analysis, and reporting of potential threats to internal validity of intervention research that uses an experimental design. To answer these research questions, we present findings from a large-scale descriptive analysis of RCTs testing intervention program efficacy or effectiveness using the Blueprints for Healthy Youth Development online clearinghouse database. Blueprints identifies scientifically proven and scalable interventions that prevent or reduce the likelihood of antisocial behavior and promote a healthy course of youth development (Buckley, Fagan,

Pampel, & Hill, 2020; Fagan & Buchanan, 2016; Mihalic & Elliott, 2015).

The Status of RCTs in the Prevention Science Field

Increased public and private funder investments in experimental studies of social programs have led to a higher volume of preventive intervention research over the past several 6 decades (Bastian, Glasziou, & Chalmers, 2010). Along with a greater number of published intervention studies, there is some evidence (up to 2010) that the methodological rigor in designing and evaluating RCTs has improved over time for trials in the medical and child health fields (Falagas, Grigori, & Ioannidou, 2009; Thomson et al., 2010). Still, more recent publications have highlighted that many RCTs in the social sciences have design and/or analysis flaws (Ioannidis, 2018), which contribute to sources of bias that weaken internal validity and question causal claims of intervention effectiveness.

In addition, the use of RCT findings for informing policy and practice decisions is hindered by poor or incomplete reporting of study design, procedures, and analysis (Montgomery et al., 2018; Walleser, Hill, & Bero, 2011). Grant, Mayo-Wilson, and colleagues (2013) were the first (to our knowledge) to conduct a comprehensive review of the reporting quality of social programs. They identified a sample of 239 RCTs among 40 high impact factor academic journals publishing complex interventions in 2010 in the fields of , criminology, education, and social work. Findings revealed that many standards concerning randomization procedures were poorly reported, such as participant allocation to conditions, information about blinding, and details about actual delivery of experimental and control conditions. The authors concluded that important details are routinely missed or inadequately reported in published RCT studies (Mayo-Wilson et al., 2013), making it difficult to evaluate the quality of evidence from potentially meaningful intervention findings.

Evaluating the Internal Validity of RCTs

Several reporting guidelines for assessing the quality of randomized trials are available online, including checklists of information to include when producing reports of RCT findings to ensure transparency of the methods. For example, The CONSORT (Consolidated Standards of

Reporting Trials) Statement was developed by an international team of scholars to help 7 biomedical researchers report RCTs transparently (Schulz et al., 2010). The CONSORT-SPI

2018 checklist and flow diagram (Grant et al., 2018) is an extension to the CONSORT 2010

Statement for social and psychological interventions. Another well-known checklist is the

Cochrane risk-of-bias tool for randomized trials (RoB2; Higgins et al., 2011; Sterne et al., 2019).

The RoB2 generates a score to assess the risk of bias (i.e., low, high, or “some concerns”), and is recommended by the Cochrane Collaboration, an internationally recognized group that focuses on systematic reviews of human health care policy and interventions.

In addition, online clearinghouses “assess applied research and evaluation studies of programs/interventions according to evidentiary (evidence-based) standards” (Means, Magura,

Burkhart, Schroter, & Coryn, 2015, p. 101) to help decision-makers identify effective interventions, with a dominant focus on internal validity. Several exist, with up to 20 within the

United States alone (Burkhardt, Schroter, Magura, Means, & Coryn, 2015; Means et al., 2015).

Two of the longest standing online clearinghouses in the U.S. are What Works Clearinghouse

(WWC), established under the Education Sciences Reform Act of 2002, and Blueprints for

Healthy Youth Development, founded in 1996. The WWC follows a standards handbook

(WWC, 2020) to assess the causal validity of RCTs and quasi-experimental design (QED) studies and is used to review all efficacy or effectiveness grants submitted to the Institute of

Education Sciences (the evaluation arm of the U.S. Department of Education). Meanwhile, originally introduced as Blueprints for Violence Prevention, Blueprints for Healthy Youth

Development (https://www.blueprintsprograms.org/) was one of the earliest efforts to establish a clear scientific standard for evaluating a program’s evidence, implementing a rigorous expert review process and certifying those programs that met its evidentiary standards. Although there are differences between WWC and Blueprints (e.g., WWC reviews education interventions whereas Blueprints has a broader focus, and Blueprints uses a tiered format for assessing 8 interventions against its evidentiary standards whereas WWC uses an “in/out” rating system), both employ high scientific standards to assess evaluation quality.

SPR Standards of Evidence and Causal Inferences

Due to various reporting tools and evidence standards for assessing internal validity of

RCTs, there is a lack of consensus among intervention research fields about what is minimally required to make causal inferences. In the SPR standards, Gottfredson et al. (2015) offer a comprehensive list of methodological requirements for high-quality research evidence that establish an intervention as efficacious, effective, and ready for scale-up. To summarize,

Gottfredson et al. discussed a set of essential or desirable cumulative standards for efficacy, which include a sufficient intervention description, measures and their properties, theory testing, valid causal inference, statistical analysis, efficacy claims, and reporting standards. For effectiveness, the standards include intervention description, generalizability, population subgroups, intervention tested (delivery and implementation), outcomes measured, effectiveness claims, and research to inform scale-up efforts. In regard to broad dissemination and readiness for scale-up, the standards mention availability of materials, training and technical assistance, fidelity assessments, improving intervention reach, and studying the scale-up efforts.

Although the SPR standards of evidence are based on several types of experimental validity that are critical for intervention research (Cook & Campbell, 1979; Shadish et al., 2002), we focus here on internal validity, which is often considered the most important type. Whereas we acknowledge the importance of of intervention effects extending to different samples, settings, and measurement, a necessary first step is to establish internal validity, which refers to the ability to make causal inferences about the observed cause and effect relationship between the intervention and outcomes. High-quality and well-implemented RCTs have high internal validity because seeks to eliminate the likelihood of systematic bias 9 in the selection and assignment of participants to experimental conditions and the effects of known and unknown confounds (Bickman & Reich, 2015; Spieth et al., 2016).

Current Study

The current paper uses the Blueprints for Healthy Youth Development (herein referred to as Blueprints) evaluation criteria to examine the extent to which the prevention science field has applied a subset of SPR standards of evidence related to internal validity. The decision to use

Blueprints over others (e.g., WWC, CONSORT-SPI or RoB2) was motivated by 1) Blueprints’ range of youth behavioral outcomes crossing multiple disciplines (e.g., behavioral, mental and physical health, education, juvenile delinquency, etc.), 2) the capacity to map SPR standards of evidence criteria onto Blueprints criteria for evaluating the internal validity of RCTs, and 3) our unrestricted access to the Blueprints database.

Several articles and other resources have described methodological pitfalls in designing and analyzing intervention research (e.g., Deaton & Cartwright, 2018; Martin et al., 2018;

Murray, Varnell, & Blitstein, 2004; Song & Herman, 2010; WWC, 2020). To our knowledge,

Grant, Mayo-Wilson et al. (2013) have conducted one of the few prior systematic reviews of the methodological design and analysis rigor of RCTs evaluating social, psychological, and health interventions that is most like our study. Their article, however, examined only one year (2010) of peer-reviewed articles published in the social and psychological science fields, whereas our study examines 10 years of reports and refereed journal articles that evaluate interventions encompassing a broader range of outcomes relevant to the prevention science field. Our primary research questions are: In a systematic review of RCTs using the Blueprints registry database over the past decade (2010-2020), 1) What are the most common sources of bias that threaten internal validity?, and 2) Which sources of bias do these studies fail to report in their evaluation 10 findings? We also ask an exploratory question of whether methodological or reporting flaws in

RCTs have improved (i.e., fewer flaws or missing methods information, on average) over time.

In this study, we examine the following nine common threats to internal validity: 1) problems with randomization, 2) missing information on attrition, 3) biased measurement from non-independent sources, 4) problems with reliability and validity of measures, 5) lack of intent- to-treat analysis, 6) incorrect level of analysis in clustered designs, 7) no baseline controls for outcomes, 8) non-equivalent groups at baseline, and 9) differential attrition (see Gottfredson et al., 2015; Gupta, 2011; Murray, Taljaard, Turner, & George, 2020; Podsakoff, MacKenzie, Lee,

& Podsakoff, 2003; Shadish & Cook, 2009; Shadish et al., 2002; Wadhwa & Cook, 2019).

Methods

Eligibility Criteria

The Blueprints database used for the analysis includes articles and reports that evaluate an intervention for youth designed to 1) prevent or reduce negative behavioral health outcomes

(e.g., mental health problems, substance use, delinquency/crime, and other health-related behaviors) or 2) promote positive development (e.g., academic achievement and prosocial behavioral outcomes). The focus on youth limits programs to those targeting ages 0-25 years, an age range that includes post-secondary education and early employment experiences. The one exception comes from studies of interventions designed to reduce recidivism that follow typically young offenders to older ages. Given that the aim of Blueprints is on prevention

(including universal, selective, and indicated preventive interventions), the database excludes interventions with a sole focus on evaluating treatment programs for diagnosed or clinical-level mental health problems, including medical or pharmacological interventions. Blueprints also excludes articles and reports that only describe process evaluations without behavioral outcomes 11 or that evaluate only cost-effectiveness. In addition, Blueprints considers RCTs and QEDs and excludes studies that use a pre-post design without a control group.

For the present study, eligibility was restricted to RCTs, as they are the most frequently conducted “gold standard” design for ensuring internal validity. The RCTs include experimental trials that randomized individuals or clustered units (i.e., cluster randomized trials). We did not include methodologically strong alternatives to RCTs, such as designs based on regression discontinuity, randomized instrumental variables, or comparative interrupted time-series (Henry,

Tolan, Gorman-Smith, & Schoeny, 2017; West, 2009; Wing & Cook, 2013). They are less common in the Blueprints database, involve different standards for internal validity, and require separate study from RCTs. We also restricted our sample to experimental studies in the

Blueprints database whose initial publication was in the last decade, from 2010 to when the present data analysis was conducted in August 2020. The 10-year time span allowed for examination of trends while also focusing on relatively recent research that incorporates up-to- date advancements in methods. No other eligibility criteria were used.

Search Strategy

Blueprints uses a systematic search process to locate evaluation studies. The approach has been identified as best able to reduce bias in literature reviews and meta-analytic studies because it entails gathering and reviewing all relevant research studies (Farrington & Petrosino,

2001; Wilson, 2009). Blueprints uses Boolean operators to create multiple search terms, as follows. First, several clauses are used to select academic journals. Second, search terms are applied to locate outcomes for youth relating to physical and mental health, delinquency, education, prosocial behavior, and problem behavior. See Online Resource Table 1 for specific search terms/clauses. Third, these Boolean operators are entered into the Web of Science online search engine, which provides subscription-based access to multiple databases with 12 comprehensive citation data for many different academic disciplines. To locate additional studies, including both new evaluations of previously reviewed interventions and new interventions yet to be reviewed, the Blueprints team searches blogs, webpages, other registries, and research organization sites, and it accepts self-nominations from intervention developers.

Sample

In this section, “reports” refer both to reports of results in the grey literature and peer- reviewed journal articles. Between 1996 (the clearinghouse’s inception) and the time of the present study’s analysis (August 2020), the Blueprints registry database contained an ongoing total of 3,582 reports nested within 1,557 preventive interventions, as each intervention can have one or more studies that includes at least one evaluation report. However, many of the reports are part of the same evaluation, because they examine the same intervention and use the same sample but have several reports publishing different outcomes, follow-up periods, or mediator or moderator analyses. Blueprints combines multiple reports within the same evaluation into what is referred to as a single study. Coding for the quality of RCTs occurred at the study (not report) level. Combining reports that are part of the same evaluation into a single study reduced our sample to 2,568 studies. We then selected studies with a first report published between 2010-

2020, which reduced our sample to 1,231 studies. Before 2010, the more narrowly defined focus of Blueprints on delinquency and drug abuse limited the coverage of program evaluations and prevented us from extending the time period any farther back. Last, we selected RCTs (excluding

QEDs), leaving an analysis sample of 851 studies (see Figure 1) nested within 631 interventions.

Data Abstraction and Measures

To summarize study problems, Blueprints uses nine criteria that, while not exhaustive, prove important for distinguishing among program evaluations on the strength of their designs.

These criteria, listed in Table 1, aim to assess key problems with experimental internal validity 13 and methodological rigor (i.e., evaluation quality). As part of SPR, Gottfredson et al. (2015) and

Flay et al. (2005) have described in more comprehensive terms the essential and desirable standards of how to build strong causal evidence for evaluations of preventive interventions. The

Blueprints standards were developed independently but overlap with the SPR standards. Table 1 maps the updated Gottfredson et al. efficacy standards onto Blueprints’ evaluation criteria. This was accomplished by matching language and concepts from the Gottfredson et al. standards for intervention validity (Table 1, pages 897-898) to Blueprints’ evaluation criteria for threats to internal validity. The similarity offers some evidence that the Blueprints criteria cover methodological problems considered important for experimental prevention research.

Blueprints uses these criteria as part of its regular review process. For an internal review, staff members with expertise in evaluation design and statistical analysis work in dyads to write a summary of the study and its limitations. They independently read an article or report, with one writing the summary and one serving as a reviewer. They each assess the methodological problems of the evaluation, convene to discuss their assessment, and come to consensus. Then, a senior Blueprints staff member uses the summary – referencing the original article or report as needed – to apply the formal evaluative criteria in Table 1. Interventions that meet requirements for strong methods through the Blueprints internal review process undergo another review by an advisory board of experts in prevention science content areas, evaluation methodology, research design, and statistical methods. The majority of the 851 studies in our analysis sample that met eligibility criteria for the present research were rated by internal reviewers only (n = 699,

82.1%); the other 152 (17.9%) also underwent external review by the advisory board.

The end goal is to produce a set of measures assigned to each study that correspond to the criteria listed in Table 1. A score of 1 on the measures indicates a problem and a score of 0 indicates no problem. In addition, several count measures of methodological sources of bias for 14 each study are equal to the sum of methodological problems, design and analysis problems, and missing/not reported information.

Data Analysis

To answer our primary research questions, we conducted a descriptive analysis of frequencies for each methodological source of bias or methods information that was missing/not reported using the measures that correspond with the evaluation criteria items 1-9 in Table 1. For these analyses, we report means, standard deviations and ranges, and present results for the full sample of 851 RCTs ranging from “high” (i.e., several problems/missing pieces of information) to “low” (i.e., minimal, or no methodological/reporting problems). We answer our exploratory research question with a series of linear regressions, which assess the 10-year proportional change for each flaw and missing piece of methods information over time (from 2010 to 2020).

Results

Descriptive Findings

The designs of the 851 studies were nearly evenly split between individual RCTs (n =

442, 51.9%) and cluster RCTs (n = 409, 48.1%). A total of 87.7% (n = 746) were published in an academic journal and 12.3% (n = 105) were unpublished reports (e.g., dissertations; evaluation reports by research organizations; working papers, etc.).

Table 2, Section A shows the percentage of these studies demonstrating sources of bias that threaten internal validity and a description/example of each. In addition, the table reports information missing from RCTs examined in our analysis (Section B). Section A indicates that the percentage of studies with design and analysis problems ranged from 23% (substantial differences between conditions at baseline) to 1% (design confound). Section B shows that the percentage of studies with missing or not reported information ranged from 23% (tests for differential attrition are incomplete) to 3% (no information on attrition). 15

Next, results for counts of the number of methodological sources of bias per study

(described in Table 2) and averages across studies are presented in Table 3. The number of reasons for any methodological problem (i.e., design and analysis problem or missing information) ranged from 5 (<1%) to 0 (22%) problems per study and averaged 1.49 (SD = 1.10) reasons across studies. The number of design or analysis problems ranged from 4 (<1%) to 0

(45%) problems per study and averaged .82 (SD = .89) reasons across studies, and number of missing/not reported problems ranged from 3 (<1%) to 0 (49%) problems per study with an average of .67 (SD = .75) reasons. Overall, RCTs had an average of 1-2 methodological flaws per study, such as baseline non-equivalence and differential attrition, as well as less frequent problems (i.e., less than 10%) of not adjusting for clustering when the cluster is the unit of assignment, not conducting an intent-to-treat analysis, or using non-independent outcome measures (see Table 2).

Finally, our exploratory linear regression analyses indicated no evidence that study methodological or reporting flaws are improving or worsening across time, for either frequencies of problems or counts of number of methodological sources of bias per study (all ps > .05).

Discussion

The purpose of this paper was to determine how a subset of SPR standards of evidence for assessing internal validity (Flay et al., 2005; Gottfredson et al., 2015) have been applied in studies of preventive interventions for youth over the past decade. Using the Blueprints clearinghouse database, which contains experimental evaluations of these interventions, we conducted a descriptive analysis of 851 RCTs published between 2010 and 2020 to determine 1) common methodological sources of bias that threaten internal validity among RCTs of preventive interventions, and 2) the extent to which potential sources of bias are missing/not reported. Overall, findings showed that some RCTs are well-designed and well-implemented; 16

22% had no serious sources of bias and provided a clear description of the methods used to evaluate bias. However, the majority (78%) had one or more identified problem(s) and/or missing information. We also separated design or analysis flaws from reporting problems and found that half (51%) lacked a clear description of the methods used to test potential biases and rule out internal validity threats. This finding is consistent with Grant, Mayo-Wilson et al.

(2013), who conducted a similar systematic review of the methodological design and analysis rigor of RCTs evaluating social, psychological, and health interventions. Like the present study,

Grant and colleagues found that many standards were poorly reported. And lastly, in contrast to some previous research in broader scientific fields (e.g., Falagas, Grigori, & Ioannidou, 2009;

Hopewell, Dutton, Yu, Chan, & Altman, 2010; Thomson et al., 2010), our exploratory analyses did not show that methodological rigor or reporting of flaws in RCTs have improved over time.

In the following sections, we first describe the common problems based on our findings, and later discuss less frequent issues, as these potential threats are still important and contribute to overall RCT methodological quality.

Baseline Equivalence and Differential Attrition Issues as the Most Common Flaws

The most frequent flaws identified in our descriptive analysis included: 1) non-equivalent groups (i.e., several baseline differences between conditions), 2) evidence of differential attrition

(i.e., differences between completers versus attritors, or baseline differences in the analysis sample after loss of subjects due to attrition), and/or 3) missing or incomplete tests of baseline equivalence and/or differential attrition. We elaborate on these points below.

Baseline equivalence. The goals of examining baseline equivalence are to a) demonstrate that the randomization has worked well by achieving well balanced treatment groups at baseline, b) identify any unlucky imbalances between treatment groups that may have arisen by chance

(e.g., “unhappy randomization” due to sampling error, particularly in RCTs with smaller sample 17 sizes) (Cook, 2018; Shadish et al., 2002; Wadhwa & Cook, 2019), and c) add credibility to the trial results by specifically encouraging confidence in unadjusted outcome analyses without any serious bias (Pocock et al., 2002). Gottfredson et al. (2015) suggest that models testing intervention effects should correct for any baseline differences between groups (e.g., control for the variable(s) showing baseline group differences in statistical analyses). However, researchers and other guidelines (e.g., CONSORT-SPI, see below; European Medicines Agency, 2015;

Raab, Day, & Sales, 2000; Senn, 1994) argue that it is most important to identify potential variables a priori and adjust for these covariates regardless of whether they are significantly different between study conditions at baseline. From this perspective, post-hoc adjustment for covariates offers only secondary or exploratory analyses.

The SPR standards state that “post-randomization checks on important outcomes measured prior to the intervention should be provided so that the pretreatment similarity of the experimental groups can be assessed” (Gottfredson et al., 2015; p. 904). They do not, however, present specifics on testing for similarity. The Cochrane Collaboration RoB2 guidelines (Sterne et al., 2019; Higgins et al., 2011) say to look for deviations from chance in comparing conditions at baseline but give no further details. What Works Clearinghouse (WWC, 2020) requires calculation of effect sizes in standard deviation units to test for condition differences. Further,

WWC only requires testing baseline equivalence for the analysis sample when attrition rates are high, according to their expected bias tools (WWC, 2020). For baseline equivalence, Blueprints requires significance tests of condition differences for all baseline measures, including both outcomes and sociodemographic characteristics of the randomized sample, as a check that the randomization equalized the conditions and avoided the risks of bias and misallocation (e.g.,

Ioannidis, 2018). Evidence of non-equivalence comes from statistically significant differences across conditions that extend beyond what might be expected by chance. The Blueprints ratings 18 thus rely more heavily on statistical inference and completeness of the tests for diverse baseline measures than do WWC and RoB2; however, all three examine baseline equivalence for RCTs.

In contrast, while CONSORT (Schulz et al., 2010) and CONSORT-SPI (Grant et al.,

2018) call for reporting important baseline characteristics by condition, they reject the use of inferential statistics when comparing baseline measures across trial arms. The CONSORT (and by extension, CONSORT-SPI) statements for baseline data describe the use of significance tests as superfluous because such tests assess the probability that baseline differences between conditions occurred by chance, which is assumed in the absence of any bias or misallocation in the randomization to be already known (e.g., Altman, 1985; Altman & Dore, 1990). Instead,

CONSORT states that baseline comparisons should be based on the strength and relevance of the variables measured, as well as the size of any chance imbalances that occurred (Altman, 1985).

In sum, inconsistencies in group equivalence approaches taken by different evidence standards and reporting guidelines may partially contribute to our findings of many studies failing to test for (or incomplete testing of) baseline equivalence.

Differential attrition. Most longitudinal studies face loss of participants. Whereas overall attrition does not threaten internal validity, differential attrition by intervention condition is especially problematic because participant attrition can be nonrandom in relation to the outcome measures and introduce systematic bias between the treatment and control groups

(Gottfredson et al., 2015; Wadhwa & Cook, 2019). Little and Rubin (2019) note that the loss of randomized or assigned participants is not a problem for internal validity if the loss occurs randomly and does not compromise the original randomization (missing completely at random;

MCAR), or if the loss can be fully accounted for by measured baseline variables (missing at random; MAR). Although external validity may be affected by these forms of attrition, unbiased estimates of treatment effects can be obtained with multiple imputation (MI), full information 19 maximum likelihood (FIML), and certain forms of weighting when the data are at least MAR

(Kristman, Manno, & Côté, 2005; Puma, Olsen, Bell, & Price, 2009). If the pattern of attrition does not meet MCAR or MAR assumptions and the data are missing not at random (MNAR), the consequences are more serious. This problematic form of missing data can threaten internal validity, bias point estimates and standard errors, and reduce the credibility of intervention evaluations (for reviews on missing data and differential attrition, see Graham, 2009; Jeličić,

Phelps, & Lerner, 2009; Nicholson, Deboeck, & Howard, 2017; Schafer & Graham, 2002).

For attrition, SPR guidelines state that “the extent and patterns of missing data must be addressed and reported” (Gottfredson et al., 2015, p. 907). They note that problems of differential attrition can occur even when rates of attrition are comparable across conditions, and that imputation of missing data does not solve the problem of differential attrition without evidence that data are MAR. Blueprints requires extensive evidence for studies with a loss of more than about 5% of the respondents from randomization to follow-up (Graham, 2009). For internal validity purposes, more important than examining whether different numbers of individuals drop out by intervention condition is examining whether different types of individuals drop out. The tests should show not only that completers do not differ from dropouts on sociodemographics and baseline outcomes, but also that completers and dropouts are statistically similar by condition. The latter requirement can be demonstrated by using logistic regression to predict dropout status with the baseline measures, condition, and the of the baseline measures by condition. It can also be demonstrated by showing equivalence on baseline measures for both the full randomized sample and the analysis sample of completers.

The RoB2 checklist (Higgins et al., 2011; Sterne et al., 2019) avoids citing a threshold for problematic attrition, as any threshold depends on the outcome and context. It does, however, recommend checking for differences in attrition or missing data rates between conditions and 20 listing the reasons for attrition. WWC has a standard for RCTs with high attrition that is based on the combination of two pieces of information: the overall rate of attrition and the difference in attrition rates between conditions (Deke & Chiang, 2017; WWC, 2020). Once again, CONSORT

(Schulz et al., 2010) and CONSORT-SPI (Grant et al., 2018) differ in that they merely state that attrition rates and reasons for attrition should be reported.

Less Frequent Methodological Problems Related to Internal Validity

As described above, our findings show additional (but less frequent) sources of bias, including incorrect level of analysis, lack of intent-to-treat analysis, and non-independent outcome measures. Given that these criteria are important for establishing internal validity and subsequent causal inferences of interventions, we briefly describe them here. Guidelines to assess these criteria across clearinghouse standards of evidence and reporting tools are generally more consistent than the baseline equivalence and differential attrition criteria discussed above.

Incorrect level of analysis. In published studies, researchers have randomized at one level, such as the school level, but conducted analyses at a different level (e.g., students).

Individual-level analyses of clustered data without adjustments are problematic because they violate regression assumptions of independent errors (Shadish & Cook, 2009). Even small violations of this assumption can greatly impact the standard error of the and consequently inflate the Type I error rate (Gottfredson et al., 2015; Shadish & Cook, 2009).

Thus, to avoid overstating the statistical significance of the results, multilevel models, robust standard errors, or other methods should be used to account for nested designs (Hedges &

Hedberg, 2007; Murray et al., 2004; Raudenbush & Bryk, 2002). For recent reviews discussing design and analysis of group (cluster) randomized trials, see Murray et al. (2018), Murray et al.

(2020), and Raudenbush and Schwartz (2020). 21

Lack of intent-to-treat analysis. Strong studies use an intent-to-treat (ITT) approach that analyzes all participants in the treatment or control condition according to their original assignment, regardless of whether a participant completed the intervention program or switched conditions throughout the study. ITT analyses preserve the integrity of the initial random assignment (i.e., maintain the balance and random variation created by random assignment) and provide an unbiased estimate of intervention treatment effects, which is frequently of special policy interest (Gupta, 2011; Shadish & Cook, 2009). If a researcher drops non-completers from analyses, this approach may select only the best participants or those with certain favorable characteristics and thus can lead to biased results. Further, analyses using a subsample of the original randomized sample or involving post hoc exclusions of information can be biased and misleading (Lachin, 2000). Aside from attrition, analyses should reflect the condition that participants/units were initially assigned at random (Gottfredson et al., 2015).

Non-independent measures. Measures are considered independent if they are free from bias due to a strong incentive or expectancy for a positive intervention outcome (i.e., demand characteristics). Several reviews and meta-analyses discuss the importance of blinding in outcome assessors when possible, given common method biases of rater effects (e.g., social desirability, implicit theories, and consistency in responses) that may affect treatment estimates

(Dechartres, Trinquart, Faber, & Ravaud, 2016; Podsakoff et al., 2003; Torgerson & Torgerson,

2003). Gottfredson et al. (2015) stated that there must be at least one form of data (measure) collected by people other than those delivering the intervention. Unless it is clear that they are blind to participant condition, researchers who rate their study participants, teachers who deliver the program and rate students, and parents who are part of a parenting program and rate children do not, according to Blueprints, provide independent measures. Self-report measures and most administrative data are considered independent. 22

Strengths and Limitations

This is the first study to systematically review a subset of SPR standards of evidence criteria for establishing internal validity by coding a large sample of RCTs evaluating preventive interventions and providing a descriptive summary of common methodological design and analysis flaws. We used evaluation quality criteria established by Blueprints; our analysis is not intended to be exhaustive in describing all possible issues related to research design and analysis.

For example, our coding did not explicitly capture nuances of the randomization process. Many studies do not adequately report randomization procedures (see Grant, Mayo-Wilson et al., 2013;

Ioannidis, 2018), which may include problems like: 1) study consent after randomization, such that participants know which condition they were assigned to and may disproportionally drop out before beginning the intervention (though this can be observed in substantial differences in treatment and control group consent rates), 2) not describing the timing and process of randomization, 3) use of multiple cohorts with deviations from random assignment throughout the study, 4) a mix of randomization and self-selection into conditions, and 5) using only a subset of randomized individuals in analyses or reassigning individuals to the other intervention condition because of study attrition (also related to intent-to-treat).

Another potential limitation relates to the varied reporting guidelines and evidence standards that exist for ensuring the internal validity of RCTs, which can be confusing for trialists (see Means et al., 2015 for a discussion on this topic). The stringent Blueprints standards for examining baseline equivalence for all RCTs and differential attrition for RCTs with high attrition contribute to the relatively high prevalence of observed problems in our sample. Others

(including WWC and Ro2B) agree that baseline equivalence and differential attrition are important to test even with RCT designs, particularly when attrition is high. However, the specifics of testing recommendations differ. Additionally, Blueprints criteria for incorrect level 23 of analysis currently only applies to when studies randomly assign groups but analyze outcomes at the individual level with no adjustments for clustered data (i.e., not adjusting for clustering when the cluster was the unit of assignment). Other guidelines may require additional or more stringent statistical adjustments depending on the analytical tests and number of clusters included in the trial (Murray et al., 2018; Murray et al., 2020; Raudenbush & Schwartz, 2020).

In addition, we present the ingredients for a “gold standard” RCT using Blueprints’ evidence standards for minimizing threats to internal validity. Design and analysis methods are correlated and were thus examined collectively; our analysis presents results orthogonally (i.e., by individual design or evaluation flaw), though we attempt to minimize this limitation by also providing a count of flaws per study. Furthermore, missing information and/or unclear descriptions contribute to the complexity and difficulty of rating the evaluation quality of many

RCT studies. Although experts in statistics and evaluation design coded the studies, and methods were followed to reach agreement across multiple raters and improve reliability of ratings, human judgment ultimately factored into determining the quality of the RCTs in our sample. As

Ioannidis (2018) explains: “Nothing is perfect, and good application of theory, modeling and other empirical data can inform these trials before their conduct and help decide what to do with their results” (p. 55). Ioannidis concludes (and we concur) that, acknowledging these caveats, having more high-quality RCTs is what the field still lacks the most.

Next, while some QEDs (e.g., regression discontinuity designs, comparative interrupted time series, and instrumental variable analysis) may be methodologically strong alternatives when RCTs are less feasible (Henry et al., 2017; West, 2009; Wing & Cook, 2013), we limited the scope of our descriptive analysis and internal validity discussion in this paper to RCTs.

Lastly, although Blueprints reviews evaluation studies of preventive interventions conducted both nationally and internationally, no robust empirical data exists specifying where the RCTs in 24 our sample were tested; however, anecdotally we know many were conducted in the U.S.

Therefore, the extent to which our findings represent the world’s prevention trials is unknown.

Future Directions and Recommendations

Our study focused on the degree to which RCTs of preventive interventions are well- designed and analyzed, and whether authors provide a clear description of the methods used to report their findings. Though we found no statistically significant reductions in methodological or reporting flaws over the past 10 years, the trend was in a positive direction. More research needs to replicate this finding and investigate possible explanations. One hypothesis to test is that, as RCTs have become more common for program evaluations, the numbers of low- and high-quality studies have both increased in a way that hides progress.

Future research related to standards of evidence in prevention science may also include a parallel review of methods for assessing and reporting external validity criteria among RCTs (see

Gottfredson et al., 2015 for a discussion of external validity standards related to effectiveness and scale-up). Additional work in this area could also incorporate methodological advancements for synthesis of preventive intervention results across multiple trials using high-quality meta- analysis methods (Pigott & Polanin, 2020) or integrative data analysis (IDA) approaches (e.g.,

Brincks et al., 2018; Curran & Hussong, 2009).

Regarding transparent reporting, WWC and Blueprints have standards for RCTs designed to track subjects by condition at baseline and at each follow-up assessment to ensure participants lost to attrition do not differ across conditions (by baseline measures). The RoB2 also has an

“attrition bias” domain to evaluate bias due to missing outcome data (Sterne et al., 2019).

Assessing attrition, however, is not possible without complete and transparent reporting of sample sizes at randomization, posttest, and each follow-up assessment (e.g., using a CONSORT

Flow Diagram). This effort is being facilitated by preregistration, which involves putting the 25 design, variables, and treatment conditions of a study into an online registry database (e.g.,

ClinicalTrials.gov) prior to it being conducted (Nosek, Ebersole, DeHaven, & Mullor, 2018).

To further encourage strong methodological and reporting quality, SPR might consider convening a group tasked with helping journal editors and funders in the prevention science field to 1) incentivize authors to describe how their intervention study design meets the Blueprints evidence standards mapped to Gottfredson et al. (2015) (see Table 1), and 2) submit a reporting guideline (e.g., CONSORT-SPI) with their manuscript or grant proposal. Doing so will require training, as many prevention scientists conduct RCTs and serve as journal manuscript reviewers and/or external grant reviewers of intervention research (see Chilenski et al., 2020 for a summary of training needs in prevention science). Training topics could be linked to the SPR standards of evidence and/or Blueprints evaluation criteria described in this paper, with an awareness that other resources and guidelines exist in various fields, and that standards may be similar or different on certain internal validity criteria.

Conclusion

Our findings underscore the need for the prevention science field to scrutinize the methodological strength and quality of intervention studies claiming to have a strong evidence base, even among RCTs. Without such scrutiny, authors who claim positive intervention effects could be reporting biased outcomes due to poor design or incorrect analysis. Only studies with strong design and analysis quality, and with a clear description of the methods used to evaluate them, permit causal inferences of program effectiveness. Such improvements will increase the usefulness of preventive interventions by ensuring the credibility and usability of RCT findings.

Compliance with Ethical Standards

This paper does not contain research with human participants or animals. The authors declare that they have no financial or other conflict of interest. 26

Table 1. Internal Validity Criteria for Blueprints “Evaluation Quality” Standards of Evidence

Blueprints Evaluation Criteria Gottfredson et al. (2015)

___1. Does the study have a high-quality design that is free of threats to the random assignment 5.a.-5.b. (e.g., consent after randomization, design confound)?  NO IF: Randomization likely compromised  NO IF: Design confound ___2. Does the study clearly describe the sample size at each stage of data gathering so that 8 attrition from the randomized sample can be calculated at posttest and each follow-up?  NO IF: Missing information on attrition ___3. Is measurement of the outcomes done independently from the delivery of the intervention, 3.e.iii are outside raters blind to condition, and do participant reports avoid social desirability and demand bias?  NO IF: No independently measured outcomes ___4. Are the measures reliable and valid as shown by acceptable psychometric properties of the 3.e. measures (e.g., interrater reliability, Cronbach’s alpha)?  NO IF: Problems with reliability or validity of outcome measures ___5. Does the study use an intent-to-treat analysis by attempting to follow and analyze all 6.c. subjects as assigned to their original condition?  NO IF: No intent-to-treat analysis ___6. Was the analysis done at the proper level, with multilevel statistical methods or other 6.b. adjustments for clustering, when clusters rather than individuals were randomized?  NO IF: Incorrect level of analysis ___7. Does the analysis control for baseline outcome measures with the use of change scores, 6.d. baseline outcomes as covariates, or group-by-time interactions?  NO IF: No controls for baseline outcomes ___8. Does the analysis demonstrate baseline equivalence between conditions with statistical 5.b.ii tests, effect sizes, or other measures of condition differences across all baseline sociodemographic and outcome measures?  NO IF: No tests for baseline equivalence  NO IF: Differences between conditions at baseline  NO IF: Tests for baseline equivalence are incomplete ___9. Does the study demonstrate with statistical tests that any attrition beyond minimal levels is 5.c. unrelated to group assignment, sociodemographic characteristics, or baseline outcomes?  NO IF: Attrition (>5%) and no tests for differential attrition  NO IF: Evidence of differential attrition  NO IF: Tests for differential attrition are incomplete

Note. Blueprints’ current evaluation criteria were adapted from Mihalic and Elliott (2015). The right column lists for each criteria the corresponding SPR standard of evidence as listed in Gottfredson et al. (2015) Table 1 for efficacy (pgs. 897-898). 27

Figure 1. Flow diagram of systematic review adapted from PRISMA 2009

n Excluded (n = 1,014) o i Reports entered in the Blueprints t Reason: a database between 1996 and c Reports that are part of the fi

i August 2020 same evaluation are t

n (n = 3,582) combined into one study, e

d with coding for evaluation I quality occurring at the study-level. g n i

n Excluded (n = 1,337)

e Studies after combining reports e

r (n = 2,568) Reason: c Initial report within a S study was published before 2010. y t i l i Excluded (n = 380) b i Studies assessed for eligibility g

i Reason: l (n = 1,231) E Quasi-experimental study d e d u l

c Experimental studies included in n I the analysis (n = 851)

Notes: Blueprints – Blueprints for Healthy Youth Development, an online registry of experimentally proven and scalable prevention interventions ( https://www.blueprintsprograms.org/ ) . Many reports in the Blueprints database (used for this analysis) are part of the same evaluation, because they examine the same intervention and use the same sample but have different outcomes, follow-up periods, or statistical methods. Blueprints combines multiple reports within the same evaluation into what is referred to as a single study.

From: Moher D, Liberati A, Tetzlaff J, Altman DG, The PRISMA Group (2009). Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement. PLoS Med 6(7): e1000097. doi:10.1371/journal.pmed1000097

For more information, visit www.prisma-statement.org. 28 Table 2. Descriptive analysis results for methodological sources of bias and missing methods information (N = 851 RCTs)

A. Design & Analysis Problems N (%) Description/Example

Differences between conditions at 192 (23%) Treatment and control conditions differ at baseline on several demographic and/or baseline outcomes baseline

Evidence of differential attrition 156 (18%) For completers vs. noncompleters, and/or treatment and control conditions vary on demographic characteristics and/or baseline outcomes for the analysis sample Incorrect level of analysis 77 (9%) Randomized schools, students are unit of analysis, but no adjustment for clustering No intent-to-treat analysis 56 (7%) Researcher dropped participants from the analysis if participant did not complete the intervention No independently measured 52 (6%) Only teacher ratings of child behaviors following an intervention delivered by the teacher, or only parent behavioral outcomes ratings of child behaviors following an intervention delivered to parents Randomization likely compromised 42 (5%) Participant consent after randomization, or participants reallocated to different intervention condition after random assignment No controls for baseline outcomes 38 (5%) No use of change scores, baseline outcomes as covariates, or group-by-time interactions Problems with reliability or validity 28 (3%) Outcome measures show low reliability or validity, or no measure information provided of outcome measures Design confound 9 (1%) Randomly assigned only one school to the treatment condition and only one school to the control condition

B. Missing Information/Not N (%) Description/Example Reported

Tests for differential attrition are 192 (23%) Tested for only some differences between completers vs. noncompleters or across intervention conditions incomplete Tests for baseline equivalence are 161 (19%) Tested whether treatment and control conditions differed on only some baseline measures (demographic or incomplete baseline outcomes) High attrition and no tests for 114 (13%) High participant dropout and no tests for completers vs. noncompleters and/or condition by baseline differential attrition outcomes interaction No tests for baseline equivalence 74 (9%) No tests for whether treatment and control conditions are equivalent on demographic characteristics and baseline outcomes prior to beginning the intervention No information on attrition 22 (3%) Reported only randomized or analyzed sample size but did not report Ns across assessment points Note. Percentages are rounded to the nearest percent. Percentages do not sum to 100%: each row represents a separate calculation for the N (and %) of studies with the particular source of bias/missing information divided by the total 851 studies. Some studies included multiple sources of bias/missing information. 29

Table 3. Full sample counts and averages of number of methodological sources of bias and missing methods information within and across studies (N = 851 RCTs)

Counts of number of problems per RCT, across dataset of all RCTs Mean number Methodological Sources of of reasons, SD, Bias from Table 2 N (%) and range 5 4 3 2 1 0

1. Any methodological problem or missing/not M=1.49 reported information 4 33 96 292 241 185 SD=1.10 (Sections A + B) (<1%) (4%) (11%) (34%) (28%) (22%) Range=0-5 N = 851

2. Design or analysis M=.82 problem 2 40 138 290 381 -- SD=.89 (Section A) (<1%) (5%) (16%) (34%) (45%) Range=0-4 N = 851

3. Missing/not reported M=.67 information 5 128 300 418 -- -- SD=.75 (Section B) (<1%) (15%) (35%) (49%) Range=0-3 N = 851

Note. Percentages are rounded to the nearest percent. Results are presented from “high” (i.e., up to 5 methodological problems/missing pieces of information) to “low” (i.e., minimal, or no methodological/reporting problems). 30

References

Altman, D. G. (1985). Comparability of randomised groups. Statistician, 34, 125-36.

Altman, D. G., & Dore, C. J. (1990). Randomisation and baseline comparisons in clinical

trials. The Lancet, 335(8682), 149-153.

Bickman, L., & Reich, S. M. (2015). Randomized controlled trials: A gold standard or gold

plated. Credible and Actionable Evidence: The Foundation for Rigorous and Influential

Evaluations, Sage, Los Angeles, 83-113.

Brincks, A., Montag, S., Howe, G. W., Huang, S., Siddique, J., Ahn, S., ... & Brown, C. H.

(2018). Addressing methodologic challenges and minimizing threats to validity in

synthesizing findings from individual-level data across longitudinal randomized

trials. Prevention Science, 19(1), 60-73.

Bonell, C. (2002). The utility of randomized controlled trials of social interventions: An

examination of two trials of HIV prevention. Critical Public Health, 12(4), 321-334.

Buckley, P. R., Fagan, A. A., Pampel, F. C., & Hill, K. G. (2020). Making evidence-based

interventions relevant for users: A comparison of requirements for dissemination

readiness across program registries. Evaluation Review, 44(1), 51-83.

Burkhardt, J. T., Schröter, D. C., Magura, S., Means, S. N., & Coryn, C. L. (2015). An overview

of evidence-based program registers (EBPRs) for behavioral health. Evaluation and

Program Planning, 48, 92-99.

Chilenski, S. M., Pasch, K. E., Knapp, A., Baker, E., Boyd, R. C., Cioffi, C., ... & Rulison, K.

(2020). The Society for Prevention Research 20 years later: A summary of training

needs. Prevention Science, 21(7), 985-1000. 31

Cook, T. D. (2018). Twenty-six assumptions that have to be met if single random assignment

are to warrant" gold standard" status: A commentary on Deaton and

Cartwright. Social Science & Medicine, 210, 37-40.

Cook, T. D., & Campbell, D. T. (1979). The design and conduct of true experiments and quasi-

experiments in field settings. In Reproduced in part in Research in Organizations: Issues

and Controversies: Goodyear Publishing Company.

Curran, P. J., & Hussong, A. M. (2009). Integrative data analysis: the simultaneous analysis of

multiple data sets. Psychological Methods, 14(2), 81.

Deaton, A., & Cartwright, N. (2018). Understanding and misunderstanding randomized

controlled trials. Social Science & Medicine, 210, 2-21.

Dechartres, A., Trinquart, L., Faber, T., & Ravaud, P. (2016). Empirical evaluation of which trial

characteristics are associated with treatment effect estimates. Journal of Clinical

Epidemiology, 77, 24-37.

Deke, J., & Chiang, H. (2017). The WWC attrition standard: sensitivity to assumptions and

opportunities for refining and adapting to new contexts. Evaluation Review, 41(2), 130-

154.

European Medicines Agency (2015). Guideline on adjustment for baseline covariates in clinical

trials. Retrieved on October 19, 2020 from

https://www.ema.europa.eu/en/documents/scientific-guideline/guideline-adjustment-

baseline-covariates-clinical-trials_en.pdf

Fagan, A. A., & Buchanan, M. (2016). What works in crime prevention? Comparison and critical

review of three crime prevention registries. Criminology & Public Policy, 15(3), 617-

649. 32

Falagas, M. E., Grigori, T., & Ioannidou, E. (2009). A systematic review of trends in the

methodological quality of randomized controlled trials in various research fields. Journal

of Clinical , 62(3), 227-231. e229.

Farrington, D. P., & Petrosino, A. (2001). The Campbell collaboration crime and justice

group. The Annals of the American Academy of Political and Social Science, 578(1), 35-

49.

Flay, B. R., Biglan, A., Boruch, R. F., Castro, F. G., Gottfredson, D., Kellam, S., . . . Ji, P.

(2005). Standards of evidence: Criteria for efficacy, effectiveness and dissemination.

Prevention Science, 6(3), 151-175.

Gottfredson, D. C., Cook, T. D., Gardner, F. E., Gorman-Smith, D., Howe, G. W., Sandler, I. N.,

& Zafft, K. M. (2015). Standards of evidence for efficacy, effectiveness, and scale-up

research in prevention science: Next generation. Prevention Science, 16(7), 893-926.

Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual Review

of Psychology, 60, 549-576.

Grant, S., Mayo-Wilson, E., Montgomery, P., Macdonald, G., Michie, S., Hopewell, S., &

Moher, D. (2018). CONSORT-SPI 2018 Explanation and elaboration: Guidance for

reporting social and psychological intervention trials. Trials, 19(1), 406.

Grant, S., Montgomery, P., Hopewell, S., Macdonald, G., Moher, D., & Mayo-Wilson, E.

(2013). Developing a reporting guideline for social and psychological intervention trials.

Research on Social Work Practice, 23(6), 595-602.

Grant, S. P., Mayo-Wilson, E., Melendez-Torres, G., & Montgomery, P. (2013). Reporting

quality of social and psychological intervention trials: A systematic review of reporting

guidelines and trial publications. PLoS One, 8(5), e65442. 33

Gupta, S. K. (2011). Intention-to-treat concept: A review. Perspectives in Clinical Research,

2(3), 109.

Hedges, L. V., & Hedberg, E. C. (2007). Intraclass correlation values for planning group-

randomized trials in education. Educational Evaluation and Policy Analysis, 29(1), 60-

87.

Henry, D., Tolan, P., Gorman-Smith, D., & Schoeny, M. (2017). Alternatives to randomized

control trial designs for community-based prevention evaluation. Prevention Science,

18(6), 671-680.

Higgins, J. P., Altman, D. G., Gøtzsche, P. C., Jüni, P., Moher, D., Oxman, A. D., . . . Sterne, J.

A. (2011). The Cochrane Collaboration’s tool for assessing risk of bias in randomised

trials. BMJ, 343, d5928.

Hopewell, S., Dutton, S., Yu, L.-M., Chan, A.-W., & Altman, D. G. (2010). The quality of

reports of randomised trials in 2000 and 2006: Comparative study of articles indexed in

PubMed. BMJ, 340, c723.

Ioannidis, J. P. (2018). Randomized controlled trials: Often flawed, mostly useless, clearly

indispensable: A commentary on Deaton and Cartwright. Social Science & Medicine

(1982), 210, 53.

Jeličić, H., Phelps, E., & Lerner, R. M. (2009). Use of missing data methods in longitudinal

studies: The persistence of bad practices in developmental psychology. Developmental

Psychology, 45(4), 1195.

Kristman, V. L., Manno, M., & Côté, P. (2005). Methods to account for attrition in longitudinal

data: Do they work? A simulation study. European Journal of Epidemiology, 20(8), 657-

662. 34

Lachin, J. M. (2000). Statistical considerations in the intent-to-treat principle. Controlled

Clinical Trials, 21(3), 167-189.

Little, R. J., & Rubin, D. B. (2019). Statistical analysis with missing data (Vol. 793). John Wiley

& Sons.

Martin, J., McBride, T., Brims, L., Doubell, L., Pote, I., & Clarke, A. (2018). Evaluating early

intervention programmes: Six common pitfalls, and how to avoid them. Retrieved on

October 12, 2020 from http://www.eif.org.uk/publication/evaluating-early-intervention-

programmes-six-common-pitfalls-and-how-to-avoid-them

Mayo-Wilson, E., Grant, S., Hopewell, S., Macdonald, G., Moher, D., & Montgomery, P.

(2013). Developing a reporting guideline for social and psychological intervention trials.

Trials, 14(1), 242.

Means, S. N., Magura, S., Burkhardt, J. T., Schröter, D. C., & Coryn, C. L. (2015). Comparing

rating paradigms for evidence-based program registers in behavioral health: Evidentiary

criteria and implications for assessing programs. Evaluation and Program Planning, 48,

100-116.

Mihalic, S. F., & Elliott, D. S. (2015). Evidence-based programs registry: Blueprints for healthy

youth development. Evaluation and Program Planning, 48, 124-131.

Montgomery, P., Grant, S., Mayo-Wilson, E., Macdonald, G., Michie, S., Hopewell, S., &

Moher, D. (2018). Reporting randomised trials of social and psychological interventions:

The CONSORT-SPI 2018 Extension. Trials, 19(1), 407.

Murray, D. M., Pals, S. L., George, S. M., Kuzmichev, A., Lai, G. Y., Lee, J. A., ... & Nelson, S.

M. (2018). Design and analysis of group-randomized trials in cancer: A review of current

practices. Preventive Medicine, 111, 241-247. 35

Murray, D. M., Taljaard, M., Turner, E. L., & George, S. M. (2020). Essential ingredients and

innovations in the design and analysis of group-randomized trials. Annual Review of

Public Health, 41, 1-19.

Murray, D. M., Varnell, S. P., & Blitstein, J. L. (2004). Design and analysis of group-

randomized trials: A review of recent methodological developments. American Journal

of Public Health, 94(3), 423-432.

Nicholson, J. S., Deboeck, P. R., & Howard, W. (2017). Attrition in developmental psychology:

A review of modern missing data reporting and practices. International Journal of

Behavioral Development, 41(1), 143-153.

Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2018). The preregistration

revolution. Proceedings of the National Academy of Sciences, 115(11), 2600-2606.

Pigott, T. D., & Polanin, J. R. (2020). Methodological guidance paper: High-quality meta-

analysis in a systematic review. Review of Educational Research, 90(1), 24-46.

Podsakoff, P. M., MacKenzie, S. B., Lee, J.-Y., & Podsakoff, N. P. (2003). Common method

biases in behavioral research: A critical review of the literature and recommended

remedies. Journal of Applied Psychology, 88(5), 879.

Puma, M. J., Olsen, R. B., Bell, S. H., & Price, C. (2009). What to Do when Data Are Missing in

Group Randomized Controlled Trials. NCEE 2009-0049. National Center for Education

Evaluation and Regional Assistance.

Raab, G. M., Day, S., & Sales, J. (2000). How to select covariates to include in the analysis of a

clinical trial. Controlled Clinical Trials, 21(4), 330-342.

Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data

analysis methods (Vol. 1): sage. 36

Raudenbush, S. W., & Schwartz, D. (2020). Randomized Experiments in Education, with

Implications for Multilevel Causal Inference. Annual Review of Statistics and Its

Application, 7, 177-208.

Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art.

Psychological Methods, 7(2), 147.

Schulz, K. F., Altman, D. G., Moher, D., & Group, C. (2010). CONSORT 2010 statement:

Updated guidelines for reporting parallel group randomised trials. Trials, 11(1), 32.

Shadish, W. R., & Cook, T. D. (2009). The renaissance of field experimentation in evaluating

interventions. Annual Review of Psychology, 60, 607-629.

Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental

designs for generalized causal inference. Boston: Houghton Mifflin.

Senn, S. (1994). Testing for baseline balance in clinical trials. Statistics in Medicine, 13(17),

1715-1726.

Song, M., & Herman, R. (2010). Critical issues and common pitfalls in designing and conducting

impact studies in education: Lessons learned from the What Works Clearinghouse (Phase

I). Educational Evaluation and Policy Analysis, 32(3), 351-371.

Spieth, P. M., Kubasch, A. S., Penzlin, A. I., Illigens, B. M.-W., Barlinn, K., & Siepmann, T.

(2016). Randomized controlled trials–a matter of design. Neuropsychiatric Disease and

Treatment, 12, 1341.

Thomson, D., Hartling, L., Cohen, E., Vandermeer, B., Tjosvold, L., & Klassen, T. P. (2010).

Controlled trials in children: Quantity, methodological quality and descriptive

characteristics of pediatric controlled trials published 1948-2006. PLoS One, 5(9),

e13106. 37

Torgerson, D. J., & Torgerson, C. J. (2003). Avoiding bias in randomised controlled trials in

educational research. British Journal of Educational Studies, 51(1), 36-45.

Wadhwa, M., & Cook, T. D. (2019). The set of assumptions randomized control trials make and

their implications for the role of such experiments in evidence‐ based child and adolescent

development research. New Directions for Child and Adolescent Development,

2019(167), 17-37.

Walleser, S., Hill, S. R., & Bero, L. A. (2011). Characteristics and quality of reporting of cluster

randomized trials in children: Reporting needs improvement. Journal of Clinical

Epidemiology, 64(12), 1331-1340.

West, S. G. (2009). Alternatives to randomized experiments. Current Directions in

Psychological Science, 18(5), 299-304.

West, S. G., & Thoemmes, F. (2010). Campbell’s and Rubin’s perspectives on causal inference.

Psychological Methods, 15(1), 18.

What Works Clearinghouse (WWC) (2020). WWC procedures and standards handbook (Version

4.1). Washington, DC: US Department of Education, Institute of Education

Sciences. National Center for Education Evaluation and Regional Assistance, What

Works Clearinghouse.

Wilson, D. B. (2009). Missing a critical piece of the pie: Simple document search strategies

inadequate for systematic reviews. Journal of Experimental Criminology, 5(4), 429-440.

Wing, C., & Cook, T. D. (2013). Strengthening the regression discontinuity design using

additional design elements: A within-study comparison. Journal of Policy Analysis and

Management, 32(4), 853-U208. doi:10.1002/pam.21721