<<

E REVIEW ARTICLE Clinical Methodology 3: Randomized Controlled Trials

Daniel I. Sessler, MD,* and Peter B. Imrey, PhD†

Randomized assignment of treatment excludes reverse causation and and, in suffciently large studies, effectively prevents confounding. Well-implemented blinding prevents measurement bias. Studies that include these protections are called randomized, blinded clin- ical trials and, when conducted with suffcient numbers of patients, provide the most valid results. Although conceptually straightforward, design of clinical trials requires thoughtful trade- offs among competing approaches—all of which infuence the number of patients required, enrollment time, internal and external validity, ability to evaluate interactions among treatments, and cost. (Anesth Analg 2015;121:1052–64)

bservational study designs, as explained in our that aggressively treated septic critical care patients do bet- previous 2 reviews, are inherently vulnerable to ter than those treated conservatively. Improved outcomes Osystematic errors from selection and measurement might occur because aggressive was more effective. biases and confounding; retrospective studies may addi- But it might equally well be that patients who appeared tionally be subject to reverse causation when the timing stronger were selected for more aggressive therapy because of exposure and outcome cannot be precisely determined. they were thought better able to tolerate it and then did bet- Fortunately, 2 study design strategies, randomization and ter because they were indeed stronger. Looking back, it is blinding, preclude or mitigate these major sources of error. hard to distinguish selection bias from true treatment effect. Randomized clinical trials (RCTs) are cohort studies— The threat of confounding, which can be latent in a necessarily prospective—in which treatments are allocated population or result from selection or measurement bias, randomly to the subjects who agree to participate. In the refers to misattribution error because of a third-factor link- most rigorous randomized trials, called blinded or masked, ing treatment and outcome. For example, anesthesiologists knowledge of which treatment each patient receives is con- may prefer to use neuraxial anesthesia in older patients cealed from the patients and, when possible, from inves- under the impression that it is safer. Let us say, however, tigators who evaluate their progress. Blinded RCTs are that younger patients in a study cohort actually do better. particularly robust because randomization essentially elim- But did they do better because of they were given general inates the threats of reverse causation and selection bias to anesthesia (a causal effect of treatment) or simply because study validity, and, considerably mitigates the threat of con- they were younger (confounding by age, the third vari- founding. Well-executed blinding/masking simultaneously able)? Looking back in time, as in a retrospective analysis mitigates measurement bias and placebo effects by equaliz- of existing data, in complex medical situations, it is diffcult 1 ing their impacts across treatments. We now discuss these to determine the extent to which even known mechanisms benefts in more detail. contribute causally to outcome differences. And it is essen- tially impossible to evaluate the potential contributions of RANDOMIZATION unknown mechanisms. In reviewing patient records, we would not expect 2 treat- Confounding can only occur when a third factor, the con- ments for a medical condition to be randomly distributed founder, differs notably between treatment groups in the among patients because care decisions are infuenced by study sample. Randomization largely prevents confound- numerous factors, including and patient prefer- ing because, in a suffciently large study, patients assigned ence. Patients given different treatments may therefore differ to each treatment group will most likely be very similar with systematically and substantially in their risks of outcomes. respect to non–treatment-related factors that might infuence Randomization eliminates selection bias in treatment comparisons because, by defnition, randomized assign- the outcome. The tendency of randomization to equalize ments are indifferent to patient characteristics of any sort. allocations across treatment groups improves as sample size For example, investigators reviewing records might fnd increases and applies to both known and unknown factors, thus making randomization an exceptionally powerful tool. Randomized groups in a suffciently large trial will thus differ substantively only in treatment. The consequence is From the *Department of Outcomes Research, Anesthesiology Institute, Cleveland Clinic, Cleveland, Ohio; and †Department of Quantitative Health that, unless there is measurement bias, differences in out- , Lerner Research Institute, Cleveland Clinic, Cleveland, Ohio. come can be causally and specifcally attributed to the treat- Accepted for publication May 1, 2015. ment itself—which is what we really want to know. Funding: Supported by internal sources. Trials need to be larger than one might expect to pro- The authors declare no conficts of interest. vide a reasonable expectation that routine baseline factors Address correspondence to Daniel I. Sessler, MD, Department of Outcomes are comparably distributed in each treatment group (i.e., Research, Cleveland Clinic, 9500 Euclid Ave./P77, Cleveland, OH 44195. Address e-mail to [email protected]; www.OR.org. n > 100 per group). For factors that are uncommon, many Copyright © 2015 International Anesthesia Research Society more patients are required. For example, consider a feasibil- DOI: 10.1213/ANE.0000000000000862 ity trial testing tight-versus-routine perioperative glucose

Number 4 ڇ Volume 121 ڇ www.anesthesia-analgesia.org October 2015 1052 Copyright © 2015 International Anesthesia Research Society. Unauthorized reproduction of this article is prohibited. Randomized Controlled Trials

control with 25 patients per group. It is unlikely that aver- statistical power, but power can always be augmented by age age or weight will differ greatly between the 2 groups. increasing the numbers of patients and/or measurements. But it would not be too surprising if, by bad luck, 1 group Nonrandom errors that affect treatment groups differ- had 14 diabetic patients and the other had just 7. Such a ently are a major threat to study validity. Measurement bias, large disparity will occur about 1 in 12 trials of this size, that is, error resulting from distortions of a measurement just by chance. But mellitus obviously has substan- process, and thus not expected to average out over many tial potential to infuence the investigators’ ability to tightly measurements, can compromise even large RCTs. For exam- control glucose concentration. Baseline inhomogeneity, that ple, suppose investigators are evaluating the effect of a new is, the disparate numbers of patients with diabetes mellitus, on postoperative nausea and vomiting. And let us say would then complicate interpretation of the results of an that both the investigators and patients know whether they otherwise excellent trial. have been assigned to the experimental or control group. One approach to avoiding baseline heterogeneity is Does knowing the treatment infuence the amount of nau- increasing the sample size. Larger sample size lowers the sea and vomiting reported? It almost surely does: patients risk of any given level of heterogeneity across all potentially (and some ) typically assume that the novel treat- important factors, including unknown confounders. But ment is better, and overestimate its benefts, even when the when there are a limited number of obviously important new treatment is actually a placebo or the same drug as in uncommon factors, an alternative is to stratify randomiza- the control group.2 They make this assumption even though tion (i.e., randomize separately in groups distinguished by superiority is actually unknown, and comparing the novel the values of these factors), using one of several possible and conventional treatment is the entire point of the study. devices that restrict the results so that each factor is distrib- The effect is so strong that IRBs typically forbid investigators uted roughly evenly across treatment groups. This process to describe the experimental treatments as novel because assures that even uncommon characteristics, if included in patients assume “new” implies “new and improved.” the stratifcation, will be roughly comparably distributed Overestimation of benefts is, of course, most likely with across treatment groups. subjective outcomes such as pain or nausea and vomiting. With electronic randomization systems, such as those But improvement also occurs with supposedly objective accessed in real time via the Web, it is possible to include outcomes, possibly because patients expect to do better various levels of stratifcation without diffculty. In multi- and that expectation alone improves physical character- center studies, permuted-block randomization is virtually istics such as immune function. Improved outcomes are always used because site-related variations are assumed to be welcome of course, but it would be a mistake under these a substantial potential confounder and a given site may not circumstances to conclude that the novel therapy caused enroll enough patients to assure baseline homogeneity. In this the improvement. The same logic applies to complications, process, random treatments are assigned in small sequential which are often underestimated for novel treatments. blocks of each site’s patients, so that different sites’ patients are Fortunately, if done effectively, blinding can largely elim- comparably distributed across treatment groups at the end of inate measurement biases by equalizing errors over treat- each block. Block sizes are concealed from clinicians enrolling ment groups so they do not skew treatment comparisons. patients, and often changed, so that enrolling clinicians and Blinding in research is defned by concealing which treat- other investigators are masked to the allocation process and ment a patient actually receives. Full blinding prevents bias cannot know the next patient’s treatment assignment. in the measurement process because patients, clinicians who Although theoretically straightforward, randomization treat them, and all who evaluate their health outcomes do can be tricky in practice. For example, patients may agree not know which treatment—the experimental or control— to participate in a randomized trial but then drop out if has been provided. Patient responses, and the processes by they are not assigned to the novel treatment they wanted. which physicians and investigators measure them, are thus Similarly, consented patients may remove themselves from not affected by their impressions about the superiority of a study based on perceived lack of beneft or complications. novel treatments. Any inherent biases in the measurement To the extent that patients drop out of studies nonrandomly, process, whether random or systematic, are experienced by there remains potential for selection bias. each study group equally and will tend to cancel out of the intergroup comparisons. BLINDING Clinical measurements, no matter how carefully taken, are RANDOMIZED AND BLINDED STUDIES rarely precisely accurate. Preservation is imperfect, bio- In retrospective analyses of treatment outcomes, which samples degrade, batches of reagents and biologics vary, by defnition involve looking back in time, it is diffcult to human operators processing samples are inconsistent, and determine whether outcome differences result from base- radiologists and pathologists vary in their interpretations line differences in the group characteristics (selection bias), of images and biopsy specimens, which are themselves of a third factor (confounding), nonrandom measurement variable quality. But these sorts of errors occur randomly, errors (measurement bias), or from the treatment itself (the with underestimates likely to be balanced by overestimates relationship of interest). over large numbers of measurements. Consequently, in Limited ability to recognize distortions from bias and clinical trials, such errors are equally likely to favor 1 treat- confounding weakens retrospective studies. But in prospec- ment group over the other and very unlikely to favor either tive cohort studies, treatments can be randomly assigned to substantially when averaged over large numbers of tests. prevent selection bias and minimize confounding; further- Random error adds variability to results, which degrades more, participating patients and clinicians can be blinded to

Number 4 www.anesthesia-analgesia.org 1053 ڇ Volume 121 ڇ October 2015 Copyright © 2015 International Anesthesia Research Society. Unauthorized reproduction of this article is prohibited. E REVIEW ARTICLE

prevent measurement bias. Studies that incorporate the pro- unambiguous dichotomous outcomes that damage patients tections of randomization and blinding are called random- in perceptible ways. ized, blinded clinical trials. And because the major sources of error are limited in such trials, they are considered to pro- SUBJECT SELECTION vide the highest level of clinical evidence. When selecting participants for clinical trials, there is inher- Before randomized trials, it was diffcult to objectively ent tension between the scientifc requirements of demon- evaluate treatment effcacy, leading Oliver Wendell Holmes strating treatment beneft conclusively and the desire for to comment: “If all our could be sunk to the bottom of results that are widely clinically useful. This relates to the the sea, it would be all the better for mankind—and all the distinction between internal and external validity of a study. worse for the fshes.” Amazingly, the frst major random- Internal validity denotes the strength of design and ized trial, which evaluated streptomycin as a treatment for analysis of a study in protecting from spurious conclusions tuberculosis, was not conducted until 1948. Clinical trials about similar participants such as might otherwise arise have thus been an established part of only from reverse causation mix-ups, selection and measure- for about the past 65 years. ment biases, confounding, and chance. External validity Of course designed experimentation can only be done refers to applicability of research results to persons differ- prospectively. And enrolling, treating, and a ent from participants, conditions different from those of the suffcient number of patients may take many years and con- trial, other doses, other routes of administration, or even siderable expense. Thus, trials should be considered a scarce other agents within a given pharmaceutical class. resource and used to address important questions sup- Restricting participants to a near-homogeneous group ported by compelling mechanistic understanding and, pref- reduces outcome variability, which reduces the sample size, erably, reasonable animal and previous human experience. time, and cost needed to obtain statistically signifcant dif- ferences between treatments, assuming one exists. It can CONTINUOUS VERSUS DICHOTOMOUS OUTCOMES also narrow the spread of potential confounders, which Clinical signs of risk or progression, and thus limits the distortion confounding can produce. It is also prognostic of outcomes, are often measurable as continu- reasonable and ethically appropriate to restrict enrollment ous variables, even though the outcomes they predict, and to patients who are (1) especially likely to beneft from the which patients directly perceive, are typically dichotomous experimental intervention and (2) unlikely to suffer compli- or categorical. Examples include QT interval in relation to cations. A consequence is that most clinical trials evaluate nonperfusing arrhythmias, such as torsade-de-pointes; dia- only small subsets of potential populations of interest. stolic in relation to myocardial infarction Although there are compelling reasons to restrict trial or ischemic ; and creatinine clearance in relation to enrollment, technically results of any trial apply only to development of end-stage renal disease. The frst example patients similar to those who were studied. The problem is typical in that QT interval per se cannot be detected by is that many randomized trials are nonrepresentative, fail- patients and would only interest them to the extent that it ing to include even substantial minority populations, along might predict something they do care about, such as hav- with subjects who cannot read, do not speak the dominant ing a cardiac arrest. That said, many continuous outcomes language, or have cognitive impairment, all of which com- are clinically and economically important, such as length of plicate obtaining valid . Because enrollment is often critical care, duration of hospitalization, months of severe highly restricted, many trials suffer from poor generaliz- pain, or quality of life. ability; that is, results may extrapolate poorly to the great Continuous outcomes are easier than dichotomous out- majority of patients who might beneft from the experimen- comes to study because continuous markers or mediators tal treatment. can often be usefully evaluated without awaiting ultimate The diffculty, of course, is that clinicians rarely care patient outcomes, which can involve extended patient sur- for patients whose characteristics and background closely veillance. Moreover, because comparisons of patients with match those who participated in relevant trials. In clinical respect to dichotomous outcomes can only be of 4 sorts: both practice, we thus need to make reasonable extrapolations yes, both no, yes/no, or no/yes, whereas numerical values from existing data to our patients. Extrapolation is almost can be ordered and scaled by distance, there is simply more always preferable to the alternative (best clinical judgment information in numerical markers than dichotomous out- or “making it up”), but confdence in various extrapolations, comes, even when the latter are far more important. A con- especially among competing study results, is enhanced sequence is that more patients must usually be studied to when studies include broad populations. In practice, real- reliably distinguish treatments on the basis of dichotomous world applications of new treatments usually prove to be as compared with continuous outcomes, a difference that less effective and more toxic than reported in clinical trials. increases for infrequent events. Table 1 summarizes the advantages and disadvantages of As thus might be expected, many and perhaps most clin- loose and tight enrollment criteria. ical trials designate continuous variables rather than more Generally speaking, observational studies such as retro- important dichotomous variables as their primary out- spective cohort analyses tend to include broad populations. comes. There is an implicit assumption that what improves Consequently, their results, if correct, tend to generalize the continuous predecessor will similarly beneft the later well. However, their internal validity is often degraded patient outcome, although this does not always prove to by unknown amounts of confounding and selection and be the case. Despite the costs, ultimately, decisions about measurement bias, making it diffcult to assert correctness comparative beneft to patients are thus best based on confdently. Tightly controlled clinical trials with restrictive

1054 www.anesthesia-analgesia.org ANESTHESIA & ANALGESIA Copyright © 2015 International Anesthesia Research Society. Unauthorized reproduction of this article is prohibited. Randomized Controlled Trials

Table 1. Subject Selection Strategies to other treatments, and these within-patient differences in responses are then aggregated across patients. Tight criteria Reduce variability and sample size The more that patients vary from one another, the Exclude subjects at risk of treatment complications greater the increase in sensitivity—statistical power—to Include subjects most likely to beneft detect a treatment beneft between paired t tests and related May restrict to advanced disease, compliant patients, etc. methods used in a crossover trials and unpaired t tests and Slow enrollment related methods used for parallel-group trials. The biggest Best case results (compliant low-risk patients with ideal disease stage) advantage of crossover trials is thus that, by minimizing the Loose criteria Include more real-world participants effects of population variability, it is easier to observe spe- Increase variability and sample size cifc intervention effects, and with far fewer patients than Speed enrollment would be needed than with a parallel-group approach. Enhance generalizability Unfortunately, however, substantial limitations of cross- over designs often preclude their use. For example, they enrollment have the opposite problem. Randomization and assume that the underlying disease process is static over blinding limit bias and confounding, resulting in high inter- the trial’s duration and that treatments have no permanent nal validity; however, trial external validity often suffers as residual effects, so that the condition of a patient when the a direct consequence of tight enrollment criteria. second treatment is initiated is no longer affected by the A further factor to consider is that the results of even the frst treatment and is similar to when the study started. best clinical trial apply to the typical patient in each group. Surgery, for example, never qualifes and often a wash-out But it remains possible—and perhaps probable—that a period of no therapy is required to restore the patient to a treatment that on average is benefcial might be harmful for baseline state. A corollary is that crossover trials can only particular members of the group. Similarly, toxicity might evaluate transient symptoms and signs, mediators, and be substantial for certain members of a particular study soft outcomes such a laboratory tests, patient function, or hemodynamic responses. They are thus perfect for compar- group, even if on average toxicity was less with the desig- ing ≥2 analgesics for control of persistent and stable pain nated treatment. In fact, such divergent results are some- or of for blood cholesterol reduction. But crossover times predictable and are termed practice misalignments.3 designs cannot directly address far more important ques- They result when enrollment criteria are broad and the tions, such as which most reduces the risk of a heart study intervention is one that is typically titrated by clini- attack or mortality. cians; for example, designating fxed low and high concen- trations of vasopressors for treatment of sepsis, rather than FACTORIAL DESIGNS allowing clinicians to titrate lower concentrations for less Factorial designs have been used in other felds for decades, severe disease and higher concentrations in more critical but only recently have they become well established and 4 patients. increasingly common in .5 The basic approach is to simultaneously randomly assign patients to ≥4 interven- CROSSOVER TRIALS tions. For example, patients in a single study might be ran- The prototypical clinical trial is a randomized and blinded domly assigned to clonidine or a placebo for clonidine and parallel-group prospective . In such studies, to aspirin or a placebo for aspirin. The randomization might patients are randomly assigned to 1 of ≥2 treatment groups, be arranged so that the fraction of patients given 1 treat- and patient outcomes are compared among these groups. ment, say clonidine, does not depend on whether or not Statistically, the challenge with this approach is separat- the patient receives the other treatment, say aspirin. This ing background population variability from the treatment avoids confounding, that is, contaminating, the effect of 1 effect. If the treatment effect is large (say comparing a treatment with that of the other, so that the overall cloni- highly effective treatment with placebo), it is often easy to dine effect can be evaluated without regard to aspirin, and isolate the effect of an experimental treatment. But in an era vice versa.6,7 These overall results are called marginal effects where there are many effective treatments for most condi- because, when results are conventionally described in a 2 × tions, the much more common problem is to identify small 2 table with rows representing 1 treatment axis (e.g., cloni- incremental benefts. And those can easily be obscured by dine or clonidine placebo) and columns representing the natural variation in the population. other (aspirin or aspirin placebo), they appear at the edges, An alternative to comparing ≥2 groups of different or margins, of the table. Effectively, then, 2 studies are con- patients is thus to randomly compare ≥2 treatments in the ducted simultaneously in the same group of patients. This same patients, that is, use a crossover design. approach is obviously effcient, in that 2 hypotheses can Specifcally, a single group of subjects is exposed to be tested with only slightly more patients than would be ≥2 treatments, with each patient crossed over from one required for each question alone. treatment to another in a random order. This approach is Another important beneft of factorial designs is that statistically effcient because the comparison is between they allow investigators to evaluate effect modifcation, that treatments within each patient. Patient variability across the is, the interaction between treatments, as well as their mar- population thus largely drops from the analysis because, ginal effects. Consider, for example, a large trial of clonidine instead of comparing patients with other patients to deter- versus placebo and consider a separate large trial of aspirin mine treatment effects, the response of each patient to a versus placebo. The results of the frst would presumably treatment is frst compared with his or her own responses clearly identify the beneft (if any) from clonidine, and the

Number 4 www.anesthesia-analgesia.org 1055 ڇ Volume 121 ڇ October 2015 Copyright © 2015 International Anesthesia Research Society. Unauthorized reproduction of this article is prohibited. E REVIEW ARTICLE

results of the second would identify the effects of aspirin on environmental experts who assess operating room airfow, the designated outcome. But no matter what the results, cli- flter integrity, and bacterial colonization of caregivers and nicians would reasonably ask what might be expected when the air-handling system. Simultaneously, there is renewed the 2 treatments are combined. focus on hand washing, surgical-site preparation, and The answer to that important question cannot be deter- proper draping. mined from 2 independent trials but is readily available The attentions of all these experts and efforts among from an adequately powered factorial trial. Note, however, caregivers are invariably successful, and the surgical wound that interactions are typically smaller than main effects and infection soon returns to the expected level. But thus require larger studies to reliably detect. Moreover, once why? A likely reason is that the high infection incidence that an interaction is detected, the study, in effect, breaks into triggered concern was simply random variation and never pieces because the effect of each treatment now must be refected a systematic increase that would be expected to evaluated separately with and without the other, with only persist. If so, the infection incidence’s subsequent return to half the original number of patients available to do this, and baseline level was simply regression-to-the-mean, the third thus losing an important beneft of the factorial approach. major problem with before-and-after studies. A disadvantage of factorial trials is their increased com- Let us say, however, that the original increase in the plexity, both in study conduct and potentially in interpreta- wound infection incidence was real and that the subse- tion. Another disadvantage is that each intervention usually quent return to baseline risk refected enhanced care. Even has at least slightly different contraindications, but only assuming real improvement, a major problem with before- patients who can be randomly assigned to each potential and-after studies is that it remains unrealistic to attribute intervention can be included, thus narrowing the group of improvement to a single intervention because several were eligible patients. introduced simultaneously, and it is essentially impossible A complex anesthesia study using a 6-way randomiza- to determine the extent to which the beneft of one was con- tion of various antiemetic strategies and powered not only founded with the benefts of others. for marginal effects but also for second- and third-level The fourth weakness of before-and-after studies is the interactions among the treatments exemplifes the strengths Hawthorne effect. The term was introduced by Henry 8 and limitations of such designs. Landsberger based on studies conducted at the Hawthorne Works, a Western Electric factory near Chicago.9 What BEFORE-AND-AFTER STUDIES AND CLUSTER Landsberger noticed was that tiny environmental changes TRIALS (such as more illumination) improved productivity and sat- Many patient characteristics—such as smoking status, race, isfaction, but only briefy. In fact, most any change briefy sex, and obesity—cannot be randomized. Other interven- improved productivity—including subsequently reducing tions, such as changes to health systems per se, for example, illumination levels! But because the improvements were not introduction of electronic records, a new billing system, or sustained, he correctly concluded that they resulted from a clinical pathway, cannot be randomized on an individ- ual basis. These require extensive system and behavioral employee engagement in the process rather than the envi- changes, months or years to implement, and cannot be ronmental change per se. undertaken or reversed on a patient-by-patient basis. But The Hawthorne effect is undoubtedly useful in that such system interventions are arguably among the most worker engagement improves quality and satisfaction. important changes to evaluate, with huge potential impact. For example, anesthesiologist-driven changes in depart- How then to study them? mental call schedules improve satisfaction even when the The most common approach for evaluating system inter- total workload is unchanged. Skilled chairs thus encourage ventions is a before-and-after analysis. However, this is a member-driven process improvement discussions because weak study design because it is nearly impossible to account engagement in this very process improves quality and sat- for 4 important sources of error. The frst is that most aspects isfaction. The diffculty is that before-and-after studies often of improve over time. And although it is tempt- attribute all improvement to a specifc factor, ignoring the ing to attribute improvement to a specifc intervention of very real improvement that comes just from the process of interest, many aspects of care inevitably change simultane- discussing changes (to say nothing of other simultaneous ously. The intervention of interest is thus confounded by improvements). other simultaneous interventions or other changes, which An alternative to before-and-after studies, with all their are often unrecognized, such as subtle caregiver behavioral weaknesses, is cluster randomized trials. Cluster random- changes. ization refers to randomly allocating treatments to entire Consider surgical wound infections. The monthly inci- groups or clusters, in a one-for-all fashion, en masse. dence of surgical wound infections varies considerably, Clusters might consist of classes attending continuing med- even in large high-quality . Why the incidence ical education events, all patients of individual primary care varies remains unknown, which speaks to the many fac- providers or entire group practices, patients or clinicians at tors contributing to this key safety and quality metric. individual outpatient clinics affliated with a health care sys- When wound infections spike, as they do from time to tem, or groups of employees of several different hospitals. time, high-quality hospitals recognize the problem and The diffculty, of course, is that entire care systems within evaluate ameliorative approaches. Almost inevitably, this a , or even entire hospitals, need to be randomized involves discussions among infection-control specialists, and each becomes a unit of analysis. Statistically, each care nurses, surgeons, and anesthesiologists. But it also involves unit or hospital is treated more-or-less the way a single

1056 www.anesthesia-analgesia.org ANESTHESIA & ANALGESIA Copyright © 2015 International Anesthesia Research Society. Unauthorized reproduction of this article is prohibited. Randomized Controlled Trials

patient would be in a conventional trial. In other words, sta- But when the novel intervention is reasonably expected to tistical analysis is not based on the number of patients but be at least as safe and effective, and possibly better or less on the number of units that are randomized. Considering expensive, many IRBs will approve this sort of trial with the logistical challenges and cost of such trials, it obvious waived consent. why they are uncommon although well-conducted cluster randomized trials can yield invaluable information.10,11 COMPOSITE OUTCOMES Escalating costs of large RCTs have prompted consideration Large trial sizes, factorial designs, and composite outcomes of other ways to accrue and study large numbers of patients are among the major trends in clinical trials.5 The use of quickly at low cost. One possibility is an alternating interven- composite outcomes is increasing because they offer distinct tion trial, a nonrandomized extension of a quality improve- advantages for studies with dichotomous outcomes—which ment demonstration approach that we mention here because include most hard outcomes such as major complications of its utility and importance under special circumstances, espe- and deaths. Composites combine ≥2 components into a sin- cially when combined with electronic data capture. gle summary, typically dichotomous, which then becomes Quality improvement projects typically at most use a the basis for analysis. before-and-after approach to evaluating treatment effect. There are 2 major advantages to composite outcomes. For example, surgeons might switch to a new skin prepa- The frst is that a single measure may poorly character- ration routine and then evaluate the incidence of wound ize the anticipated effect of an experimental intervention. infection before and after the change. Very often, the switch Consider guided fuid management, which may help pre- will appear salutary—but quite possibly because of unrec- vent respiratory failure, protect the kidneys and reduce the ognized concomitant changes, regression to the mean, or risk of wound complications. There is little clinical basis for the Hawthorne effect. The true beneft (if any) of the new choosing just 1 of these complications as the primary end skin preparation technique is essentially impossible to point of a trial because each is important. Furthermore, an determine with this approach. outcome such as wound complications includes infection, Now consider an alternative and stronger study design. dehiscence, anastomotic leak, abscess, etc. A composite out- Say an anesthesia department is considering switching from come that includes all these potential clinical events thus sevofurane (more expensive but shorter acting) to isofu- has better construct validity, meaning that it more fully rane (less expensive but longer acting) but is concerned characterizes the potential overall beneft of guided fuid the switch might prolong hospital stays. Instead of simply management, than its single components. changing anesthetics and a before-and-after comparison, 1 The second major advantage of composite outcomes anesthetic can be used exclusively for a couple of weeks and relates to sample size. Sample size for studies with dichoto- then the other for a couple of weeks, with continued alterna- mous outcomes is determined by baseline incidence of the tion for several months. Note that this is not an individual outcome and the expected treatment effect. Treatment effect patient randomization; in fact, the trial is not randomized at can be enhanced by optimal patient selection but is other- all or necessarily blinded. But in the course of say a dozen wise a function of the intervention. Baseline outcome inci- switches, other process improvements and any regression dence, however, is increased in a composite because each to the mean may reasonably be expected to average out, component contributes. Thus rare outcomes can be evalu- letting the investigators isolate the nearly pure effect of ated if a suffcient number are combined into a composite. anesthetic choice on duration of hospitalization. This expec- The alternative approach, considering each component tation is most warranted when the intervention involves a as an independent primary outcome and taking a statisti- relatively inconspicuous and noncontroversial aspect of an cal penalty for using multiple parallel signifcant tests, overall treatment process, in contrast to a major change in each incurring separate risk of false-positive error, vastly treatment approach. increases the number of patients required. Composites thus Alternating intervention approaches work especially allow investigators to simultaneously combine clinical rel- well when the intervention treatment is easy to switch, and evant outcomes and reduce sample size. the studies can be inexpensive when the outcomes are elec- The most common form of composite is a simple col- tronically recorded or otherwise easy to obtain. However, lapsed composite in which the frst occurrence of any com- because of the lack of randomization and blinding, such ponent makes the composite positive. The general rule for studies can be completely invalidated if some clinicians a collapsed composite is that its components should be of preferentially schedule patients or surgeries or if discharge comparable severity and have roughly similar incidences. decisions are infuenced by which treatment is in effect that For example, it would be a poor idea to use an infection week or by secular factors correlated, for whatever reason, composite that combines organ space infection, deep ster- with both the alternation cycle and the outcome—here, nal infection, abscess, sepsis, and urinary tract infection. duration of stay (for an example of an alternating interven- Urinary tract infections are far more common than all the tion trial, see the study by Kopyeva et al.12). Note that a other components combined, to say nothing of being far stepped-wedge approach to introducing an intervention at less serious. The proposed composite is thus essentially just multiple sites can gain some, but not all, benefts of alternat- a measure of urinary tract infection, which is not what the ing intervention approaches. investigators really want to assess. Although alternating intervention trials share aspects A further assumption of composite outcomes is that of quality improvement studies, they are very much real treatments at least change component outcomes in the same research and absolutely require IRB approval. And of course, direction (i.e., between no change and improved). When only selected quality-related interventions are appropriate. outcomes respond quite heterogeneously to a treatment,

Number 4 www.anesthesia-analgesia.org 1057 ڇ Volume 121 ڇ October 2015 Copyright © 2015 International Anesthesia Research Society. Unauthorized reproduction of this article is prohibited. E REVIEW ARTICLE

composites become diffcult to interpret. Consider, for exam- statistical signifcance criterion is made more stringent to ple, a composite of vascular complications that includes account for multiple interrogations of the data. The risk of myocardial infarctions and stroke. β-Blockers affect each overstating treatment effects is not restricted to formal anal- differently, signifcantly reducing myocardial infarctions ysis and remains even if data are only informally inspected while increasing stroke.13 Simply adding the incidence of by investigators. Multiple unprotected evaluations of data each (1 positive and 1 negative) might wrongly imply that may, in part, explain why many studies, especially small β-blockers have little overall effect, whereas they in fact ones, are subsequently contradicted.16 have clinically important, but divergent, actions. Group sequential clinical trials, proposed by Pocock Fortunately, there are alternatives to simple collapsed et al.17 and extensively developed since, provide a menu of composites. In fact, there are many ways to construct and practical solutions to this problem. Within this framework, analyze composites, some of which avoid the general multiple looks at trial results are completely acceptable if requirement for components to have comparable incidences governed by a formal interim analysis plan incorporated and severities (for more detail about composite outcomes in the clinical trial design. Rather than allowing each addi- and associated statistical complexities, see Mascha and tional look to further increase the chances of false-positive Imrey14 and Mascha and Sessler15). and/or false-negative conclusions above the target levels, 5% and 20% for instance, group sequential designs reduce SAMPLE SIZE, INTERIM ANALYSES, AND the error probabilities at each look, so that the original, STOPPING RULES planned overall false-positive and false-negative error pro- Before the mid-1970s, clinical trialists were faced with an portions for the entire study are maintained. uncomfortable choice between adequate trial monitoring In practice, the total false-positive risk (i.e., α = type 1 and controlling chance error. Well-designed trials were statistical error probability) is distributed across each of the planned to control the potential for false-positive and false- analyses at different times to keep the total at 5%, for exam- negative results within what were considered acceptable ple. The simplest way to do this, although conservative, is limits, often 5% for the chance of false-positive results and to divide the intended α-risk by the number of analyses. So 10% or 20% for the chance of false-negative results. But such if there are 3 interim looks planned, plus a fnal evaluation, performance characteristics were based on the assumption each might be assessed at 0.05/4 = 0.0125 = 1.25%. That is, that trials would be analyzed only once when results from only P values <0.0125 would be considered statistically sig- all patients had been obtained. nifcant at any given analysis. A similar process can be used Investigators, however, would often monitor accumu- to distribute risk of false-negative error (β = type 2 statistical lating results that, depending on how trends developed, error probability) over various analysis points (Fig. 1). might raise concerns about the ethical basis or fnancial Most investigators regard taking equal chances of error advisability of continuing to enroll patients. Investigators at early and late times as placing too much emphasis and might also informally evaluate results as they accrued and risk of error early, when few results are available, at the stop trials when their effcacy results reached statistical expense of too little emphasis late, when most or all the signifcance. The diffculty with this approach is that such results are known. In practice, various formulas are used multiple unprotected looks at the data for effcacy signals to distribute risks of false-positive and false-negative errors substantially increase a study’s chance of false-positive in ways that allow investigators to better satisfy ethical and error. Similarly, multiple looks for possible safety concerns fnancial concerns by stopping trials early when data are increase the chance of stopping too early because of tran- already suffciently conclusive as to relative effcacy and/ sient random imbalances in adverse events and thus the or safety or when it is clear that continuing the trial cannot chance of a false-negative error. achieve the intended objective. Most trials are designed to Multiple looks are problematic because trial results are be increasingly sensitive at later times, with the fnal test, infuenced not only by true population differences but also if the trial does not stop early, only slightly more stringent by the vagaries of chance in how these play out over time, than its fxed sample size counterpart. as results from a particular clinical trial sample accumulate. Often, the weighting over time for α- and β-risks differs. Observed values at various times during the trial are thus For example, investigators may want to stop the trial early sometimes greater than the true population differences, if if there is no evidence of effcacy but continue to a much any, and at other times less. The degree to which observed larger number of patients if the treatment looks effective. values differ from true population values is a function of This is equivalent to cutting your losses when the treatment sample size, with more variability being observed in small does not appear effective but requiring a high degree of evi- studies, which is why power improves with larger size. dence if it does. The trouble is that investigators who evaluate data as it Group sequential designs exact a price for their fexibil- accrues, even informally, and stop a study when the results ity. The sample size planned for such a design is always look good, are effectively choosing a random high point larger than that needed for a single-look design with the on the difference curve as it evolves over time. Because the same error probabilities (α and β) and hence the same sta- chance that a random curve will meet any particular fxed tistical power. However, the difference is usually not large criterion at any one of many possible times is, by defni- and is mitigated because, in practice, the actual sample tion, greater than the chance it will meet that criterion at size accrued is reduced when the trial stops early. Thus, a single predetermined time, random false-positive and one must plan for a modestly larger sample size than for a false- negative errors will be far more likely with multiple 1-look trial; but on average, group sequential trials require looks than with a single evaluation at the end, unless the fewer patients, often substantially fewer, because many

1058 www.anesthesia-analgesia.org ANESTHESIA & ANALGESIA Copyright © 2015 International Anesthesia Research Society. Unauthorized reproduction of this article is prohibited. Randomized Controlled Trials

a completely independent data safety monitoring board. Typically, they are presented on a group A versus group B basis because the decision to stop or continue the trial should usually be independent of the group assignments. Interim analysis results should not be shared with investigators involved in data collection to reduce the risk of bias. Although the statistical properties of group sequential trial designs pre- sume strict adherence to their stopping rules, in practice such bodies consider these statistical rules as guidance, taking into account the scientifc context of the study and other factors when making stop-versus-continue decisions. For studies with normally distributed continuous out- comes, the primary determinants of sample size are out- come variability and treatment effect (difference between the study groups). For dichotomous outcomes, the pri- mary determinants are baseline incidence and treatment effect. Furthermore, the number of study subjects needed Figure 1. Observed standardized differences in primary postopera- increases rapidly, in proportion to the inverse square, as tive sore throat proportions (green line) at 2 interim analyses (at treatment effect is reduced. Although less important than N = 79 and N = 179) and the fnal analysis (at N = 235) of a clini- whether data are continuous or dichotomous, the statistical cal trial comparing an experimental licorice-based gargle to a sugar analysis approach also infuences the sample size. solution for prevention of sore throat and postextubation coughing. At a baseline incidence of 20%, for example, it takes only The upper and lower blue areas are stopping regions for beneft and harm of the licorice solution, respectively; the pink region desig- 398 patients—199 per group—to give a balanced 2-arm nates early stopping for futility, and non-signifcant fnal differences. study 80% power at a 5% α for detecting a 50% reduction in Although statistically signifcant effcacy was demonstrated at the the outcome incidence. But, there are few new interventions second interim analysis based on 150 patients, this minimal risk that provide anything resembling a 50% treatment effect. the study was nevertheless continued to 235 patients to assess treatment beneft more precisely. This trial had a >50% chance of Given the same baseline risk, powering the study for a 25% early stopping after 150 patients due to futility, had there been no risk reduction would require 1812 patients. And for a 20% real licorice beneft. Reprinted from Ruetzler K, et al. A randomized, risk reduction, 2894 patients are needed. The diffculty is that double-blind comparison of licorice versus sugar-water gargle for pre- 20% is often the largest treatment that might realistically be vention of postoperative sore throat and postextubation coughing. expected, and a trial of that size might miss even smaller Anesth Analg 2013;117:614–21. effects that would be considered highly clinically important. Of course, treatment effect is not known when investi- such trials can be stopped early. Trials that are stopped early gators begin a trial. After all, the point of a trial is exactly for unplanned reasons overstate statistical signifcance if to determine treatment effect. Nonetheless, comparative knowledge of accumulated results informs the stopping clinical trials would rarely meet ethical standards unless decision. But a study that is stopped per rule at a scheduled based on reasonable mechanistic, animal, and some human, interim analysis is not a “stopped early” trial because it was, data. Thus investigators usually have at least some basis for in fact, stopped as planned by . expecting effect at a given magnitude. Another consider- Because group sequential designs have so many advan- ation is clinical importance; there is no point in powering tages, many modern trials include 1 or periodic interim a study for an effect that would not be considered clinically analyses. Most use O’Brien-Fleming stopping boundaries, important. The general approach is thus to power trials for for which statistical penalties are relatively small. However, plausibly anticipated treatment effects of clinically impor- each interim look at the data requires special data quality tant magnitude. control and database activities, a formal statistical analysis, Sample size estimation is straightforward for single-look, and review by an executive committee and/or data safety 2-group trials. But the calculations quickly get diffcult, monitoring board. There is thus some fxed cost to each and numerical simulations may be required when dealing interim analysis. The amount of relevant new information with multivariable analysis, nonparametric distributions, that becomes available during any calendar period between composite outcomes, factorial designs, multiple looks, sur- looks, and the potential benefts of stopping at the begin- vival or cumulative incidence analyses, and other design ning rather than at the end of the period, will vary among complexities.15,18 But in all cases, the most problematic part trials and during the course of a single trial, depending on for investigators is developing realistic estimates of base- the absolute and relative durations of the patient enroll- line incidence, population variability, treatment effect, and ment, treatment, and follow-up periods and the risks and potential patient enrollment rate. Too often, these estimates potential benefts of the experimental treatment. In our are optimistic, resulting in underpowered trials. experience, ≤3 interim looks usually provide suffcient fexibility at the cost of a small increase in the maximum EQUIVALENCE AND NONINFERIORITY TRIALS required patient enrollment and a manageable amount of A placebo is the natural comparator in a trial conducted sim- additional statistical and administrative work. ply to show that an experimental treatment provides some Depending on the size and nature of the trial, interim degree of beneft. Historically, such placebo-controlled trials analyses might be reported to an executive committee or have confrmed scientifc and clinical breakthroughs against

Number 4 www.anesthesia-analgesia.org 1059 ڇ Volume 121 ڇ October 2015 Copyright © 2015 International Anesthesia Research Society. Unauthorized reproduction of this article is prohibited. E REVIEW ARTICLE

previously untreatable medical problems. When comparing a novel experimental treatment with an inactive placebo, the only relevant question is superiority of the presumably active experimental agent. In practice, such studies often use 2-sided statistical hypothesis testing to recognize harm when it occurs, but the hope is to demonstrate a difference in patient responses that favors the treatment. Increasingly, however, new treatments emerge for prob- lems for which ≥1 existing therapy is already accepted as effective. In such cases, comparison with placebo may be insuffcient to support the use of the new treatment, and the provision of some active therapy to research subjects is usu- ally considered an ethical imperative. Instead of superior- ity to a placebo, investigators may thus seek to show that a novel intervention is at least as good as an active, accepted alternative. For example, a new patient-warming system that is less expensive and easier to use might be considered an improvement over a conventional system, even if it does Figure 2. Sample confdence intervals and inference for trials not warm patients any better. assessing superiority, noninferiority, or equivalence of treatment T The statistical approach in such cases is to demonstrate to standard S. NI = noninferiority. Notice, for example, that conf- that the new treatment, while it may or may not be superior dence intervals for results B and C are identical, but the permitted to its competitor, is not enough worse to clinically matter. conclusions quite different because the designs and hypotheses Not enough worse, euphemistically termed noninferiority, differ. [Reprinted with modifcations from Mascha EJ, Sessler DI. Equivalence and noninferiority testing in regression models and is defned in terms of a clinically important noninferiority repeated-measures designs. Anesth Analg 2011;112:678–87.] margin, a performance decrement defned as the boundary between a reduction in performance that is ignorable and one that is considered unacceptable. Noninferiority is demon- results are fragile. This intuitive comparison illustrates the strated when a 1-sided confdence interval for a comparative the probabilistic Central Limit Theorem that, other things measure of treatment effcacy excludes underperformance by being equal, the precision of a study improves in proportion greater than or equal to the noninferiority margin. to the square root of the number of subjects. Or, to put this Equivalence trials are symmetric, bidirectional noninfe- in practical terms, the larger the study, the more likely it is riority trials: “not enough different” must be shown in both that the results will closely refect the target characteristics directions, such as by a 2-sided confdence interval that of the underlying biological system and be closely repli- excludes suffciently large differences favoring either of the cated if the study was repeated.21 treatments being compared. Although noninferiority stud- Another reason trial size matters is that randomization ies are far more common, equivalence studies are needed only assures that the treatment groups will be comparable when the purpose of the treatment is to regulate and main- in suffciently large trials. In large trials, treatment groups tain a physiologic parameter within an established target will almost always be very similar. But in small trials, by range, as for hormonal agents such as thyroxine and insu- pure bad luck, the treatment groups may differ enough for lin or for anticoagulants such as heparin and warfarin. The such differences to distort (by confounding) the results. The relationships between superiority, equivalence, and nonin- number of patients required to provide reasonable assur- feriority approaches are shown in Figure 2 (for additional ance of baseline homogeneity is considerably larger than detail, see a recent review.19). generally appreciated. For example, with 100 patients per group in a 2-arm trial in a population with men and women TRIAL SIZE equally represented, the fractions of men and women will Trial size matters!20 In 1000 independent fips of a coin, the differ between groups by >10% in 15% of such trials and by proportion of heads will very likely fall within ±3% of the >7% in a third of them. expected proportion of 50% and thus give a precise esti- As noted above, the number of patients needed for vari- mate of the truth, meaning the coin’s long-run proportion, ous types of studies depends on a host of factors, includ- or chance, of heads. Because fipping the coin 1000 times ing the type of outcome, its incidence if dichotomous and again would likely produce another result within the same variability if quantitative, the duration of follow-up, and the range, the 2 sequences of coin fips will be fairly close, usu- outcome difference between the experimental and control ally within 6% of one another. treatments (treatment effect) that the study is intended to But now consider fipping the same coin just 10 times. detect. But larger trials increase the precision in estimating It would be unsurprising to observe a proportion of heads the treatment effects and always enhance the confdence in anywhere from 30% to 70% or for that matter from 20% to the results.22 Large randomized trials often reverse medical 80%. Nor would it be surprising to fnd that the proportion practices that were based on small studies.23 of heads in a second 10 fips departs considerably from the It is worth considering that with a P value of 0.05, proportion among the frst 10. Such a study (10 coin fips) which is generally considered signifcant, the probability therefore does not produce either a stable result from one of a repeat trial being signifcant when the frst trial has repetition to another or a precise estimate of the truth: its accurately estimated the true treatment is only 50%. The

1060 www.anesthesia-analgesia.org ANESTHESIA & ANALGESIA Copyright © 2015 International Anesthesia Research Society. Unauthorized reproduction of this article is prohibited. Randomized Controlled Trials

P value needs to be 0.005 for this replication probability to THREATS TO VALIDITY reach the conventional power criterion of 80% and 0.0001 Randomization and blinding provide excellent protection to reach 95%.24 These criteria have recently been suggested against selection bias, confounding, and measurement bias. as replacements for the 5% and 1% P value criteria by However, there are numerous other threats to trial valid- which results are currently conventionally labeled statisti- ity that compromise conclusions. Consequently, the frac- cally signifcant, or highly statistically signifcant.25 Note tion of unrepeatable published trials is probably far higher that signifcance conventionally means that, were there no than generally appreciated.16,22 Care in design, conduct, and real treatment effect, <5% of trials would be expected to reporting of trials is associated with improved validity.27 produce a difference between outcomes of treatment and Inadequate trial size, as discussed above, reduces trial control groups as large as that observed (ignoring possible validity. But there are many other subtle threats to indi- issues of bias). But that is not really what clinicians need to vidual studies and the in general. Among know; instead, they need bounds on the confdence inter- the most important is publication bias, which has consider- vals around the observed treatment effect that are narrow able potential to seriously distort medical literature, and is enough to be clinically useful. Tight confdence intervals hardly restricted to clinical trials.28 result from sample sizes well exceeding those required to There are 3 main reasons why studies may not be pub- barely confrm statistical signifcance. lished. The frst is that the study might not even be com- pleted. To the extent that investigators see negative results CLINICAL EQUIPOISE while the study is in progress, they may decide not to com- The ethical basis for randomized trials is equipoise. The plete the project or a sponsor may decline to continue fund- term means that investigators and oversight bodies collec- ing the project. (The bias inherent in this decision is why tively judge that the bases for favoring each treatment are investigators and sponsors should normally be blinded roughly balanced, so that the groups receiving each treat- to trial results, and why results should only be evaluated ment are equally likely to receive greater beneft, with bene- at predefned interim analysis points.) Trials may also be ft being broadly defned. For example, a novel intervention stopped because of inadequate enrollment.29 The resulting might be considered benefcial if it provided: (1) superior underpowered study may not be publishable even if the treatment effect; (2) comparable treatment effect with fewer investigators wish otherwise, but often they do not even try. side effects; (3) comparable treatment effect but is easier The second reason negative studies may not be pub- for patients or physicians to implement; or (4) comparable lished is that investigators often do not even submit nega- treatment effect at lower cost. Of course, many other cost- tive results for peer review, concluding that the results are beneft combinations are possible. uninteresting or unlikely to be accepted. And to some extent, Equipoise is generally achieved by comparing an they may be correct that their work may not be accepted in experimental treatment with the best existing treatment, their journal(s) of choice, which is the third reason negative to which it may or may not prove superior. A major task results may remain unpublished. A more subtle problem is for IRBs is to judge that there truly is equipoise and that that corporate sponsors may deliberately restrict submis- patients in the control group used for comparison are sion of negative results.30 getting the best current treatment and thus are not being Competent studies with negative results should be disadvantaged by participation in the trial. Note that the accepted for publication, but they may be given lower pri- control group should receive at least best current local ority than positive results by reviewers and editors. The practice and preferably best available treatment. Thus, pla- pervasive damage from publication bias is now better cebo controls should be restricted to situations in which understood by editors who seem more willing to publish there is no generally accepted treatment or with suitable high-quality negative results. But to the extent that studies rescue with clearly effective drugs (as in many pain stud- are stopped early without statistical justifcation, results are ies). For example, it would seem questionable these days likely to be underpowered with an ambiguous rather than a to conduct an antiemetic trial in which high-risk control truly negative result, reducing a publication’s attractiveness patients were given a placebo now that many safe and to journal editors. effective treatments are available. The primary result of publication bias is that, for 1 reason An important point is that during a trial, individuals or another, negative results are less likely to appear in print may have strong (and varied) opinions about which treat- than positive ones. Obviously, this distorts the literature ment is best. Nonetheless, participants and safety moni- and makes it more likely that a clinician will conclude— tors may agree that the trial will provide crucial evidence based on available evidence—that a treatment is effective for adjudicating these opinions. In other words, indi- when it actually is not. The problem is especially serious vidual opinions and expectations do not preclude equi- when available results are included in systematic reviews poise when there is a consensus that additional evidence and meta-analyses that essentially assume that all results is required. While it is tempting to assume that a novel are available for analysis. Nonetheless, evidence suggests treatment will be better (the source of much measurement that meta-analyses of small trials are generally consistent bias), that turns out to be the case well under half of the with subsequent large trials.31 time in substantial trials.26 History documents that it is Design choices obviously infuence the interpretation of very hard to predict whether a novel treatment will actu- study results. For example, using a placebo control rather ally prove superior in terms of effcacy and safety—which than the best currently available treatment makes experi- is exactly why trials are necessary. mental interventions appear superior. Although considered

Number 4 www.anesthesia-analgesia.org 1061 ڇ Volume 121 ڇ October 2015 Copyright © 2015 International Anesthesia Research Society. Unauthorized reproduction of this article is prohibited. E REVIEW ARTICLE

unethical for serious where there are effective alter- most commonly used. Some journals now require authors natives, the practice continues because of pressure from spon- to submit a specifc checklist documenting that all relevant sors and because the expected larger treatment effect allows elements are addressed in their articles. But, with or with- smaller and less expensive trials. A more insidious approach out a formal checklist, skilled reviewers will expect to see is the use of control and experimental drug doses that are not relevant critical components addressed.32,33 comparable; to the extent that the experimental dose is func- The CONSORT checklist includes 25 items, many with tionally higher, its effcacy will presumably be enhanced. subelements. Among these, perhaps the most important are Loss to follow-up (attrition bias) is another subtle way (1) type of trial (i.e., parallel group, factorial); (2) important trial results can be compromised, consider, for example, a changes after starting enrollment; (3) eligibility criteria; (4) trial comparing an effective analgesic—say morphine—with interventions, with suffcient detail to allow replication; (5) an ineffective experimental treatment. Patients randomly completely defned prespecifed primary and secondary assigned to the experimental treatment will fnd their anal- outcomes; (6) interim analyses, stopping rules, and sample gesia inadequate and many will drop out of the study. And size determination and justifcation; (7) how randomization once out, they will no longer contribute pain scores or other was generated and allocation concealed; (8) blinding; (9) sta- outcome data. Those who remain will be the ones who have tistical methods; (10) subject enrollment, participation, and least pain or are otherwise pain tolerant. Their pain scores loss (usually presented as a diagram); (11) why the study will typically be considerably lower than those of patients on ended, if not taken to completion; (12) baseline character- the experimental therapy who dropped out. Average results istics for each group (usually a table); (13) outcomes with for patients who remained in the study will thus far over- appropriate measures of variance; (14) harms in each group; estimate the analgesic beneft of the experimental treatment, and (15) public trial registration (registry and number). perhaps even making it appear comparable with morphine. The most obvious way for investigators to be sure of Intention-to-treat analysis is a widely used strategy for being able to provide each of the CONSORT elements is countering such postrandomization biases, which, in effect, to incorporate them into protocols. Investigators can also break the balance between treatment groups fostered by the consult the Standard Protocol Items: Recommendations randomization process. The intent-to-treat concept is neatly for Interventional Trials (SPIRIT) checklist during the trial captured by the catch-phrase “Where randomized, there design phase.34 analyzed.” This analytic philosophy retains patients in the group to which they were randomly assigned, no matter CONCLUSIONS what occurs during the trial. Patients who drop out of the Randomized assignment of treatment excludes reverse study before the recording of an outcome are analyzed by causation and selection bias and, in suffciently large stud- using an imputed outcome, preferably by a multiple impu- ies, effectively prevents confounding. Well-implemented tation technique grounded in statistical but some- blinding prevents measurement bias. Studies that include times in practice by an average value in other patients or these protections are called randomized, blinded clinical that patient’s last observation carried forward. Patients who trials and, when conducted with suffcient numbers of remain in a study but are noncompliant with the assigned patients, provide the most valid results. treatment, even those who switch to another treatment and/ Although conceptually straightforward, design of clini- or never receive the one randomly assigned, are analyzed as cal trials requires thoughtful trade-offs among compet- members of their original randomized group. ing approaches—all of which infuence the number of The intention-to-treat approach is counter intuitive, but patients required, enrollment time, internal and external well justifed when understood. Retaining randomized validity, ability to evaluate interactions among treat- groups as originally constituted assures that, in the analy- ments, and cost. sis, departures from assigned treatments and differential Because randomized trials are so expensive and time dropout will not exaggerate true treatment differences. In consuming, the number of patients required is an overrid- contrast, if dropouts and treatment crossovers are excluded ing concern. Many trials evaluate mediators or intermediate from analysis, or crossovers analyzed in the groups to outcomes, often characterized by continuous measurements. which they move, or other ad hoc approaches used, biases Hard outcomes (i.e., myocardial infarction, respiratory of unknown magnitude and direction may occur. Intention- arrest, or death) that are much more serious and interesting to-treat analyses thus protect against exaggeration of a true are usually dichotomous and thus require far more patients. treatment effect, or even production of a spurious one, at Subject selection also infuences the sample size by increas- the expense of potentially underestimating treatment effects ing baseline event rates or reducing variability. and diminishing study power. But because intention-to-treat Crossover designs reduce required sample size but are analyses also attenuate differences in harm, they are gener- limited to short-term interventions with minimal carry-over ally used only for effcacy comparisons, with adverse events effects. Factorial designs are effcient, have the capacity to compared between groups refecting treatments actually detect interactions, and can be used with any type of out- received. Comparisons on this basis are termed per-protocol come. And novel study designs, such as alternating inter- analyses and are often used for secondary analyses. vention approaches, can speed enrollment and enhance effciency. The feld of clinical trial design is rapidly evolv- CONSORT REPORTING GUIDELINES ing, and this review has attempted only a basic introduction. Various checklists and guidelines have been proposed for Estimating the number of patients required for a trial can reporting of clinical trials, with the Consolidated Standards range from trivial to extraordinarily complex depending on of Reporting Trials (CONSORT) guidelines being by far the the trial design and is further complicated by inclusion of

1062 www.anesthesia-analgesia.org ANESTHESIA & ANALGESIA Copyright © 2015 International Anesthesia Research Society. Unauthorized reproduction of this article is prohibited. Randomized Controlled Trials

interim analyses. Even beyond the basics of selecting a valid 11. Kappen TH, Moons KG, van Wolfswinkel L, Kalkman CJ, design for a particular problem, choices made within the Vergouwe Y, van Klei WA. Impact of risk assessments on pro- phylactic antiemetic prescription and the incidence of post- context of a general trial design have considerable potential operative nausea and vomiting: a cluster-randomized trial. for enhancing a trial’s strength and feasibility. E Anesthesiology 2014;120:343–54 12. Kopyeva T, Sessler DI, Weiss S, Dalton JE, Mascha EJ, Lee JH, DISCLOSURES Kiran RP, Udeh B, Kurz A. Effects of volatile anesthetic choice on hospital length-of-stay: a retrospective study and a prospec- Name: Daniel I. Sessler, MD. tive trial. Anesthesiology 2013;119:61–70 Contribution: This author helped write the manuscript. 13. Devereaux PJ, Yang H, Yusuf S, Guyatt G, Leslie K, Villar JC, Attestation: Daniel I. Sessler approved the fnal manuscript. Xavier D, Chrolavicius S, Greenspan L, Pogue J, Pais P, Liu L, Name: Peter B. Imrey, PhD. Xu S, Malaga G, Avezum A, Chan M, Montori VM, Jacka M, Contribution: This author helped design the study and write Choi P: Effects of extended-release metoprolol succinate in the manuscript. patients undergoing non-cardiac surgery (POISE trial): a ran- domised controlled trial. Lancet 2008;371:1839–47 Attestation: Peter B. Imrey approved the fnal manuscript. 14. Mascha EJ, Imrey PB. Factors affecting power of tests for mul- This manuscript was handled by: Steven L. Shafer, MD. tiple binary outcomes. Stat Med 2010;29:2890–904 15. Mascha EJ, Sessler DI: Design and analysis of studies REFERENCES with binary-event composite endpoints. Anesth Analg 1. Devereaux PJ, Yusuf S. The evolution of the randomized con- 2011;112:1461–71 trolled trial and its role in evidence-based decision making. 16. Ioannidis JP. Contradicted and initially stronger effects in J Intern Med 2003;254:105–13 highly cited clinical research. JAMA 2005;294:218–28 2. Kaptchuk TJ, Friedlander E, Kelley JM, Sanchez MN, Kokkotou 17. Pocock SJ, Assmann SE, Enos LE, Kasten LE. Subgroup anal- E, Singer JP, Kowalczykowski M, Miller FG, Kirsch I, Lembo ysis, covariate adjustment and baseline comparisons in clini- AJ. Placebos without deception: a randomized controlled trial cal trial reporting: current practice and problems. Stat Med in irritable bowel syndrome. PLoS One 2010;5:e15591 2002;21:2917–30 3. Deans KJ, Minneci PC, Danner RL, Eichacker PQ, Natanson 18. Mascha EJ, Turan A. Joint hypothesis testing and gatekeeping C. Practice misalignments in randomized controlled trials: procedures for studies with multiple endpoints. Anesth Analg Identifcation, impact, and potential solutions. Anesth Analg 2012;114:1304–17 2010;111:444–50 19. Mascha EJ, Sessler DI. Equivalence and noninferiority testing 4. Deans KJ, Minneci PC, Suffredini AF, Danner RL, Hoffman in regression models and repeated-measures designs. Anesth WD, Ciu X, Klein HG, Schechter AN, Banks SM, Eichacker PQ, Analg 2011;112:678–87 Natanson C. Randomization in clinical trials of titrated thera- 20. Devereaux PJ, Chan MT, Eisenach J, Schricker T, Sessler DI. pies: unintended consequences of using fxed treatment proto- The need for large clinical studies in perioperative medicine. cols. Crit Care Med 2007;35:1509–16 Anesthesiology 2012;116:1169–75 5. Sessler DI, Devereaux PJ. Emerging trends in clinical trial 21. Walsh M, Srinathan SK, McAuley DF, Mrkobrada M, Levine O, design. Anesth Analg 2013;116:258–61 Ribic C, Molnar AO, Dattani ND, Burke A, Guyatt G, Thabane 6. Devereaux PJ, Mrkobrada M, Sessler DI, Leslie K, Alonso- L, Walter SD, Pogue J, Devereaux PJ. The statistical signifcance Coello P, Kurz A, Villar JC, Sigamani A, Biccard BM, Meyhoff of randomized controlled trial results is frequently fragile: a CS, Parlow JL, Guyatt G, Robinson A, Garg AX, Rodseth RN, case for a Fragility Index. J Clin Epidemiol 2014;67:622–8 Botto F, Lurati Buse G, Xavier D, Chan MT, Tiboni M, Cook 22. Ioannidis JP. Why most published research fndings are false. D, Kumar PA, Forget P, Malaga G, Fleischmann E, Amir M, PLoS Med 2005;2:e124 Eikelboom J, Mizera R, Torres D, Wang CY, VanHelder T, 23. Prasad V, Vandross A, Toomey C, Cheung M, Rho J, Quinn S, Paniagua P, Berwanger O, Srinathan S, Graham M, Pasin L, Chacko SJ, Borkar D, Gall V, Selvaraj S, Ho N, Cifu A. A decade Le Manach Y, Gao P, Pogue J, Whitlock R, Lamy A, Kearon C, of reversal: an analysis of 146 contradicted medical practices. Baigent C, Chow C, Pettit S, Chrolavicius S, Yusuf S; POISE-2 Mayo Clin Proc 2013;88:790–8 Investigators. Aspirin in patients undergoing noncardiac sur- 24. Goodman SN. A comment on replication, p-values and evi- gery. N Engl J Med 2014;370:1494–503 dence. Stat Med 1992;11:875–9 7. Devereaux PJ, Sessler DI, Leslie K, Kurz A, Mrkobrada M, 25. Johnson VE. Revised standards for statistical evidence. Proc Alonso-Coello P, Villar JC, Sigamani A, Biccard BM, Meyhoff CS, Natl Acad Sci U S A 2013;110:19313–7 Parlow JL, Guyatt G, Robinson A, Garg AX, Rodseth RN, Botto F, 26. Gordon D, Taddei-Peters W, Mascette A, Antman M, Kaufmann Lurati Buse G, Xavier D, Chan MT, Tiboni M, Cook D, Kumar PA, PG, Lauer MS. Publication of trials funded by the National Forget P, Malaga G, Fleischmann E, Amir M, Eikelboom J, Mizera Heart, Lung, and Blood Institute. N Engl J Med 2013;369:1926–34 R, Torres D, Wang CY, Vanhelder T, Paniagua P, Berwanger O, 27. Nowbar AN, Mielewczik M, Karavassilis M, Dehbi HM, Srinathan S, Graham M, Pasin L, Le Manach Y, Gao P, Pogue J, Shun-Shin MJ, Jones S, Howard JP, Cole GD, Francis DP; Whitlock R, Lamy A, Kearon C, Chow C, Pettit S, Chrolavicius S, DAMASCENE Writing Group. Discrepancies in autologous Yusuf S; POISE-2 Investigators. Clonidine in patients undergo- bone marrow stem cell trials and enhancement of ejection frac- ing noncardiac surgery. N Engl J Med 2014;370:1504–13 tion (DAMASCENE): weighted regression and meta-analysis. 8. Apfel CC, Korttila K, Abdalla M, Kerger H, Turan A, Vedder BMJ 2014;348:g2688 I, Zernak C, Danner K, Jokela R, Pocock SJ, Trenkler S, Kredel 28. Dwan K, Altman DG, Arnaiz JA, Bloom J, Chan AW, Cronin M, Biedler A, Sessler DI, Roewer N; IMPACT Investigators. A E, Decullier E, Easterbrook PJ, Von Elm E, Gamble C, Ghersi factorial trial of six interventions for the prevention of postop- D, Ioannidis JP, Simes J, Williamson PR. of erative nausea and vomiting. N Engl J Med 2004;350:2441–51 the empirical evidence of study publication bias and outcome 9. Landsberger H. Hawthorne Revisited: Management and the reporting bias. PLoS One 2008;3:e3081 Worker: Its Critics, and Developments in Human Relations in 29. Kasenda B, von Elm E, You J, Blümle A, Tomonaga Y, Saccilotto Industry. Ithaca, NY: Cornell University Press R, Amstutz A, Bengough T, Meerpohl JJ, Stegert M, Tikkinen 10. Huang SS, Septimus E, Kleinman K, Moody J, Hickok J, Avery KA, Neumann I, Carrasco-Labra A, Faulhaber M, Mulla SM, TR, Lankiewicz J, Gombosev A, Terpstra L, Hartford F, Hayden Mertz D, Akl EA, Bassler D, Busse JW, Ferreira-González MK, Jernigan JA, Weinstein RA, Fraser VJ, Haffenreffer K, Cui I, Lamontagne F, Nordmann A, Gloy V, Raatz H, Moja L, E, Kaganov RE, Lolans K, Perlin JB, Platt R; CDC Prevention Rosenthal R, Ebrahim S, Schandelmaier S, Xin S, Vandvik Epicenters Program; AHRQ DECIDE Network and Healthcare- PO, Johnston BC, Walter MA, Burnand B, Schwenkglenks M, Associated Infections Program. Targeted versus univer- Hemkens LG, Bucher HC, Guyatt GH, Briel M. , sal decolonization to prevent ICU infection. N Engl J Med characteristics, and publication of discontinued randomized 2013;368:2255–65 trials. JAMA 2014;311:1045–51

Number 4 www.anesthesia-analgesia.org 1063 ڇ Volume 121 ڇ October 2015 Copyright © 2015 International Anesthesia Research Society. Unauthorized reproduction of this article is prohibited. E REVIEW ARTICLE

30. Turner EH, Matthews AM, Linardatos E, Tell RA, Rosenthal R. 33. Moher D, Hopewell S, Schulz KF, Montori V, Gøtzsche PC, Selective publication of antidepressant trials and its infuence Devereaux PJ, Elbourne D, Egger M, Altman DG. CONSORT on apparent effcacy. N Engl J Med 2008;358:252–60 2010 explanation and elaboration: updated guidelines for 31. Cappelleri JC, Ioannidis JP, Schmid CH, de Ferranti SD, Aubert reporting parallel group randomised trials. BMJ 2010;340:c869 M, Chalmers TC, Lau J. Large trials vs meta-analysis of smaller 34. Chan AW, Tetzlaff JM, Gøtzsche PC, Altman DG, Mann H, trials: how do their results compare? JAMA 1996;276:1332–8 Berlin JA, Dickersin K, Hróbjartsson A, Schulz KF, Parulekar 32. Schulz KF, Altman DG, Moher D; CONSORT Group. CONSORT WR, Krleza-Jeric K, Laupacis A, Moher D. SPIRIT 2013 expla- 2010 Statement: updated guidelines for reporting parallel nation and elaboration: guidance for protocols of clinical trials. group randomised trials. BMC Med 2010;8:18 BMJ 2013;346:e7586

1064 www.anesthesia-analgesia.org ANESTHESIA & ANALGESIA Copyright © 2015 International Anesthesia Research Society. Unauthorized reproduction of this article is prohibited.