<<

New Methods and Software for Designing Adaptive Clinical Trials of New Medical Treatments

Michael Rosenblum1, PhD, Jon Arni Steingrimsson2, PhD, Josh Betz1, M.S.,

Affiliations: 1Department of , Johns Hopkins University, Baltimore, MD 21205, USA 2Department of Biostatistics, Brown University, Providence, RI 02903, USA

Original Project Title: Innovative Randomized Trial Designs to Generate Stronger Evidence about Subpopulation Benefits and Harms PCORI ID: ME-1306-03198 HSRProj ID: 20143600 Institution: Johns Hopkins University

______To cite this document, please use: Rosenblum M, Steingrimsson JA, Betz J. (2019). New Methods and Software for Designing Adaptive Clinical Trials of New Medical Treatments. Washington, DC: Patient- Centered Outcomes Research Institute (PCORI). https://doi.org/10.25302/10.2019.ME.130603198

1 Table of Contents

Abstract ...... 3 Background ...... 4 Specific Aims ...... 7 Participation of patients and/or other stakeholders in the design and conduct of research and dissemination of findings ...... 7 Methods ...... 8 Aim 1: Develop and evaluate adaptive enrichment designs for time-to-event and other delayed endpoints...... 8 Aim 2: Conduct extensive simulation studies...... 12 Aim 3: Produce user-friendly, free, open-source software to optimize our adaptive enrichment designs and compare performance versus standard designs...... 22 Discussion ...... 24 Study results in context: ...... 24 Uptake of study results: ...... 24 Study limitations: ...... 25 Future research: ...... 25 Conclusions: ...... 26 References ...... 27 Related Publications Published Manuscripts: ...... 30 Acknowledgments: ...... 31

2 Abstract

Background: Standard designs aim to determine whether a treatment is beneficial, on average, for a target population. Such trials can have low power if the treatment only benefits a subpopulation, e.g., defined by severity, a biomarker, or a risk score at baseline. Randomized trial designs that adaptively change enrollment criteria during a trial, called adaptive enrichment designs, have potential to provide improved information about which subpopulations benefit from new treatments. Objectives: We aimed to (i) develop new adaptive enrichment designs and prove their key statistical properties; (ii) conduct simulations that mimic features of completed trial sets in order to evaluate the new trial designs’ performance (such as sample size, duration, power, bias); the data sets are from trials involving treatments for HIV, stroke, and heart failure; (iii) develop user-friendly software for optimizing the performance of our new adaptive designs and comparing them to standard designs. The goal was to construct designs that satisfy power and Type I error requirements at the minimum cost in terms of expected sample size, i.e., average sample size over a set of plausible scenarios. We also considered the maximum sample size, i.e., the number of participants enrolled if there is no early stopping. Methods: We constructed new adaptive trial designs (including new rules for modifying enrollment and new procedures for testing multiple hypotheses) and proved key statistical properties such as control of the study-wide Type I error rate. Results: For the simulation study involving stroke, the new adaptive design reduced expected sample size by 32% compared to standard designs; the tradeoff is that the maximum sample size was 22% larger for the adaptive design. For the simulation study involving the cardiac resynchronization device for treating heart failure, the benefit of the adaptive design was a 2 5 % reduction in expected sample size but an 8% increase in maximum sample size versus standard designs. For the simulation study involving HIV, the adaptive designs did not provide substantial benefits. Conclusions: Optimized, adaptive enrichment designs can lead to reduced expected sample size compared to standard designs, in some settings. For adaptive enrichment to substantially add value, a sufficient number of primary outcomes need to be observed before enrollment is exhausted; this depends on the enrollment rate and the time from enrollment to observation of the primary outcome. Adaptive designs often involve tradeoffs such as reduced expected sample size at the price of greater maximum sample size, compared to standard designs. Our software can reveal these tradeoffs and determine whether certain adaptive enrichment designs substantially add value for a given trial design problem; this enables trial to make informed decisions among trial design options. Our designs assumed that subpopulations are defined before the trial starts, which requires prior data and scientific understanding of who may be more likely to benefit from the treatment. The sample size required to determine treatment effects for subpopulations can be substantially greater than for the overall population.

3 Background

Adaptive designs involve preplanned rules for modifying how the trial is conducted based on accruing data. For example, adaptations could be made to the number enrolled, the probability of being randomized to treatment or control, the inclusion criteria, the length of follow-up, etc. According to the Patient-Centered Outcomes Research Institute (PCORI) Methodology Report1 “Adaptive designs are particularly appealing for PCOR because they could maintain many of the advantages of randomized clinical trials while minimizing some of the disadvantages.” We focus on one type of adaptive design called adaptive enrichment designs. Adaptive enrichment designs involve preplanned rules for modifying enrollment criteria in an ongoing trial.2 They typically involve multiple, pre-planned stages, each ending with an analysis of the cumulative data and a decision as to enrollment in the subsequent stage. These designs have potential to learn more about treatment effects in subpopulations.3 For example, enrollment of a subpopulation for which there is sufficient evidence of treatment efficacy, futility, or harm could be stopped, while enrollment continues for the remaining subpopulations. Figure 1, based on a similar figure in our paper4, gives a schematic of a 2- stage adaptive enrichment design in the context of a trial to evaluate a surgical treatment for stroke in 2 subpopulations.

4

Figure 1:4 Schematic of 2 stage adaptive enrichment design.

2 Stage Adaptive Enrichment Design Flow of Enrollment and Decision Stage 1 Stage 2 Enroll Both Pop. Enroll Both Subpopulation 1 Subpopulations Option 1 Subpopulation 2 Subpopulation 1 Enroll Only Subpop.1 Subpopulation 2 Option 2 Subpopulation 1

Enroll Only Subpop.2 Option 3 Subpopulation 2

Option 4 Trial

A decision is made after stage 1 to: (1) continue enrolling both subpopulations; (2) enroll only subpopulation 1; (3) enroll only subpopulation 2; (4) stop the trial. The adaptive enrichment designs considered in this paper generally involve more than 2 stages, where similar decisions as in this figure are made at the interim analysis after each stage using the cumulative data available. We developed new statistical methodology and an open-source, freely available, software tool that optimizes new adaptive enrichment designs and compares their performance (via simulation) versus standard designs. Our designs aimed to determine treatment benefits and harms for subpopulations defined by a risk factor such as age, disease severity, or a biomarker measured at baseline. We also assessed tradeoffs involved in using adaptive enrichment designs versus standard designs. Our project addressed research priorities of the U.S Food and Drug Administration (FDA) and the National Institutes of Health (NIH). The FDA, in their draft guidance on adaptive designs for drugs and biologics5, highlighted the importance of developing new statistical methods for adaptive enrichment

30 designs. Adaptive designs research is also important to the NIH. As stated in the PCORI Methodology Report1, “Recognizing the need for innovation in clinical trial design, representatives from the NIH’s Clinical and Translational Science Award programs have identified design as a high- priority methodological issue ‘to increase the of comparative effectiveness trials.’” As stated in a paper6 co-authored by the former Chief of the Biometric Research Branch, Division of Cancer Research and Diagnosis, National Cancer Institute (NCI), “There has been relatively little previous methodological work on adaptively changing the eligibility criteria during a clinical trial.” Aim 1 of our research was to develop methods to address a facet of this gap. We focused on developing new adaptive enrichment designs for time-to-event and other delayed endpoints. Most existing methods for constructing adaptive enrichment designs are limited to situations where patient outcomes are observed soon after enrollment. This is a major barrier to the use of such designs in practice, since for many the outcome of most clinical importance may occur long after enrollment. Building on preliminary work7,8,9,10,11,12,13 that focused on the simpler setting of outcomes observed soon after enrollment, we developed new designs and analytic tools to handle time-to-event and delayed outcomes, that is, outcomes measured a fixed time after enrollment. As stated in the PCORI Methodology Report1, the chief statistical concerns for adaptive designs include “type I error, power, and sample size distributions, as well as the precision and bias in the estimation of treatment effects.” Similar concerns are emphasized in the FDA draft guidance on adaptive designs for drugs and biologics5 and the FDA guidance on adaptive designs for medical devices.14 We addressed these concerns by evaluating the above statistical properties of our new designs. The adaptations in our designs are based on predefined rules as recommended in the PCORI Methodology Report and the aforementioned FDA guidance documents. Although there is potential for producing stronger evidence about which subpopulations benefit by using adaptive designs, these designs are no panacea. In fact, indiscriminate application of adaptive designs could be quite detrimental. An adaptive design should only be used in situations where it offers clear advantages over standard designs. Previously, the only practical way to decide if an adaptive design should be used is to hire a statistical consulting firm specializing in adaptive designs, or to with different designs, e.g., using commercial software. The first option can be expensive, putting it beyond the reach of many trial statisticians. The second option is a time intensive, trial and error approach, where the user must try a multitude of possible design parameter settings and scenarios. Even when a good adaptive design exists, it may be missed by this approach. The problem is that currently available software for adaptive designs does not automatically search over many possible designs to find those with best performance in the scenarios most relevant to a particular research

30 question. This represented a barrier preventing trial statisticians from tapping the potential of adaptive designs. Our project addressed this barrier by producing a freely available, user-friendly software tool that optimizes adaptive enrichment designs to the scenarios and performance criteria input by the user (e.g., a designing a new clinical trial) and comparing their performance to standard designs. Our software tool is intended to help enable trial statisticians to learn whether the adaptive designs implemented in our project offer advantages for their research questions, and if so to provide designs optimized to scenarios relevant for their study goals. We also summarized our results in a plain language, non-technical document to help make the key ideas involving adaptive enrichment designs accessible to clinical investigators (Appendix 1). Throughout, we refer to adaptive enrichment designs simply as adaptive designs, for conciseness.

Specific Aims Aim 1: Develop and evaluate adaptive enrichment designs for time-to-event and other delayed endpoints. 1.1 Combine advantages of group sequential and adaptive enrichment designs in a unified framework. 1.2 Enhance the evidence available at each interim analysis by using improved estimators. 1.3 Optimize adaptive enrichment designs. Aim 2: Conduct extensive simulation studies. The goal is to determine how the new adaptive designs in Aim 1 perform in scenarios derived from our completed trial data sets. Aim 3: Produce user-friendly, free, open-source software to optimize our adaptive enrichment designs and compare performance versus standard designs.

Participation of patients and/or other stakeholders in the design and conduct of research and dissemination of findings

Because this was a methodology project, we did not include patients/stakeholders.

30 Methods Overview: Aim 1 involved development of new methods, which are described in this section. The methods consist of new adaptive trial designs (including new rules for modifying enrollment and new multiple testing procedures) and proofs of their statistical properties (such as control of the study-wide Type I error rate). We also describe the data sets that were used in our simulation studies in Aim 2. Aim 1: Develop and evaluate adaptive enrichment designs for time-to- event and other delayed endpoints. Aim 1.1 Combine advantages of group sequential and adaptive enrichment designs in a unified framework. The methods developed for this aim are presented in our manuscript15. Group sequential designs involve rules for early stopping of an entire trial for efficacy, futility, or harm. They are often used in practice due to their flexibility, simplicity, and familiarity to those designing clinical trials. Adaptive enrichment designs have a different set of desirable features, including potential to improve detection of treatment effects for subpopulations; they require multiple testing procedures since they involve testing a null hypothesis for each subpopulation. Our general framework for adaptive enrichment designs can be applied to clinical trials with primary outcomes that are continuous, binary, or time-to-event. All of our adaptive enrichment designs involve preplanned interim analyses, where the following type of enrollment modification rule is applied to each subpopulation: if the cumulative (on the z-scale) comparing treatment versus control for that subpopulation is above an efficacy boundary or below a futility boundary, accrual for that subpopulation is stopped; otherwise accrual continues. (We also considered variations on this enrollment modification rule, such as adding rules for stopping the combined population or for stopping assignment to a treatment arm in trials evaluating more than 1 treatment versus control.) The main challenge is to determine the interim analysis times and efficacy/ futility boundaries for which the corresponding design satisfies user-specified requirements on power and Type I error at the minimum cost in terms of expected sample size.

Aim 1.2 Enhance the evidence available at each interim analysis by using improved estimators. We considered estimators that leverage information in baseline variables and short-term outcomes in order to improve estimator precision. These estimators, based on targeted maximum likelihood estimation16, use prediction models that synthesize information available at a given analysis. We wrote several papers17,18,19,20 on the impact of such estimators, described below.

We first considered estimators that adjust only for prognostic baseline variables (called

30 ``adjusted estimators” below) to improve precision, in the context of standard trial designs and adaptive enrichment designs. We did this for binary, continuous, and time-to-event outcomes. We constructed new, adjusted estimators for time-to-event outcomes that provide theoretical guarantees (such as equal or better precision compared to the Kaplan-Meier estimator, asymptotically)17; in simulation studies that mimic features of the CLEAR stroke treatment trial, the new adjusted estimators led to a 12% precision gain compared to the standard, unadjusted estimator (Kaplan-Meier estimator). We also used the CLEAR trial to illustrate strengths and limitations of different types of adjusted estimators, where we showed that a commonly used adjusted estimator is uninterpretable if there is any treatment effect heterogeneity; we proposed that other adjusted estimators (called standardization estimators) be used, since they lead to comparable precision gains and do not have this flaw.18 Such estimators were applied in simulation studies that mimic key features of the MISTIE trial19,20 and of the PEARLS trial19; precision gains for the MISTIE trial context ranged from 29%-39%, while they were 2% for the PEARLS trial. The main reason is that in the MISTIE trial context, there were strongly prognostic baseline variables, unlike in the PEARLS trial.

Aim 1.3 Optimize adaptive enrichment designs. This is challenging since the adaptive designs we considered have many parameters to set, including sample sizes per stage and stopping boundaries for efficacy or futility (for each subpopulation). The goal is to conduct an automated search over these parameters to find a trial design that minimizes expected sample size under requirements on power and Type I error. We used two optimization approaches: first21, we used an optimization algorithm called simulated annealing; second22 we used an optimization algorithm called sparse linear programming. Both approaches led to substantial reductions in expected sample size. We discuss results based on the first approach throughout this manuscript; the next paragraph summarizes results based on the second approach. The results from the second approach22 led to a 17% reduction in expected sample size, comparing our new optimized adaptive designs to an existing set of adaptive designs from related work, in the main example in that paper; a limitation of these new optimized designs is that they are substantially more complex than the designs output by the first approach, which may make them more difficult to communicate. We therefore decided to implement the designs using the first approach in our software tool (see Aim 3).

Data sources and datasets: We used datasets from 4 completed randomized trials. The trials were used for two purposes. First, 3 of the 4 trials motivated new trial design problems where it was of interest to

30 learn about treatment effects in specific subpopulations; the remaining trial was used only for assessing the value added by improved estimators that adjust for prognostic baseline variables. Second, we used the data from each of these trials to construct realistic simulation scenarios by mimicking key features of these data sets. The first trial is the Prospective Evaluation of Antiretrovirals in Resource Limited Settings (PEARLS) study of the AIDS Clinical Trials Group (ACTG Trial A5175)23. As we described in our manuscript24, This was a randomized, non-inferiority trial that enrolled 1,571 HIV positive participants. Participants were randomized to three HIV treatments referred to here as treatment arms A, B, and C, respectively. The primary outcome was time to the composite endpoint of virologic failure, HIV disease progression (AIDS), or death. There was a difference in treatment effects for men and women when comparing two of the arms. The trial’s lead statistician Dr. Victor DeGruttola (who was on our advisory board) asked whether an adaptive design might be useful in contexts such as this one to learn about treatment effects in subpopulations defined by sex. The second trial is a phase 2 trial of a surgical treatment for severe stroke, called Minimally- Invasive Surgery Plus rt-PA for Intracerebral Hemorrhage (MISTIE)25. Daniel Hanley, the principal investigator for the MISTIE trial, was on our advisory board. Each participant’s functional disability was measured 180 days from enrollment, based on the modified Rankin Scale (mRS). The primary outcome was the indicator of having mRS 3 or less. As we described in our paper21: In planning the Phase III trial, the investigators were interested in two subpopulations defined by size of intraventricular hemorrhage (IVH) at baseline. ``Small IVH'' participants are defined to have IVH volume less than 10ml and not requiring a catheter for intracranial pressure monitoring. The remaining participants are called ``large IVH''. The Phase II trial only recruited small IVH participants. Knowledge of the underlying biology of these types of brain hemorrhage suggested a possible benefit for those with large IVH as well. However, there was greater uncertainty about the treatment effect in the large IVH subpopulation. Investigators inquired about the possibility of running a phase III trial that included both small IVH and large IVH participants (called subpopulation 1 and 2, respectively), but with the option to stop a subpopulation's accrual (using a preplanned rule) if interim data indicated that a benefit was unlikely. We evaluated the strengths and limitations of such an adaptive enrichment design21, with our findings presented in the Results section of this report. Throughout, we assume that the subpopulation definitions are based on clinical knowledge and are defined before the trial starts.

30 The third trial is the phase 3 trial called Clot Lysis: Evaluating Accelerated Resolution of Intraventricular Hemorrhage (CLEAR), involving a surgical treatment for a different type of severe stroke than the MISTIE trial.26 500 individuals with intraventricular hemorrhage (IVH) were randomized with equal allocation to receive either alteplase (treatment) or saline (control) for IVH removal. The primary outcome was the same as in the MISTIE II trial. We used the two stroke trials to investigate different design features; MISTIE was primarily used to assess the value added from adaptive enrichment designs, while CLEAR was used to assess the value added from adjusting for prognostic baseline variables. The CLEAR trial enrolled a different population than the MISTIE trial. The fourth trial, the SMART AV trial27, is described in our manuscript28. The SMART-AV trial compared two methods (called treatments) of optimizing atrioventricular (AV) delay versus a fixed delay (called control) for heart failure patients having cardiac resynchronization therapy with a defibrillator (CRT-D). The trial is named “SMART-AV” because one of the atrioventricular delay optimization methods is called SmartDelay. Prior scientific knowledge indicated that participants with narrow QRS width (QRS ≤ 150ms) may be more likely to benefit from optimizing atrioventricular delay. This raises the question of whether a design targeted to test for treatment effects in subpopulations defined by QRS width would have been more successful at identifying the participants (if any) who benefit from treatment. To address this question, we developed, optimized, and evaluated a new class of adaptive enrichment designs for comparing two treatments to a common control in two disjoint subpopulations for outcomes measured with delay; we compared these designs versus standard designs in scenarios that mimic features of the SMART-AV trial.28

Evaluative framework: An important choice in our project was what designs we should compare our new adaptive enrichment designs to. Our choice of comparator designs, which was informed by input from our advisory board, consists of standard, fixed (single-stage) designs that do not involve adaptation, and designs that use simple rules for early stopping described in the Results sections. Study outcomes: The primary outcomes considered in the four trials are as follows: For the PEARLS trial, the primary outcome was time to the composite endpoint of virologic failure, HIV disease progression (AIDS), or death. For the MISTIE and CLEAR trials, the primary outcome was the modified Rankin Scale (measuring functional disability) 180 days after enrollment. For the SMART-AV trial, the primary outcome was 180 day change in left ventricular end-systolic volume. We used these as the primary outcomes in our simulation studies comparing adaptive enrichment designs versus standard designs. Analytical and statistical approaches: The performance metrics we used to compare adaptive enrichment designs versus standard designs are the following: sample size; duration; power; Type I

30 error; estimator bias, , squared error; coverage probability of confidence intervals. These performance metrics were compared using data generating distributions (called scenarios) that we constructed to mimic key features of the completed trial data sets (such as subpopulation proportions and the correlations among baseline variables, short-term outcomes, and the primary outcome). We considered scenarios where the treatment benefited neither subpopulation, a single subpopulation, or both subpopulations. For the SMART-AV application, we considered a larger set of scenarios since the goals of that design involve both multiple treatments and multiple populations. Conduct of Study N/A Results: Our results consist of simulation studies evaluating the benefits and limitations of adaptive enrichment designs (Aim 2) and the software tool for optimizing these designs (Aim 3). We first report the results from Aim 2. Aim 2: Conduct extensive simulation studies. The goal was to determine how the new adaptive designs in Aim 1 perform in simulation scenarios derived from the data sets in the MISTIE, SMART-AV, and PEARLS trials. For each data set, we present our trial design optimization problem, the types of designs that were compared, and the performance of the best designs found by the optimization algorithm (simulated annealing). These results are also presented in a manuscript that includes simulations outside the scope of this project.29 MISTIE optimization problem: We first present the results from our paper21 for the MISTIE stroke trial application. The clinically meaningful, minimum treatment effect δ was defined as a 12% absolute increase in the probability of good outcome defined as mRS being 3 or less at 180 days. The subpopulations of interest were those with small IVH at baseline (subpopulation 1) and large IVH at baseline (subpopulation 2). There were 3 null hypotheses of interest: no average treatment benefit for subpopulation 1 (H01), for subpopulation 2 (H02), and for the combined population (H00). The trial design optimization problem involved 4 requirements: (i) 80% power to reject H01 when the treatment only benefits subpopulation 1 at level δ; (ii) 80% power to reject H02 when the treatment only benefits subpopulation 2 at level δ; (iii) 80% power to reject H00 when the treatment benefits both subpopulations at level δ; (iv) strong control of the familywise Type I error rate at level 0.025 (one- sided). The goal was to construct a design meeting all four of the above requirements, with the minimum expected sample size, i.e., expected sample size averaged over the scenarios in goals (i)-(iii) and the global null hypothesis of no treatment effect in either subpopulation. MISTIE designs compared: We compared adaptive enrichment designs with up to 10 stages versus single stage designs for this problem. The number 10 was chosen somewhat arbitrarily.

30 Typical group sequential designs involve 3-5 stages, and we wanted to allow more than this; it turned out that the optimal adaptive designs all had 6 or fewer stages. The adaptive enrichment designs used the following rule for stopping enrollment at the end of each stage: if the test statistic (Z-score) corresponding to the estimated treatment effect for a subpopulation exceeds the corresponding efficacy boundary or is below the corresponding futility boundary, enrollment for that subpopulation is stopped; also, if the test statistic corresponding to the estimated treatment effect for the combined population is below a corresponding futility boundary, all enrollment is stopped. The efficacy and futility boundaries were either set according to standard methods (including O’Brien-Fleming and Pocock boundaries) or were optimized. The single stage designs only involved efficacy boundaries, which were either set based on an equal allocation of alpha (representing Type I error) to the null hypotheses or based on an optimized allocation. MISTIE performance comparison: The efficacy and futility stopping boundaries for the optimized adaptive enrichment design are shown in Figure 2, and the performance is compared to the other design types in Figure 3. The optimized adaptive design had expected sample size 981, compared to 1443 for the optimized single stage design; the tradeoff is that the maximum sample size of the former design was 1762 compared to 1443 for the latter. This represents a tradeoff in that the 32% reduction in expected sample size is accompanied by a 22% larger maximum sample size for the adaptive design compared to the optimized single stage design. The designs that did not involve optimization all had worse performance than the optimized designs. We also published a paper in the journal Stroke giving a non-technical overview of the strengths and limitations of adaptive enrichment designs for the MISTIE trial application4.

30 Figure 2: Efficacy and futility boundaries for 6 stage optimized adaptive design from the MISTIE stroke application21.

Note: There are efficacy (blue circles) and futility (red triangles) stopping boundaries corresponding to each null hypothesis, where H01=no average treatment benefit for subpopulation 1 (solid lines); H02=no average treatment benefit for subpopulation 2 (dotted lines); H00=no average treatment benefit for the combined population (dashed lines).

30 Figure 3: Comparison of sample size distributions between adaptive enrichment designs and standard designs for MISTIE trial application21.

Five designs are shown. The simplest are the 2 standard (non- adaptive) single stage designs “1 stage, non-opt.” and “1 stage, opt.” where the former uses a non- optimized (standard) multiple testing procedure and the latter uses an optimized multiple testing procedure; their sample sizes are constant and indicated by the dashed and solid lines, respectively. The three adaptive enrichment designs (defined next) have sample sizes that depend on the enrollment modifications made during the trial (e.g., stopping a subpopulation or the combined population); their sample sizes are represented by distributions using violin plots (approximate with density proportional to width) with the expected sample size marked as ‘x’. The design “optim” represents the optimized adaptive enrichment design, while “OBF” and “Pocock” represent non-optimized designs using O’Brien-Fleming and Pocock efficacy stopping boundaries, respectively. There is asymmetry in the MISTIE trial design problem in the subpopulation proportions: 1/3 and 2/3 for subpopulations 1 and 2, respectively. This led to asymmetries in the optimized design shown in Figure 2, e.g., futility boundaries for H01 being below those for H02. Intuitively, subpopulation 1

30 information accrues at half the rate compared to subpopulation 2 (since enrollment is assumed to be proportional to subpopulation size), and so the design is more reluctant to stop subpopulation 1 early for futility compared to subpopulation 2. The efficacy boundaries for H01 and H02 have an analogous relationship at the first 3 interim analyses; after that, they cross. Such a crossing has to occur if (as occurred in the ) the efficacy boundaries start out higher for H01 compared to H02 and more total alpha is allocated to subpopulation 1 than subpopulation 2. Intuitively, one may expect such an allocation since the maximum sample size (and maximum information) for subpopulation 1 is substantially smaller than for subpopulation 2.

SMART-AV optimization problem: We next present our results28 for simulation studies motivated by the SMART-AV trial, which involved 3 arms (2 treatments versus control). We focused on two subpopulations defined by QRS duration at baseline being narrow or wide as defined above. The accrual rate was set to be 20 participants per month. The proportion of participants with narrow QRS is estimated to be 0.49. The minimum, clinically meaningful treatment effect was a reduction by 15ml of the primary outcome of 6-month change in left ventricular end systolic volume (LVESV), and the outcome for each treatment by subpopulation combination was set to 60ml. The familywise type-one error rate is 0.05. There were 4 null hypotheses of interest, with one for each treatment by subpopulation combination, e.g., no average treatment effect for treatment 1 (versus control) for subpopulation 1. A greater variety of scenarios and corresponding power requirements were considered, compared to the MISTIE trial application. Table 1 shows the 6 different scenarios (one per row) used to define the objective function and power constraints. The table caption describes how the power constraints are represented. The objective function is the expected sample size, where the expectation (average) is with respect to an equally weighted combination of the 6 scenarios.

30 Table 1: Scenarios used to define the objective function and power constraints for the SMART- AV trial simulation study.28

S1 T1 S2 T1 S1 T2 S2 T2 Power

Scenario 1 0 0 0 0 0

Scenario 2 15 0 0 0 0.8

Scenario 3 15 15 0 0 0.8

Scenario 4 15 0 15 0 0.8

Scenario 5 15 15 15 0 0.8

Scenario 6 15 15 15 15 0.8

S1 and S2 refer to subpopulations 1 and 2, respectively. T1 and T2 refer to treatments 1 and 2, respectively. Each of columns 2-5 corresponds to a treatment by subpopulation combination. Each cell in columns 2-5 represents the treatment effect (mean difference between the treatment and control) for the corresponding subpopulation, with positive values indicating a benefit from treatment compared to control. The power column shows the power required for each false null hypothesis in the corresponding row, i.e., for each subpopulation by treatment combination where the treatment effect is 15ml. For example, in scenario 3 the requirements are: 80% power to reject the null hypothesis for treatment by subpopulation combination (subpopulation 1, treatment 1), and 80% power to reject the null hypothesis for treatment by subpopulation combination (subpopulation 2, treatment 1). SMART-AV designs compared: We considered adaptive enrichment designs that use preplanned rules for early stopping of subpopulation by treatment combinations, e.g., if the estimated treatment effect for treatment A versus control in subpopulation 2 is below a futility threshold at an interim analysis, then new participants from that subpopulation are no longer assigned to treatment A (but treatment B and the control arm may continue). Designs could have up to 4 stages. We compared four types of designs of increasing complexity: (i) a simple 1- stage design with equal alpha allocation (e.g., the alpha allocated to each subpopulation is 0.025) where the only design parameter to be optimized is the sample size; (ii) an optimized- alpha 1-stage design where the alpha allocation was optimized between subpopulations to minimize the expected sample size; (iii) a simple adaptive enrichment design with equal alpha-allocation, all futility boundaries set to zero, and equally spaced interim analyses; (iv) an adaptive enrichment design where futility boundaries, alpha allocation, and timing of the interim analysis were optimized.

30 SMART-AV performance comparison: Table 2 shows the expected and maximum sample size for the best design from each of the four design types. There was a 25% reduction in expected sample size comparing the optimized adaptive enrichment design versus the optimized 1-stage design, and a 12% reduction compared to the simple adaptive enrichment design. This shows that there is added value both from optimization of the design parameters and from adaptive enrichment. A disadvantage of optimized adaptive enrichment design is that it has 8% larger maximum sample size than the optimized 1-stage design.

Table 2:28 Results for the SMART-AV trial. The table shows the expected and maximum sample size for each optimal design from the 4 types compared in the context of the SMART-AV trial.

Expected Sample Size Maximum Sample Size Simple 1-Stage Design 1818 1818 Optimized-alpha 1-Stage Design 1779 1779 Simple Adaptive Design 1525 2154 Optimized Adaptive Design 1339 1917

Hypothetical Example of SMART-AV Trial: Here we provide an example of how an adaptive enrichment design could play out in a hypothetical CRT-D trial comparing two treatments to a common control in two subpopulations (1: narrow QRS patients, 2: wide QRS patients). For simplicity, we consider an optimized, adaptive enrichment design with two stages: an interim analysis and final analysis. While this design has a larger expected sample size than the four-stage design that gave the lowest expected sample size as shown in Table 2 (Two Stage: 1385 vs. Four Stage: 1339), it does have a lower maximum sample size (Two Stage: 1890 vs. Four Stage: 1917), meets the same power and Type I error constraints, and the reduction in number of analyses may be preferred for logistical reasons. For ease of clinical interpretation, we present unstandardized decision boundaries on the mean difference scale rather than standardized boundaries on the Z-score scale. We describe the stopping rules for this two-stage adaptive design. Treatment A will be stopped in stage 1 in narrow QRS patients if the mean difference in LVESV from control is less than 0 mL, and treatment B will be stopped for futility in stage 1 if the mean difference in LVESV from control is less than 1 mL. In wide QRS patients, each treatment arm will be stopped for futility if the mean difference in LVESV from control is less than 1 mL. In narrow QRS patients, treatments A or B will be declared

30 effective at the interim analysis if the mean difference in LVESV from control greater than 18 mL. In wide QRS patients, each treatment will be declared effective at the interim analysis if the mean difference in LVESV from control is greater than 20 mL. At the final analysis, treatments will be declared effective in narrow QRS patients if the mean difference from control is greater than 11 mL, and treatments will be declared effective in wide QRS patients if the mean difference from control is greater than 12 mL. We next present a hypothetical example of how such an adaptive design might proceed. At the interim analysis, each treatment is evaluated separately in each subpopulation. First, consider the narrow QRS subpopulation. Suppose we see a positive effect in arm A (a mean difference of 15 mL), and no appreciable difference in arm B (mean difference of 0 mL) when each is compared with control in narrow QRS patients. Since the mean difference between arm A and control is above the futility boundary but does not meet the efficacy criterion (0 mL < 15 mL < 18 mL), we will continue randomizing patients to arm A and control in the second stage of the trial. However, since arm B in narrow QRS patients was below the futility boundary at the interim analysis (0 mL < 1 mL), no patients will be randomized to arm B in stage 2 from this subpopulation. Next, consider the wide QRS subpopulation. We may see arm A stopped for futility (a mean difference of -3 mL < 1 mL futility boundary) and arm B stopped for efficacy in this subpopulation (a mean difference of 22 mL > 20 mL efficacy boundary). Since both arm A and B have stopped in this subpopulation, no further recruitment from subpopulation 2 is needed in the final stage. The final stage of the trial will only recruit patients in subpopulation 1, randomizing 1:1 between the remaining arm A and control. Since patients are only being randomized 1:1 between arm A and control in stage 2 instead of 1:1:1 in stage 1, this results in stage 2 completing accrual faster than if arm B hadn’t been stopped in this subpopulation. Arm A would be declared effective in narrow QRS patients if the mean difference from control was greater than 11 mL. PEARLS optimization problem: We compared adaptive enrichment designs in the context of two- arm non-inferiority trials with time-to-event endpoints,24 motivated by the PEARLS trial application. The goal of our trial design problem was to test whether treatment arm A is non- inferior to arm B for each of the two subpopulations (defined by sex). There were two null hypotheses: inferiority for subpopulation 1 and inferiority for subpopulation 2, respectively. We next describe features from the completed PEARLS trial that we mimicked in our simulated study. Women (subpopulation 1) were 47% of the enrolled participants in the PEARLS trial. The accrual rate was approximately 724 participants per year. The hazard rate for study arm B was approximately 0.08. The studywide Type I error rate was 0.05 and the non-inferiority margin was a

30 of 1.35. The power constraints and expected sample size involved four scenarios, show in Table 3. These 4 scenarios represent, respectively, (1) equivalence of treatments A and B in both subpopulations; (2) equivalence of treatments A and B for subpopulation 1 (women) but inferiority at hazard ratio 1.35 of treatment A for subpopulation 2 (men); (3) equivalence of treatments A and B for subpopulation 1 and inferiority at hazard ratio 2.14 of treatment A for subpopulation 2; (4) inferiority at hazard ratio 1.35 of treatment A for each subpopulation. The motivation for the above scenarios was that the PEARLS trial showed evidence of treatment A being inferior to treatment B for men (estimated hazard ratio 2.14), but a significant difference was not observed for women. Adaptive enrichment designs allow, for example, stopping enrollment of men if treatment A is shown to be inferior to B only for this subpopulation, while continuing enrollment for women (with the hope that A may subsequently be demonstrated to be non-inferior to B for women).

Table 3: Scenarios in the PEARLS simulation studies. 24

Subpopulation 1 Subpopulation 2 Required Power

Scenario 1 1 1 0.8

Scenario 2 1 1.35 0.8

Scenario 3 1 2.14 0.8

Scenario 4 1.35 1.35 0

Note: Columns 2 and 3 show the hazard ratios (comparing arm A to arm B) for each subpopulation and scenario. The power column represents the required power for each false null hypothesis (i.e., each subpopulation with hazard ratio at least 1.35).

PEARLS designs compared: We compared the following three types of designs: (i) a single stage design; (ii) a two-stage adaptive enrichment design (denoted DAdapt-Both) that starts enrolling both subpopulations during stage 1; (iii) a two-stage adaptive enrichment design (denoted DAdapt-Only- Subpop.1) that enrolls only subpopulation 1 (women) in stage 1 but may expand the enrollment criteria to include men after the first interim analysis. All designs had a maximum duration of 8 years, but enrollment could be stopped earlier. For each type of design, we optimized the corresponding design parameters. These included the alpha allocation and the duration of enrollment for all designs; for the two-stage designs we also optimized the timing of the interim analysis and the futility boundaries for each enrolled subpopulation; for design type (iii), we also optimized the rule for deciding whether to enroll men during

30 stage 2. PEARLS performance comparison: For the original enrollment rate of 724 per year, the optimized adaptive designs did not improve on the optimized single stage design. We also considered half the original enrollment rate, in which case the optimal design slightly improved on the single stage design, as described in the following paragraphs. The optimized single stage design continued enrollment until 4.7 years and then followed enrolled participants until the end of study (8 years). It allocated 88% of alpha (Type I error) to the hypothesis test for subpopulation 1 (women). The sample size was 1702.

For the optimized DAdapt-Both design, the analysis times were at 3.4 and 8 years, respectively. For each subpopulation that was not stopped at the interim analysis, enrollment continued to 5.0 years (and follow-up continued to the end of study at 8 years). The proportions of total alpha (0.05) allocated to subpopulation 1 (women) at the interim and final analysis were 15% and 74%, respectively, while the proportions of the total alpha allocated to men at the interim and final analysis were 1% and 10%, respectively. The futility boundaries used at the interim analysis were -2.1 for women and -0.74 for men (on the z-scale). A potential reason for this asymmetry is that in all scenarios in Table 3, if treatment A is non-inferior to B for men it is also non-inferior for women, but not vice versa; intuitively, it may be advantageous to stop enrolling men and continue to enroll women, but not vice versa. The expected sample size for DAdapt-Both was 1660 and the maximum sample size was 1799. The tradeoff compared to the optimized single stage design was a reduction in the expected sample size of 2% at the cost of an increase in maximum sample size of 6%. For the optimized DAdapt-Only-Subpop.1 design, the optimizer set the interim analysis time to be immediately after the trial starts, effectively replicating a single stage design and making no performance improvement compared to the single stage design. This completes the description of the results from our simulation studies that mimicked key features of the MISTIE, SMART-AV, AND PEARLS trial data sets. We next present two more sets of results. The first assessed the sensitivity of one of our optimized adaptive enrichment designs to delay times, enrollment rates, and the population distribution of the variables measured in the trial. The second examined the impact of statistical adjustment for imbalances between study arms in prognostic baseline variables (i.e., baseline variables that are correlated with the outcome). We investigated the sensitivity of an optimized adaptive enrichment design to features including: time between enrollment and observation of the primary outcome (delay time), accrual rates, and the prognostic value of auxiliary variables (including baseline variables and short-term outcomes).30 We found that faster patient accrual results in shorter trial duration, but can have the negative consequence

30 of increasing the sample size when the primary outcome is measured with delay. For trials using adjusted estimators, larger prognostic value leads to increased power and decreased expected sample size and trial duration when using information monitoring; a prognostic baseline variable typically results in better precision and power compared to an equally prognostic short-term outcome. We examined the impact of covariate adjustment on group sequential designs, i.e., designs with preplanned interim analyses where the trial may be stopped early for efficacy or futility.31 We derived new formulas that decompose the overall precision gain from covariate adjustment into contributions from different sources, including: the number of pipeline participants, analysis timing, enrollment rate, and treatment effect heterogeneity. We proved that, ceteris paribus, larger treatment effect heterogeneity decreases the value added from covariate adjustment. Another implication of our formulas is that in most practical situations, adjusting for a prognostic short-term outcome leads to smaller precision gains compared to adjusting for an equally prognostic baseline variable. We conducted simulation studies based on data from the MISTIE Phase II trial to check the predictions made by our theoretical results; the simulations matched the theoretical derivations and showed that using an adjusted estimator can result in up to 38% sample size reduction while maintaining 80% power compared to the unadjusted estimator. We summarized our results in Aims 1 and 2 in a plain language, non-technical document to help clinical investigators think through whether an adaptive enrichment design may be useful for their specific applications (Appendix 1). In Sections 2b and 2c of Appendix 1, we discuss the impact of the subpopulation sizes and of the information accrual rate (which involves the enrollment rate and time from enrollment to primary outcome measurement), respectively, on the potential value-added by adaptive enrichment designs.

Aim 3: Produce user-friendly, free, open-source software to optimize our adaptive enrichment designs and compare performance versus standard designs. The software is intended for a trial statistician who is planning a confirmatory trial where it’s suspected that a subpopulation may benefit more than the overall population. The software applies the simulated annealing optimization approach described above in order to search for adaptive enrichment designs that satisfy the user’s requirements for Type I and II error at the minimum cost in expected sample size. The software is for superiority trials and handles outcomes that are continuous, binary, or time-to-event. For the case of time-to- event outcomes, it also has the option to implement non-

30 inferiority trials. Precision gains from adjustment for baseline variables can be incorporated. The software implements our new adaptive enrichment designs24,28. These adaptive enrichment trial designs are defined by the following design parameters: number of stages, sample size per stage, interim analysis times, and efficacy and futility stopping boundaries. The software searches over values of the parameters with the goal of minimizing the expected sample size (ESS). The search is done using a simulated annealing algorithm: starting from an initial design, variations to this design are proposed (new values are chosen for the trial parameters based on the current value of the parameters), and the performance of this candidate design is evaluated. If the proposed candidate design is superior to the current design, the candidate is ‘accepted’ and replaces the current design. If the design is not superior, it is either ‘rejected’ (the current design remains in consideration), or ‘accepted’ with a small probability. Rarely accepting a less optimal design is necessary to better explore the space of possible trials without getting stuck in areas that may be better than neighboring values, but far from the global optimum. The algorithm becomes more conservative as the search progresses: variations proposed to the current design are smaller in scale and designs not superior to the current design are less likely to be accepted. In each iteration of the algorithm, trial design performance is evaluated using simulation: a large number of simulated trials are carried out to estimate power, error rates, statistical properties of estimators, and the distribution of trial duration and sample size. These operating characteristics depend upon characteristics of the patient population and outcome (e.g., size of subpopulations, distributions of outcomes, the delay between enrollment and observation of the primary outcome), logistical constraints (e.g., on sample size and duration), and the rules for performing interim analyses and modifying enrollment. While simulated annealing is not guaranteed to find the global optimum design (which is currently an open problem), it can be useful in identifying potential improvements and tradeoffs between study designs. The simulated annealing algorithm requires substantial computation and is typically completed within 24 hours, at which time a summary report is emailed to the user. The user of our software can specify her/his priorities through the power constraints and the objective function. For example, if there is a high priority on rejecting the null hypotheses for both subpopulations when the treatment benefits both of them at a given level, this can be entered as a power constraint. Intuitively, the resulting optimized design will be less likely to stop the trial when only one subpopulation’s treatment effect estimate is high, compared to a similar design but that is optimized without this constraint (e.g., if there were only constraints on individual subpopulation power but not on power to reject both simultaneously). It’s possible to investigate this tradeoff using our

30 software tool by trying different power constraints. The software has a graphical user interface that runs on all major platforms (e.g., Windows, Mac OS X, and Linux) because it runs on web-browsers including Google Chrome and Firefox (both of which are free to download). The trial optimization software tool can be accessed here: http://rosenblum.jhu.edu where a user can request a (free) account giving access to the software. All of the code used in our software tool is open-source and freely available at: https://github.com/mrosenblum/AdaptiveDesignStreamlinedOptimizer

Discussion Study results in context: To put the above results in the context of existing literature, we note that our general approach for controlling the familywise Type I error rate is based on extending the group sequential approach32 to multiple hypotheses, as has been done in different contexts by, e.g.,33,34,35. Unlike alternative approaches such as the p-value combination approach36, our approach uses minimal sufficient , thereby avoiding potential precision losses37. To put our software tool for optimizing and evaluating adaptive enrichment designs in context, we describe several other software tools for optimizing adaptive enrichment designs. ADDPLAN PE38 is commercial software that uses the p-value combination approach. Advantages of ADDPLAN PE are that it computes trial design performance very quickly and has versatile 679 adaptive enrichment designs using a Bayesian approach. Both ADDPLAN PE and FACTS involves classes of adaptive designs that differ from those in our project. The main potential advantage of our software, compared to ADDPLAN PE and FACTS, is that our software tool searches over many trial designs to optimize expected sample size under power and Type I error constraints. Uptake of study results: We disseminated our results through teaching two, free short- courses, where we presented (1) an overview of the strengths and limitations of these designs, (2) case studies for the MISTIE and SMART-AV trial applications, and (3) the software tool. The first short-course was taught on June 13, 2017, in Washington, D.C., at the University of California Berkeley’s Forum for Collaborative Research in Washington, D.C. The course had 36 attendees in person plus 27 remotely. 20 were from the FDA, 8 were from PCORI, and others were from academia and industry. The second short-course, with the same format, was taught on August 30, 2017, at Johns

30 Hopkins University, and had 127 registered participants. The video recording for each short course is available free athttp://rosenblum.jhu.edu A potential barrier to dissemination of our new analytic methods and software tool is that, as in any new method, it requires a time investment to learn. Our project involved several approaches to address this obstacle. First, the overview document in Appendix 1 is written in clear language to succinctly illustrate the potential advantages and limitations of adaptive enrichment designs. Second, our software tool runs on all major operating systems and has a graphical interface that we hope will make it widely accessible. Third, we posted the video recordings of our adaptive enrichment designs short-course.

Study limitations: Limitations of our adaptive enrichment designs include that they only allow enrichment to a single subpopulation. Another limitation is that the proofs of statistical properties, such as Type I error control, are asymptotic. Except in special cases such as when the outcome is assumed to be normally distributed, proofs of statistical properties typically involve asymptotic arguments, even for standard designs. A limitation of our approach for time-to-event outcomes is that our simulations involve distributions with constant hazard rates over time.

Future research: Areas of future research include optimization of other types of adaptive designs. Our software, which is open-source and freely available for anyone to build on, includes a general trial design optimizer that can be applied to a wide variety of different types of trial designs. For example, we could include adaptive enrichment designs with different types of enrollment modification rules and multiple testing procedures (e.g., based on the p- value combination approach). Other types of adaptation could be considered as well, e.g., sample size modifications or dose-selection. Another area of future research is to consider larger numbers of subpopulations or treatments. It is more challenging to optimize designs for this case, in that greater computational resources would be needed.

30 Conclusions We developed and evaluated new adaptive enrichment designs, that is, designs with preplanned rules for modifying enrollment criteria based on accrued data in an ongoing trial. Such designs may be useful when there is prior evidence that the treatment effect may differ between subpopulations, e.g., defined by a biomarker or risk score measured at baseline. We developed user-friendly software that automatically searches over these adaptive designs with the aim of optimizing performance (in terms of average sample size) in the context of a user’s requirements (in terms of power and Type I error). The best adaptive enrichment designs found by the software are compared to standard designs that achieve the same goals; key tradeoffs between these designs are highlighted so that the user can determine which design makes the most sense for their specific application. While all of the designs presented focus on settings where the treatment effects in each subpopulation are of interest, some investigations may focus on the treatment effect in one subpopulation or in the overall population, but not the treatment effect in the complementary subpopulation (e.g. trials of targeted therapies). Such designs are not discussed here, but our software could be modified to explore such designs. Special considerations are necessary in these settings to minimize the risk of recommending ineffective treatments to the biomarker negative subpopulation while maintaining a feasible sample size40. Optimized, adaptive enrichment designs can lead to reduced expected sample size compared to standard designs, in some settings. For adaptive enrichment to substantially add value, a sufficient number of primary outcomes need to be observed before enrollment is exhausted; this depends on the enrollment rate and the time from enrollment to observation of the primary outcome. Adaptive designs often involve tradeoffs such as reduced expected sample size at the price of greater maximum sample size, compared to standard designs. Since the performance and tradeoffs of an adaptive design depend on many population and treatment-specific factors, it is important that trial statisticians consider a of possible designs and potential treatment effects. Our software can help navigate design choices, revealing the performance of designs and tradeoffs between them, helping to determine whether certain adaptive enrichment designs substantially add value for a specific trial design problem. This enables trial statisticians to make informed decisions among trial design options rather than relying on general strategies which many not apply to their setting. Limitations of our adaptive enrichment designs include that they only allow enrichment to a single subpopulation, they require preplanned analyses, proofs of the statistical properties are asymptotic, and our time-to- event simulations involve distributions with constant hazard rates over time.

30 References

1. Patient-Centered Outcomes Research Institute (PCORI) Methodology Committee. (2017). The PCORI Methodology Report. http://www.pcori.org/sites/default/files/PCORI-Methodology- Report.pdf Accessed 8/15/17.

2. Wang, S. J., Hung, H. & O’neill, R. T. (2009). Adaptive patient enrichment designs in therapeutic trials. Biomet. J. 51, 358–74.

3. Rosenblum, M. (2015), Adaptive Randomized Trial Designs that Cannot be Dominated by Any Standard Design at the Same Total Sample Size. Biometrika. 102(1). 191-202. http://goo.gl/ovNu1S

4. Rosenblum, M., and Hanley, D.F. (2017) Topical Review: Adaptive Enrichment Designs for Stroke Clinical Trials. Stroke. 48(6). https://doi.org/10.1161/STROKEAHA.116.015342

5. FDA. (2010). Draft guidance for industry. Adaptive design clinical trials for drugs and biologics. http://goo.gl/sND1mU Accessed 8/15/17.

6. Simon, N. and Simon, R. (2013). Adaptive enrichment designs for clinical trials. Biostatistics 14, 613–25.

7. Rosenblum, M. & van der Laan, M. (2009) Using Regression Models to Analyze Randomized Trials: Asymptotically Valid Hypothesis Tests Despite Incorrectly Specified Models. Biometrics; 65(3): 937- 945

8. Rosenblum, M. & van der Laan, M. (2010) Simple, Efficient Estimators of Treatment Effects in Randomized Trials Using Generalized Linear Models to Leverage Baseline Variables. International Journal of Biostatistics. 6(1).

9. Rosenblum, M. & van der Laan, M. (2011). Optimizing Randomized Trial Designs to Distinguish which Subpopulations Benefit from Treatment. Biometrika. 98(4), 845–860.

10. Rosenblum, M. (2013) Confidence Intervals for the Selected Population in Randomized Trials that Adapt the Population Enrolled, Biometrical Journal. 55(3), 322-340.

11. Rosenblum, M. (2014), Uniformly Most Powerful Tests for Simultaneously Detecting a Treatment Effect in the Overall Population and at Least One Subpopulation. Journal of Statistical Planning and Inference, (155). 107-116.

12. Rosenblum, M., Liu, H., and Yen, E.-H. (2014), Optimal Tests of Treatment Effects for the Overall Population and Two Subpopulations in Randomized Trials, using Sparse Linear Programming, Journal of the American Statistical Association, Theory and Methods Section. 109(507). 1216-1228. http://goo.gl/kUqxVU

13. Rosenblum, M., Thompson, R., Luber, B., Hanley, D. (2016) Group Sequential Designs with Prospectively Planned Rules for Subpopulation Enrichment. Statistics in Medicine. 35(21), 3776- 3791. http://goo.gl/7nHAVn

27 14. U.S. Food and Drug Administration (2016), Adaptive Designs for Medical Device Clinical Studies. Guidance for Industry and Food and Drug Administration Staff. http://www.fda.gov/downloads/medicaldevices/deviceregulationandguidance/guidancedocum ents/ucm446729.pdf accessed February 23, 2017.

15. Rosenblum, M., Qian, T., Du, Y., and Qiu, H., Fisher, A. (2016) Multiple Testing Procedures for Adaptive Enrichment Designs: Combining Group Sequential and Reallocation Approaches. Biostatistics. 17(4), 650-662. https://goo.gl/c8GlcH

16. van der Laan, M. J. and Gruber, S. (2012). Targeted minimum loss based estimation of causal effects of multiple time point interventions. Int. J. Biostat. 8.

17. Dıaz, I., Colantuoni, E., Hanley, D. F., and Rosenblum, M. Improved Precision in the Analysis of Randomized Trials with Survival Outcomes, without Assuming Proportional Hazards. Lifetime Data Analysis (Under Review. Submitted February 1, 2017, revision requested July 13, 2017) http://arxiv.org/abs/1511.08404

18. Steingrimsson, J. A., Hanley, D. F., and Rosenblum, M. (2017) Improving Precision by Adjusting For Baseline Variables in Randomized Trials with Binary Outcomes, without Regression Model Assumptions. Contemporary Clinical Trials. 54. 18-24. http://dx.doi.org/10.1016/j.cct.2016.12.026

19. Colantuoni, E. and Rosenblum, M. (2015) Leveraging Prognostic Baseline Variables to Gain Precision in Randomized Trials. Statistics in Medicine. 34(18), 2602-2617. http://goo.gl/evGHF6

20. Diaz, I., Colantuoni, E., Rosenblum, M. (2016) Enhanced Precision in the Analysis of Randomized Trials with Ordinal Outcomes. Biometrics. (72) 422-431. http://goo.gl/bnTjff

21. Fisher, A. and Rosenblum, M., (In Press) Stochastic Optimization of Adaptive Enrichment Designs for Two Subpopulations. Journal of Biopharmaceutical Statistics. Working paper: http://biostats.bepress.com/jhubiostat/paper279

22. Rosenblum, M., Fang, X., and Liu, H. Optimal, Two Stage, Adaptive Enrichment Designs for Randomized Trials Using Sparse Linear Programming. JRSS B (Under review. Submitted May 30, 2017). Working paper: https://goo.gl/67lBga

23. Campbell TB, Smeaton LM, Kumarasamy N, et al. PEARLS study team of the ACTG. Efficacy and safety of three antiretroviral regimens for initial treatment of HIV-1: a randomized clinical trial in diverse multinational settings. PLoS Med. 2012;9(8):e1001290. Epub 2012 Aug 14.

24. Betz, Josh; Steingrimsson, Jon Arni; Qian, Tianchen; and Rosenblum, Michael. Comparison Of Adaptive Randomized Trial Designs For Time-To-Event Outcomes That Expand Versus Restrict Enrollment Criteria, To Test Non-Inferiority (2017). Johns Hopkins University, Dept. of Biostatistics Working Papers. Working Paper 289. http://biostats.bepress.com/jhubiostat/paper289

25. Hanley, D. F., Thompson, R. E., Muschelli, J., et al. (2016) Safety and efficacy of minimally invasive surgery plus alteplase in intracerebral haemorrhage evacuation (MISTIE): a randomized, controlled, open-label, phase 2 trial. Lancet Neurology. 15(12). 1228-1237. http://dx.doi.org/10.1016/S1474-4422(16)30234-4

26. Hanley, D.F., Lane, K., McBee, et al. (2017) Thrombolytic removal of intraventricular haemorrhage in

28 treatment of severe stroke: results of the randomised, multicentre, multiregion, placebo-controlled CLEAR III trial. The Lancet. 389(10069), 603–611. http://dx.doi.org/10.1016/S0140-6736(16)32410-2

27. Diaz, I., Colantuoni, E., Rosenblum, M. (2016) Enhanced Precision in the Analysis of Randomized Trials with Ordinal Outcomes. Biometrics. (72) 422-431. http://goo.gl/bnTjff

28. Steingrimsson, Jon Arni; Betz, Joshua; Qian, Tiachen; and Rosenblum, Michael, Optimized Adaptive Enrichment Designs For Multi-Arm Trials: Learning Which Subpopulations Benefit From Different Treatments (2017). Johns Hopkins University, Dept. of Biostatistics Working Papers. Working Paper 288. http://biostats.bepress.com/jhubiostat/paper288

29. Rosenblum., M., Betz, J., Steingrimsson, J., Fisher, A., Qian, T., Performance of optimized adaptive enrichment designs in simulation studies of treatments for stroke, heart disease, HIV, and Alzheimer’s disease. Manuscript in preparation.

30. Qian, T., Colantuoni, E., Fisher, A., and Rosenblum, M. (2017) Sensitivity of adaptive enrichment trial designs to accrual rates, time to outcome measurement, and prognostic variables. Contemporary Clinical Trials Communications, Volume 8. https://doi.org/10.1016/j.conctc.2017.08.003.

31. Qian, Tianchen; Rosenblum, Michael; and Qiu, Huitong, Improving Power In Group Sequential, Randomized Trials By Adjusting For Prognostic Baseline Variables And Short-Term Outcomes (2016). Johns Hopkins University, Dept. of Biostatistics Working Papers. Working Paper 285. http://biostats.bepress.com/jhubiostat/paper285

32. Jennison, C. and Turnbull, B. W. (1999). Group Sequential Methods with Applications to Clinical Trials. Chapman and Hall/CRC Press.

33. Jennison, C. and B. W. Turnbull (2007). Adaptive seamless designs: Selection and prospective testing of hypotheses. J. Biopharmaceutical Statistics 17 (6), 1135-1161, doi: 10.1080/10543400701645215.

34. Maurer, W. and Bretz, F. (2013). Multiple testing in group sequential trials using graphical approaches. Statistics in Biopharmaceutical Research, 5(4):311-320.

35. Magnusson B, Turnbull B (2013) Group sequential enrichment design incorporating subgroup selection. Statistics in Medicine. 32(16):2695–2714. doi:10.1002/sim.5738.

36. Bretz F, Schmidli H, König F, Racine A, Maurer W. Confirmatory seamless phase II/III clinical trials with hypotheses selection at interim: general concepts. Biometrical Journal 2006; 48(4):623–634.

37. Emerson SS. (2006) Issues in the use of adaptive clinical trial designs. Statistics in Medicine; 25(19):3270–3296, doi:10.1002/sim.2626.

38. G. Wassmer, W. Brannath (2016) Group Sequential and Confirmatory Adaptive Designs in Clinical Trials. Chapter “Appendix: Software for Adaptive Designs”. Springer Series in Pharmaceutical Statistics. Springer.

39. “FACTS -- Fixed and Adaptive Clinical Trial Simulator” https://www.berryconsultants.com/software Accessed August 28, 2017.

29

40. Freidlin B, Sun Z, Gray R, Korn EL. Phase III clinical trials that integrate treatment and biomarker evaluation. Journal of Clinical Oncology 2013; Sep 1; 31(25): 3158–3161.

Related Publications Published Manuscripts: 1. Qian, T., Colantuoni, E., Fisher, A., and Rosenblum, M. (2017) Sensitivity of Trial Performance to Delay Outcomes, Accrual Rates, and Prognostic Variables Based on a Simulated Randomized Trial with Adaptive Enrichment. Contemporary Clinical Trials Communications. Volume 8. 39-48. https://doi.org/10.1016/j.conctc.2017.08.003 2. Rosenblum, M., and Hanley, D.F. (2017) Topical Review: Adaptive Enrichment Designs for Stroke Clinical Trials. Stroke. 48(6). https://doi.org/10.1161/STROKEAHA.116.015342 3. Steingrimsson, J. A., Hanley, D. F., and Rosenblum, M. (2017) Improving Precision by Adjusting For Baseline Variables in Randomized Trials with Binary Outcomes, without Regression Model Assumptions. Contemporary Clinical Trials.54.18-24. http://dx.doi.org/10.1016/j.cct.2016.12.026 4. Huang, E., Fang, E., Hanley, D., and Rosenblum, M., (2017) Inequality In Treatment Benefits: Can We Determine If a New Treatment Benefits the Many or the Few? Biostatistics. 18(2). 308- 324. https://doi.org/10.1093/biostatistics/kxw049 5. Rosenblum, M., Qian, T., Du, Y., and Qiu, H., Fisher, A. (2016) Multiple Testing Procedures for Adaptive Enrichment Designs: Combining Group Sequential and Reallocation Approaches. Biostatistics. 17(4), 650- 662. https://goo.gl/c8GlcH 6. Diaz, I., Colantuoni, E., Rosenblum, M. (2016) Enhanced Precision in the Analysis of Randomized Trials with Ordinal Outcomes. Biometrics. (72) 422-431. http://goo.gl/bnTjff 7. Colantuoni, E. and Rosenblum, M. (2015) Leveraging Prognostic Baseline Variables to Gain Precision in Randomized Trials. Statistics in Medicine. 34(18), 2602-2617. http://goo.gl/evGHF6 8. Diaz, I. and Rosenblum, M. (2015) Targeted Maximum Likelihood Estimation Using Exponential Families. International Journal of Biostatistics. 11(2), 233-251. 9. Webb, A., Ullman, N., Morgan, T., et al. (2015) Accuracy of the ABC/2 score for intracerebral hemorrhage: and analysis of MISTIE, CLEAR-IVH, CLEAR III. Stroke. 46(9), 2470-6. 10. Huang, E. J., Fang, E. X., Hanley, D. F., and Rosenblum, M., Constructing a for the Fraction Who Benefit from Treatment, Using Randomized Trial Data. Journal of the American Statistical Association, Theory and Methods Section (Submitted October 17, 2017). Working paper: http://biostats.bepress.com/jhubiostat/paper287 11. Rosenblum, Michael; Fang, Xingyuan; and Liu, Han, "OPTIMAL, TWO STAGE, ADAPTIVE ENRICHMENT DESIGNS FOR RANDOMIZED TRIALS USING SPARSE LINEAR PROGRAMMING" (June 2017). Johns Hopkins University, Dept. of Biostatistics Working Papers. Working Paper 273. https://biostats.bepress.com/jhubiostat/paper273 12. Dıaz, I., Colantuoni, E., Hanley, D. F., and Rosenblum, M. Improved Precision in the Analysis of Randomized Trials with Survival Outcomes, without Assuming Proportional Hazards. Lifetime Data Analysis http://arxiv.org/abs/1511.08404 13. Fisher, Aaron and Rosenblum, Michael, "STOCHASTIC OPTIMIZATION OF ADAPTIVE ENRICHMENT DESIGNS FOR TWO SUBPOPULATIONS" (December 2016). Johns Hopkins University, Dept. of Biostatistics Working Papers. Working Paper 279. https://biostats.bepress.com/jhubiostat/paper279 14. Rosenblum, M., Miller, P., Stuart, E., Thieme, M., Reist, B., and Louis, T. Adaptive Design in Surveys and Clinical Trials: Similarities, Differences, and Opportunities for Cross-Fertilization. 2019;182(3:963- 982)https://doi.org/10.1111/rssa.12438

30

Acknowledgments: Our project was lucky to have received periodic feedback from the project advisory board of 8 individuals who have a wide range of additional experience in the design, analysis, and implementation of clinical trials. The advisory board includes: Mark J. van der Laan, Professor of Biostatistics, University of California, Berkeley; Scott Zeger, Professor of Biostatistics, Johns Hopkins University; James Tonascia, Professor of Biostatistics, Johns Hopkins University; Daniel Hanley, Professor of Neurology, Johns Hopkins University; Patrick Heagerty, Professor of Biostatistics, University of Washington; Ruth Pfeiffer, Senior Investigator, National Cancer Institute (NCI); Victor DeGruttola, Professor of Biostatistics, Harvard University; Sean Tunis, President of the Center for Medical Technology Policy, a Baltimore non-profit organization that supports the design and implementation of research that helps patients, clinicians and payers make informed treatment and policy decisions.

31

Copyright© 2019. Johns Hopkins University. All Rights Reserved.

Disclaimer: The [views, statements, opinions] presented in this report are solely the responsibility of the author(s) and do not necessarily represent the views of the Patient-Centered Outcomes Research Institute® (PCORI®), its Board of Governors or Methodology Committee.

Acknowledgement: Research reported in this report was [partially] funded through a Patient-Centered Outcomes Research Institute® (PCORI®) Award (#ME-1306-03198) Further information available at: https://www.pcori.org/research-results/2013/new-methods-and-software-designing-adaptive-clinical-trials-new-medical

32