Biometrika (2010), 97,2,pp. 405–418 doi: 10.1093/biomet/asq003 C 2010 Biometrika Trust Advance Access publication 14 April 2010 Printed in Great Britain

Interval estimation for drop-the-losers designs

BY SAMUEL S. WU Department of and Health Policy Research, University of Florida, Gainesville, Florida 32610, U.S.A. [email protected]fl.edu

WEIZHEN WANG Downloaded from Department of Mathematics and , Wright State University, Dayton, Ohio 45435, U.S.A. [email protected]

AND MARK C. K. YANG

Department of Statistics, University of Florida, Gainesville, Florida 32610, U.S.A. http://biomet.oxfordjournals.org [email protected]fl.edu

SUMMARY In the first stage of a two-stage, drop-the-losers design, a candidate for the best treatment is selected. At the second stage, additional observations are collected to decide whether the candidate is actually better than the control. The design also allows the investigator to stop the trial for ethical reasons at the end of the first stage if there is already strong evidence of futility or superiority. Two types of tests have recently been developed, one based on the combined and the other based on the combined p-values, but corresponding inter- at University of Florida on May 27, 2010 val are unavailable except in special cases. The problem is that, in most cases, the interval estimators depend on the configuration of all treatments in the first stage, which is unknown. In this paper, we prove a basic stochastic ordering lemma that enables us to bridge the gap between hypothesis testing and interval estimation. The proposed confi- dence intervals achieve the nominal confidence level in certain special cases. Simulations show that decisions based on our intervals are usually more powerful than those based on existing methods.

Some key words: Adaptive design; ; Drop-the-losers design; p-value combination; Two-stage test.

1. INTRODUCTION The purpose of this paper is to construct confidence limits for the mean difference between the selected treatment and the control in a two-stage, drop-the-losers clinical trial. As the name implies, when there are k treatments to be compared with a control, we wish to pick the most promising treatment using a two-stage design, in which we drop all other treatments after the first stage and confirm that the selected treatment is indeed better than the control in the second stage. The investigator can also stop the trial for ethical reasons at the end of the first stage if there is already strong evidence for or against the null hypothesis that no treatment is better than the control. Multiple interim looks may be incorporated after the treatment selection. In drug testing, the selection stage corresponds to a Phase II trial and the confirmation stage corresponds to a 406 SAMUEL S. WU,WEIZHEN WANG AND MARK C. K. YANG Phase III study. Traditionally, these two trials have been conducted separately, with the Phase III study ignoring all the Phase II information except the selection. A seamless Phase II/III design combines the from both stages to make the final decision. Since the primary goal of most clinical trials is to make sure that the selected treatment is indeed better than the control, most two-stage designs are investigated using hypothesis testing. Thall et al. (1988) considered the two-stage, drop-the-losers strategy for dichotomous responses with an early futility stop, but no early rejection. Bischoff & Miller (2005) constructed an optimal test based on the likelihood ratio principle for k = 2, but an extension of their result for k > 2 was not explored. Sampson & Sill (2005) proposed a uniformly most powerful conditional test for general k conditioned on the first-stage results. Unfortunately, this test is not an uncondi- tionally uniformly most powerful test. Wang (2006) constructed a rigorous level-α test under the assumption that all treatments are not inferior to the control. Downloaded from The methods in the four papers mentioned so far use the likelihood ratio principle on data from the two stages; i.e. the decision is based on combined means. Alternatively, one could use p-values from the two stages to integrate information. Popular methods of combining p-values include Fisher’s products of p-value combination (Bauer & Kohne,¨ 1994; Bretz et al., 2006), weighted linear p-value combination (Chang, 2007) and weighted linear p-value combination http://biomet.oxfordjournals.org under the inverse transformation of the function (Lehmacher & Wassmer, 1999). In contrast to likelihood-based methods, the techniques that combine p-values do not exploit all of the available information and are therefore less efficient (Tsiatis & Mehta, 2003; Jennison & Turnbull, 2006). On the other hand, the p-value methods offer more flex- ibility. For example, the statistical methods that lead to the p-values at the two different stages do not have to be the same, and the second-stage design may depend on all first-stage data. Once we have confirmed that the candidate treatment is indeed better than the control, it is important from a practical standpoint to quantify the difference. In some cases, an interval estima- tor may be obtained as a by-product of the hypothesis-testing procedure. For example, Liu et al. at University of Florida on May 27, 2010 (2002) describe how this can be accomplished in a situation where there exists a for a univariate parameter. Unfortunately, the parameter of interest in our problem is random due to selection. There seems no clear way to extend the method of Liu et al. (2002) in this case. To be more specific, inverting a test in our context would require the least favourable con- figuration for the coverage probability of the interval . Lehmacher & Wassmer (1999) obtained a confidence interval for the k = 1 case and Brannath et al. (2006) provided a com- prehensive discussion of confidence interval construction under various conditions, such as designs with upper and lower bounds on the second-stage size, but they also considered only the k = 1 case. Bischoff & Miller (2005) considered the k = 2 case but did not provide an interval estimator. Sampson & Sill (2005) were able to convert their conditional test into a confidence interval because, by conditioning on the first-stage data, the final decision is a one-stage problem. Since the Sampson and Sill method seems to be the only confidence in- terval for the general k-arm drop-the-losers design, we use it as the standard reference for comparison. In this paper, we develop interval estimators for the mean difference between the selected treatment and the control. Our methods work for both types of testing; i.e. both mean combi- nation and p-value combination methods. While the unknown mean configuration prevents us from constructing the tightest possible lower confidence limit, we are able to prove that our interval estimators do achieve the nominal confidence level in certain special cases. Moreover, our simulation study shows that the new intervals usually perform better than Sampson and Sill’s intervals. Interval estimation for drop-the-losers designs 407

2. BASIC FORMULATION AND DECISION RULES 2·1. Basic formulation

Let the response of the jth patient from the ith treatment at stage t be Xtij and

Xtij = μi + tij ( j = 1,...,nti; i = 0, 1,...,k; t = 1, 2),

2 where μi stands for the ith treatment mean, with i = 0 for the control, and tij ∼ N(0,σ )are independent errors. Furthermore, we assume that the larger the mean the better. ¯ 2 Let Xti and Sti be the sample mean and sample from the ith treatment at stage t.From 2 the first stage, let S1 be the pooled estimate of the common variance as defined in Appendix 1 and = ¯ − ¯ / ¯ ¯ = ,...,

Yi (X1i X10) ai be the standardized differences between X1i and X10 for i 1 k,where Downloaded from 1/2 ai = (1/n1i + 1/n10) .Letτ be the selected treatment; i.e. τ = i, if Yi = max j=1,...,k Y j .We assume that n2i (i = 0,...,k) are prespecified functions of S1 for the mean combination method and functions of both S1 and X¯ 1i (i = 0,...,k), for the p-value combination method; hence, they are known at the end of the first stage. Also, only the selected treatment and the control are continued after the first stage; i.e. the second-stage data from the dropped treatments are not observed. http://biomet.oxfordjournals.org Since the selected treatment τ for the second stage is random, we are actually estimating the difference between the means of an unprespecified treatment and the control,  = μτ − μ0.Itis clear that for continuous error tij, τ and μτ are uniquely defined with probability one.

2·2. The mean combination decision rule The difference between the sample means of the selected treatment and the control is D = (n1τ X¯ 1τ + n2τ X¯ 2τ )/(n1τ + n2τ ) − (n10 X¯ 10 + n20 X¯ 20)/(n10 + n20), based on the data from both stages. When the sample sizes n2i (i = 1,...,k) are different, the distribution of the usual sample variance depends on τ; consequently, it depends on X¯ 1i (i = 1,...,k). Therefore, we elect to use at University of Florida on May 27, 2010 2 a modified variance estimate S defined in Appendix 1. In most practical situations, the n2i are all equal and S2 is the usual variance estimator. 1/2 Let bτ ={1/(n1τ + n2τ ) + 1/(n10 + n20)} and T = D/bτ .Foragivenset(d0, d1, d2)such that d0  d1, our decision rule is: ⎧ ⎨if Yτ /S1 < d0, stop for futility, /  , τ, at stage 1 ⎩if Yτ S1 d1 stop for superiority of treatment if d  Yτ /S < d , continue to stage 2;  0 1 1 if T/S < d , claim futility, at stage 2 2 (1) if T/S  d2, claim superiority of treatment τ.

Thus the ith treatment will be confirmed as effective if the event {τ = i}∩{(Yi /S1  d1) ∪ (Yi /S1  d0, T/S  d2)} occurs. The determination of (d0, d1, d2) depends on the probability of making a Type I error and on the desired power at a given mean configuration. Bischoff & Miller (2005) explain how to do this analytically, involving triple integrals, for k = 2. They showed that the global null hypothesis, in which all treatments are the same as the control, is the least favourable configuration in terms of Type I error. This result has been extended to any k in a 2008 technical report by Wu, Sun, Yang and Yang at the University of Florida. The decision rule of Bischoff and Miller is the same as (1) exceptthattheyusedT/S1 in the second stage. Our S should be a better estimate of the variance than S1, but the difference is negligible when the sample size is not too small. 408 SAMUEL S. WU,WEIZHEN WANG AND MARK C. K. YANG 2·3. The p-value combination decision rule

Let p1 and p2 be p-values from the first and second stages, respectively, and C(p1, p2)bea p-value combination function. Two commonly used combination functions are Fisher’s product C(p1, p2) = p1 p2 (Bauer & Kohne,¨ 1994), and the weighted sum of inverse normal C(p1, p2) = − {w −1 − + w −1 − } w w2 + w2 = 1 1 (1 p1) 2 (1 p2) ,where i are positive weights with 1 2 1 (Lehmacher & Wassmer, 1999). The decision rule is, for given (α0,α1, c), ⎧ ⎨if p1 >α0, stop for futility,  α , τ, at stage 1 ⎩if p1 1 stop for superiority of treatment if α1 < p1  α0, continue to stage 2;  if C(p1, p2) > c, claim futility, at stage 2 Downloaded from if C(p1, p2)  c, claim superiority of treatment τ.

There is no unique choice of α0, α1 or c. The Type I error is controlled at α defined as   1 1 α = α1 + I {α1 < p1  α0, C(p1, p2)  c}dp1dp2. 0 0

The relation between these numbers and the overall Type I error is tabulated in Chapter 7 of http://biomet.oxfordjournals.org Chow & Chang (2007). Examples can also be found in our simulation study.

3. CONFIDENCE LIMITS CONSTRUCTION AND JUSTIFICATION 3·1. The mean combination test

For any given constants (d0, d1, d2) in (1), we denote the random stage at which recruitment is stopped by M ∈{1, 2}; i.e. M = 1if(Yτ /S1  d1)or(Yτ /S1 < d0)andM = 2 otherwise. Let μnull = (0, 0,...,0),μ00 = (0, 0, −∞,...,−∞)andDti = X¯ ti − X¯ t0.Wedefine at University of Florida on May 27, 2010 α , , = { τ /  ∪ τ /  , /  }, (d0 d1 d2) prμnull (Y S1 d1) (Y S1 d0 T S d2)      α , = { τ /  ∪ /  }, (d1 d2) prμ00 (Y S1 d1) (T S d2)  (D1τ − d1aτ S1), M = 1, L = (D1τ − d1aτ S1) ∨ {(D1τ − d0aτ S1) ∧ (D − d2bτ S)} , M = 2,  −  , =  = (D1τ d1aτ S1) M 1, U −  ∧ −  , = (D1τ d1aτ S1) (D d2bτ S) M 2, where x ∨ y is defined as max(x, y)andx ∧ y = min(x, y). , ,  ,  α, α,  THEOREM 1. For any given constants (d0 d1 d2) and (d1 d2),let L and U be defined as above. If n2τ /n1τ = n20/n10, then we have  prμ(>L )  1 − α, prμ(<U )  1 − α , 2 for any μ = (μ0,μ1,...,μk) and σ > 0. Remark 1. Theorem 1 is valid even if the second-stage sample sizes are re-estimated based on S1. Suppose n1i and n2i (i = 0, 1,...,k) are designed to achieve power of 1 − β at level α for the mean combination test with thresholds (d0, d1, d2) at a prespecified alternative μ and σ = σ0. After the first stage, one may wish to re-assess n2i so that the original test will have the desired power at alternative μ and σ = S1. Such an adaption would require the final test to use a new threshold, d˜2, that is slightly different from the original d2. Also, the requirement Interval estimation for drop-the-losers designs 409 n2τ /n1τ = n20/n10 means that we do not need to assign the same number of subjects to the treatment groups and the control group. For example, the sample size of the control group can be k1/2 times that of the treatments, as suggested by Dunnett (1955) for a single-stage design.

Remark 2. If we choose d0 =−∞and d1 =∞, then the design reduces to the case with no  =−∞  interim termination. In this case, we may also choose d1 , which implies that U is the usual upper limit of a t-interval when there is no sample-size re-estimation. >∗  − α Weactuallyprovedthatpr( L ) 1 ,where ∗ , , = − ∨ { − ∧ − } , L (d0 d1 d2) (D1τ d1aτ S1) (D1τ d0aτ S1) (D d2bτ S)

which is not a when M = 1 because it uses additional observations from the second stage Downloaded from {∗ < }= as if the trial had not stopped at the end of first stage. However, we always have L 0 { < } ∗   ∗ <  < L 0 for two reasons: since L L ,then L 0 implies L 0; on the other hand, if L < 0, then we must have D1τ − d1aτ S1 < 0. From here, we can show that L < 0 implies ∗ < − < ∗ < −  L 0 by inspecting two cases: if D1τ d0aτ S1 0, then L 0; if D1τ d0aτ S1 0, then = ∗ =  <  M 2and L L 0. The above facts imply that even though L is a more conservative ∗  http://biomet.oxfordjournals.org lower confidence limit than L , they do not differ in determining whether is larger than 0. The essence of Theorem 1 is that μnull and μ00 are the least favourable configurations for the coverage probability of the lower and upper confidence limits, respectively. The key to its proof is the following stochastic ordering lemma. All proofs are given in Appendix 2.

2 LEMMA 1. Let W ={(X¯ 1τ − X¯ 10) − (μτ − μ0)}/aτ . Then, for each fixed σ , W is stochasti- cally largest at μnull and stochastically smallest at μ00, among all μ = (μ0,μ1,...,μk). Equations (A5) and (A6) in the proof of Lemma 1 imply that W is invariant for any μ = (μ0,μ0 + δa1,...,μ0 + δak), regardless of the values of μ0 and δ. Similarly, W has the same distribution for any μ = (μ0,μ0 + δ,−∞,...,−∞). Therefore, we have the following. at University of Florida on May 27, 2010

THEOREM 2. For a drop-the-losers design with no interim termination (that is, M = 2 with probability one), L achieves the confidence level (1 − α) at parameter configurations μ =  (μ0,μ0 + δa1,...,μ0 + δak), and U achieves the confidence level (1 − α ) at μ = (μ0,μ0 + δ,−∞,...,−∞), for any constants μ0 and δ.

3·2. The p-value combination test Lemma 1 enables us to apply the duality between hypothesis testing and confidence set to the p-value combination method in the drop-the-losers design. More specifically, the lower confidence limit can be defined as follows. Let G(·) be the cumulative distribution function of Yτ /S1 under μnull and p1(δ) = 1 − G{(D1τ − δ)/(cτ S1)}. Conditioned on τ,letF(·)bethe distribution function of a t-distribution with n2τ + n20 − 2 degrees of freedom and p2(δ) = 1/2 1 − F[(D2τ − δ)/{S2(1/n2τ + 1/n20) }]. We define

L1 = sup{δ : p1(δ)  α1},

L2 = sup{δ : p1(δ)  α0, C(p1(δ), p2(δ))  c},  L1, M = 1, L = max(L1,L2), M = 2.

Similarly, we may construct the upper confidence limit for the p-value combination method as k − − follows. Let H be the distribution function of a t-distribution with i=0 n1i k 1degreesof 410 SAMUEL S. WU,WEIZHEN WANG AND MARK C. K. YANG freedom, q1(δ) = H{(D1τ − δ)/(aτ S1)} and q2(δ) = 1 − p2(δ). We define

U1 = inf{δ : q1(δ)  α1},

U2 = inf{δ : q1(δ)  α0, C(q1(δ), q2(δ))  c},  U1, M = 1, U = min(U1,U2), M = 2.

THEOREM 3. If C(p1, p2) is a nondecreasing function in p1 and p2,then

prμ(>L )  1 − α, prμ(<U )  1 − α μ = μ ,μ ,...,μ σ 2 > for any ( 0 1 k) and 0. Downloaded from Remark 3. All commonly used combination functions, such as Fisher’s combination method and the inverse normal method, satisfy the monotonicity requirement. Also, Theorem 3 remains true even if the second-stage sample sizes are re-estimated based on the first-stage sample means and sample variance. For example, n2i can be determined according to the predictive power rule (Lehmacher & Wassmer, 1999; Brannath et al., 2006) to achieve approximately a http://biomet.oxfordjournals.org predetermined conditional power of 1 − β at the estimated alternative ˆ = X¯ 1τ − X¯ 10 andσ ˆ = S1. More specifically, for some prespecified nmax and nmin, this gives the sample size rule: = , { , + 2 /ˆ 2} = , ,..., , n2i max[nmin min nmax 2(zβ z p20 ) (S1 ) ](i 0 1 k) th where zα is the 100(1 − α) of the standard normal distribution and p20 satisfies C(p1, p20) = c. The additional second-stage data are not observed if p1 >α0 or p1  α1. 3·3. Unequal variance across treatment groups It is common in practice that the are not the same for all treatments. If the response 2  ∼ ,σ = ,..., , at University of Florida on May 27, 2010 model is changed to tij N(0 i )(i 0 k) and the variances may be assumed to be known, then Theorems 1–3 remain valid with the following changes: replace S1 and S by 1; re- = σ 2/ + σ 2/ 1/2 ={σ 2/ + + σ 2/ + }1/2 define ai ( i n1i 0 n10) and bτ τ (n1τ n2τ ) 0 (n10 n20) ; and change distributions F and H to standard normal. This result follows from the fact that Lemma 1 remains 1/2/σ true for the inhomogeneous variance model if ci is replaced by n1i i in the proof. Furthermore, for the case where variances are unknown but can be estimated consistently with large sample sizes, our confidence limits should have good coverage with the above changes.

4. EXAMPLE The methods for construction of confidence intervals are illustrated by a Phase III trial of Galantamine to improve the symptoms of Alzheimer’s disease. There were three dosage levels, 18, 24 and 36 mg/day, administered for 3 months; and a placebo group. The primary endpoint was the improvement in a subject’s cognitive score on the Alzheimer’s Disease Assessment Scale. Scores were assumed to be normally distributed. The trial was originally conducted as a group sequential study (Wilkinson & Murray, 2001; Stallard & Todd, 2003), but for illustrative purposes we pretend that a two-stage, drop-the-losers design was used. α = · ,α = · , =− · , = ·  =−∞ We use 0 0 1850 1 0 0010 d0 0 1914 d1 3 5161 and d1 to match the original boundaries for superiority and futility at the first interim analysis. This implies that = · = ·  = · c 0 0046 for Fisher’s combination method and d2 2 3040 and d2 1 9787 for the mean combination test in order to control the Type I error at 0·025. The summary data in Table 1, which were generated based on Wilkinson & Murray (2001) and Stallard & Todd (2003), suggest Interval estimation for drop-the-losers designs 411 Table 1. Efficacy outcomes from the Galantamine study First analysis Second analysis i (Treatment group) n1i X¯ 1i S1i n2i X¯ 2i S2i 0 (Placebo) 23 −2·36·321−1·86·4 1 (16 mg/day) 23 0·16·2– –– 2 (24 mg/day) 23 1·96·6211·86·3 3 (36 mg/day) 23 0·75·9– –– that the 24-mg group should be selected and the study should continue with the selected dose and the placebo since Y1τ /S1 = 2·28 was between d0 and d1. Similarly, the study should also continue based on the p-value combination method since p1 = 0·0326 is between α1 and α0.At Downloaded from the second interim analysis, S = 6·28, D = 3·91 and T/S = 2·92  d2, which implies that the 24 mg/day dose was effective by the mean combination method. In addition, since p2 = 0·0368 and Fisher’s p-value combination p1 p2 < c, we can again claim that the selected dose was effective. The 95% confidence interval for  based on the mean combination, the p-value combina- tion and the Sampson and Sill methods are, respectively, equal to (0·83, 6·56), (0·61, 6·79) and http://biomet.oxfordjournals.org (0·51, 6·18). The t-interval based on Phase III data only is (−0·36, 7·56), while the naive t- interval using the combined data but incorrectly ignoring the sequential nature of the design is (1·26, 6·56). The lower confidence limits from the first three methods are in decreasing order, and they are all larger than the limit based on Phase III data only and smaller than the liberal limit from the naive approach. Matlab codes for constructing these confidence intervals and computing the sample size and critical threshold of a two-stage, drop-the-losers design can be found at: https://www.ehpr.ufl.edu/wu programs. at University of Florida on May 27, 2010 5. SIMULATION EVIDENCE Since only the lower confidence limit can be used in deciding whether the selected treatment is better than the control, we focus on the comparisons of the three lower bounds based on the mean combination, the p-value combination, and the Sampson and Sill methods. These confidence limits are consistent with their corresponding tests; i.e. a test claims superiority of a treatment if and only if the corresponding lower confidence limit is greater than 0. Therefore, their powers determine superiority. Because analytic comparisons cannot be derived, simulation was used. We report simulation results for the nine types of mean configurations shown as μ in Fig. 1. The simulation was conducted using α = 0·05,σ = 1 and the δ in μ taking values from 0·1to1·0instepsof0·1. The total sample sizes were determined using the one-stage test of Dunnett (1955) to achieve 70% power because we would like the more efficient two-stage tests to reach powers around 80%. More specifically, the total sample sizes were 102, 128, 150 based on the alternatives (0, 0·5, 0·5), (0, 0·5, 0·5, 0·5) and (0, 0·5, 0·5, 0·5, 0·5), respectively. For the two-stage designs with possible interim termination, the total sample size would be random, so we used expected sample sizes under the global null and under the alternative at δ = 0·5. We made the average of the two expected sample sizes equal to 102, 128 and 150, respectively. In addition, we let the total sample sizes in both stages be the same; consequently, the sample sizes per arm are smaller in the first stage than in the second because the first stage has more treatments. This equal-size split between the two stages has proven to be very efficient according to Bretz et al. (2006) and our experience. 412 SAMUEL S. WU,WEIZHEN WANG AND MARK C. K. YANG

μ=(0,δ,δ) μ=(0,δ,δ,δ) μ=(0,δ,δ,δ,δ)

0.2 0.2 0.2

0.1 0.1 0.1

Power difference Power 0 0 0

0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 δ δ δ μ=(0,δ/2,δ) μ=(0,δ/3,2 δ/3,δ) μ=(0,δ/4,δ/2,3 δ/4,δ)

0.2 0.2 0.2 Downloaded from

0.1 0.1 0.1

Power difference Power 0 0 0

0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 http://biomet.oxfordjournals.org δ δ δ μ=(0,−δ,δ) μ=(0,0,0, δ) μ=(0,0,0,0, δ)

0.2 0.2 0.2

0.1 0.1 0.1

Power difference Power 0 0 0

0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 δ δ δ at University of Florida on May 27, 2010

Fig. 1. The differences in power using Dunnett’s method as the reference (dashed line). The plot symbols correspond to the mean combination (᭺), the p-value (ᮀ) and the Sampson and Sills method (᭛); the line types indicate with (dotted) and without (solid) interim termination.

Table 2. The cut-off values for the four new procedures Method Cut-offs k = 2 k = 3 k = 4

MC1 (d0 d1 d2)(−∞ ∞ 1·85) (−∞ ∞ 1·94) (−∞ ∞ 1·99) PC1 (α0 α1 c) (100·0087) (1 0 0·0087) (1 0 0·0087) MC2 (d0 d1 d2)(03·61 1·87) (0 3·62 1·95) (0 3·65 1·98) PC2 (α0 α1 c)(0·67 0·0007 0·0072) (0·75 0·0009 0·0073) (0·80 0·0009 0·0072) MC1, mean combination without interim termination; PC1, Fisher’s p-value combination without interim termination; MC2, mean combination with interim termination; PC2, Fisher’s p-value combination with interim termination.

Two versions of mean combination and p-value combination methods were included to examine the interim termination options. First, since the Sampson and Sill conditional test does not allow interim termination, we match it by letting d0 =−∞and d1 =∞for the mean combination and α0 = 1andα1 = 0 for the p-value combination. A second mean combination procedure uses d0 = 0 and the optimal cut-off values (d1, d2) that achieve the best power for the mean configuration at δ = 0·5; a second p-value combination procedure uses (α0,α1) that matches the second set of (d0, d1). The details of these four procedures are given in Table 2. Interval estimation for drop-the-losers designs 413 Figure 1 shows the power differences from the Dunnett test for the five methods. All powers were based on 10 000 simulations. The new tests based on mean combination and p-value combination are better than the Sampson and Sill conditional test in the first six cases. The Sampson and Sill test is less powerful than the Dunnett test in the first case. However, their test is better than the mean combination method in the last three cases, when one treatment is far superior to the others. These relative disadvantages/advantages of Sampson and Sill’s test are due to the fact that the statistics their test is conditioned on are related to the parameter of interest. Furthermore, compared to the mean combination test, the p-value combination method lost some efficiency. A similar result was shown by Tsiatis & Mehta (2003) and Jennison & Turnbull (2006) for the case of two-arm clinical trials. The reason is that the mean combination method uses the two sample means from both stages more efficiently. Downloaded from

6. DISCUSSION The main contribution of this paper is to lay down confidence limits for two tests commonly used in drop-the-losers designs. The confidence limits provide crucial information concerning http://biomet.oxfordjournals.org how much better the selected treatment is compared to the control. On the theoretical side, it is surprising that the bound obtained from Theorem 1 can apply to unequal sample sizes. Simul- taneous confidence intervals for unequal sample size are usually difficult to derive. For example, the confidence intervals are very complicated in Dunnett’s test; see Theorem 3.1.2 on page 55 of Hsu (1996); and Bischoff & Miller (2005) discussed only the equal-sample-size scenario. One concern about using the drop-the-losers design in drug development is that only the treatment with the highest efficacy is allowed to be selected after the first stage. There are reasons that the second-best treatment might be chosen (Kieser, 2005; Jennison & Turnbull, 2006). In this case, certain hypothesis testing procedures can be shown to remain valid, albeit conservative

(Stallard & Friede, 2008). Interval estimation for the corresponding analysis merits further study. at University of Florida on May 27, 2010 However, a decision based on efficacy only is appropriate in a two-by-two randomized clinical trial, in which two treatments, A, B and their combination A + B are compared to the placebo, because the treatment arms are usually similar in safety. Moreover, away from medicine, the drop-the-losers design can be applied to many one-sided multiple comparisons with control when there is no safety concern. The confidence limits constructed in this paper are valid for certain adaptations. When data are observed at the first stage, we may find that the real is probably larger than expected. Thus, we may wish to increase the sample sizes to achieve the planned power. On the other hand, if we wish to increase the sample sizes when the mean differences are found to be smaller than expected after the first stage, the mean combination interval does not apply, but the p-value combination method still holds. For example, the original sample sizes may be determined to achieve 80% power of claiming superiority of the selected treatment based on the effect-size pattern of (0, 0·5, 0·5, 0·8). If the interim estimate of effect-sizes are smaller, say (0, 0·4, 0·4, 0·1), we may wish to increase the second-stage sample sizes. Finally, it can be seen that the normality assumption in our model is not crucial. With moderate sample sizes, sample means tend to normal very quickly for most practical distributions. This also applies to dichotomous responses. In this case, the sample mean X¯ ti becomes the estimated success rate pˆti. Because the true pi s are expected to be different, it is usually difficult to derive theoretical bounds. Since Theorem 1 requires only n2τ /n1τ = n20/n10,thismeansthe results apply to unequal pi . Thus, our test and confidence limits can be used without the arcsine transformation. 414 SAMUEL S. WU,WEIZHEN WANG AND MARK C. K. YANG

ACKNOWLEDGEMENT The authors would like to thank the editor, the associate editor and three referees for their insightful comments and valuable suggestions. Wang’s research is supported in part by the National Science Foundation, U.S.A.

APPENDIX 1 2 2 Definition and properties of S1 and S From the first stage, we have a common estimated variance

k n1i k 2 = − ¯ 2 − − S1 (X1ij X1i ) n1i k 1 . Downloaded from i=0 j=1 i=0 = { , = ,..., } Let m2 min n2i i 1 k , which depends on how the second-stage sample sizes for all treatments ¯ ∗ = m2 / are chosen based on S1 and X2τ j=1 X2τ j m2. Define ⎧ ⎫ ⎨k n1i n20 m2 ⎬ k 2 = − ¯ 2 + − ¯ 2 + − ¯ ∗ 2 + + − − S ⎩ (X1ij X1i ) (X20 j X20) (X2τ j X2τ ) ⎭ n1i n20 m2 k 3 . http://biomet.oxfordjournals.org i=0 j=1 j=1 j=1 i=0

Since m2 and {n2i , i = 0,...,k} are prespecified functions of S1, they are independent of the first-stage sample means. Hence, the distribution of S, conditioned on S1, does not depend on X¯ 1i (i = 0,...,k), which implies that S is independent of X¯ 1i (i = 0,...,k) and W.

APPENDIX 2 Proof of Theorems 1–3 and of Lemma 1 1/2 1/2 Proof of Theorem 1. Let λ = n2τ /n1τ = n20/n10.Wehave(1/n2τ + 1/n20) = aτ /λ , bτ = 1/2 at University of Florida on May 27, 2010 aτ /(1 + λ) and D − (μτ − μ0) = λ{X¯ 2τ − X¯ 20 − (μτ − μ0)}/(1 + λ) + aτ W/(1 + λ). Hence, D − (μτ − μ0) − d2bτ S < 0 is equivalent to ¯ − ¯ − μ − μ = X2τ X20 ( τ 0) < , , Z / U(W S) aτ σ/λ1 2 1/2 1/2 where U(W, S) ={d2(1 + λ) S − W}/(σλ ). Consequently, we have >∗ = < { < < , } I ( L ) I (W d1 S1)I W d0 S1 or Z U(W S)

= I (W < d0 S1) + I (d0 S1  W < d1 S1)I {Z < U(W, S)}. Therefore, >  >∗ prμ( L ) prμ( L )(A1)

= Eμ [I (W < d0 S1) + I (d0 S1  W < d1 S1){U(W, S)}] (A2)  < +  < { , } Eμnull [I (W d0 S1) I (d0 S1 W d1 S1) U(W S) ] (A3) = >∗ prμnull ( L ) = >∗ prμnull (0 L )

= − {( τ /  ) ∪ ( τ /  , /  )} = − α 1 prμnull Y S1 d1 Y S1 d0 T S d2 1 . The equality (A2) is due to the fact that, conditioned on the first-stage data, Z follows a standard normal distribution and is independent of S. The second inequality (A3) holds because W is independent of S1 and S;givenS1 and S, the sum in the bracket, I (W < d0 S1) + I (d0 S1  W < d1 S1){U(W, S)},is Interval estimation for drop-the-losers designs 415 nonincreasing as a function of W, which is 1 in {W < d0 S1}, less than 1 and decreasing in {d0 S1  W < d1 S1} and is 0 in {W > d1 S1}; and the W is stochastically largest at μnull by Lemma 1. ∗ = −  ∧ The proof for the upper confidence limit is nearly the same. Let U (D1τ d1aτ S1) −   , ={  + λ 1/2 − }/ σλ1/2 <∗ = >  { > (D d2bτ S) and U (W S) d2(1 ) S W ( ). Then I ( U ) I (W d1 S1)I Z U (W, S)}. Therefore, <  <∗ prμ( U ) prμ( U ) = >  − {  , } Eμ(I (W d1 S1)[1 U (W S) ])  >  − {  , } Eμ00 (I (W d1 S1)[1 U (W S) ]) = >  { >  , } Eμ00 [I (W d1 S1)I Z U (W S) ]    = { τ / > ∩ / > }= − α Downloaded from prμ00 (Y S1 d1) (T S d2) 1 .   ∗ The first inequality is true because U U . The second inequality holds because W is independent of S1 and S;givenS1 and S, the product in the bracket as a function of W is nondecreasing; and the random variable W is stochastically smallest at μ00 by Lemma 1. 

∗ =  Proof of Theorem 2. Equality holds in (A1) because L L when there is no interim termination. http://biomet.oxfordjournals.org Equations (A5) and (A6) imply that W is stochastically equal at μ = (μ0,μ0 + δa1,...,μ0 + δak )for any constants μ0 and δ, hence the confidence level is achieved at the parameter configurations. The proof for the upper confidence limit is almost the same. 

∗ =  , = Proof of Theorem 3. We define L max( L1 L2), which is not observable if M 1. Obviously,   ∗ = { ,  } , L L .LetA(p) sup q : C(p q) c , which is a nonincreasing function because C(p1 p2)isa nondecreasing function. Note that

h(δ) = I ({p1(δ)  α1}∪[p1(δ)  α0, C{p1(δ), p2(δ)}  c])

= I ({p1(δ)  α1}∪[p1(δ)  α0, p2(δ)  A{p1(δ)}]) at University of Florida on May 27, 2010 = I {p1(δ)  α1}+I {α1 < p1(δ)  α0}I [p2(δ)  A{p1(δ)}] is a nonincreasing function changing from 1 to 0 since p1(δ) and p2(δ) are increasing functions of δ and , ∗ = {δ δ = } C(p1 p2) is nondecreasing. Therefore, L sup : h( ) 1 . Hence, we have       ∗ prμ( L ) prμ( L )

= prμ{h() = 1}

= Eμ(I {p1()  α1}+I {α1 < p1()  α0}I [p2()  A{p1()}])

= Eμ[I {p1()  α1}+I {α1 < p1()  α0}A{p1()}]

= Eμ[I {1 − G(W/S1)  α1}+I {α1 < 1 − G(W/S1)  α0}A{1 − G(W/S1)}]  { − /  α }+ {α < − /  α } { − / } Eμnull [I 1 G(W S1) 1 I 1 1 G(W S1) 0 A 1 G(W S1) ] = α.(A4)

Equation (A4) holds because p2() follows a uniform distribution conditional on the first-stage data. The last inequality holds because the random variable W is stochastically largest at μnull by Lemma 1; and the sum

I {1 − G(W/S1)  α1}+I {α1 < 1 − G(W/S1)  α0}A{1 − G(W/S1)}, is nondecreasing as a function of W, which follows from the properties that 1 − G(W/S1) is a decreasing function of W and I (y  α1) + I (α1 < y  α0)A(y) is a nonincreasing function of y. The proof for the upper confidence limit is omitted.  416 SAMUEL S. WU,WEIZHEN WANG AND MARK C. K. YANG

1/2 Proof of Lemma 1. In this proof, we denote ci = (n1i ) /σ , Zi = ci (X¯ 1i − μi ), δi = (μi − μ0)/ai and let φ(z) and (z) be the pdf and the cdf of N(0, 1). Let f0(y) be the pdf of N(0, 1/n10). We have

prμ(W  x)  k X¯ − X¯ − (μ − μ ) X¯ − X¯ X¯ − X¯ = pr 1i 10 i 0  x, 1i 10 > 1 j 10 ( j  i) a a a i=1 i i j  k Z /c − Z /c Z /c + μ − Z /c − μ Z /c + μ − Z /c − μ = pr i i 0 0  x, i i i 0 0 0 > j j j 0 0 0 ( j  i) a a a i=1 i i j   k Zi /ci − y Zi /ci − y Z j /c j − y

= pr  x, + δ > + δ ( j  i) f (y)dy Downloaded from a a i a j 0 i=1 i i j 

= F(δ1,δ2,...,δk , y) f0(y)dy, (A5) where  k Z /c − y Z /c − y Z /c − y http://biomet.oxfordjournals.org F(δ ,δ ,...,δ , y) = pr i i  x, i i + δ > j j + δ ( j  i) . 1 2 k a a i a j i=1 i i j

1 Note that prμ(W  x) depends on the μi s through the δi s for any x ∈ R . First, we need to show that prμ(W  x), as a function of (δ1,...,δk ), is minimized at (δk ,...,δk ) for any constant δk . It suffices to show that F(δ1,...,δk , y) is minimized at (δk ,...,δk ) for any y. Note that (Zi /ci − y)/ai  x is equivalent to Zi  (xai + y)ci , and (Zi /ci − y)/ai + δi > (Z j /c j − y)/a j + δ j is the same as Z j < c j a j {(Zi /ci − y)/ai + δi − δ j }+c j y. Therefore,  k (xai +y)ci F(δ1,δ2,...,δk , y) = φ(z) j  i [c j a j {(z/ci − y)/ai + δi − δ j }+c j y]dz.(A6) −∞ at University of Florida on May 27, 2010 i=1

Without loss of generality, assume that δ1  δ2  ···  δk . We claim that, for any m  1,

F(δm,...,δm ,δm+1,...,δk , y)  F(δm+1,...,δm+1,δm+1,...,δk , y). (A7)

For any fixed y,ifwelet ijst(z) = [c j a j {(z/ci − y)/ai + δs − δt }+c j y], then  m (xai +y)ci F(δm,...,δm ,δm+1,...,δk , y) = φ(z) j  m, j  i ijjj(z) j  m+1 ijmj(z)dz −∞ i=1  k (xai +y)ci + φ(z) j  m ijim(z) j  i, j  m+1 ijij(z)dz. −∞ i=m+1

Let γijst(z) = c j a j φ[c j a j {(z/ci − y)/ai + δs − δt }+c j y]. Taking the derivative with respect to δm ,we have

∂ F(δm,...,δm ,δm+1,...,δk , y)

∂δm  m (xai +y)ci k = φ(z) j  m, j  i ijjj(z) {γivmv(z) j  m+1, j  v ijmj(z)}dz −∞ i=1 v=m+1  k (xai +y)ci m − φ(z) {γivim(z) j  m, j  v ijim(z)} j  i, j  m+1 ijij(z)dz. −∞ i=m+1 v=1 Interval estimation for drop-the-losers designs 417 Exchanging the indices i and v in the last line and rearranging the summations, we have

∂ F(δm,...,δm ,δm+1,...,δk , y)

∂δm  m k (xai +y)ci = φ(z)γivmv(z) j  m, j  i ijjj(z) j  m+1, j  v ijmj(z)dz −∞ i=1 v=m+1  m k (xav +y)cv − φ(z)γvivm(z) j  m, j  i vjvm (z) j  v,j  m+1 vjvj (z)dz.(A8) −∞ i=1 v=m+1

Let w = ci ai {(z/cv − y)/av + δv − δm }+ci y. Then, z = cvav{(w/ci − y)/ai + δm − δv}+cv y. Hence,

φ(z) = γivmv(w)/(avcv), γvivm (z) = ai ci φ(w), vjvm (z) = ijjj(w) and vjvj (z) = ijmj(w). Now insert Downloaded from these transformations into the last line of equation (A8):

∂ F(δm,...,δm ,δm+1,...,δk , y)

∂δm  m k (xai +y)ci = φ γ v v  ,   + ,  v (z) i m (z) j m j i ijjj(z) j m 1 j ijmj(z)dz http://biomet.oxfordjournals.org −∞ i=1 v=m+1  m k (xai +y)ci −ai ci (δm −δv ) − γivmv(w)φ(w) j  m, j  i ijjj(w) j  v,j  m+1 ijmj(w)dw −∞ i=1 v=m+1  m k (xai +y)ci = φ(z)γivmv(z) j  m, j  i ijjj(z) j  m+1, j  v ijmj(z)dz + − δ −δ i=1 v=m+1 (xai y)ci ai ci ( m v )  0, (A9) δ  δ  ···  δ = when m m+1 k . Therefore, the claim (A7) is true. Thus, applying (A7) repeatedly for m at University of Florida on May 27, 2010 1, 2,...,k − 1, we obtain

F(δ1,δ2,...,δk , y)  F(δ2,δ2,...,δk , y)  ···  F(δk ,δk ,...,δk , y).

Therefore, W is stochastically largest at μnull. On the other hand, inequality (A9) implies that ∂ F(δ1,δ2,...,δk , y)/∂δ1  0; hence, we have

F(δ1,δ2,...,δk , y)  F(∞,δ2,...,δk , y) = F(δ,−∞,...,−∞, y) = {(xa1 + y)c1}.

2 Therefore, W is stochastically smallest at μ00, which follows a normal distribution N(0,σ ). 

REFERENCES BAUER,P.&KOHNE¨ , K. (1994). Evaluation of with adaptive interim analyses. Biometrics 50, 1029–41. BISCHOFF,W.&MILLER, F.(2005). Adaptive two-stage procedures to find the best treatment in clinical trials. Biometrika 92, 197–212. BRANNATH,W.,KONIG¨ ,F.&BAUER, P. (2006). Estimation in flexible two stage designs. Statist. Med. 25, 3366–81. BRETZ,F.,SCHMIDLI,H.,KONIG¨ ,F.,RACINE,A.&MAURER, W. (2006). Confirmatory seamless phase II/III clinical trials with hypotheses selection at interim: general concepts. Biomet. J. 48, 623–34. CHANG, M. (2007). Adaptive design method based on sum of p-values. Statist. Med. 26, 2772–84. CHOW,S.C.&CHANG, M. (2007). Adaptive Design Methods in Clinical Trials. Boca Raton, FL: Chapman & Hill/CRC. DUNNETT, C. (1955). A multiple comparison procedure for comparing several treatments with a control. J. Am. Statist. Assoc. 50, 1096–21. HSU, J. C. (1996). Multiple Comparisons, Theory and Methods. London: Chapman and Hall. JENNISON,C.&TURNBULL, B. W. (2006). Adaptive and nonadaptive group sequential tests. Biometrika 93, 1–21. KIESER, M. (2005). Discussion of the paper by Sampson and Sill. Biomet. J. 47, 271–3. 418 SAMUEL S. WU,WEIZHEN WANG AND MARK C. K. YANG

LEHMACHER,W.&WASSMER, G. (1999). Adaptive sample size calculation in group sequential trials. Biometrics 55, 1286–90. LIU,Q.,PROSCHAN,M.A.&PLEDGER, G. W. (2002). A unified theory of two-stage adaptive designs. J. Am. Statist. Assoc. 97, 1034–41. SAMPSON,A.R.&SILL, M. W. (2005). Drop-the-losers design: normal case. Biomet. J. 47, 257–68. STALLARD,N.&FRIEDE, T. (2008). A group-sequential design for clinical trials with treatment selection. Statist. Med. 27, 620–7. STALLARD,N.&TODD, S. (2003). Sequential designs for phase III clinical trials incorporating treatment selection. Statist. Med. 22, 689–703. THALL,P.F.,SIMON,R.&ELLENBERG, S. S. (1988). Two-stage selection and testing designs for comparative clinical trials. Biometrika 75, 303–10. TSIATIS,A.A.&MEHTA, C. (2003). On the inefficiency of the adaptive design for monitoring clinical trials. Biometrika 90, 367–78. WANG, J. (2006). An adaptive two-stage design with treatment selection using the conditional approach.

Biomet. J. 48, 679–89. Downloaded from WILKINSON,D.&MURRAY, J. R. (2001). Galantamine: a randomized double-blind, dose comparison in patients with Alzheimero`s disease. Int. J. Geriatric Psychiat. 16, 852–7.

[Received December 2008. Revised September 2009] http://biomet.oxfordjournals.org at University of Florida on May 27, 2010