Drug Information Journal, Vol. 34, pp. 591±597, 2000 0092-8615/2000 Printed in the USA. All rights reserved. Copyright  2000 Drug Information Association Inc.

SEQUENTIAL AND MULTIPLE TESTING FOR DOSE-RESPONSE

WALTER LEHMACHER,PHD Professor, Institute for Medical , Informatics and , University of Cologne, Germany MEINHARD KIESER,PHD Head, Department of Biometry, Dr. Willmar Schwabe Pharmaceuticals, Karlsruhe, Germany LUDWIG HOTHORN,PHD Professor, Department of , University of Hannover, Germany

For the analysis of dose-response relationship under the assumption of ordered alterna- tives several global trend tests are available. Furthermore, there are multiple test proce- dures which can identify doses as effective or even as minimally effective. In this paper it is shown that the principles of multiple comparisons and interim analyses can be combined in flexible and adaptive strategies for dose-response analyses; these procedures control the experimentwise error rate.

Key Words: Dose-response; Interim analysis; Multiple testing

INTRODUCTION rate in a strong sense) are necessary. The global and multiplicity aspects are also re- THE FIRST QUESTION in dose-response flected by the actual International Confer- analysis is whether a monotone trend across ence for Harmonization Guidelines (1) for dose groups exists. This leads to global trend dose-response analysis. Appropriate order tests, where the global level (familywise er- restricted tests can be confirmatory for dem- ror rate in a weak sense) is controlled. In onstrating a global trend and suitable multi- clinical research, one is additionally inter- ple procedures can identify a recommended ested in which doses are effective, which dose. minimal dose is effective, or which dose For the global hypothesis, there are sev- steps are effective, in order to substantiate eral suitable trend tests. For multiple compar- the choice of a dose. Dose differences are isons, there are several approaches for multi- considered effective if they are clinically rel- ple test procedures, some of which are evant and statistically significant. Therefore, described in the following. It is easy to per- multiple comparisons which control the ex- form global tests in group sequential or adap- perimentwise level (familywise type-I error tive interim versions, but it seems that there are only a few methods which combine mul- tiple test procedures with interim analyses. Reprint address: Prof. Dr. Walter Lehmacher, Institut In this paper, some new approaches are pro- fuÈr Medizinische Statistik, Informatik und Epidemiolo- gie der UniversitaÈtzuKoÈln, Joseph-Stelzmann-Str. 9, D- posed for the combination of these multiple 50931, KoÈln, Germany. E-mail: Walter.Lehmacher@ and sequential dose-response analyses, medizin.uni-koeln.de. which can link proof of global trend with

591 592 Walter Lehmacher, Meinhard Kieser, and Ludwig Hothorn

proof of efficacy of a chosen dose. Sequential K = 3:H0 = H123 = H13 : µ1 =µ2 =µ3 procedures can be especially useful in this H12 : µ1 =µ2 area because the multiplicity of this problem K = 4:H = H = H : µ =µ =µ =µ makes a reasonable sample size calculation 0 1234 14 1 2 3 4 = difficult. H123 H13

H12 K = 5:H = H = H : TESTS OF TREND 0 12345 15 µ1 =µ2 =µ3 =µ4 =µ5

K groups with increasing dose levels are con- H1234 = H14 sidered, where the smallest dose is often the H = H null-dose (placebo) as a control. A one-way 123 13 H12 layout with nk experimental units tested at the k-th dose level, k = 1,...,K, is consid- As usual, the definition H1...k : µ1 =µ2 = ... ered. All x ki observations are assumed to be 2 =µk is used; the monotonicity implies H1...k mutually independent with xki ϳ N(µk, σ ), = H1k. For each K, these families of K − 1 k = 1,..., K,andi = 1,...,nk. In the fol- lowing, a monotone dose-response relation- hypotheses are closed under intersection. Un- der the assumption of a monotone dose-re- ship is assumed: µ1 ≤ ...≤µK. The global sponse relationship all relevant hypotheses hypothesis is H0: µ1 = ...=µK, and the trend which are necessary to identify an (mini- alternative is *H*: µ1 ≤ ...≤µK with at least mally) effective dose are included. In practi- µ1 <µK. Related trend tests are reviewed in Tamhane et al. (2). Most are based on special cal applications, the recommended dose will contrasts (eg, Helmert contrasts or simple be chosen among the effective doses by care- pairwise contrasts between the maximal and fully considering the magnitude of the ob- minimal dose). Phillips (3) described meth- served benefits and side effects. ods for sample size estimation in the random- For this family of hypotheses the closed ized k-sample design including single-step test procedure is identical to the method of dose-response studies. Recently, a review of a-priori ordered hypotheses, where the the performance of several tests of trend for hypotheses are tested step by step, beginning = dose-response studies was given by Phillips with the global hypothesis H0 H1K , fol- (4). Here a detailed discussion of the perfor- lowed by H1,K−1, H1,K−2,..., H13 and ending mance of these trend tests is not repeated, with the ªsmallestº hypothesis H12. All tests α because in the framework of this paper all are performed at local level , and all hypoth- such test statistics can be used providing one eses are rejected as long as significance at α considers their well-known pros and cons. level is reached; if a test for a hypothesis H1k is not significant at level α, H1k is not rejected, and no further test (for H1,k−1, H1,k−2, MULTIPLE TEST PROCEDURES etc.) is necessary because no further rejection is possible. For a more detailed description Three multiple test procedures which can be of this a-priori method see Hothorn and Leh- extended to designs with interim analyses are macher (5), Kieser and Lehmacher (6), or mentioned here. Maurer et al. (7). For local tests of these elementary hypotheses, all (local) trend tests A-priori Ordered Hypotheses can be chosen (eg, linear contrasts or com- parisons of highest vs. lowest doses), as men- The family of the K − 1 elementary hypothe- tioned earlier. In view of the monitonicity, ses which contain the smallest dose group is these tests should be 1±sided. The assump- considered, that is, H1k: µ1 = ...=µk,k= 2, tion of monotonicity can be omitted, because ...,K. Then for K = 3, 4, 5 one obtains the the a-priori method works generally with following families of hypotheses: each predefined order of hypotheses. In this Sequential and Multiple Dose-Response Analysis 593 case, however, pairwise t-tests should be pre- application of the closed test principle was ferred as local test statistics to trend tests. proposed by Marcus et al. (9). Some other multiple testing methods fail to Each hypothesis H of this closed family control the experimentwise error rate if mon- is tested at local level α; a hypothesis H is otonicity is not given; see Bauer (8). rejected if its local test and all local tests of hypotheses H′ inducing H (ie, H′ ʚ H)are significant at level α; if a local test of H is The Closed Test Procedure not significant at level α, H is not rejected, ″ ʚ The K − 1 elementary hypotheses Hk,k+1 stat- and all tests for H induced by H (ie, H ing the identity of two neighboured dose H″) are not necessary because rejections of ″ groups are considered, that is, H12 : µ1 =µ2, H are not possible. H23 : µ2 =µ3,...,HK−1,K : µK−1 =µK. The fam- Alternatively to the original approach of ily of hypotheses is chosen, which is gener- Marcus et al. (9), a partition hypothesis = ʝ ated by all intersections of these elementary H12/34 H12 H34 can be tested very sim- hypotheses. As usual, the abbreviation Hk k ply by combining the two 1-sided p-values 1 2 with the Bonferroni method or with the ...k : µk = ...µk is used. For example, for m 1 m Fisher combination test; or the test statistics K = 3, 4, 5 the closed systems of hypotheses can be combined to (z12 + z34)/√2 or another are given by the families: weighted sum of these statistics. Rom et al. (10) proposed a modified closed test proce- = = = K 3:H0 H123 H13 dure, where the partition hypotheses are

H12 H23 tested with a test for the ªsmallestº partial hypothesis, for example, H is tested by a K = 4:H0 = H1234 = H14 12/34 level-α-test for H12, which is evidently a H123 = H13 H234 = H24 level-α-test for H12/34 as well. As a conse- H := H ʝ H 12/34 12 34 quence, H34 can be tested only after the rejec-

H12 H23 H34 tion of H12, and according to the general closed test principle after the rejection of all K = 5:H0 = H12345 = H15 other inducing hypotheses H , H , and so H = H H = H 0 234 1234 14 2345 25 forth. As tests for the simple pairwise com- = ʝ = ʝ H123/45 : H123 H45 H13 H45 parison hypotheses all (local) trend tests (see

H12/345: = H12 ʝ H345 = H12 ʝ H35 above) can be chosen. The above mentioned H = H H = H H = H H Rom procedure is ªuniformly betterº than 123 12 234 24 345 35 12/34 the a-priori method, because after the rejec- H12/45 H23/45 tion of a simple pairwise comparison hypoth-

H12 H23 H34 H45 esis H1k of the family described in the above a-priori ordered hypotheses section, only ad-

For each K, these families of hypotheses ditional partial hypotheses induced by H1k are closed under intersection by definition. can be tested without loss of power concern- They consist not only of simple pairwise ing these K − 1 tests. comparisons such as H123 = H13, but even of partition hypotheses such as H12/34 : = H12 ʝ H34. Under the assumption of a monotone The Bonferroni-Holm Procedure dose-response relationship, this family con- tains all pairwise comparisons, because un- Here the K − 1 elementary hypotheses of der order restriction, for example, H234 = H24 neighboured comparisons Hk,k+1 : µk =µk+1 are and so forth. Therefore, this family contains tested according to the Bonferroni-Holm all hypotheses necessary to identify effective method; see Budde and Bauer (11). This pro- dose steps as well as the minimally effective cedure has the primary goal of indentifying dose, too. For this family of hypotheses, the effective dose steps. This method, however, 594 Walter Lehmacher, Meinhard Kieser, and Ludwig Hothorn has bad power characteristics if the investi- (Lehmacher, Wassmer [16]), which allows gated doses are too close together. for data-driven sample size reestimation after each of the interim analyses; this approach is essentially based on the combination of SEQUENTIAL TEST PROCEDURES − the Φ 1-transformation of the independent p- values. Such adaptive interim analyses are of Group Sequential Procedures great importance especially in dose-response For many of the global trend tests the classi- analyses, because here in most cases no valid cal group sequential test procedures can be sample size estimation is possible, and there- directly performed with 1-sided tests; see De- fore the idea of an internal pilot study looks Mets and Ware (12). Bauer and Budde (13) attractive. Because the adaptive methods of applied these 1-sided group sequential tests Bauer-KoÈhne and Lehmacher-Wassmer are to multiple comparisons concerning neigh- based only on the combination of indepen- boured dose groups in combination with the dent p-values, they have the additional ad- Bonferroni-Holm and the a-priori ordered vantage that a change of the test is methods. possible, too. Even a change of the hypothe-

ses to be tested is possible, say, H1 in the first sequence and H in the second sequence. Adaptive Interim Analyses 2 In this case, however, a significant result in Bauer and KoÈhne (14) proposed procedures general enables only the rejection of the with one (or two) adaptive interim analyses, global intersection hypothesis H0 = H1 ʝ H2. which, in to group sequential de- Bauer and RoÈhmel (17) applied the Bauer- signs, allow for a new estimation of the final KoÈhne procedure to dose-response analyses, sample size using the results of the interim where a change of dose groups (ie, a change analysis. The approach is based on the com- of hypotheses) is admitted. This approach bination of the independent 1-sided p-values has the primary goal of demonstrating a of the sequences of the study by Fisher's global dose-response relationship, and in combination test. For α=0.05, and one in- their paper it is only mentioned that after the terim analysis and one null hypothesis to be rejection of the global hypothesis H0, a closed tested in confirmatory analysis, this proce- test procedure can follow. dure works as follows: Let p1 denote the 1- For many trials which aim to demonstrate sided p-value of the first sequence of the only a global trend, a fixed sample size de- study. If p1 ≥α0, the trial stops without rejec- sign may be sufficient. But even in these tion of the null hypothesis. If p1 ≤α1, the null simple cases, group sequential trials can re- hypothesis can be rejected and the trial can duce the average sample number (ASN). The stop. If p1 ∈ (α1; α0), no final decision can multiplicity of the additional proof of effec- be made, and the second sequence has to be tive doses makes a sample size calculation planned. For example, α1 = 0.023 for α0 = essentially more complicated, and sequential 0.5. approaches are especially useful. Therefore,

Let p2 denote the 1-sided p-value of the multiple comparisons within group sequen- test statistic generated by the data of the sec- tial approaches are particularly proposed in

.ond sequence. If p1 ؒ p2 ≤ 0.0087 (ie, Fisher's the following section product criterion) then the null hypothesis can be rejected. COMBINATION OF MULTIPLE AND The procedure allows a free, that is, adap- SEQUENTIAL PROCEDURES tive choice of sample sizes after an interim analysis. A similar adaptive approach was All types of interim analyses (group sequen- proposed by Proschan and Hunsberger (15). tial or adaptive) can be combined with multi- Recently, a modification of the classical ple test procedures (a-priori ordered, closed group sequential procedures was proposed test,orBonferroni-Holm) described above. Sequential and Multiple Dose-Response Analysis 595

The general construction rule is: ªEach hy- sequence; even unbalanced sample sizes can pothesis Hi of the family of hypotheses is be considered. tested with a sequential analysis at local level The adaptive interim approaches enable α, or the related p-value is calculated. It is further flexibilities of the study conduct: If locally rejected, if it is rejected in one of the after an interim analysis a certain dose group interim or final analyses of the respective gives no hope of a relevant contribution, in sequential test, taking into account the α- the second sequence this dose group can be adjustments induced by the performance of cancelled, even without acceptance or rejec- interim analyses.º An hypothesis Hi is re- tion of related hypotheses. For testing a local jected if the multiple test procedure allows (intersection) hypothesis in these adaptive for the rejection of Hi (possibly in depen- approaches, only independent p-values have dence from other rejections). to be combined: In the closed test procedure, In the a-priori and closed test procedures, in the second sequence the needed p-value each hypothesis Hi is tested at local level α; of an intersection hypothesis can be ªsubsti- in the Bonferroni-Holm method the p-values tutedº by a p-value of one of its inducing are compared with the K − 1 adjusted levels hypotheses (by an a-priori definition). This α/(K − 1), α/(K − 2), . . . , α/2, α. is again possible because a level-α-test for

Evidently, these procedures control the an inducing hypothesis Hi is also a level-α- experimentwise error rate α. Flow charts of test for an intersection hypothesis Hi ʝ Hj. the a-priori method with K = 3 and of the For example: If with K = 3 after the in- closed test procedure with K = 2 are given for terim analysis the largest dose group has to designs with one interim analysis in Figures 1 be dropped (eg, for safety reasons), in the and 2. For a detailed description of the deci- closed test procedure (see Figure 2) for the sion rules see Kieser, Bauer, and Lehmacher test of the global hypothesis H0,thep-value

(18). for H0 = H123 from the first sequence has to In practical applications, relevant short- be combined with the p-value of H12 from cuts in dependence of early rejections are the second sequence. The same strategy can possible: If an hypothesis is rejected at an be used if the method of a-priori ordered interim analysis, certain dose groups can be hypotheses is chosen. Additionally in the eliminated in the following sequences. In adaptive interim approaches, the test statistic adaptive interim analyses, the sample sizes can be changed, depending upon the type of can be recalculated for the second (or third) contrasts which seem most appropriate.

FIGURE 1. A-priori ordered hypotheses. 596 Walter Lehmacher, Meinhard Kieser, and Ludwig Hothorn

FIGURE 2. The closed test procedure.

A PRACTICAL EXAMPLE pericum extract with a content of only 0.5 % hyperforin could not be rejected in the In a three-armed randomized, double-blind interim analysis. Due to the small difference multicenter study, the clinical efficacy of two in HAM-D between placebo and the extract different extracts of St. John's wort (Hyperi- with the lower hyperforin content the study cum perforatum) was investigated against was stopped with this result. In principle, a placebo (k = 1) in patients suffering from second sequence could have been planned mild or moderate depression according to the for testing H , where the 1-sided p-value criteria of the Diagnostic and Statistical Man- 12 must fall short of 0.0087/0.19 = 0.0458. ual of Mental Disorders, fourth edition (DSM-IV) of the American Psychiatric As- sociation (19). The Hypericum preparations DISCUSSION differed only with respect to their content of hyperforin (k = 2 : 0.5 % hyperforin; k = 3 : In principle, all of the sequential and multiple 5 % hyperforin), and it was the aim of the procedures mentioned can be combined. Due trial to explore the relationship of the antide- to the complexity of the problem, however, pressant efficacy of Hypericum extract from an optimal procedure cannot be derived in the hyperforin content. The study was general. The choice of the test statistic, and planned according to the adaptive two-stage the multiple and sequential procedures must Bauer-KoÈhne design with 1-sided α=0.05, be determined by the well-known pros and

α1 = 0.0299 and α0 = 0.3. It was assumed that cons of these procedures. the treatment effect measured by the change Because valid sample size estimations are in Hamilton Rating Scale for Depression rarely available, the combinations of the pro- (HAM-D) (20) increases with increasing hy- cedures described in this paper are suitable perforin content of the Hypericum extract to join pilot and confirmative studies. For and the hypotheses were a-priori ordered ac- identifying primarily minimally effective cordingly. The prespecified nonparametric doses, a simple but rather effective combina- Jonckheere-test (21) revealed a 1-sided p- tion is the a-priori method (or the slightly value of p13 = 0.017 <α1 for H0 = H13 = H123, more complicated but generally ªbetterº leading to the rejection of H0 and thus demon- Rom method) with the adaptive interim anal- strating the efficacy of the extract with the yses based on simple pairwise contrasts. Fur- higher content of hyperforin. The U-test for ther, the assumptions of homogene-

H12 showed a 1-sided p-value of p12 = 0.19 > ity are not necessary, if only pairwise α1. Therefore, the null hypothesis concerning contrasts with the Welch modification as lo- the comparison between placebo and the Hy- cal test statistics are used. Related nonpara- Sequential and Multiple Dose-Response Analysis 597 metric versions or trend tests for binary data in der chemisch-pharmazeutischen Industrie.) Stutt- are available; therefore, all the procedures gart: Fischer: 1995;6:3±21. 8. Bauer P. A note on multiple testing procedures in proposed can be applied essentially for the dose finding. Biometrics. 1997;53:1125±1128. nonnormal situation, too. 9. Marcus R, Peritz E, Gabriel KR. On closed testing procedures with special reference to ordered . Biometrika. 1976;63:655±660. AcknowledgementÐThe authors thank two anonymous 10. Rom DM, Costello RJ, Connell LT. On closed test referees for critical comments and helpful suggestions. procedures for dose-response analysis. Stat Med. 1994;13:1583±1596. REFERENCES 11. Budde M, Bauer P. Multiple test procedures in clini- cal dose finding studies. J Am Stat Assoc. 1989;84: 1. Note for guidance on dose-response information to 792±796. support drug registration. International Conference 12. DeMets DL, Ware JH. Group sequential methods in for Harmonization. Topic E, CPMP/ICH/378/95. clinical trials with a one-sided hypothesis. Biome- 2. Tamhane AC, Hochberg Y, Dunnett CW. Multiple trika. 1980;67:651±660. test procedures for dose finding. Biometrics. 1996; 13. Bauer P, Budde M. Multiple testing for detecting 52:21±37. efficient dose steps. Biom J. 1994;36:3±15. 3. Phillips A. Sample size estimation when comparing 14. Bauer P, KoÈhne K. Evaluation of with more than two treatment groups. Drug Inf J. 1998; adaptive interim analyses. Biometrics. 1994;50: 32:193±200. 1029±1041. 4. Phillips A. A review of the performance of tests 15. Proschan MA, Hunsberger SA. Designed extension used to establish whether there is a drug effect in of studies based on conditional power. Biometrics. dose-response studies. Drug Inf J. 1998;32:683± 1995;51:1315±1324. 692. 16. Lehmacher W, Wassmer G. Adaptive sample size 5. Hothorn L, Lehmacher W. A simple testing proce- calculation in group sequential trials. Biometrics. dure ªcontrol versus k treatmentsº for one-sided or- 1999;55:1286±1290. dered alternatives, with application in toxicology. 17. Bauer P,RoÈhmel J. An adaptive method for establish- Biom J. 1991;33:179±189. ing a dose response relationship. Stat Med. 1995; 6. Kieser M, Lehmacher W. Multiple testing in clinical 14:1595±1607. trials with interim analyses and a-priori ordered 18. Kieser M, Bauer P, Lehmacher W. Inference on mul- hypotheses. (Multiples Testen bei klinischen PruÈ- tiple endpoints in clinical trials with adaptive interim fungen mit Zwischenauswertungen und a priori analyses. Biom J. 1999;41:261±277. geordneten Hypothesen.) In: Trampisch HJ, Lange 19. Laakmann D, SchuÈle C, Baghai T, Kieser M. St. S. (eds.): Medical ResearchÐMedical Practice. John's Wort in mild to moderate depression: the (Medizinische ForschungÐAÈrztliches Handeln.) Pro- relevance of hyperforin for the clinical efficacy. ceedings, MMV Medizin Verlag MuÈnchen. 1995; Pharmacopsychiatry. 1998;31:54±59. 162±165. 20. Hamilton M. A rating scale for depression. J Neurol 7. Maurer W, Hothorn L, Lehmacher W. Multiple com- Neuro-surgery Psychiatry. 1960;23:56±62. parisons in drug clinical trials and preclinical assays: 21. Jonckheere AR. A distribution-free k sample test a priori ordered hypotheses. In: Vollmar J (ed.). Bio- against ordered alternatives. Biometrika. 1954;41: metrics in the Pharmaceutical Industry. (Biometrie 133±145.