<<

Journal of and Applications Volume 10, Number 4, 2011, pp. 553-569 ISSN 1538-7887

Comparison of Wald, , and Likelihood Ratio Tests for Response Adaptive Designs

Yanqing Yi1∗and Xikui Wang2

1 Division of Community Health and Humanities, Faculty of Medicine,

Memorial University of Newfoundland, St. Johns, Newfoundland, Canada A1B 3V6

2 Department of , University of Manitoba, Winnipeg, Manitoba, Canada R3T 2N2

Abstract collected from response adaptive designs are dependent. Traditional statistical methods need to be justified for the use in response adaptive designs. This paper gener- alizes the Rao’s to response adaptive designs and introduces a generalized score . Simulation is conducted to compare the statistical powers of the Wald, the score, the generalized score and the likelihood ratio statistics. The overall statistical power of the Wald statistic is better than the score, the generalized score and the likelihood ratio statistics for small to medium sample sizes. The score statistic does not show good sample properties for adaptive designs and the generalized score statistic is better than the score statistic under the adaptive designs considered. When the sample size becomes large, the statistical power is similar for the Wald, the sore, the generalized score and the likelihood ratio test statistics. MSC: 62L05, 62F03 Keywords and Phrases: Response adaptive design, likelihood ratio test, maximum likelihood estimation, Rao’s score test, statistical power, the

∗Corresponding author. Fax: 1-709-777-7382. E-mail addresses: [email protected] (Yanqing Yi), xikui [email protected] (Xikui Wang) Y. Yi and X. Wang 554

1. Introduction

The Wald, Rao’s score and likelihood ratio tests are regarded as the Holy Trinity in asymp- totic statistics. These tests are first-order equivalent and asymptotically optimal, however they differ in small samples and in second-order properties under certain conditions. The likelihood ratio test was introduced by Neyman and Pearson (1928), the Wald test by Wald (1943) and the score test by Rao (1948). Aitchison and Silvey (1958) and Silvey (1959) derived the La- grangian Multiplier (LM) test independently of the score test, however the LM and score tests are equivalent. Neyman’s C(α) test (Neyman 1959, 1979) may be regarded as a conditional Rao’s score test (Bera and Bilias, 2001). Bera and Bilias (2001) provided historical perspectives of the Rao’s score test, Silvey’s LM test and Neyman’s C(α) test.

Expository studies of the Wald, score and likelihood ratio tests are given in Buse (1982) and Rayner (1997). Engle (1984) provided review on these tests and Ghosh (1991) reviewed the higher-order statistical power performance of these test statistics. Comparisons of these tests are given in Rao (2005) with respect to their merits and defects, in Molenberghs and Verbeke (2007) in a constrained , in Sutradhar and Bartlett (1993) by monte carlo simulation, in Li (2001) on the sensitivity to nuisance parameters, and Chandra and Joshi (1983), Chandra and Mukerjee (1985), and Mukerjee (1990a, 1990b) under contiguous alternatives. Furthermore, Rao and Mukerjee (1997) and Taniguchi (1991) compared these tests in a possibly non-iid set- up. Ghosh and Mukerjee (2001) considered the higher-order asymptotic of statistical power for a large class of test statistics including the Wald, score and likelihood ratio statistics based on quasi likelihood. However their assumptions do not apply to data from response adaptive clinical trials.

Response adaptive designs of clinical trials use accruing information to improve the efficacy and ethics of the clinical trials without undermining the validity and integrity of the clinical research. In response adaptive designs the probability of treatment allocation is sequentially modified depending on information so far accumulated in the trial. Consequently treatment allocation is deliberately biased in order to assign more patients to the potentially superior treatment while a valid statistical comparison of the alternative treatments is still feasible at the conclusion of the study. However due to the particular dependence structure in data collected from response adaptive trials, the statistical comparison of treatment effectiveness Wald, Score, and Likelihood Ratio Tests 555 is non-traditional.

The Wald and likelihood ratio tests have been extended to analyze data from response adaptive designs. Hu and Rosenberger (2003) analyzed the statistical power based on the Wald test and found that the power is a decreasing function of the of allocation proportion. Rosenberger et al (2001) used the Wald statistic to analyze power and proposed an optimal adaptive design (namely the RSIHR design) which optimally balances the expected number of failures and the statistical power of the test. Ivanova (2003) introduced the drop-the-loser design (namely the DL design) and used the Wald statistic to compare the statistical power under the DL design with other designs. Yi and Wang (2009) also proposed a design (namely the YW design) and compared the statistical power of the Wald test under different adaptive designs. The likelihood ratio test was applied to the birth and death urn design in Ivanova et al (2000). Yi and Wang (2007) justified the use of the likelihood ratio test for a general class of response adaptive designs. The Wald and likelihood ratio tests are based on the usual likelihood and the maximum likelihood are used in these statistics.

The Wald, score and likelihood test statistics have been generalized based on quasi-likelihood functions. Ghosh and Mukerjee (2001) generalized these test statistics to quasi-likelihood set- tings and considered high-order statistical power of these statistics. Taniguchi (1991) compared high-order statistical power of these statistics in a general setting including iid and data. Heyde (1997) introduced a general framework to obtain optimal parameter estimation by using quasi-likelihood functions. The assumptions for high-order statistical power of these test statistics are not satisfied by the data collected from response adaptive designs. For the esti- mation for response adaptive designs, Coad and Woodroofe (1998) investigated the bias of the maximum likelihood for sequential clinical trials. Coad and Ivanova (2001) proposed the bias-corrected estimators for response adaptive designs. Yi and Wang (2008) proved that the maximum likelihood estimators are efficient in the Bahadur sense. This paper uses the usual and the maximum likelihood estimators in the Wald, score and likelihood ratio statistics. The results in this paper can be generalized to quasi-likelihood settings.

With a response adaptive design, the adaptation of the treatment allocation introduces more variation in the data and results in a loss of statistical power. Most of the research on statistical power for response adaptive designs are based on the Wald test statistic. The comparison of the statistical powers of the Wald, score and likelihood ratio tests has not been conducted for Y. Yi and X. Wang 556

response adaptive designs. It is well known that the statistical power performance of the Wald test is not satisfactory and the score test performs well when sample sizes are small for iid data. However it is unclear whether the small sample performance of the score test remains true for data collected from response adaptive designs. This paper generalizes the score test to response adaptive designs and compares the performance of statistical power of the Wald, score and likelihood ratio tests for response adaptive designs. Considering the variability in test statistics added by adaptive designs, the sensitivity of these tests to the type of design is also explored. The paper is organized as follows. Section 2 introduces necessary assumptions and the asymptotic distributions of the Wald, score and likelihood ratio tests under the null hypothesis and under contiguous alternatives. Section 3 presents simulation results to compare statistical powers of these tests under different response adaptive designs. An application of these statistics to real data is included in Section 4. Section 5 concludes the paper.

2. Formulation of the problem

Suppose patients arrive sequentially in the trial and each patient receives one and only

one of k treatments, k ≥ 2. Patients’ responses Y1j,Y2j, ··· , from treatment j, j = 1, ··· , k,

are independent and identically distributed with a density function fj(y, θj), j = 1, ··· , k. We T assume that θ = (θ1, θ2, ··· , θk) ∈ Θ is an unknown parameter, where T stands for transpose. th Let δi = (δi1, δi2, ··· , δik) be the treatment assignment to the i patients such that δij = 1 th if the i patient receives treatment j and δij = 0 otherwise, and yi = (Yi1δi1,Yi2δi2, ··· ,Yikδik) be the corresponding response. Here we use the convention that if treatment j is not applied to patient i, then the response is 0 from treatment j. When the ith patient, i ≥ 2, is to be treated,

the information available is given by the σ algebra Fi−1 generated by {(δ1, y1), ··· , (δi−1, yi−1)}. Pn After n patients have been treated in the trial, let Nj(n) = i=1 δij, j = 1, 2, ··· , k, be the

number of patients receiving treatment j. For simplicity, denote Nj(n) as Nj. For response adaptive designs, assume

Nj (A) As n → ∞, we have n → vj(θ) ∈ (0, 1) almost surely for every θ ∈ Θ and Nj(n) → ∞

almost surely, where vj(θ) is a continuous function of θ, j = 1, 2, ··· , k, representing the desired allocation proportion to treatment j. Wald, Score, and Likelihood Ratio Tests 557

This assumption holds true for a number of response adaptive designs, including the ran- domized play-the-winner design, the proposed by Rosenberger et al. (2001), and the allocation rule in Melfi et al. (2001).

A randomized response adaptive allocation rule π = {πi, i = 1, 2, ···} consists of a sequence Pk of conditional probabilities πij = P (δij = 1|Fi−1), j=1 πij = 1, i ≥ 2, and the initial possibly randomized treatment allocation probabilities π1j = P (δ1j = 1) are pre-fixed (such as 1/k),

1 ≤ j ≤ k. A response adaptive design uses the accumulated information Fi−1 to adapt the treatment allocation probability πij for the purpose of assigning more patients to the potentially superior treatment.

For each observed sequence {(δ1, y1), ··· , (δn, yn)}, the likelihood function is

n k n k Y Y δij Y Y δij L(θ) = [πijfj(yij, θj)] = b(π) [fj(yij, θj)] i=1 j=1 i=1 j=1

Qn Qk δij where b(π) = i=1 j=1 πij does not depend on the unknown parameter θ. We consider the constraint formulation of the hypothesis

Ho : h(θ) = 0

k r where h(θ) = (h1(θ), h2(θ), ··· , hr(θ)) : < → < is a vector-valued function of θ and hm(θ) is a function on Θ, m = 1, 2, ··· , r. We assume that the k × r matrix H(θ) = ∂h(θ) exists and is ∂θ a continuous function of θ, and the rank of H(θ) is r, r < k. The constraint formulation includes a wide class of hypotheses, including the test of k equally effective treatments and the test of partial collection of equally effective treatments. To ensure almost sure existence of a strongly consistent root of the likelihood equation, certain regularity conditions are required. As in Serfling (1980), we assume that Θ is an open set and

2 3 ∂(logfj (y,θj )) ∂ (logfj (y,θj )) ∂ (logfj (y,θj )) (B1) For each θ ∈ Θ, the derivatives ∂θ , 2 , 3 , j = 1, ··· , k, j ∂θj ∂θj exist for all y.

(B2) For θ0 ∈ Θ, there are functions H1(y),H2(y) and H3(y) such that for all θ in a small,

open neighborhood N(θ0) of θ0 we have

∂(logf (y, θ )) ∂2(logf (y, θ )) ∂3(logf (y, θ )) j j < H (y), j j < H (y), j j < H (y) 1 2 2 3 3 ∂θj ∂θj ∂θj Y. Yi and X. Wang 558

for all y and j = 1, ··· , k, where H1(y) and H2(y) are finitely integrable over (−∞, +∞),

and Eθ(H3(Y )) < ∞ for all θ ∈ N(θ0).

 2  ∂ fj (Y,θj ) (B3) For any θ ∈ N(θ0), the Fisher’s information numbers Ij(θj) = −E 2 , j = ∂θj 1, 2, ··· , k, are finite and non-zero.

Condition (B1) insures that the Taylor expansion in θ of the score function exists for each y. Condition (B2) justifies the interchangeability of integration in y and differentiation with respect to θ. Condition (B3) guarantees that the random variables ∂(logfj (Y,θj )) , j = 1, ··· , k, ∂θj have finite, positive . Under these regularity assumptions, there exists a strongly consistent root of the likelihood equation S(θ) = 0, where S(θ) is the score function defined as

∂(ln L(θ)) ∂(ln L(θ)) ∂(ln L(θ))T S(θ) = , , ··· , . ∂θ1 ∂θ2 ∂θk

The consistency and asymptotic normality of the MLE θˆn for response adaptive designs have been studied by Rosenberger et al (1997), Melfi and Page (2000), Melfi et al. (2001), Hu et al. (2006) and Yi and Wang (2008). Using the method of the Taylor expansion, law of large numbers, central limit theorems, and the methods in Yi and Wang (2007) and in Aitchison and Silvey (1958), we can also show

the existence of restricted MLE θ˜n. Lemma 1 Under regularity assumptions (A) and (B1−B3), for each given design, the restricted

MLE θ˜n exists with probability 1 under the distributions specified under the true parameter θ0,

subject to h(θ˜n) = 0. Moreover,

(1) θ˜n is a strongly of θ0, and √ (2) n(θ˜n − θ0) is asymptotically normally distributed with 0 and variance- matrix

−1 t −1 −1 t −1 [Γ(θ0)] {Ik − H(θ0)[(H(θ0)) (Γ(θ0)) H(θ0)] (H(θ0)) (Γ(θ0)) }

where Ik is the k×k identical matrix and Γ(θ) = diag(v1(θ)I1(θ1), v2(θ)I2(θ2), ··· , vk(θ)Ik(θk)) is a diagonal matrix.

The Wald test statistic is given by

−1 T n T −1 o W = n[h(θˆn)] [H(θˆn)] [Γ(θˆn)] [H(θˆn)] [h(θˆn)], Wald, Score, and Likelihood Ratio Tests 559

ˆ ˆ ˆ ˆ ˆ ˆ ˆ where Γ(θn) = diag(v1(θn)I1(θn), v2(θn)I2(θn), ··· , vk(θn)Ik(θn)). Note that the unrestricted

MLE θˆn has an asymptotic , so after applying the multivariate , the statistic h(θˆn) follows asymptotically a normal distribution. The likelihood ratio test statistic is expressed in the general form

LR = 2{log L(θˆn) − log L(θ˜n)}.

The score test statistic is defined as

 1 T  1  RS = √ S(θ˜ ) [Γ(θ˜ )]−1 √ S(θ˜ ) , n n n n n

˜ ˜ ˜ ˜ ˜ ˜ ˜ 2 where Γ(θn) = diag(v1(θn)I1(θn), v2(θn)I2(θn), ··· , vk(θn)Ik(θn), Ij(θ) = E (∂(ln fj(Y, θj)/∂θj) , and vj(θ) is the limiting proportion of patients assigned to treatment j, j = 1, 2, ··· , k. By the

Courant Theorem (Sen and Singer, 1993), vj(θ˜n) may be replaced by any consistent estimator of vj(θ), including Nj/n. Therefore we define a generalized score test statistic as

 1 T  1  RS∗ = √ S(θ˜ ) [G(θ˜ )]−1 √ S(θ˜ ) , n n n n n   ˜ N1 ˜ N2 ˜ Nk ˜ where G(θn) = diag n I1(θn), n I2(θn), ··· , n Ik(θn) . For the Holy Trinity of the Wald, score and the likelihood ratio tests, we have Theorem 2 Assume regularity assumptions (A) and (B1 − B3). Under the null hypothesis ∗ 2 Ho : h(θ) = 0, each of the above statistics W, LR, RS, and RS follows asymptotically a χ distribution with r degrees of freedom. Let us now incorporate local Pitman-type (contiguous) alternatives (Sen and Singer, 1993) and consider testing the hypotheses

−1/2 Ho : θ = θ0 versus Ha : θ = θ0 + n ∆, where ∆ is a fixed vector in

3. Simulation comparisons

We compare the statistical power of using the Wald, score, and likelihood ratio tests for a total number of 30, 50 or 100 patients under different response adaptive designs: the randomized play-the-winner design (RPW for short) by Wei and Durham (1978), the design proposed by Rosenberger et al (2001) (RSIHR for short), the drop the loser design (DL) by Invanova (2003) and the design introduced by Yi and Wang (2009) (YW for short). We consider binary responses

with θA and θB representing the success probability of treatment A and B respectively. Without

loss of generality, we assume θA > θB. We use the RPW design with an initial structure of one ball each in the urn representing treatment A and B respectively. The DL design uses initially three balls for treatment A and three balls for treatment B, and one immigration ball. The RSIHR design balances the expected number of failures and the power of the test. The target proportion of receiving treatment A √ √ √ in the RSIHR design is ρ = θA/( θA + θB). The target proportion for treatment A under 1−θ + min{1−θ ,1−θ }sign(θ −θ ) the YW design is ρ = B A B A B , where 0 ≤  ≤ 1 measures the tradeoff 2−θA−θB between individual and collective ethics. The results in Yi and Wang (2009) show that the YW design with a higher value of  assigns a higher proportion of patients to the better treatment but results in lower statistical power. We set  at 1/4 in this paper according to Yi and Wang (2009). In our simulations, the first two patients are assigned by complete randomization, with one on each treatment. The remaining patients are assigned by using respective adaptive designs.

Table 1 describes the simulated statistical powers using the Wald, the score, the generalized score and the likelihood ratio tests under the designs RPW, RSIHR, DL and YW for a total number of 30, 50 and 100 patients respectively. Each run is repeated 10000 times and the statistical powers in the table is the average of the statistical powers of the 100 runs.

Table 1 shows that the statistical power is in general higher or similar with the Wald test

than with the other tests for different values of θA and θB, when a total number of patients is no larger than 50. Under each of the RPW, DL and YW designs, the statistical power remains

similar with the four tests when both θA and θB are no larger than 0.5 for a total number of patients 30 and 50. Under the RSIHR design, the Wald, the generalized score and the likelihood

ratio tests perform similarly when θA = 0.5, but the score test has a smaller statistical power Wald, Score, and Likelihood Ratio Tests 561

than the other tests when the total number of patients is no more than 50 and θA − θB = 0.4. For a total number of patients is as small as 30, the Wald and likelihood ratio tests have similar statistical powers under each of the designs when θA and θB are no larger than 0.7. When

θA = 0.9 and the difference θA − θB is no larger than 0.4, the Wald test has a higher statistical power under each of the designs than the other tests. As the difference θA − θB increases to as large as 0.6, all tests perform similarly under the DL and RSIHR designs. When θA = 0.9 and

θA − θB = 0.6, the score test has a power lower than those of the other tests under the RPW and YW designs. If θA is no less than 0.7, the Wald test is far more better than other tests under the YW design, particularly when θA = 0.9. For a sample of size 50, the Wald and likelihood ratio tests perform similarly under each of the RPW, DL and RSIHR designs. The Wald test is better than the other tests under the YW design when the difference θA − θB is no larger than 0.4 for either θA = 0.7 or θA = 0.9. When the difference θA − θB increases to as large as 0.6, all tests have similar statistical powers. When the total number of patients is as large as 100, all tests perform similarly in terms of statistical power under each of the RPW, DL, RSIHR and YW designs, except when θA = 0.9 and θB = 0.7, in which case the Wald statistic has a higher statistical power. Table 1 also shows that for each given test, there are differences in statistical power among the response adaptive designs. For example, the RSIHR design has a statistical power that is overall highest among the designs when the total number of patients is no more than 50. Except for the case of θA = 0.5 and θB = 0.1, the statistical powers are similar under the DL and RSIHR designs for each of the tests, and these statistical powers are larger than those under the YW design. The RSIHR and DL designs have statistical powers higher than the RPW and

YW designs when θA = 0.7 and θA = 0.9. Compared with the DL design, the RSIHR design has a similar or higher statistical power. When the total number of patients is as small as 30 and

θA = 0.9 with a difference θA − θB of no larger than 0.4, the statistical power under the RSIHR design is larger than that under the DL design, for the Wald, generalized score and likelihood ratio tests. When θA = 0.7, the RSIHR design has a statistical power similar to that of the DL design when the difference θA − θB is as small as 0.2 or as large as 0.6, and the RSIHR design is better than the DL design when the difference is 0.4. When the total number of patients is as large as 50, the statistical power under the RSIHR design is no worse than that under the DL design. When the total number of patients is as large as 100, the RSIHR design is similar to the Y. Yi and X. Wang 562

DL design. Although the YW design has the merit of assigning a high proportion of patients to the potential better treatment and maintains a good statistical power when the total number of patients is no less than 200 (Yi and Wang 2009), the YW design has a statistical power lower than the RPW, DL and RSIHR designs, when the total number of patients is no more than 100.

4. Application

In the zidovudine (AZT) trial (Connor et al. (1994)), the success rates of reducing HIV

transmission for the AZT group and the placebo group were observed to be θA = 0.916 and

θB = 0.748, respectively, for a total number of patients n = 477. Yao and Wei (1996) and Ivanova (2003) redesigned the trial using the RPW and DL designs. Yi and Wang (2009) compared the YW design with other response adaptive designs such as the DL, RPW and the RSIHR designs in terms of statistical power and the proportion of patients allocated to the AZT treatment for a total number of patients of 477. The statistical power in Yi and Wang (2009) was based on the Wald test statistic. Assuming instantaneous responses, we simulate the statistical powers of the Wald, score, generalized score, likelihood ratio tests to detect the observed difference in success rate of reducing HIV transmission in the AZT trial under the designs RPW, DL, RSIHR and YW. Various total numbers of patients of 200, 250, 300 are considered. The type I error rate is kept at approximate 0.05 for all the tests under these designs. In simulation, each run is repeated 10000 times and the statistical powers in Table 2 is the average of the statistical powers of the 100 runs. Table 2 shows that the statistical power of the Wald, score, generalized score and likelihood ratio tests have similar statistical powers under each of the designs RPW, DL, RSIHR and YW to detect the observed difference in the AZT example for total numbers of patients 200, 250 and 300. When a total number of patients increases to as large as 300, the statistical powers of these tests achieve 0.95 or higher under the RPW, DL and RSIHR designs. The statistical powers under the YW design are close to 0.95. Table 2 also indicates that the statistical power is sensitive to the type of response adaptive designs. Under each of the Wald, score, generalized score and likelihood ratio tests, the sta- tistical powers under the DL and RSIHR are better than these under RPW and YW designs. Wald, Score, and Likelihood Ratio Tests 563

However, as a total number of patients increases to 300, the difference in statistical power be- comes small among these designs. The proportion of patients allocated to the AZT treatment

E(NA/n) and its Std are also given in Table 2.

5. Conclusion

This paper justifies the use of the score test and introduces a generalized score test for re- sponse adaptive designs. We compare the statistical power of the Wald, score, generalized score and likelihood tests by of simulation. The simulation indicates that the overall perfor- mance of the Wald statistic is better than the score, the generalized score and the likelihood ratio statistics. For small sample sizes, the score test does not demonstrate good performance for response adaptive designs. The simulation results also indicate that the statistical power varies among the RPW, DL, RSIHR and YW designs of response adaptive clinical trials. Under each of the RPW, DL, RSIHR and YW designs, overall the Wald test has a better performance in statistical power for small to moderate sample sizes. When the total sample size is no larger than 50, the Wald test has a higher statistical power than the other tests when

θA = 0.9 and the difference θA − θB is as small as 0.4, under each of the designs considered. For other values of θA and θB, the Wald test has a statistical power no worse than the other tests. When the total sample size is as large as 100 under each of the designs considered, all tests behave similarly in terms of statistical power except when θA = 0.9 and θB = 0.7, in which case the Wald statistic is the best. The simulation results also show that the statistical power is affected by the type of response adaptive design employed. Both the DL and RSIHR designs have statistical powers higher than the RPW and YW designs except when θA = 0.5 and θB = 0.1. The RSIHR design has a statistical power that is overall better than that of the DL design when the sample size is no larger than 50. However the statistical power becomes similar under these designs as the total sample size becomes large.

Acknowledgement We acknowledge the reviewer’s comments which helped improving the presentation and quality of the paper. Yanqing Yi acknowledges the IRIF Start-up Fund from the Government of Newfoundland and Labrador, through the Department of Innovation, Trade and Rural Development. Both Y. Yi and X. Wang 564

authors acknowledge research supports from the Natural Sciences and Engineering Research Council (NSERC) of Canada. Conflict of interests statement The authors have declared no conflict of interest.

Appendix Wald, Score, and Likelihood Ratio Tests 565 YW 1 1 1 1 5.01 5.08 5.025.05 5.18 5.06 5.035.67 5.05 5.46 5.625.26 4.93 5.48 5.205.30 5.15 5.41 5.295.20 5.26 5.47 5.235.11 5.33 5.09 5.124.97 5.03 5.11 5.014.40 5.03 4.43 4.34 4.53 Wald Score GScore15.87 LR 15.9457.32 57.38 15.8316.54 57.29 16.02 15.9549.74 58.47 47.3590.80 14.94 88.02 44.46 16.41 24.87 85.80 49.35 15.7458.44 90.16 39.8086.28 17.67 70.28 44.90 18.51 26.21 76.12 48.32 25.6857.32 79.79 57.38 25.2626.80 57.29 26.30 24.8175.82 58.47 72.6599.36 24.39 99.08 71.25 24.90 34.91 99.04 72.74 29.4378.95 99.13 71.0197.49 28.04 95.57 69.25 32.32 48.03 95.09 75.81 47.2399.30 97.08 99.18 47.1446.83 99.18 47.89 45.2296.76 99.23 96.16 44.94 95.97 45.44 59.43 96.33 54.9597.66 96.7799.97 52.45 99.98 96.01 55.53 99.98 96.85 99.99 RSIHR 1 11 1 1 1 1 1 5.15 5.21 5.215.10 5.03 5.24 5.144.98 5.12 5.02 5.054.97 5.04 4.98 4.934.97 4.98 4.91 4.934.94 4.91 4.88 4.804.94 4.87 4.94 4.934.91 4.89 4.93 4.885.01 4.93 5.03 5.06 5.04 Wald Score GScore LR 18.75 18.8465.43 62.95 18.9620.78 66.17 19.08 19.3660.93 66.61 57.5295.25 20.20 93.53 59.59 20.12 38.37 94.68 59.47 33.9781.89 94.74 76.8098.15 35.35 97.08 78.08 36.18 28.94 97.44 79.05 28.0088.42 97.60 86.15 28.5531.61 88.32 28.91 30.2883.52 88.54 81.7899.76 30.55 99.61 82.44 30.64 50.65 99.72 82.46 46.2393.78 99.73 91.7799.89 46.81 99.81 92.00 47.06 52.35 99.83 92.11 51.3999.57 99.83 99.47 52.0753.43 99.55 52.21 52.8098.47 99.58 98.35 53.12 98.40 53.02 75.60 98.41 73.9999.75 99.71 73.81 99.70 73.88 99.70 DL 1 11 1 1 1 1 1 5.00 4.95 4.985.00 5.00 5.03 4.894.63 4.96 4.63 4.665.42 4.38 5.46 5.435.03 5.05 5.43 5.054.90 5.09 5.01 4.854.92 4.91 4.95 4.955.20 5.10 4.92 5.065.21 5.19 5.21 5.20 5.26 Wald Score GScore18.07 LR 17.9466.54 66.40 17.9820.11 66.51 18.11 19.2358.79 66.83 57.0794.59 18.75 94.03 57.13 19.43 35.95 94.43 58.20 33.0378.82 94.58 75.4297.61 33.25 96.29 76.04 33.27 29.83 96.73 76.37 29.9289.09 96.90 89.16 29.8530.21 89.13 29.88 28.6881.94 89.54 79.8899.74 29.43 99.64 80.18 29.78 49.39 99.65 80.84 45.1592.66 99.70 90.3699.84 45.88 99.69 90.99 45.57 52.52 99.74 90.77 52.4899.60 99.72 99.60 52.1052.98 99.60 52.56 52.0398.35 99.63 98.18 51.89 98.18 52.49 74.72 98.24 72.2399.66 99.50 72.96 99.56 72.61 99.55 RPW 1 1 1 1 4.96 4.96 4.944.96 4.96 4.91 5.045.15 5.02 5.25 5.185.12 5.15 5.06 5.064.92 4.96 5.07 5.014.90 5.01 4.97 4.934.92 4.93 4.98 4.914.94 4.95 4.92 4.895.01 4.87 5.05 4.89 5.03 Wald Score GScore17.65 LR 18.0064.79 64.10 17.3318.68 63.67 17.47 17.4755.98 64.43 52.3593.01 17.66 89.74 53.75 18.33 32.69 92.15 54.95 27.8872.37 92.68 65.7092.65 27.02 89.27 66.54 29.91 28.45 91.88 69.41 28.1887.95 92.47 87.49 28.0128.39 87.89 28.42 27.5979.15 88.13 76.9599.48 27.36 98.86 77.35 27.71 44.73 99.43 77.99 39.9286.98 99.46 82.5997.81 39.32 95.82 84.70 41.13 51.72 98.33 85.26 51.8399.58 98.04 99.57 51.4451.44 99.57 51.75 50.7697.94 99.59 97.63 50.29 97.93 50.55 69.18 97.80 66.7697.94 96.7699.77 66.41 99.39 98.31 67.14 99.92 98.07 99.86 B θ A θ 0.5 0.5 0.5 0.1 0.7 0.7 0.7 0.3 0.7 0.1 0.9 0.9 0.9 0.5 0.9 0.3 0.5 0.5 0.5 0.1 0.7 0.7 0.7 0.3 0.7 0.1 0.9 0.9 0.9 0.5 0.9 0.3 0.5 0.5 0.5 0.1 0.7 0.7 0.7 0.3 0.7 0.1 0.9 0.5 0.9 0.3 0.5 0.3 0.7 0.5 0.9 0.7 0.5 0.3 0.7 0.5 0.9 0.7 0.5 0.3 0.7 0.5 0.9 0.9 0.9 0.7 = 30 = 50 = 100 n n n Table 1: Simulation of the statisticalthe power for RPW, the DL, Wald, RSIHR score, and generalized YW score designs. (GScore for short) and likelihood ratio tests under Y. Yi and X. Wang 566

Table 2: Simulation of the statistical power for the Wald, score, generalized score (GScore for short) and likelihood ratio tests for the AZT trial with total number of patients of 200, 250 and 300 under the RPW, DL, RSIHR and YW designs.

E(NA/n)(Std) Wald Score GScore LR RPW 66.68(13.98) 83.44 83.17 86.29 85.39 DL 65.67(5.18) 89.07 90.11 89.79 89.65 n = 200 RSIHR 52.49(3.74) 90.22 90.46 90.47 90.58 YW 80.61(5.52) 77.10 80.38 82.39 80.61 RPW 67.24(13.23) 90.78 90.01 92.54 91.90 DL 66.88(4.82) 94.83 95.28 95.28 95.11 n = 250 RSIHR 52.50(3.35) 95.40 95.51 95.50 95.57 YW 80.80(4.80) 87.61 89.05 90.20 89.18 RPW 67.69(12.62) 94.95 94.03 96.02 95.56 DL 67.81(4.52) 97.46 97.69 97.68 97.61 n = 300 RSIHR 52.50(3.05) 97.94 98.02 98.01 98.03 YW 80.89(4.34) 93.45 94.29 94.97 94.30

References

[1] Aitchison, J. and Silvey, S. D. (1958). Maximum-likelihood estimation of parameters sub- ject to restraints. Annals of , 29, 813-828.

[2] Bera, A. K. and Bilias, Y. (2001). Rao’s score, Neyman’s C(α) and Silvey’s LM tests: an essay on historical developments and some new results. Journal of Statistical Planning and Inference, 97, 9-44.

[3] Buse, A. (1982). The likelihood ratio, Wald, and Lagrange Multiplier tests: an expository note. American , 36, 153-157.

[4] Chandra, T. K. and Joshi, S. N. (1983). Comparison of the likelihood ratio, Wald’s and Rao’s tests. Sankhy¯a, Series A, 45, 226-246.

[5] Chandra, T. K. and Mukerjee, R. (1985). Comparison of the Likelihood Ratio, Wald’s and Rao’s Tests. Sankhy¯a, Series A, 47, 271-284.

[6] Coad, D. S. and Ivanova, A. (2001). Bias calculations for adaptive urn designs. , 20, 91-116.

[7] Coad, D.S. and Woodroofe, M. (1998). Approximate bias calculations for sequentially de- Wald, Score, and Likelihood Ratio Tests 567

signed . Sequential Analysis, 17, 1-31.

[8] Connor, E.M., Sperling, R. S., Gelber, R., Kiselev, P., Scott, G., OSullivan, M. J., VanDyke, R., Bey, M., Shearer, W., Jacobson, R. L., Jiminez, E., ONeil, E., Bazin, B., Delfraissy, J., Culnane, M., Coombs, R., Elkins, M., Moye, J., Stratton, P. and Balsley, J. for the Pediatric AIDS Clinical Trials Group Protocol 076 Study Group (1994). Reduction of maternal-infant transmission of human immunodeficiency virus type 1 with zidovudine treatment. New England Journal of Medicine, 331, 1173-1180.

[9] Engle, R. F. (1980). Wald, likelihood ratio and Lagrange Multiplier test in . In Handbook of Econometrics, Vol. 2, Chapter 13, 775-826, (Eds., Z. Griliches and M. Intriligator), North-Holland Science Publishers, Amsterdam.

[10] Ferguson, T. S. (1996). A Course in Large Sample Theory. Chapman and Hall, New York.

[11] Ghosh, J. K. (1991). Higher order asymptotics for the likelihood ratio, Rao’s and Wald’s tests. Statistics & Probability Letters, 12(6), 505-509.

[12] Ghosh, J. K. and Mukerjee, R. (2001). Test statistics arising from quasi likelihood: Bartlett adjustment and higher-order power. Journal of Statistical Planning and Inference, 97, 45-55.

[13] Heyde, C. C. (1997). Quasi-likelihood and its application: a general approach to optimal parameter estimation. Springer, New York.

[14] Hu, F. and Rosenberger, W. F. (2003). Optimality, variability, power: evaluating response- adaptive randomization procedures for treatment comparisons. Journal of the American Statistical Association, 98, 671-678.

[15] Hu, F., Rosenberger, W. F., Zhang L. X. (2006). Asymptotically best response-adaptive randomization procedures. Journal of Statistical Planning and Inference, 136, 1911- 1922.

[16] Ivanova, A. (2003). A play-the-winner-type urn design with reduced variability. Metrika, 58, 1-13.

[17] Ivanova, A., Rosenberger, W. F., Durham, S. D. and Flournoy, N. (2000). A birth and death urn for randomized clinical trials: asymptotic methods. Sankhy¯a, Series B, 62, 104-118. Y. Yi and X. Wang 568

[18] Li, B. (2001). Sensitivity of Rao’s score test, the Wald test and the likelihood ratio test to nuisance parameters. Journal of Statistical Planning and Inference, 97, 57-66.

[19] Melfi, V. F. and Page, C. (2000). Estimation after adaptive allocation. Journal of Statistical Planning and Inference, 87, 353-363.

[20] Melfi, V. F., Page, C. and Geraldes, M. (2001). An adaptive randomized design with application to estimation. Canadian Journal of Statistics, 29, 107-116.

[21] Molenberghs, G. and Verbeke, G. (2007). Likelihood ratio, score, and Wald tests in a constrained parameter space. American Statistician, 61, 22-27.

[22] Mukerjee, R. (1990a). Comparison of tests in the multiparameter case I. Second-order power. Journal of Multivariate Analysis, 33, 17-30.

[23] Mukerjee, R. (1990b). Comparison of tests in the multiparameter case II. A third-order optimality property of Rao’s test. Journal of Multivariate Analysis, 33, 31-48.

[24] Neyman, J. (1959). Optimal asymptotic test of composite statistical hypothesis. In Prob- ability and Statistics (Ed., U. Grenander), John Wiley and Sons, New York.

[25] Neyman, J. (1979). C(α) tests and their uses. Sankhy¯a, Series A, 41, 1-21.

[26] Neyman, J. and Pearson, E.S. (1928) On the use and interpretation of certain test criteria for purposes of . Biometrika, 20, 175-240, 263-294.

[27] Rao, C. R. (1948). Large sample tests of statistical hypotheses concerning several pa- rameters with application to problems of estimation. Proceedings of the Cambridge Philosophical Society, 44, 50-57.

[28] Rao, C. R. (2005) Score test: historical review and recent developments. In Advances in and Selection, Multiple Comparisons, and Reliability, (Eds., N. Balakrishnan, N. Kannan, H.N. Nagaraja), Chapter 1, pp. 3 - 20, Birkh¨auser, Boston.

[29] Rao, C. R. and Mukerjee, R. (1997). Comparison of LR, score, and Wald tests in a non-iid setting. Journal of Multivariate Analysis, 60, 99-110.

[30] Rayner, J. C. W. (1997). The asymptotically optimal tests. Statistician, 46, 337-346.

[31] Rosenberger, W. F., Flournoy, N. and Durham, S. D. (1997). Asymptotic normality of maximum likelihood estimators from multiparameter reponse-driven designs. Journal of Statistical Planning and Inference, 60, 69-76. Wald, Score, and Likelihood Ratio Tests 569

[32] Rosenberger, W. F., Stallard, N., Ivanova, A., Harper, C. N. and Ricks, M. L. (2001). Optimal adaptive designs for binary response trials. Biometrics, 57, 909-913.

[33] Sen, P. K. and Singer, J. M. (1993). Large Sample Methods in Statistics - An Introduction with Applications. Chapman and Hall, New York.

[34] Serfling, R. J. (1980). Approximation Theorems of Mathematical Statistics. John Wiley and Sons, New York.

[35] Silvey, S. D. (1959). The Lagrangian multiplier test. Annals of Mathematical Statistics, 30, 389-407.

[36] Sutradhar, B. C. and Bartlett, R. F. (1993). Monte carlo comparison of Wald’s, likelihood ratio and Rao’s tests. Journal of Statistical Computation and Simulation, 46, 23-33.

[37] Taniguchi, M. (1991). Third-order asymptomic properties of a class of test statistics under a local alternative. Journal of Multivariate Analysis, 37, 223-238.

[38] Wald, A. (1943) Tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical Society, 54, 426-482.

[39] Yao, Q. and Wei, L. J. (1996). Play the winner for phase II/III clinical trials. Statistics in Medicine, 15, 2415-2423.

[40] Yi, Y. and Wang, X. (2007). Goodness-of-fit test for response adaptive clinical trials. Statis- tics and Probability Letters, 77, 1014-1020.

[41] Yi, Y. and Wang, X. (2008). Asymptotically Efficient Estimation in Response Adaptive Trials. Journal of Statistical Planning and Inference, 138, 2899-2905.

[42] Yi, Y. and Wang, X. ( 2009). Response adaptive designs with a variance-penalized criterion. Biometrical Journal, 51, 763-773.