<<

Math 659: Survival Analysis Chapter 7 — Hypothesis Testing

Wenge Guo

October 17, 2011

Wenge Guo Math 659: Survival Analysis Motivation for Hypothesis Testing 1.0

ALL AML−low AML−high 0.8 0.6 0.4 Estimated Disease−free Survival 0.2 0.0

0 500 1000 1500 2000 2500

Days Post Transplant I Bone Marrow Transplant : survival curves for 3 groups

2/68 General Problems

I One sample problem: To test if the sample comes from a population with a prespecified rate h0(t) I Two samples or more samples: To test if there is no difference in survival between K treatments

I Test for trends: are the hazard functions ordered

I Stratified tests

I Do hazard functions cross?

I ...

3/68 The Basic Idea of Hypothesis Testing

I We do nonparametric hypothesis testing

I Focus on hypothesis tests that are based on comparing the Nelson-Aalen estimator

I Rather than a direct comparison of these two rates, we examine tests that look at weighted differences

I The weights will allow us to put more emphasis on certain parts of the curves

I Different weights allow us to present tests that are most sensitive to early or late departures

4/68 One-sample Tests

I The problem: we have a censored sample of size n from some population

I Want to test the hypothesis that the population hazard rate is h0(t) for all t < τ against the alternative that the hazard rate is not h0(t) for some t < τ I That is: H0 : h(t) = h0(t) for all t < τ

HA : h(t) 6= h0(t) for some t < τ

I Here h0(t) is a completely specified function over the 0 to τ

I Typically, τ is the largest of the observed study times

5/68 The Test

I Notation P di I The Nelson-Aalen estimator H(t) = b ti ≤t Y (ti ) I di is the number of events at the observed event times, t1 < t2 < ··· < tD I Y (ti ) is the number of individuals under study just prior to the observed event time ti .

I The quantity di /Y (ti ) gives a crude estimate of the hazard rate at an event time ti

I When H0 is true, the expected hazard rate at ti is h0(ti ) I Compare the sum of weighted differences between the observed and expected hazard rates

I The test is PD di R τ Z(τ) = O(τ) − E(τ) = W (ti ) − W (s)h0(s)ds i=1 Y (ti ) 0

6/68 The Weights

I Here W (t) is the weight function that need to be specified

I When H0 is true, the sample of Z (τ) is R τ 2 h0(s) V [Z (τ)] = 0 W (s) Y (s) ds 2 2 I Z(τ) /V [Z(τ)] ∼ χ1, when H0 is true, under large samples I Weights:

I The most popular choice is W (t) = Y (t), which yields the one sample log-rank test I Harrington and Fleming family p q WHF (t) = Y (t)S0(t) [1 − S0(t)] , p ≥ 0, q ≥ 0 I p = q = 0 gives log-rank test I One can put more weight on early departures (p much larger than q) I late departures (p much smaller than q) I on departures in the mid-range (p = q > 0).

7/68 Example

I Consider the mortality in a sample of 26 Iowa psychiatric patients (details in Section 1.15)

I 26 psychiatric inpatients admitted to the University of Iowa hospitals during the years 1935-1984

I Want to know if the psychiatric patients tend to have shorter lifetimes, compare the general Iowa population (Mortality given in Table 6.2 in page 179-180)

I The data have left truncation, due to delayed entries

I We use log-rank test, that is W (t) = Y (t)

I O(τ) observed number of events at or prior to time τ Pn I E(τ) = V [Z (τ)] = j=1[H0(Tj ) − H0(Lj )] 2 2 I χ = (15 − 4.474) /4.474 = 24.7645, p-value is close to zero, significant difference.

8/68 Calculation

9/68 Tests for Two or More Samples

I The problem: to compare hazard rates of K (K ≥ 2) populations

I The test is:

H0 : h1(t) = h2(t) = ··· = hK (t), for all t ≤ τ

HA : at least one of the hj (t)’s is different for some t ≤ τ

I Here τ is the largest time at which all the groups have at least one subject at risk

10/68 Notation

I K denote the number of populations

I t1 < t2 < ··· < tD be the distinct death times in the pooled sample

I at time ti we observe dij events in the jth sample

I Yij individuals at risk, j = 1, ··· , K ; i = 1, ··· , D PK I Let di = j=1 dij is the number of PK I Yi = i=1 Yij is the number at risk in the pooled sample at time ti

11/68 The Test Statistic

I Compare the weighted estimates of the hazard rate of the jth population

I If H0 is true, an estimator of the expected hazard rate in the jth population is the pooled sample estimator of the hazard rate di /Yi I Using only data from the jth sample, the estimator of the hazard rate is dij /Yij

I The test of H0 is based on the statistics, n d o Z (τ) = PD W (t ) ij − di , j = 1, 2, ··· K j i=1 j i Yij Yi

I Here Wj (t) is a positive weight function

I If all the Zj (τ)’s are close to zero, then, there is little evidence to believe that the null hypothesis is false

I If one of the Zj (τ)’s is far from zero, then, there is evidence that this population has a hazard rate differing from that expected under the null hypothesis 12/68 Weight Function and Matrix

I In practice, all of the commonly used tests have a weight function Wj (ti ) = Yij W (ti )

I W (ti ) is a common weight shared by each group n o PD di I We have Z (τ) = W (t ) d − Y , j = 1, 2, ··· K j i=1 i ij ij Yi

I The variance of Zj (τ) is given by Y  Y    σ = PD W (t )2 ij 1 − ij Yi −di d , j = 1, 2, ··· , K bjj i=1 i Yi Yi Yi −1 i

I The covariance of Zj (τ) and Zg(τ) is Y Y   σ = − PD W (t )2 ij ig Yi −di d , j 6= g bjg i=1 i Yi Yi Yi −1 i

13/68 The Asymptotic Distribution of Test Statistic

PD I Because j=1 Zj (τ) = 0, the test statistic is constructed by selecting any K − 1 of the Zj ’s I The corresponding estimated variance- is denoted by a (K − 1) × (K − 1) matrix Σ 2 I The test statistic is given by the quadratic form χ = −1 t 2 (Z1(τ), ··· , ZK −1(τ))Σ (Z1(τ), ··· , ZK −1(τ)) ∼ χK −1 2 I An α level test of H0 rejects when χ is larger than the αth upper percentage point of a chi-squared, with K − 1 df

14/68 Special case: K = 2

I The test statistic can be simplified to PD n di o W (ti ) di1−Yi1 = i=1 Yi Z r    PD 2 Yi1 Yi1 Yi −di W (ti ) 1− di i=1 Yi Yi Yi −1 I Has a standard normal distribution for large samples when H0 is true I An α level test of the HA : h1(t) > h2(t), for some t ≤ τ, is rejected when Z ≥ Zα

I The test of HA : h1(t) 6= h2(t), for some t, rejects when |Z | > Zα/2

15/68 Weights

I Log-rank test: W (t) = 1

I a test available in most statistical packages I has optimum power to detect alternatives where the hazard rates in the K populations are proportional to each other

I Gehan: W (ti ) = Yi I Tarone and Ware: W (ti ) = f (Yi ) I f is a fixed function 1/2 I they suggest f (y) = y I gives more weight to differences between the observed and expected number of deaths in sample j at time points where there is the most data

16/68 More on Weights

I Peto-Peto: W (ti ) = Se(ti )   Q di I here Se(ti ) = 1 − ti ≤t Yi +1

I Modified Peto-Peto: W (ti ) = Se(ti )Yi /(Yi + 1) I Fleming-Harrington: p q Wp,q(ti ) = Sb(ti−1) [1 − Sb(ti−1)] , p ≥ 0, q ≥ 0 I as special cases, the log-rank test, p = q = 0 I sf at the previous time is used as a weight to ensure that these weights are known just prior to the time at which the comparison is to be made I when q = 0 and p > 0, these weights give the most weight to early departures I when p = 0 and q > 0, these tests give most weight to departures which occur late in time I by an appropriate choice of p and q, one can construct tests which have the most power against alternatives

17/68 Example: Time to Infection of Kidney Dialysis Patients

I Compare the effectiveness of two methods for placing catheters in kidney dialysis patients (details in Section 1.4)

I Two treatment: catheter was placed surgically (group 1) as compared to patients who had their catheters placed percutaneously (group 2).

I Data: time to cutaneous exit-site infection

I Use this example to illustrate the effect of weights √ I The Log-rank test: Zobs = 3.964/ 6.211 = 1.59 which has a p-value of 2Pr[Z > 1.59] = 0.1117

I Log-rank test suggests no difference between the two procedures in the distribution of the time to exit-site infection

18/68 KM Curves

19/68 Construction of Two-Sample, Log-Rank Test

20/68 Comparison of Two-Sample Tests

21/68 PD Relative Weights W (ti )/ i=1 W (ti )

22/68 Computing of the Kidney Infection Example

I The Data > library(KMsurv) > data(kidney) > kidney time delta type 1.5 1 1 3.5 1 1 4.5 1 1 ...... 0.5 1 2 0.5 1 2 0.5 1 2 .....

23/68 Log-rank Test Using survdiff

I > library(survival) > fit=survdiff(Surv(time,delta)~type,data=kidney) > fit Call: survdiff(formula = Surv(time,delta)~type,data = kidney) N Observed Expected (O-E)^2/E (O-E)^2/V type=1 43 15 11.0 1.42 2.53 type=2 76 11 15.0 1.05 2.53 Chisq= 2.5 on 1 degrees of freedom, p= 0.112

>fit$obs [1] 15 11 >fit$exp [1] 11.03645 14.96355 >fit$var [,1] [,2] [1,] 6.210596 -6.210596 [2,] -6.210596 6.210596

24/68 Other Weights

I Peto-Peto > survdiff(Surv(time,delta)~type,data=kidney,rho = 1) Call: survdiff(formula=Surv(time, delta)~type,data=kidney,rho=1)

N Observed Expected (O-E)^2/E (O-E)^2/V type=1 43 12.0 9.48 0.686 1.39 type=2 76 10.4 12.98 0.501 1.39

Chisq= 1.4 on 1 degrees of freedom, p= 0.239

I survdiff produces the Fleming-Harrington class of tests with q = 0

I The SAS procedure LIFETEST can be used to perform the log-rank test and Gehan’s test for right-censored data

25/68 Example: Channing House Data

I In section 1.16, data on 462 individuals who lived at the Channing House retirement center was reported

I These data are left-truncated by the individual’s entry time into the retirement center

I To test the hypothesis that females tend to live longer than males

I We test the hypothesis H0 : hF (t) = hM (t), 777 ≤ t ≤ 1152 months HA : hF (t) ≤ hM (t) for all t ∈ [777, 1152] and hF (t) < hM (t) for some t ∈ [777, 1152]

26/68 Survival Curves

27/68 Number of Units at Risk: YiM and YiF

28/68 Computing the Statistic

I Need to compute YiF and YiM as the number of females and males, respectively, who were in the center at age ti I The test will be based on the weighted difference between the observed and expected number of male deaths

I Using the log-rank weights, we find ZM (1152) = 9.682, Vb(ZM (1152)) = 28.19, so Zobs = 1.82 and the onesided p-value is 0.0341

I This provides evidence that males are dying at a faster rate than females

29/68 Example: BMT Data

I Test the hypothesis that the disease-free survival functions of these three populations are the same over the range of observation, t < 2204 days, versus the alternative that at least one of the populations has a different survival rate.

I Using the log-rank weights, we find Z1(2204) = 2.148, Z2(2204) = 14.966 and Z3(2204) = 12.818 I The covariance matrix and the test statistic is

30/68 Survival Curves 1.0

ALL AML−low AML−high 0.8 0.6 0.4 Estimated Disease−free Survival 0.2 0.0

0 500 1000 1500 2000 2500

Days Post Transplant

31/68 Computing

I > bmt.test=survdiff(Surv(t2,d3)~group,data=bmt) N Observed Expected (O-E)^2/E (O-E)^2/V group=1 38 24 21.9 0.211 0.289 group=2 54 25 40.0 5.604 11.012 group=3 45 34 21.2 7.756 10.529 Chisq= 13.8 on 2 degrees of freedom, p= 0.00101 > print.default(bmt.test) $obs [1] 24 25 34 $exp [1] 21.85171 39.96612 21.18217 $var [,1] [,2] [,3] [1,] 15.955175 -10.345092 -5.610084 [2,] -10.345092 20.339789 -9.994697 [3,] -5.610084 -9.994697 15.604781

32/68 Results from Other Weights

I Apply other weight functions

Weights χ2 p-value 2 Gehan W (ti ) = Yi χ = 16.2407 0.0003 1/2 2 Tarone-Ware W (ti ) = Y χ = 15.6529 0.0040 p = 1, q = 0 χ2 = 15.6725 0.0040 Fleming-Harrington p = 0, q = 1 χ2 = 6.1097 0.0471 p = q = 1 χ2 = 9.9331 0.0070

I All of these tests agree with the conclusion that the disease-free survival curves are not the same in these three disease categories

33/68 Remarks

I In applied setting, it is important to choose the weight function

I In most applications, the strategy is to compute the statistics using the logrank weights W (ti ) = 1 and the Gehan weight with W (ti ) = Yi . Tests using these weights are available in most statistical packages

I For the two-sample tests, the log-rank weights have optimal local power to detect differences in the hazard rates, when the hazard rates are proportional

I In some applications, one of the other weight functions may be more appropriate, based on the investigator’s desire to emphasize either late or early departures between the hazard rates

34/68 One More Example

I Want to compare the efficacy of autologous (auto) versus allogeneic (allo) transplants for acute myelogenous leukemia patients (Details on Section 1.9)

I The interest is to compare of disease-free survival for these two types of transplants. The event of interest is death or relapse

I It is well known that patients given an allogeneic transplant tend to have more complications early in their recovery process

I The most critical of these complications is acute graft-versus-host disease (GVHD) which occurs within the first 100 days after the transplant and is often lethal

35/68 KM Curves for Auto-Allo AML Transplant Data 1.0 0.8 0.6 0.4 0.2 0.0

0 10 20 30 40 50 60

36/68 The

I Patients given an autologous transplant are not at risk of developing acute graft-versus-host disease

I They tend to have a higher survival rate during this period

I The log-rank test and Gehan’s test have p-values of 0.5368 and 0.7556, respectively

I These statistics have large p-values because the hazard rates of the two types of transplants cross at about 12 months

I The late advantage of allogeneic transplants is negated by the high, early mortality of this type of transplant

37/68 The F-H Test and Results

I The primary interest to most investigators in this area is comparing the treatment (death or relapse) among long term survivors

I We use a test with the Fleming and Harrington weight function W (ti ) = 1 − S(ti−1) I This function downweights events (primarily due to acute GVHD) which occur early

I For these weights, we find that Z1(τ) = 2.093 and σb11(τ) = 1.02 2 I χ = 4.2 and p-value= 0.0404

I This suggest that there is a difference in the treatment failure rates for the two types of transplants.

38/68 Test for Trends

I The goal is to develop a test statistic with power to detect ordered alternatives

I That is, we test H0 : h1(t) = h2(t) = ··· = hK (t), for t ≤ τ, HA : h1(t) ≤ h2(t) ≤ · · · ≤ hK (t) for t ≤ τ, with at least one strict inequality

I Use Zj (τ), j = 1, 2, ··· , K I Any of the weight functions discussed previously can be used here

I We let Σ be the full K × K covariance matrix, (σbjg, j, g = 1, ··· , K ).

39/68 The Score

I Need a sequence of scores a1 < a2 < ··· < aK I Any increasing set of scores can be used

I The test is invariant under linear transformations of the scores

I In most cases, the scores aj = j are used I One may take the scores to be some numerical characteristic of the jth population

40/68 The Test Statistic

I The test statistic is PK j=1 aj Zj (τ) Z = q P PK σ j=1K g=1 aj ag bjg I Under the null hypothesis, when the sample sizes are sufficiently large, Z has a standard normal distribution

I If the alternative hypothesis is true, the Zj (τ) associated with larger values of aj should tend to be large. Thus, the null hypothesis is rejected

41/68 Example: Larynx Cancer Data

I A study of 90 patients diagnosed with cancer of the larynx in the 70s at a Dutch hospital (Details on Section 1.8)

I The data consists of the times between first treatment and either death or the end of the study

I Patients were classified by the stage of their disease using the American Joint Committee for Cancer Staging

42/68 KM Curves for Different Stages

43/68 The Test and Results

I Want test that there is no difference in the death rates among the four stages of the disease versus that, the higher the stage, the higher the death rate

I Use the scores aj = j I Using the log-rank weights,

I The value of the test statistic is 3.72 and the p-value of the test is less than 0.0001

I Also significant under Tarone-Ware, Gehan and Peto-Peto weights

44/68 Computing

I > library(KMsurv) > data(larynx) > fit=survdiff(Surv(time,delta)~stage,data=larynx) > aj=1:4 > z=sum(aj*(fit$obs-fit$exp)) /sqrt(t(aj)%*%(fit$var)%*%aj) > z 3.718959 > 1-pnorm(z) 0.0001

45/68 Stratified Tests

I Example:

I In a study, we want to compare the effectiveness of allogeneic (allo) transplants versus autogeneic (auto) transplants for lymphoma patients, details on Section 1.10 I We also know the type of lymphoma: Hodgkin’s disease (HOD) or non-Hodgkin lymphoma (NHL) I We are interested to test H0 that there is no difference in the disease-free-survival rate between patients given an allo or auto transplant, adjusting for the patient’s disease state

I We will use the stratified test. Let M be the levels of a set of covariates, each level is called a strata

I We test H0 : h1s(t) = h2s(t) = ··· = hKs(t), s = 1, ··· , M, for t ≤ τ

46/68 The Test

I Using the data from the sth strata, n o Z (τ) = PD W (t ) d − Y dis , j = 1, 2, ··· K js i=1 is ijs ijs Yis

I The variance and covariance matrix of Zjs(τ) is given by Y  Y  PD 2 ijs ijs  Yis−dis  σjjs = W (tis) 1 − dis b i=1 Yis Yis Yis−1 Y Y PD 2 ijs igs  Yis−dis  σjgs = − W (tis) dis, j 6= g b i=1 Yis Yis Yis−1 I Any choice of weight functions can be used for Zjs PM PM I Let Zj·(τ) = s=1 Zjs(τ) and σbjg· = s=1 σbjgsThe test statistic is: χ2 = −1 t 2 (Z1·(τ), ··· , ZK −1·(τ))Σ· (Z1·(τ), ··· , ZK −1·(τ)) ∼ χK −1 I For two-sample problem, the stratified test statistic is PM s=1 Z1s(τ) q ∼ N(0, 1), under H0 PM s=1 σb11s

47/68 Example: KM Curves 1.0 0.8 0.6 0.4 0.2 0.0

0 500 1000 1500 2000

48/68 Computing and Results

I > fit.diff=survdiff(Surv(time,delta)~gtype +strata(dtype),data=hodg) > fit.diff$obs-fit.diff$exp [,1] [,2] [1,] -2.343717 3.106206 [2,] 2.343717 -3.106206 > fit.diff$var [,1] [,2] [1,] 4.836347 -4.836347 [2,] -4.836347 4.836347 > z=(-2.343717+3.106206)/sqrt(4.836347) > z [1] 0.3467168 > 2*pnorm(-abs(z)) [1] 0.7288041

49/68