Math 659: Survival Analysis Chapter 7 — Hypothesis Testing
Wenge Guo
October 17, 2011
Wenge Guo Math 659: Survival Analysis Motivation for Hypothesis Testing 1.0
ALL AML−low AML−high 0.8 0.6 0.4 Estimated Disease−free Survival 0.2 0.0
0 500 1000 1500 2000 2500
Days Post Transplant I Bone Marrow Transplant Data: survival curves for 3 groups
2/68 General Problems
I One sample problem: To test if the sample comes from a population with a prespecified hazard rate h0(t) I Two samples or more samples: To test if there is no difference in survival between K treatments
I Test for trends: are the hazard functions ordered
I Stratified tests
I Do hazard functions cross?
I ...
3/68 The Basic Idea of Hypothesis Testing
I We do nonparametric hypothesis testing
I Focus on hypothesis tests that are based on comparing the Nelson-Aalen estimator
I Rather than a direct comparison of these two rates, we examine tests that look at weighted differences
I The weights will allow us to put more emphasis on certain parts of the curves
I Different weights allow us to present tests that are most sensitive to early or late departures
4/68 One-sample Tests
I The problem: we have a censored sample of size n from some population
I Want to test the hypothesis that the population hazard rate is h0(t) for all t < τ against the alternative that the hazard rate is not h0(t) for some t < τ I That is: H0 : h(t) = h0(t) for all t < τ
HA : h(t) 6= h0(t) for some t < τ
I Here h0(t) is a completely specified function over the range 0 to τ
I Typically, τ is the largest of the observed study times
5/68 The Test Statistic
I Notation P di I The Nelson-Aalen estimator H(t) = b ti ≤t Y (ti ) I di is the number of events at the observed event times, t1 < t2 < ··· < tD I Y (ti ) is the number of individuals under study just prior to the observed event time ti .
I The quantity di /Y (ti ) gives a crude estimate of the hazard rate at an event time ti
I When H0 is true, the expected hazard rate at ti is h0(ti ) I Compare the sum of weighted differences between the observed and expected hazard rates
I The test statistics is PD di R τ Z(τ) = O(τ) − E(τ) = W (ti ) − W (s)h0(s)ds i=1 Y (ti ) 0
6/68 The Weights
I Here W (t) is the weight function that need to be specified
I When H0 is true, the sample variance of Z (τ) is R τ 2 h0(s) V [Z (τ)] = 0 W (s) Y (s) ds 2 2 I Z(τ) /V [Z(τ)] ∼ χ1, when H0 is true, under large samples I Weights:
I The most popular choice is W (t) = Y (t), which yields the one sample log-rank test I Harrington and Fleming family p q WHF (t) = Y (t)S0(t) [1 − S0(t)] , p ≥ 0, q ≥ 0 I p = q = 0 gives log-rank test I One can put more weight on early departures (p much larger than q) I late departures (p much smaller than q) I on departures in the mid-range (p = q > 0).
7/68 Example
I Consider the mortality in a sample of 26 Iowa psychiatric patients (details in Section 1.15)
I 26 psychiatric inpatients admitted to the University of Iowa hospitals during the years 1935-1984
I Want to know if the psychiatric patients tend to have shorter lifetimes, compare the general Iowa population (Mortality given in Table 6.2 in page 179-180)
I The data have left truncation, due to delayed entries
I We use log-rank test, that is W (t) = Y (t)
I O(τ) observed number of events at or prior to time τ Pn I E(τ) = V [Z (τ)] = j=1[H0(Tj ) − H0(Lj )] 2 2 I χ = (15 − 4.474) /4.474 = 24.7645, p-value is close to zero, significant difference.
8/68 Calculation
9/68 Tests for Two or More Samples
I The problem: to compare hazard rates of K (K ≥ 2) populations
I The test is:
H0 : h1(t) = h2(t) = ··· = hK (t), for all t ≤ τ
HA : at least one of the hj (t)’s is different for some t ≤ τ
I Here τ is the largest time at which all the groups have at least one subject at risk
10/68 Notation
I K denote the number of populations
I t1 < t2 < ··· < tD be the distinct death times in the pooled sample
I at time ti we observe dij events in the jth sample
I Yij individuals at risk, j = 1, ··· , K ; i = 1, ··· , D PK I Let di = j=1 dij is the number of deaths PK I Yi = i=1 Yij is the number at risk in the pooled sample at time ti
11/68 The Test Statistic
I Compare the weighted estimates of the hazard rate of the jth population
I If H0 is true, an estimator of the expected hazard rate in the jth population is the pooled sample estimator of the hazard rate di /Yi I Using only data from the jth sample, the estimator of the hazard rate is dij /Yij
I The test of H0 is based on the statistics, n d o Z (τ) = PD W (t ) ij − di , j = 1, 2, ··· K j i=1 j i Yij Yi
I Here Wj (t) is a positive weight function
I If all the Zj (τ)’s are close to zero, then, there is little evidence to believe that the null hypothesis is false
I If one of the Zj (τ)’s is far from zero, then, there is evidence that this population has a hazard rate differing from that expected under the null hypothesis 12/68 Weight Function and Covariance Matrix
I In practice, all of the commonly used tests have a weight function Wj (ti ) = Yij W (ti )
I W (ti ) is a common weight shared by each group n o PD di I We have Z (τ) = W (t ) d − Y , j = 1, 2, ··· K j i=1 i ij ij Yi
I The variance of Zj (τ) is given by Y Y σ = PD W (t )2 ij 1 − ij Yi −di d , j = 1, 2, ··· , K bjj i=1 i Yi Yi Yi −1 i
I The covariance of Zj (τ) and Zg(τ) is Y Y σ = − PD W (t )2 ij ig Yi −di d , j 6= g bjg i=1 i Yi Yi Yi −1 i
13/68 The Asymptotic Distribution of Test Statistic
PD I Because j=1 Zj (τ) = 0, the test statistic is constructed by selecting any K − 1 of the Zj ’s I The corresponding estimated variance-covariance matrix is denoted by a (K − 1) × (K − 1) matrix Σ 2 I The test statistic is given by the quadratic form χ = −1 t 2 (Z1(τ), ··· , ZK −1(τ))Σ (Z1(τ), ··· , ZK −1(τ)) ∼ χK −1 2 I An α level test of H0 rejects when χ is larger than the αth upper percentage point of a chi-squared, random variable with K − 1 df
14/68 Special case: K = 2
I The test statistic can be simplified to PD n di o W (ti ) di1−Yi1 = i=1 Yi Z r PD 2 Yi1 Yi1 Yi −di W (ti ) 1− di i=1 Yi Yi Yi −1 I Has a standard normal distribution for large samples when H0 is true I An α level test of the alternative hypothesis HA : h1(t) > h2(t), for some t ≤ τ, is rejected when Z ≥ Zα
I The test of HA : h1(t) 6= h2(t), for some t, rejects when |Z | > Zα/2
15/68 Weights
I Log-rank test: W (t) = 1
I a test available in most statistical packages I has optimum power to detect alternatives where the hazard rates in the K populations are proportional to each other
I Gehan: W (ti ) = Yi I Tarone and Ware: W (ti ) = f (Yi ) I f is a fixed function 1/2 I they suggest f (y) = y I gives more weight to differences between the observed and expected number of deaths in sample j at time points where there is the most data
16/68 More on Weights
I Peto-Peto: W (ti ) = Se(ti ) Q di I here Se(ti ) = 1 − ti ≤t Yi +1
I Modified Peto-Peto: W (ti ) = Se(ti )Yi /(Yi + 1) I Fleming-Harrington: p q Wp,q(ti ) = Sb(ti−1) [1 − Sb(ti−1)] , p ≥ 0, q ≥ 0 I as special cases, the log-rank test, p = q = 0 I sf at the previous death time is used as a weight to ensure that these weights are known just prior to the time at which the comparison is to be made I when q = 0 and p > 0, these weights give the most weight to early departures I when p = 0 and q > 0, these tests give most weight to departures which occur late in time I by an appropriate choice of p and q, one can construct tests which have the most power against alternatives
17/68 Example: Time to Infection of Kidney Dialysis Patients
I Compare the effectiveness of two methods for placing catheters in kidney dialysis patients (details in Section 1.4)
I Two treatment: catheter was placed surgically (group 1) as compared to patients who had their catheters placed percutaneously (group 2).
I Data: time to cutaneous exit-site infection
I Use this example to illustrate the effect of weights √ I The Log-rank test: Zobs = 3.964/ 6.211 = 1.59 which has a p-value of 2Pr[Z > 1.59] = 0.1117
I Log-rank test suggests no difference between the two procedures in the distribution of the time to exit-site infection
18/68 KM Curves
19/68 Construction of Two-Sample, Log-Rank Test
20/68 Comparison of Two-Sample Tests
21/68 PD Relative Weights W (ti )/ i=1 W (ti )
22/68 Computing of the Kidney Infection Example
I The Data > library(KMsurv) > data(kidney) > kidney time delta type 1.5 1 1 3.5 1 1 4.5 1 1 ...... 0.5 1 2 0.5 1 2 0.5 1 2 .....
23/68 Log-rank Test Using survdiff
I > library(survival) > fit=survdiff(Surv(time,delta)~type,data=kidney) > fit Call: survdiff(formula = Surv(time,delta)~type,data = kidney) N Observed Expected (O-E)^2/E (O-E)^2/V type=1 43 15 11.0 1.42 2.53 type=2 76 11 15.0 1.05 2.53 Chisq= 2.5 on 1 degrees of freedom, p= 0.112
>fit$obs [1] 15 11 >fit$exp [1] 11.03645 14.96355 >fit$var [,1] [,2] [1,] 6.210596 -6.210596 [2,] -6.210596 6.210596
24/68 Other Weights
I Peto-Peto > survdiff(Surv(time,delta)~type,data=kidney,rho = 1) Call: survdiff(formula=Surv(time, delta)~type,data=kidney,rho=1)
N Observed Expected (O-E)^2/E (O-E)^2/V type=1 43 12.0 9.48 0.686 1.39 type=2 76 10.4 12.98 0.501 1.39
Chisq= 1.4 on 1 degrees of freedom, p= 0.239
I survdiff produces the Fleming-Harrington class of tests with q = 0
I The SAS procedure LIFETEST can be used to perform the log-rank test and Gehan’s test for right-censored data
25/68 Example: Channing House Data
I In section 1.16, data on 462 individuals who lived at the Channing House retirement center was reported
I These data are left-truncated by the individual’s entry time into the retirement center
I To test the hypothesis that females tend to live longer than males
I We test the hypothesis H0 : hF (t) = hM (t), 777 ≤ t ≤ 1152 months HA : hF (t) ≤ hM (t) for all t ∈ [777, 1152] and hF (t) < hM (t) for some t ∈ [777, 1152]
26/68 Survival Curves
27/68 Number of Units at Risk: YiM and YiF
28/68 Computing the Statistic
I Need to compute YiF and YiM as the number of females and males, respectively, who were in the center at age ti I The test will be based on the weighted difference between the observed and expected number of male deaths
I Using the log-rank weights, we find ZM (1152) = 9.682, Vb(ZM (1152)) = 28.19, so Zobs = 1.82 and the onesided p-value is 0.0341
I This provides evidence that males are dying at a faster rate than females
29/68 Example: BMT Data
I Test the hypothesis that the disease-free survival functions of these three populations are the same over the range of observation, t < 2204 days, versus the alternative that at least one of the populations has a different survival rate.
I Using the log-rank weights, we find Z1(2204) = 2.148, Z2(2204) = 14.966 and Z3(2204) = 12.818 I The covariance matrix and the test statistic is
30/68 Survival Curves 1.0
ALL AML−low AML−high 0.8 0.6 0.4 Estimated Disease−free Survival 0.2 0.0
0 500 1000 1500 2000 2500
Days Post Transplant
31/68 Computing
I > bmt.test=survdiff(Surv(t2,d3)~group,data=bmt) N Observed Expected (O-E)^2/E (O-E)^2/V group=1 38 24 21.9 0.211 0.289 group=2 54 25 40.0 5.604 11.012 group=3 45 34 21.2 7.756 10.529 Chisq= 13.8 on 2 degrees of freedom, p= 0.00101 > print.default(bmt.test) $obs [1] 24 25 34 $exp [1] 21.85171 39.96612 21.18217 $var [,1] [,2] [,3] [1,] 15.955175 -10.345092 -5.610084 [2,] -10.345092 20.339789 -9.994697 [3,] -5.610084 -9.994697 15.604781
32/68 Results from Other Weights
I Apply other weight functions
Weights χ2 p-value 2 Gehan W (ti ) = Yi χ = 16.2407 0.0003 1/2 2 Tarone-Ware W (ti ) = Y χ = 15.6529 0.0040 p = 1, q = 0 χ2 = 15.6725 0.0040 Fleming-Harrington p = 0, q = 1 χ2 = 6.1097 0.0471 p = q = 1 χ2 = 9.9331 0.0070
I All of these tests agree with the conclusion that the disease-free survival curves are not the same in these three disease categories
33/68 Remarks
I In applied setting, it is important to choose the weight function
I In most applications, the strategy is to compute the statistics using the logrank weights W (ti ) = 1 and the Gehan weight with W (ti ) = Yi . Tests using these weights are available in most statistical packages
I For the two-sample tests, the log-rank weights have optimal local power to detect differences in the hazard rates, when the hazard rates are proportional
I In some applications, one of the other weight functions may be more appropriate, based on the investigator’s desire to emphasize either late or early departures between the hazard rates
34/68 One More Example
I Want to compare the efficacy of autologous (auto) versus allogeneic (allo) transplants for acute myelogenous leukemia patients (Details on Section 1.9)
I The interest is to compare of disease-free survival for these two types of transplants. The event of interest is death or relapse
I It is well known that patients given an allogeneic transplant tend to have more complications early in their recovery process
I The most critical of these complications is acute graft-versus-host disease (GVHD) which occurs within the first 100 days after the transplant and is often lethal
35/68 KM Curves for Auto-Allo AML Transplant Data 1.0 0.8 0.6 0.4 0.2 0.0
0 10 20 30 40 50 60
36/68 The Logrank Test
I Patients given an autologous transplant are not at risk of developing acute graft-versus-host disease
I They tend to have a higher survival rate during this period
I The log-rank test and Gehan’s test have p-values of 0.5368 and 0.7556, respectively
I These statistics have large p-values because the hazard rates of the two types of transplants cross at about 12 months
I The late advantage of allogeneic transplants is negated by the high, early mortality of this type of transplant
37/68 The F-H Test and Results
I The primary interest to most investigators in this area is comparing the treatment failure rate (death or relapse) among long term survivors
I We use a test with the Fleming and Harrington weight function W (ti ) = 1 − S(ti−1) I This function downweights events (primarily due to acute GVHD) which occur early
I For these weights, we find that Z1(τ) = 2.093 and σb11(τ) = 1.02 2 I χ = 4.2 and p-value= 0.0404
I This suggest that there is a difference in the treatment failure rates for the two types of transplants.
38/68 Test for Trends
I The goal is to develop a test statistic with power to detect ordered alternatives
I That is, we test H0 : h1(t) = h2(t) = ··· = hK (t), for t ≤ τ, HA : h1(t) ≤ h2(t) ≤ · · · ≤ hK (t) for t ≤ τ, with at least one strict inequality
I Use Zj (τ), j = 1, 2, ··· , K I Any of the weight functions discussed previously can be used here
I We let Σ be the full K × K covariance matrix, (σbjg, j, g = 1, ··· , K ).
39/68 The Score
I Need a sequence of scores a1 < a2 < ··· < aK I Any increasing set of scores can be used
I The test is invariant under linear transformations of the scores
I In most cases, the scores aj = j are used I One may take the scores to be some numerical characteristic of the jth population
40/68 The Test Statistic
I The test statistic is PK j=1 aj Zj (τ) Z = q P PK σ j=1K g=1 aj ag bjg I Under the null hypothesis, when the sample sizes are sufficiently large, Z has a standard normal distribution
I If the alternative hypothesis is true, the Zj (τ) associated with larger values of aj should tend to be large. Thus, the null hypothesis is rejected
41/68 Example: Larynx Cancer Data
I A study of 90 patients diagnosed with cancer of the larynx in the 70s at a Dutch hospital (Details on Section 1.8)
I The data consists of the times between first treatment and either death or the end of the study
I Patients were classified by the stage of their disease using the American Joint Committee for Cancer Staging
42/68 KM Curves for Different Stages
43/68 The Test and Results
I Want test that there is no difference in the death rates among the four stages of the disease versus that, the higher the stage, the higher the death rate
I Use the scores aj = j I Using the log-rank weights,
I The value of the test statistic is 3.72 and the p-value of the test is less than 0.0001
I Also significant under Tarone-Ware, Gehan and Peto-Peto weights
44/68 Computing
I > library(KMsurv) > data(larynx) > fit=survdiff(Surv(time,delta)~stage,data=larynx) > aj=1:4 > z=sum(aj*(fit$obs-fit$exp)) /sqrt(t(aj)%*%(fit$var)%*%aj) > z 3.718959 > 1-pnorm(z) 0.0001
45/68 Stratified Tests
I Example:
I In a study, we want to compare the effectiveness of allogeneic (allo) transplants versus autogeneic (auto) transplants for lymphoma patients, details on Section 1.10 I We also know the type of lymphoma: Hodgkin’s disease (HOD) or non-Hodgkin lymphoma (NHL) I We are interested to test H0 that there is no difference in the disease-free-survival rate between patients given an allo or auto transplant, adjusting for the patient’s disease state
I We will use the stratified test. Let M be the levels of a set of covariates, each level is called a strata
I We test H0 : h1s(t) = h2s(t) = ··· = hKs(t), s = 1, ··· , M, for t ≤ τ
46/68 The Test
I Using the data from the sth strata, n o Z (τ) = PD W (t ) d − Y dis , j = 1, 2, ··· K js i=1 is ijs ijs Yis
I The variance and covariance matrix of Zjs(τ) is given by Y Y PD 2 ijs ijs Yis−dis σjjs = W (tis) 1 − dis b i=1 Yis Yis Yis−1 Y Y PD 2 ijs igs Yis−dis σjgs = − W (tis) dis, j 6= g b i=1 Yis Yis Yis−1 I Any choice of weight functions can be used for Zjs PM PM I Let Zj·(τ) = s=1 Zjs(τ) and σbjg· = s=1 σbjgsThe test statistic is: χ2 = −1 t 2 (Z1·(τ), ··· , ZK −1·(τ))Σ· (Z1·(τ), ··· , ZK −1·(τ)) ∼ χK −1 I For two-sample problem, the stratified test statistic is PM s=1 Z1s(τ) q ∼ N(0, 1), under H0 PM s=1 σb11s
47/68 Example: KM Curves 1.0 0.8 0.6 0.4 0.2 0.0
0 500 1000 1500 2000
48/68 Computing and Results
I > fit.diff=survdiff(Surv(time,delta)~gtype +strata(dtype),data=hodg) > fit.diff$obs-fit.diff$exp [,1] [,2] [1,] -2.343717 3.106206 [2,] 2.343717 -3.106206 > fit.diff$var [,1] [,2] [1,] 4.836347 -4.836347 [2,] -4.836347 4.836347 > z=(-2.343717+3.106206)/sqrt(4.836347) > z [1] 0.3467168 > 2*pnorm(-abs(z)) [1] 0.7288041
49/68