<<

Multivariate Survival Analysis

Previously we have assumed that either (Xi, δi) or (Xi, δi,Zi), i = 1, ..., n, are i.i.d.. This may not always be the case.

Multivariate survival can arise in practice in difference ways:

• Clustered survival data. This happens when failure times (often of the same type, eg. ) from the same cluster are dependent, such as in – multicenter clinical trials: cluster = clinical center – familial data: cluster = family – matched pairs: cluster = pair • Recurrent event data. This happens when an in- dividual may experience events of the same type more than once, eg, infections. It is actually a special case of clustered data, where the failure times are ordered.

What are the failure time random variables in each case?

1 Work has been done (and still being done) to estimate non- parametrically the joint survival distribution, without any predictors, eg. bivariate . This turns out to be much harder than the one dimensional case (KM), mainly due to .

In practice, we are often interested in relating certain covari- ates to the survival time (regression setting), while taking into consideration the dependence among the survival times.

Here we will introduce extensions of the Cox model to handle correlated survival data.

There are two types of approaches:

• the conditional approach (using random effects) • the marginal (WLW) approach

2 The random effects approach

The dependence among failure times may be viewed as in- duced by unmeasured factors that are common to subjects within a cluster but may vary from cluster to cluster.

For example,

• EST 1582 is a phase III multi-institutional lung cancer trial to compare two treatments, CAV vs. CAV-HEM. It was shown (Gray 1995) that there is significant insti- tutional variation in the treatment effects (see plot); • for familial data, common genetic or environmental fac- tors might have induce the dependence of survival times among the family members; • in recurrent events, multiple events of the same individ- ual are likely correlated due to some factor unique to that individual but different from individual to individual.

3 · 19 28·

·29

·

· ·

· · · · · · · · · · · · · ·

· · · · Estimated treatment effects (log HR) · · · · · · 16 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 · 18

0 5 10 15 20 25 30 Institution

Figure 1: Estimated institutional treatment effects (E1582). Institutions are sorted in increasing order of number of patients.

One may model these cluster-to-cluster variation as fixed effects. For example, in the E1582 data, if Z = 1 for CAV- HEM, and Z = 0 for CAV, instead of using a single β for the treatment difference, we may use β1 for the treatment effect in institution 1, β2 for the treatment effect in institution 2, ... there are 31 institutions (579 observations). How do we write down such a fixed effects Cox model?

Obviously the above fixed effects model will introduce many parameters. In the above figure, that is NOT how we ob- tained the estimated effects.

4 Alternatively one may try to use a stratified Cox model. But this will still have problems with small strata. In fact, the number of patients per institution varied from 1 to 56.

One way to solve this problem is to turn the many (31) fixed treatment effects into random effects, i.e. put a (prior) distribution on them (in a Bayesian sense). This effectively shrinks the magnitude of the estimated effects towards zero.

Shrinkage estimators are known to be biased, but nonetheless have smaller MSE (Morris, 1983).

5 The univariate frailty model

The simplest random effects model for clustered survival data is the univariate frailty model

0 λij(t) = λ0(t) exp{β Zij + bi}. Here i indicates the cluster, and j the individual from cluster i (i = 1, ..., n, j = 1, ...ni), and bi is the ‘random cluster effect’. This is sometimes called a hierarchical model.

The assumption of the random effects model is that b1, ..., bn are i.i.d. random variables, from a distribution known up to 2 a finite number of parameters. For example, bi ∼ N(0, σ ). This can substantially reduce the number of parameters com- pared to a fixed effects model.

Such modeling is most suitable when there are many clusters and each of relative small size.

In general bi induces correlation among the individuals in the same cluster i.

6 We can equivalently write the above model as

0 λij(t) = λ0(t) exp{β Zij}ωi. so that bi = log ωi; ωi is often called the frailty term. Early use of frailty models was mainly to count for ‘un- observed heterogeneity’, or unmeasured covariates, for i.i.d. data (ni = 1).

In the past, the commonly used distribution for the ωi’s is the with of one and unknown . This is mainly due to mathematical convenience (conjugate prior).

The mean of bi (or ωi) needs to be fixed to avoid problems of identifiability (why?).

The univariate frailty model can be fitted using coxph() with option frailty. The distributions allowed for ωi are gamma and log-normal.

7 In terms of the data generating mechanism, we can think of the clusters as being sampled from a population.

bi here is sometimes called the random intercept; it’s a ran- dom effect on the baseline .

This model counts for the correlation among the failure times within a cluster, but not the covariate effects (like in E1582) that vary from cluster to cluster.

8 General random effects Cox model

A more general random effects Cox model is

0 0 λij(t) = λ0(t) exp{β Zij + biWij},

Here Wij is a vector of covariates that have random effects, such as treatment in E1582 data; bi is the ‘covariate by cluster ’ (compare to the fixed effects model we might have written down earlier).

When Wij = 1, the above is the univariate frailty model.

Often the covariates in W are also included in Z, so that they have fixed effects (can be thought of as ‘main effects’), and we impose that E(bi) = 0.

The above model is also called the proportional mixed-effects model (PHMM). It parallels the linear mixed- effects model (LMM) and generalized mixed-effects model (GLMM).

The bi’s are often assumed to be normal. Gamma distribu- tion is no longer suitable, as it is not scale invariant.

This model was considered in Vaida and Xu (2000), and Ripatti and Palmgren (2000).

9 Cluster

log hazard Treatment -1.0 -0.5 0.0 0.5 1.0

0246810 time

Trt, Cluster 1

Trt, Cluster 2 log hazard -1.0 -0.5 0.0 0.5 1.0

0246810 time

Figure 2: Top: cluster effects (βZ + bi0); bottom: cluster x treatment interaction (βZ + bi0 + bi1Z).

10 Inference

Inference under the random effects Cox model is carried out using the full likelihood (as opposed to the partial likeli- hood).

• The assumptions are: data from different clusters are independent; within the same cluster i, the dependence is induced by bi; in other words, conditional on bi, the individuals in cluster i are independent.

• Conditional on bi, we may write down the full likelihood for cluster i: ni Y δij Li = λij(Xij) Sij(Xij) j=1

ni 0 0 0 0 Y β Zij+b Wij δij β Zij+b Wij = {λ0(Xij)e i } exp{−Λ0(Xij)e i } j=1

• The whole likelihood condition on the bi’s is n Y Li. i=1

11 • The likelihood for observed data is then n Y Z L(β, λ0, σ) = Li · p(bi; σ)dbi. i=1 Here σ denotes the unknown parameter(s) in the distri- bution of the random effects, and p(bi; σ) is the density of bi. The parameters to be estimated are (β, λ0, σ).

In order to solve for MLE, one needs to evaluate the above integrals. Often these integrals are not of closed forms. It- erative algorithms such as EM are often used.

An R package ‘phmm’ is available for the above NPMLE.

Table 1: Estimates (standard errors) from E1582 lung cancer trial

d = 0 d = 1 d = 2 β β Var(b) β Var(b) treatment -0.254 (0.085) -0.250 (0.104) 0.071 (0.069) -0.247 (0.119) 0.046 (0.184) bone 0.223 (0.093) 0.212 (0.095) - 0.230 (0.144) 0.129 (0.083) liver 0.429 (0.090) 0.423 (0.091) - 0.393 (0.094) - ps -0.602 (0.104) -0.641 (0.109) - -0.649 (0.131) - wt loss 0.200 (0.087) 0.218 (0.089) - 0.208 (0.092) -

12 Notes

In a previous figure of estimated treatment effects by insti- tution, those are the ‘estimated’ bi’s under the d = 1 model. They are obtained as empirical Bayes estimate following the MLE; they are the posterior (can also be ) of the bi’s given the observed data.

In addition,

• R package ‘phmm’ gives the NPMLE, as well as the em- pirical Bayes estimate of the random effects, and various testing, AIC etc.; • Asymptotic properties for the NPMLE has been estab- lished, with also stable numerical properties (Gamst et al, 2009). • coxph() with ‘frailty’ option gives a penalized partial likelihood (PPL) estimator; • there is also an R package ‘coxme’ that gives the PPL under the general PHMM. • No theoretical properties for the PPL has been estab- lished (inconsistent in general). It has numerical prob- lems sometimes, and the estimated (hence CI’s) can be inaccurate. • PPL is faster to compute.

13 (a)

· · · · ·· · ····· · · · ···· · · · · ·· ·· · ·· · ·· ·· · · · · · · · ·· · · ·· · ········ · ··· · · ············ ·· · ····· ··· · ··· · ·· ···· · · · ·· ·· ···· · · · · ·· · · ···· · · · ··· · · Random effect

1 event/patient 2 events/patient 3 4 -3 -2 -1 0 1 2 3 0 20 40 60 80 100 120 Patient

(b)

· ·· ·· ·· ·· · · · ·· · · · · ·· ·· · · · ·· · · · · ··· · · · ·· ··· ··· · ··· · · ·· ·· ·· ···· · ············ · ··· · ·· ········ · · · · · · ····· · ··· ·· · · · · · ·· · · · · · · ··· · · · · · · Random effect

1 event/patient 2 events/patient 3 4 -3 -2 -1 0 1 2 3 0 20 40 60 80 100 120 Patient

Figure 3: Estimated random effects from univariate lognormal frailty model and 95% confidence intervals (CGD data). (a) without adjustment for covariates other than treatment; (b) with adjustment for covariates.

Notes on recurrent events

Both the marginal model and the random effects model can be used for recurrent event data. When the random effects model is used, the assumption is a common baseline hazard function for the events of the same individual, like in the CGD example.

14 • Other methods have been studied that are aimed specifi- cally for recurrent event data. One popular such method is PWP (Prentice, Williams and Peterson 1981), which is a conditional approach that takes into account the ordering of the events, and was the recommended method by Oakes (1992, in Survival Analysis, State of the Art, ed. Klein and Goel). • Another simple approach that is sometimes used is the Anderson-Gill model, or the counting process ap- proach. This has the favor of the simple Cox model, and requires the relatively strong assumption that the times between successive events be conditionally independent given the history. The covariate ‘prior number of events’ is sometimes included in such a model. • Therneau and Hamilton (1997, Stat in Med, p.2029- 2047) compared the above methods using example datasets, showing how they are implemented using popular statis- tical software. Similar material can also be found in Therneau and Grambsch’s book.

15 The marginal approach

The marginal approach (WLW method) was proposed by Wei, Lin and Weissfeld (1989).

An example: in an AIDS trial to evaluate the drug rib- avirin, there are 3 treatment groups: placebo, low dose, and high dose. To study the antiretroviral capability of the drug, blood samples were collected at 4, 8 and 12 weeks. For each sample, the ‘viral load’ was measured by the number of days until virus positivity was detected. So potentially each pa- tient in the study should have 3 event times.

In general, suppose that there are m types of failure. Let Tij be the failure time of the jth type for the ith subject, j = 1, ..., m, i = 1, ..., n. We observe

(Xij, δij) where Xij = min(Tij,Cij), δij = I(Tij ≤ Cij), and Cij is the censoring time.

Let Zij(t) be a p × 1 vector of possibly time-dependent co- variates for subject i with respect to failure type j. Note that some covariates can be common for different types of failure. In the example above, Zij = Zi =treatment assign- ment for patient i, stays the same for all three event times of the same patient.

16 The WLW model is 0 λij(t) = λ0j(t) exp{βjZij(t)}.

Here βj is the failure-specific regression parameter. This is a marginal model. It models the marginal hazard of type j failure time, and says nothing about the joint distri- bution of different types of failures.

The assumption regarding censoring is that given Zij, the Tij’s and the Cij’s are independent.

Inference under the WLW model is carried out via the failure-specific partial likelihood

n ( 0 )δij Y exp[βjZij(Xij)] Lj(βj) = P 0 , exp[β Zlj(Xij)] i=1 l∈Rj(Xij) j

where Rj(t) = {l : Xlj ≥ t} is the set of subjects at risk for the jth type of failure just prior to time t. ˆ The MPLE (not a MLE) βj is solution to ∂ log Lj(βj)/∂βj = 0.

ˆ • βj is consistent for βj (this is exactly the same as fit the Cox model to one failure type at a time); ˆ • βj’s are generally correlated; ˆ ˆ • (β1, ..., βm) are asymptotically normal with matrix that can be consistently estimated by a sandwich estimator.

This allows simultaneous inference about the βj’s.

17 In the AIDS example, if β1j indicates the difference between low-dose and placebo, we may want test H0 : β11 = β12 = β13 = 0. This can be done using ˆ ˆ ˆ −1 ˆ ˆ ˆ 0 (β11, β12, β13)Ψˆ (β11, β12, β13) ˆ ˆ ˆ where Ψˆ is the estimated of (β11, β12, β13).

Furthermore, if we can assume that β11 = β12 = β13 = β1, ˆ we can estimate β1 by a linear combination of the β1j’s. The vector of weights that gives the smallest aymptotic variance is (eΨˆ −1e0)−1Ψˆ −1e0, where e = (1, ..., 1) is the vector of 1’s.

18 Notes:

• The marginal model is suitable for multivariate survival data that are correlated and have well-defined failure types, when we are interested in relating the marginal survival to the covariates, while seeing the correlation among the survival times as a nuisance parameter. The coefficients in the model has a ‘population-averaged’ in- terpretation (to be contrasted with the random effects approach later). • This model is sometimes used for recurrent event data. The interpretation, however, has some difficulty. This is because the model does not take the ordering of the events into account so that a subject can be at risk for the second event even before the first event has occurred. The method is also not efficient in the recurrent events setting (because it does not take into account the order- ing of the events) though can be ‘convenient and useful’. coxph() with option cluster (see its relation with robust)

Therneau and Grambsch book p.186-189 also discusses ways of creating ‘new’ data set to fit the model.

19 Random effects vs. marginal models

• Marginal models are often simpler to fit (computation- ally faster). • One advantage of the random effects model over the marginal (WLW) model is that it allows ‘estimation’ of the random effects; this usually gives us more insight into the data and the scientific problem. – In E1582 data, estimated institutional treatment ef- fects would allow the researchers to further investi- gate why certain institutions have larger treatment effects than others. – Estimation of the random effects and their variance is also of interest in genetic studies, where the variance components can be translated to the correlation. – The magnitudes of the random effects across clusters reflect the ‘heterogeneity’ among the clusters. In a recurrent events (CGD) example in Vaida and Xu (2000), where cluster = patient, and survival times = time to first infection, from the first infection to the second infection, .... The fact that people with more infections have larger random effects from fitting a univariate frailty model, reflects that these people are different from those with fewer infections, in a way not captured by the covariates that were collected.

20 A relation with robust estimation under the Cox model

‘Robust’ refers to when the assumption(s) are wrong. There are several possibilities:

1. the i.i.d. assumption is wrong: this becomes the WLW ap- proach, which parallels the generalized (GEE) approach for generalized linear models (GLM). In this case the original partial likelihood estimator is still consistent for the true regression coefficients, but its variance needs to be estimated by the robust sandwich estimator. 2. the Cox model is wrong, eg. non-PH; in this case the partial likelihood estimator converges in to some ‘least- false’ parameter defined by a population estimating equation, eg. an average regression effect under non-PH. Again its vari- ance needs to be estimated by a sandwich estimator. (Lin and Wei, 1989; Xu and O’Quilgley, 2000) 3. missing covariate: this is specific to non-linear models like the Cox or , i.e. if a true covariate is not included the model, then the resulting model is no longer a Cox or logistic model. In that sense the model is wrong, and 2) applies. 4. In addition, sometimes we might wish to use the weighted Cox model, eg. inverse probability weighting (IPW) for causal infer- ence. (Lin, 1991; Sasieni, 1993)

All of the above fall under estimating equations instead of the likelihood approach, and use the ‘robust’ sandwich esti- mator for variance.

21