An Introduction to Survival Analysis∗
Total Page:16
File Type:pdf, Size:1020Kb
An Introduction to Survival Analysis∗ Mark Stevenson EpiCentre, IVABS, Massey University December 2007 Contents 1 General principles 3 1.1 Describing time to event . .3 Deathdensity...........................................3 Survival ..............................................3 Hazard...............................................3 1.2 Censoring . .4 2 Non-parametric survival 6 2.1 Kaplan-Meier method . .7 2.2 Life table method . .8 2.3 Flemington-Harrington estimator . .8 2.4 Worked examples . .9 Kaplan-Meier method . .9 Flemington-Harrington estimator . 10 Instantaneous hazard . 11 Cumulative hazard . 11 3 Parametric survival 11 3.1 The exponential distribution . 12 3.2 The Weibull distribution . 13 3.3 Worked examples . 13 The exponential distribution . 13 The Weibull distribution . 14 4 Comparing survival distributions 15 4.1 The log-rank test . 16 4.2 Othertests ............................................ 16 4.3 Worked examples . 16 ∗Notes for MVS course 195.721 Analysis and Interpretation of Animal Health Data. http://epicentre.massey.ac.nz 1 5 Non-parametric and semi-parametric models 16 5.1 Modelbuilding .......................................... 17 Selection of covariates . 17 Tiedevents ............................................ 18 Fitting a multivariable model . 18 Check the scale of continuous covariates . 19 Interactions . 19 5.2 Testing the proportional hazards assumption . 19 5.3 Residuals . 20 5.4 Overall goodness-of-fit . 20 5.5 Worked examples . 21 Selection of covariates . 21 Fit multivariable model . 22 Check scale of continuous covariates (method 1) . 23 Check scale of continuous covariates (method 2) . 23 Interactions . 24 Testing the proportional hazards assumption . 24 Residuals ............................................. 25 Overall goodness of fit . 26 Dealing with violation of the proportional hazards assumption . 26 5.6 Poisson regression . 26 6 Parametric models 27 6.1 Exponential model . 27 6.2 Weibull model . 27 6.3 Accelerated failure time models . 27 6.4 Worked examples . 29 Exponential and Weibull models . 29 Accelerated failure time models . 30 7 Time dependent covariates 30 2 1 General principles Survival analysis is the name for a collection of statistical techniques used to describe and quantify time to event data. In survival analysis we use the term `failure' to define the occurrence of the event of interest (even though the event may actually be a `success' such as recovery from therapy). The term `survival time' specifies the length of time taken for failure to occur. Situations where survival analyses have been used in epidemiology include: Survival of patients after surgery. The length of time taken for cows to conceive after calving. The time taken for a farm to experience its first case of an exotic disease. 1.1 Describing time to event Death density When the variable under consideration is the length of time taken for an event to occur (e.g. death) a frequency histogram can be constructed to show the count of events as a function of time. A curve fitted to this histogram produces a death density function f(t), as shown in Figure 1. If we set the area under the curve of the death density function to equal 1 then for any given time t the area under the curve to the left of t represents the proportion of individuals in the population who have experienced the event of interest. The proportion of individuals who have died as a function of t is known as the cumulative death distribution function and is called F(t). Survival Considering again the death density function shown in Figure 1. The area under the curve to the right of time t is the proportion of individuals in the population who have survived to time t, S(t). S(t) can be plotted as a function of time to produce a survival curve, as shown in Figure 2. At t = 0 there have been no failures so S(t) = 1. By day 15 all members of the population have failed and S(t) = 0. Because we use counts of individuals present at discrete time points, survival curves are usually presented in step format. Hazard The instantaneous rate at which a randomly-selected individual known to be alive at time (t - 1) will die at time t is called the conditional failure rate or instantaneous hazard, h(t). Mathematically, instantaneous hazard equals the number that fail between time t and time t + ∆(t) divided by the size of the population at risk at time t, divided by ∆(t). This gives the proportion of the population present at time t that fail per unit time. Instantaneous hazard is also known as the force of mortality, the instantaneous death rate, or the failure rate. An example of an instantaneous hazard curve is shown in Figure 3. Figure 3 shows the the weekly probability of foot-and-mouth disease occurring in two farm types in Cumbria (Great Britain) in 2001. You should interpret this curve in exactly the same way you would an epidemic curve. The advantage of this approach is that it shows how disease risk changes over time, correcting for changes in the size of the population that might occur over time (something that should be considered when dealing with foot-and-mouth disease data). 3 Figure 1: Line plot f(t) (death density) as a function of time. The cumulative proportion of the population that has died up to time t equals F (t). The proportion of the population that has survived to time t is S(t) = F (t) − 1. The cumulative hazard (also known as the integrated hazard) at time t, H(t) equals the area under the hazard curve up until time t. A cumulative hazard curve shows the (cumulative) probability that the event of interest has occurred up to any point in time. 1.2 Censoring In longitudinal studies exact survival time is only known for those individuals who show the event of interest during the follow-up period. For others (those who are disease free at the end of the observation period or those that were lost) all we can say is that they did not show the event of interest during the follow-up period. These individuals are called censored observations. An attractive feature of survival analysis is that we are able to include the data contributed by censored observations right up until they are removed from the risk set. The following terms are used in relation to censoring: Right censoring: a subject is right censored if it is known that failure occurs some time after the recorded follow-up period. Left censoring: a subject is left censored it it is known that the failure occurs some time before the recorded follow-up period. For example, you conduct a study investigating factors influencing days to first oestrus in dairy cattle. You start observing your population (for argument's sake) at 40 days after calving but find that several cows in the group have already had an oestrus event. These cows are said to be left censored at day 40. Interval censoring: a subject is interval censored if it is known that the event occurs between two times, but the exact time of failure is not known. In effect we say `I know that the event occurred between date A and date B: I know that the event occurred, but I don't know exactly when.' In an 4 Figure 2: Survival curve showing the cumulative proportion of the population who have `survived' (not experienced the event of interest) as a function of time. observational study of EBL seroconversion you sample a population of cows every six months. Cows that are negative on the first test and positive at the next are said to have seroconverted. These individuals are said to be interval censored with the first sampling date being the lower interval and the second sampling date the upper interval. We should distinguish between the terms censoring and truncation (even though the two events are handled the same way analytically). A truncation period means that the outcome of interest cannot possibly occur. A censoring period means that the outcome of interest may have occurred. There are two types of trunction: Left truncation: a subject is left truncated if it enters the population at risk some stage after the start of the follow-up period. For example, in a study investigating the date of first BSE diagnosis on a group of farms, those farms that are established after the start of the study are said to be left truncated (the implication here is that there is no way the farm can experience the event of interest before the truncation date). Right truncation: a subject is right truncated if it leaves the population at risk some stage after the study start (and we know that there is no way the event of interest could have occurred after this date). For example, in a study investigating the date of first foot-and-mouth disease diagnosis on a group of farms, those farms that are pre-emptively culled as a result of control measures are right truncated on the date of culling. Consider a study illustrated in Figure 4. Subjects enter at various stages throughout the study period. An `X' indicates that the subject has experienced the outcome of interest; a `O' indicates censoring. Subject A experiences the event of interest on day 7. Subject B does not experience the event during the 5 Figure 3: Weekly hazard of foot-and-mouth disease infection for cattle holdings (solid lines) and `other' holdings (dashed lines) in Cumbria (Great Britain) in 2001. Reproduced from Wilesmith et al. (2003). study period and is right censored on day 12 (this implies that subject B experienced the event sometime after day 12). Subject C does not experience the event of interest during its period of observation and is censored on day 10.