HST 190: Introduction to Biostatistics

HST 190: Introduction to Biostatistics Lecture 8: Analysis of time-to-event data 1 HST 190: Intro to Biostatistics • Survival analysis is studied on its own because survival data has features that distinguish it from other types of data. • Nonetheless, our tour of survival analysis will visit all of the topics we have learned so far: § Estimation § One-sample inference § Two-sample inference § Regression 2 HST 190: Intro to Biostatistics Survival analysis • Survival analysis is the area of statistics that deals with time-to- event data. • Despite the name, “survival” analysis isn’t only for analyzing time until death. § It deals with any situation where the quantity of interest is amount of time until study subject experiences some relevant endpoint. • The terms “failure” and “event” are used interchangeably for the endpoint of interest • For example, say researchers record time from diagnosis to death (in months) for a sample of patients diagnosed with laryngeal cancer. § How would you summarize the survival of laryngeal cancer patients? 3 HST 190: Intro to Biostatistics • A simple idea is to report the mean or median survival time § However, sample mean is not robust to outliers, meaning small increases the mean don’t show if everyone lived a little bit longer or whether a few people lived a lot longer. § The median is also just a single summary measure. • If considering your own prognosis, you’d want to know more! • Another simple idea: report survival rate at � years § How to choose cutoff time �? § Also, just a single summary measure—graphs show same 5-year rate 4 HST 190: Intro to Biostatistics • Summary measures such as means and proportions are highly useful for concisely describing data. However, they capture some features of the data but not all. § As we see, survival curve itself is far more informative § So, let’s estimate that! Patient months 1 1 2 2.5 3 3 4 3 5 5.5 6 6 7 7 8 7 9 9 10 13 5 HST 190: Intro to Biostatistics Measuring survival time • We measure time with variable �. The beginning of the period we’re studying (e.g., time of diagnosis) is � = 0. § � can represent days, months, weeks, years, etc. • Time is recorded relative to when a person enters the study, not according to the calendar. § In other words, a person reaches the point � = 1 after being in the study for 1 year (or month, etc). § It doesn’t matter how far along the others are. 6 HST 190: Intro to Biostatistics • Person #1 enters study in 2003 • Person #2 enters study in 2002 • Both die at 2006 • If t measured in years, person #1 dies at � = 3 and person #2 dies at � = 3 Person #1 X Person #2 X 2002 2003 2004 2005 2006 2007 7 HST 190: Intro to Biostatistics • The probability of an individual in our population surviving beyond a given time is called the survival function, written as �(�) • Here we want to make inference about a function, not just a single population value • Seems logical to estimate �(�) by taking a sample and recording the proportion of sample who are still alive after time � has elapsed § However, it is not as simple as that sounds 8 HST 190: Intro to Biostatistics Censored data • Analyses of survival times often include censored data (a type of missingness). • Valid inference in the presence of missing data is a topic of ongoing research in statistics. • To do valid inference when some data are missing, we must make assumptions. • Time-to-event data often have data missing in a particular way: individuals are lost to follow-up (or the study ends) before they experience the event of interest. This is called (right) censoring. • Censored data provide partial information: you don’t know how long a patient lived, but you know that she/he lived at least as long as the time before being lost to follow-up. 9 HST 190: Intro to Biostatistics • Why would a person be lost to follow-up? • The person could have… § moved to another city § withdrawn from the study § died of a different cause § still be in the study without an event at the time of the analysis • To do inference in the setting of missing data, we must be willing to make a big assumption § Assumption that censoring is non-informative • In other words, assume that being lost to follow-up is unrelated to prognosis. • If this assumption can’t be made, inference becomes more complicated and requires strong assumptions. 10 HST 190: Intro to Biostatistics • As an important counterexample, say researchers administer a new chemotherapy drug to 10 cancer patients to estimate survival time while on the drug. • 5 patients can’t tolerate the side effects and drop out of the study • If non-informative censoring were assumed, the drug would probably appear falsely impressive. § Those who dropped out were probably more ill; hence shorter survival times were disproportionately removed from the sample. 11 HST 190: Intro to Biostatistics Estimating the survival curve • The Kaplan-Meier estimator provides an estimate of �(�) at all time points, even if some data are censored § Also known as product-limit estimator • Using the rules of probability, we’ll see where this estimator comes from by choosing specific times �*, … , �-, then at �., � �. = � alive at �. = � alive at �. ∩ alive at �.8* = � alive at �.|alive at �.8* ⋅ � alive at �.8* = � alive at �.|alive at �.8* ⋅ �(�.8*) § Put simply, probability of surviving to time �. is probability of surviving to �.8* and then given you made it that far, surviving to �. § What if we applied this trick repeatedly? 12 HST 190: Intro to Biostatistics � �. = � alive at �.|alive at �.8* ⋅ � �.8* = � alive at �.|alive at �.8* ⋅ � alive at �.8*|alive at �.8; ⋅ � �.8; = � alive at �.|alive at �.8* ⋅ … ⋅ � alive at �;|alive at �* ⋅ � �* • After writing the survival function as the product of these individual pieces, we can then estimate it by estimating each piece individually • If there were no censoring, then we could simply estimate � alive at �.|alive at �.8* by the sample quantity # alive at �. # alive at �.8* • However, a patient who is alive but censored at time �.8* never really had a chance to make it to �. § That patient was not eligible to die during the interval from �.8* to �. and therefore shouldn’t be counted for computing survival rate in this interval. 13 HST 190: Intro to Biostatistics • When there is censoring, then we estimate � alive at �.|alive at �.8* using the sample quantity # alive at �. # alive at �.8* − # censored at �.8* • Denominator counts those at risk for event at time �. § It is exactly because of independent censoring that we can estimate the conditional probability as � alive at �.|alive at �.8* and uncensored at �. 14 HST 190: Intro to Biostatistics • Let’s define the following: § �. = # died at time �. § �. = # censored at time �. § �. = # still alive and not censored at �. • Notice that �.8* = �. + �. + �., so we can write the previous estimator as # alive at � � + � . = . # alive at �.8* − # censored at �.8* �.8* + �.8* − �.8* � − � � = . = 1 − . �.8* �.8* 15 HST 190: Intro to Biostatistics • So, given a set of observed time points �*, … , �-, the Kaplan- Meier estimator of survival probability at time �. is �* �; �. �I �. = 1 − ⋅ 1 − ⋅ … ⋅ 1 − �J �* �.8* § Two key features: 1) Estimated curve jumps at event times only 2) Curve goes to zero if last observed time is event, not censored • Consider outcomes, at 2-year intervals, for 100 patients with some disease or other. year fail censored 2 7 2 § Estimate the survival function. 4 16 5 6 19 8 8 14 7 10 11 4 12 5 2 16 HST 190: Intro to Biostatistics • Consider outcomes, at 2-year intervals, for 100 patients with some disease or other. L § �I 2 = 1 − = 0.930 *JJ L § �I 4 = 1 − = 0.766 *JJ L *R *S § �I 6 = 1 − 1 − 1 − = 0.558 … *JJ S* LJ year fail = �� censored = �� survive = �� Total = �.8* 2 7 2 100 − (7 + 2) = 91 100 4 16 5 91 − (16 + 5) = 70 91 6 19 8 70 − (19 + 8) = 43 70 8 14 7 10 11 4 12 5 2 17 HST 190: Intro to Biostatistics • Estimating S(t) this way is also called the life-table method because it pre-specifies the time intervals. • Using software, estimates of S(t) usually defined at each individual time to get a smoother curve Liu R et al. NEjM 2007 jan 18; 365(3):217-226 (Fig. 2) 18 HST 190: Intro to Biostatistics Brahmer j et al. NEjM 2015; 373(2):123-135 19 HST 190: Intro to Biostatistics Confidence intervals for KM estimator • In addition to estimating �(�) at any time point, we can also form a confidence interval for it as well. • As with the odds ratio, we use the log-transformation for this (which improves the normal approximation) 1) Take logarithm of �(�) 2) CI for ln (�(�)) 3) Convert back to CI for �(�) 20 HST 190: Intro to Biostatistics I . [\ • At time �., variance of ln � �. = ∑à* ]\^_ ]\^_8[\ • Therefore, the 100(1 − �)% CI for ln � �. is . h �` I f ln � �. ± �*8 g = �*, �; ; � � − � à* `8* `8* ` k_ kl • So the 100(1 − �)% CI for � �. is � , � 21 HST 190: Intro to Biostatistics • Returning to our example, �I 6 = 0.558 ⇒ ln �I 6 = −0.583 • Then the variance of ln �I 6 is . �` � � � g = * + ; + n � � − � �J �J − �* �* �* − �; �; �; − �n à* `8* `8* ` 7 16 19 = + + = 0.008 100 100 − 7 91 91 − 16 70 70 − 19 • Then the 100(1 − �)% CI for ln �I 6 is −0.583 ± 1.96h 0.008 = (−0.763, −0.403) • Thus, the 100(1 − �)% CI for �I 6 is �8J.LRn, �8J.oJn = (0.466,0.668) 22 HST 190: Intro to Biostatistics Log-rank test • In addition to estimating survival, we may want to compare two groups’ survival functions using a hypothesis test.

HST 190: Introduction to Biostatistics

A Simple Censored Median Regression Estimator

Linear Regression with Type I Interval- and Left-Censored Response Data

Statistical Methods for Censored and Missing Data in Survival and Longitudinal Analysis

Common Misunderstandings of Survival Time Analysis

CPS397-P1-S.Pdf

Bayesian Analysis of Survival Data with Missing Censoring Indicators and Simulation of Interval Censored Data Veronica Bunn

Censored Regression Models in R Using the Censreg Package

Statistical Inference of Truncated Normal Distribution Based on the Generalized Progressive Hybrid Censoring

Censoring and Truncation

An Approach to Understand Median Follow-Up by the Reverse Kaplan

Estimation of Moments and Quantiles Using Censored Data

Statistics of Extremes Under Random Censoring