Kaplan-Meier Estimator

Kaplan-Meier estimator

If we didn’t have censoring, then we could just use the ECDF and subtract it from 1 to get the estimated survival function. What’s brilliant about the K-M approach is that generalizes to allow censoring in a way that wouldn’t be clear how to do with the ECDF.

To work with the K-M estimator, it helpful to visualize all the terms in a table. We can also compute the estimated variance of Sb(t), which is denoted Vb [Sb(t)]. The standard error is the square root of the estimated variance. This allows us to put conﬁdence limits on Sb(t).

One formula (there are others that are not equivalent) for the estimated variance is: X di Vb [Sb(t)] = Sb(t)2 Yi (Yi − di ) tI ≤t

STAT574: Biostatistics February 10, 2016 1 / 27 Kaplan-Meier example with censoring

Now let’s try an example with censoring. We’ll use the example that we used for the exponential:

1.5, 2.4, 10.5, 12.5+, 15.1, 20.2+

In this case there are no ties, but receall that ti refers to the ith death time.

STAT574: Biostatistics February 10, 2016 2 / 27 Kaplan-Meier example with censoring

Consequently, we have

t1 = 1.5, t2 = 2.4, t3 = 10.5, t4 = 15.1, d1 = d2 = d3 = d4 = 1

Y1 = 6, Y2 = 5, Y3 = 4, Y4 = 2, Y5 = 1 Following the formula we have 1 Sb(1.5) = 1 − = 0.83 6 1 1 Sb(2.4) = 1 − 1 − = 0.67 6 5 1 1 1 Sb(10.5) = 1 − 1 − 1 − = 0.5 6 5 4 1 Sb(15.1) = (0.5) 1 − = 0.167 3

STAT574: Biostatistics February 10, 2016 3 / 27 Comparison to MLE

It is interesting to compare to the MLE that we obtained earlier under the exponential model. For the exponential model, we obtained λb = 0.066. The estimate survival function at the observed death times are

> 1-pexp(1.5,rate=.066) [1] 0.9057427 > 1-pexp(2.4,rate=.066) [1] 0.8535083 > 1-pexp(10.5,rate=.066) [1] 0.5000736 > 1-pexp(15.1,rate=.066) [1] 0.3691324

STAT574: Biostatistics February 10, 2016 4 / 27 K-M versus exponential

The exponential model predicted higher survival probabilities at the observed death times than Kaplan-Meier except that they both estimate Sb(10.5) to be 0.5 (or very close for the exponential model. Note that the Kaplan Meier estimate still has an estimate of 50% survival for, say 12.3 months, whereas the exponential model estimates 44% for this time. As another example, Sb(10.0) = 0.67 for Kaplan-Meier but 0.51 for the exponential model. The exponential model seems to be roughly interpolating between the values obtained by K-M.

STAT574: Biostatistics February 10, 2016 5 / 27 K-M versus exponential

STAT574: Biostatistics February 10, 2016 6 / 27 Example with lots of censoring

STAT574: Biostatistics February 10, 2016 7 / 27 K-M table

STAT574: Biostatistics February 10, 2016 8 / 27 Leukemia example

The data for the auto group (own bone marrow reinfused): 0.658, 0.822, 1.414, 2.500, 3.322, 3.816, 4.737, 4.836+, 4.934, 5.033, 5.757, 5.855, 5.987, 6.151, 6.217, 6.447+, 8.651, 8.717, 9.441+, 10.329, 11.480, 12.007, 12.007+, 12.237+, 12.401+, 13.059+, 14.474+, 15.000+, 15.461, 15.757, 16.480, 16.711, 17.204+, 17.237, 17.303+, 17.664+, 18.092+, 18.092, 18.750+, 20.625+, 23.158, 27.730+, 31.184+, 32.434+, 35.921+, 42.237+, 44.638+, 46.480+, 47.467+, 48.322+,56.086

We’ll arrange this into a spreadsheet.

STAT574: Biostatistics February 10, 2016 9 / 27 Example with lots of censoring

STAT574: Biostatistics February 10, 2016 10 / 27 Example with lots of censoring

STAT574: Biostatistics February 10, 2016 11 / 27 Example with lots of censoring

STAT574: Biostatistics February 10, 2016 12 / 27 Example with lots of censoring

STAT574: Biostatistics February 10, 2016 13 / 27 Example with lots of censoring

STAT574: Biostatistics February 10, 2016 14 / 27 K-M estimator is deﬁned

The K-M estimator is deﬁned for all values less than the largest observed death time, tD . For times beyond this, the K-M estimator of S(t) is considered undeﬁned if the largest study time is censored (i.e., if there are any censored observations beyond tD . If the largest observation is not censored, then Sb(t) = 0 for t > tD .

In the case of the largest time being censored, diﬀerent authors have suggested that for t > tD either using Sb(t) = 0 or Sb(t) = Sb(tD ) (which suggests that you can life indeﬁnitely long). Both estimates are biased (clearly underestimates and overestimates, respectively).

STAT574: Biostatistics February 10, 2016 15 / 27 Cumulative Hazard

The cumulative hazard can be estimated from the estimated by survival fnction by Hb(t) = −lnSb(t) An alternative estimator for the cumulative hazard is called the Nelson-Aalen estimator and is ( 0, if t ≤ t H(t) = 1 b P di , if t > t1 ti ≤t Yi We can go from the cumulative hazard to the survival function to get

Sb(t) = exp[−Hb(t)] Which gives an alternative estimate so the survival function compared to the Kaplan-Meier. The Nelson-Aalen is particularly useful for using the data to estimate the hazard and cumulative hazard nonparametrically to get an idea of an appropriate parametric model. STAT574: Biostatistics February 10, 2016 16 / 27 Nelson-Aalen estimator

The variance for the Nelson-Aalen estimator for the cumulative hazard is

2 2 X di σH (t) = Yi ti ≤t

STAT574: Biostatistics February 10, 2016 17 / 27 Nelson-Aalen and K-M estimators

The estimators are asymptotitically equivalent (so diﬀer more for smaller samples), and both are consistent estimators (both converge to the true function being estimated as the sample size increases).

STAT574: Biostatistics February 10, 2016 18 / 27 K-M and Nelson-Aalen

STAT574: Biostatistics February 10, 2016 19 / 27 Conﬁdence intervals for the survival function

A conﬁdence interval can be constructed for each time t of interest of the survival function. This is called a pointwise conﬁdence interval. The most straightforward approach takes the estimated survival function plus or minus a critical z-score times the standard error: q Sb(t) ± Z1−α/2 Vb [Sb(t)] where X di Vb [Sb(t)] = Sb(t)2 Yi (Yi − di ) ti ≤t (for the Kaplan-Meier).

STAT574: Biostatistics February 10, 2016 20 / 27 Conﬁdence intervals for the survival function

The variance and conﬁdence interval is based on assuming that the number of deaths at a particular time is a binomial random variable (with probability of death estimated from the data), and using a normal approximation for the proportion of deaths. Clearly the number (or proportion) of deaths is not exactly normal, although the binomial tends to be well-approximated by the normal for large samples. This is also the conﬁdence interval usually implemented in statistical packages.

STAT574: Biostatistics February 10, 2016 21 / 27 Conﬁdence intervals for the survival function

It is possible to get better conﬁdence intervals by ﬁrst transforming the data. Proportions are often transformed using an arcsine-square root transformation. The reason for this is that this transformation often makes proportion data more closely resemble normally distributed data. This leads to the interval with left endpoint    s !1/2 2  1/2 Z1−α/2 X di Sb(t)  sin max 0, arcsin(Sb(t) ) −  2 Yi (Yi − di ) 1 − S(t)  ti ≤t b 

and right end-point    s !1/2 2  π 1/2 Z1−α/2 X di Sb(t)  sin max  , arcsin(Sb(t) ) +  2 2 Yi (Yi − di ) 1 − S(t)  ti ≤t b 

STAT574: Biostatistics February 10, 2016 22 / 27 Conﬁdence intervals for the survival function

Another approach is to log-transform the cumulative hazard. This leads to the interval [Sb(t)1/θ, Sb(t)θ] where q  P di  Z1−α/2 t ≤t Y (Y −d )  θ = exp i i i i  ln[Sb(t)] 

STAT574: Biostatistics February 10, 2016 23 / 27 Conﬁdence bands

Pointwise confidence bands give an estimated interval for the survival function for a particular time t. However, the pointwise 95% confidence intervals should not be interpreted as 95% confidence that the entire survival function falls within the bounds.

To get a 1 − α% conﬁdence band valid for an interval of time [tL, tU ], we want two functions, L(t) and U(t) for which

Pr{L(t) ≤ S(t) ≤ U(t), for all t ∈ [tL, tU ]} = 1 − α

STAT574: Biostatistics February 10, 2016 24 / 27 Conﬁdence bands

Let 2 X di σS (t) = Yi (Yi − di ) ti ≤t then let 2 nσS (tL) aL = 2 1 + nσS (tL) 2 nσS (tU ) aU = 2 1 + nσS (tU )

The conﬁdence interval works as before with Z1−α/2 replaced with cα(aL, aU ) as the critical value, and this value is given in Table C.3 in Appendix C of the book.

STAT574: Biostatistics February 10, 2016 25 / 27 Conﬁdence bands

STAT574: Biostatistics February 10, 2016 26 / 27 Conﬁdence bands

The values aL and aU transform the lower and upper times to be between 0 and 1. Since you are replacing Z1−α/2 with cα(aL, aU ), which is a constant, you are multiplying the pointwise conﬁdence intervals by the same scaling factor at each t.

STAT574: Biostatistics February 10, 2016 27 / 27