Kernel in

Kernel density estimation can be done in R using the density() function in R. The default is a Guassian kernel, but others are possible also. It uses it’s own algorithm to determine the bin width, but you can override and choose your own. If you rely on the density() function, you are limited to the built-in kernels. If you want to try a different one, you have to write the code yourself.

STAT474/STAT574 February 24, 2017 1 / 94 Kernel density estimation in R: effect of bandwidth for rectangular kernel

STAT474/STAT574 February 24, 2017 2 / 94 Kernel density estimation in R

Note that exponential densities are a bit tricky to estimate to using kernel methods. Here is the default behavior estimating the density for exponential .

> x <- rexp(100) > (density(x))

STAT474/STAT574 February 24, 2017 3 / 94 Kernel density estimation in R: exponential data with Gaussian kernel

STAT474/STAT574 February 24, 2017 4 / 94 There are better approaches for doing nonparametric estimation for an exponential density that are a bit beyond the scope of the course. Violin plots: a nice application of kernel density estimation

Violin plots are an alternative to boxplots that show nonparametric density estimates of the distribution in addition to the and interquartile . The densities are rotated sideways to have a similar orientation as a .

> x <- rexp(100) > install.packages("vioplot") > library(vioplot) > x <- vioplot(x)

STAT474/STAT574 February 24, 2017 5 / 94 Kernel density estimation in R: violin plot

STAT474/STAT574 February 24, 2017 6 / 94 Kernel density estimation R: violin plot

The violin plot uses the function sm.density() rather than density() for the nonparametric density estimate, and this leads to smoother density estimates. If you want to modify the behavior of the violin plot, you can copy the original code to your own function and change how the nonparametric density estimate is done (e.g., replacing sm.density with density, or changing the kernel used).

STAT474/STAT574 February 24, 2017 7 / 94 Kernel density estimation in R: violin plot

STAT474/STAT574 February 24, 2017 8 / 94 Kernel density estimation in R: violin plot

STAT474/STAT574 February 24, 2017 9 / 94 Kernel density estimation in R: violin plot

> vioplot function (x, ..., range = 1.5, h = NULL, ylim = NULL, names = NULL, horizontal = FALSE, col = "magenta", border = "black", lty = 1, lwd = 1, rectCol = "black", colMed = "white", pchMed = 19, at, add = FALSE, wex = 1, drawRect = TRUE) { datas <- list(x, ...) n <- length(datas) if (missing(at)) at <- 1:n upper <- vector(mode = "numeric", length = n) lower <- vector(mode = "numeric", length = n) q1 <- vector(mode = "numeric", length = n) q3 <- vector(mode = "numeric", length = n) ... args <- list(display = "none") if (!(is.null(h))) args <- c(args, h = h) for (i in 1:n) { ... smout <- do.call("sm.density", c(list(data, xlim = est.xlim), args))

STAT474/STAT574 February 24, 2017 10 / 94 Kernel density estimation

There are lots of popular Kernel density estimates, and have put a lot of work into establishing their properties, showing when some Kernels work better than others (for example, using integrated square error as a criterion), determining how to choose bandwidths, and so on.

In addition to the Guassian, common choices for the hazard function include

I Uniform, K(u) = 1/2 I (−1 ≤ u ≤ 1) 2 I Epanechnikov, K(u) = .75(1 − u )I (−1 ≤ u ≤ 1) 15 2 2 I biweight, K(u) = 16 (1 − x ) I (−1 ≤ u ≤ 1)

STAT474/STAT574 February 24, 2017 11 / 94 Kernel-smoothed hazard estimation

To estimate a smoothed version of the hazard function using a kernal method, first pick a kernel, then use

D   1 X t − ti bh = K ∆He(ti ) b b i=1 where D is the number of death times and b is the babdwidth (instead of h). A common notation for bandwidth is h, but we use b because h is used for the hazard function. Also Hb(t) is the Nelson-Aalen estimator of the cumulative hazard function: ( 0, if t ≤ t H(t) = 1 e P di , if t > t1 ti ≤t Yi

STAT474/STAT574 February 24, 2017 12 / 94 Kernel-smoothed hazard estimation

The of the smoothed hazard is

D   2 X t − ti σ2[bh(t)] = b−2 K ∆Vb [He(t)] b i=1

STAT474/STAT574 February 24, 2017 13 / 94 Asymmetric kernels

A difficulty that we saw with the exponential also can occur here, the estimated hazard can give negative values. Consequently, you can use an asymmetric kernel instead for small t.

For t < b, let q = t/b. A similar approach can be used for large t, when tD − b < t < tD . In this case, you can use q = (tD − t)/b and replace x with −x in the kernal density estimate for these larger times.

STAT474/STAT574 February 24, 2017 14 / 94 Asymmetric kernels

STAT474/STAT574 February 24, 2017 15 / 94 Asymmetric kernels

STAT474/STAT574 February 24, 2017 16 / 94 Confidence intervals

A pointwise confidence interval can be obtained with lower and upper limits " # " #! Z1−α/2σ(bh(t)) Z1−α/2σ(bh(t)) bh(t) exp − , bh(t) exp bh(t) bh(t)

Note that the confidence interfal is really a confidence interval for the smoothed hazard function and not a confidence interval for the actual hazard function, making it difficult to interpret. In particular, the confidence interval will depend on the both the kernel and the bandwidth. Coverage probabilities for smoothed hazard estimates (the proportion of times the confidence interval includes the true hazard rate) appears to have ongoing research.

STAT474/STAT574 February 24, 2017 17 / 94 Asymmetric kernels

STAT474/STAT574 February 24, 2017 18 / 94 Effect of bandwidth

STAT474/STAT574 February 24, 2017 19 / 94 Effect of bandwidth

Because the bandwidth has a big impact, we somehow want to pick the optimal bandwidth. An idea is to minimize the squared area between the true hazard function and estimated hazard function. This squared area between the two functions is called the Mean Integrated Squared Error (MISE):

Z τU MISE(b) = E [bh(u) − h(u)]2 du τL Z τU Z τU = E bh2(u) du − 2E bh(u)h(u) du + f (h(u)) τL τL The last term doesn’t depend on b so it is sufficient to minimize the function ignoring the last term. The first term can be estimated by R τU h2(u) du, which can be estimated using the trapezoid rule from τL b calculus.

STAT474/STAT574 February 24, 2017 20 / 94 Effect of bandwidth

The second term can be approximated by   1 X t − ti K ∆He(ti )∆He(tj ) b b i6=j

summing over event times between τL and τU . Minimizing MISE is can be done approximately by minimizing     X ui+1 − ui 2 2 2 X t − ti g(b) = [bh (ui ) − bh (ui+1)] − K ∆He(ti )∆He(tj ) 2 b b i i6=j The minimization can be done numerically by plugging in different values of b and evaluating.

STAT474/STAT574 February 24, 2017 21 / 94 Effect of bandwidth

STAT474/STAT574 February 24, 2017 22 / 94 Effect of bandwidth

For this example, the minimum occurs around b = 0.17 to b = 0.23 depending on the kernel.

Generally, there is a trade-off with smaller bandwidths having smaller bias but higher variance, and larger bandwidths (more smoothing) having less variance but greater bias. Measuring the quality of bandwidths and kernels using MISE is standard in kernel density estimation (not just ). Bias here that E[bh(t)] 6= h(t).

STAT474/STAT574 February 24, 2017 23 / 94 Section 6.3: Estimation of Excess Mortality

The idea for this topic is to compare the survival curve or hazard rate for one group against a reference group, particular if the non-reference group is thought to have higher risk. The reference group might come from a much larger sample, so that its survival curve can be considered to be known.

An example is to compare the mortality for psychiatric patients against the general population. You could use data to get the lifetable for the general population, and determine the excess mortality for the psychiatric patients. Two approaches are: a multiplicative model, and an additive model. In the multiplicative model, belonging to a particular group multiplies the hazard rate by a factor. In the additive model, belonging to a particular group adds a factor to the hazard rate.

STAT474/STAT574 February 24, 2017 24 / 94 Excess mortality

For the multiplicative model, if there is a reference hazard rate of θj (t) for the jth individual in a study (based on sex, age, ethnicity, etc.), then due to other risk factors, the hazard rate for the jth individual is

hj (t) = β(t)θj (t)

where β(t) ≥ 1 implies that the hazard rate is higher than the reference hazard. We define Z t B(t) = β(u) du 0 as the cumulative relative excess mortality.

STAT474/STAT574 February 24, 2017 25 / 94 Excess mortality

d Note that dt = β(t). To estimate B(t), let Yj (t) = 1 if the jth individual is at risk at time t. Otherwise, let Yj (t) = 0. Here Yj (t) is defined for left-truncated and right-censored data. Let

n X Q(t) = θj (t)Yj (t) j=1

where n is the sample size. Then we estimate B(t) by

X di Bb(t) = Q(ti ) ti ≤t This value is comparing the actual number of deaths that have occurred by time ti with the expected number of deaths based on the hazard rate and number of patients available to have died.

STAT474/STAT574 February 24, 2017 26 / 94 Excess mortality

The variance is estimated by

X di Vb [Bb(t)] = 2 Q(ti ) ti ≤t

β(t) can be estimated by slope of Bb(t), which can be improved by using kernel-smoothing methods on Bb(t).

STAT474/STAT574 February 24, 2017 27 / 94 Excess mortality

For the additive model, the hazard is

hj (t) = α(t) + θj (t) Similarly to the multiplicative model, we estimate the cumulative excess mortality Z t A(t) = α(u) du 0 In this case the expected cumulative hazard rate is n Z t X Yj (u) Θ(t) = θ (u) du j Y (u) j=1 0 where n X Y (u) = Yj (u) j=1 is the number at risk at time u. STAT474/STAT574 February 24, 2017 28 / 94 Excess mortality

The estimated excess mortality is

X di Ab(t) = − Θ(t) Yi ti ≤t where the first term is the Nelson-Aalen estimator of the cumulative hazard. The variance is

X di Vb [Ab(t)] = Y (t)2 ti ≤t

STAT474/STAT574 February 24, 2017 29 / 94 Excess mortality

For a lifetable where times are every year, you can also compute

X λ(aj + t − 1) Θ(t) = Θ(t − 1) + Y (t) t

where aj is the age at the beginning of the study for patient j and λ is the reference hazard. Note that Θ(t) is a smooth function of t while Ab(t) is has jumps.

STAT474/STAT574 February 24, 2017 30 / 94 Excess mortality

A more general model is to combine multiplicative and additive components, using hj (t) = β(t)θj (t) + α(t) which is done in chapter 10.

STAT474/STAT574 February 24, 2017 31 / 94 Example: Iowa psychiatric patients

As an example, starting with the multiplicative model, consider 26 psychiatric patients from Iowa, where we compare to census data.

STAT474/STAT574 February 24, 2017 32 / 94 Iowa psychiatric patients

STAT474/STAT574 February 24, 2017 33 / 94 Census data for Iowa

STAT474/STAT574 February 24, 2017 34 / 94 Excess mortality for Iowa psychiatric patients

STAT474/STAT574 February 24, 2017 35 / 94 Excess mortality for Iowa psychiatric patients

STAT474/STAT574 February 24, 2017 36 / 94 Excess mortality

The cumulative excess mortality is difficult to interpret. The slope of the curve is more meaningful. The curve is relatively linear. If we consider age 10 to age 30, the curve goes from roughly 50 to 100, suggesting a slope of (100 − 50)/(30 − 10) = 2.5, so that patients aged 10 to 30 had a roughly 2.5 times higher chance of dying.

This is a fairly low-risk age group, for which suicide is high risk factor. Note that the census data might include psychiatric patients who have committed suicide, so we might be comparing psychiatric patients to the general population which includes psychiatric patients, as opposed to psychiatric patients compared to people who have not been psychiatric patients, so this might bias results.

STAT474/STAT574 February 24, 2017 37 / 94 Survival curves

You can use the reference distribution to inform the survival curve instead of just relying on the data. This results in an adjusted or corrected survival curve. Let S∗(t) = exp[−Θ(t)] (or use the cumulative hazard based on multiplying the reference hazard by the excess harzard) and let Sb(t) be the standard Kaplan-Meier survival curve (using only the data, not the reference survival data). Then Sc (t) = Sb(t)/S∗(t) is the corrected . The estimate can be greater than 1, in which case the estimate can be set to 1. Typically, S∗(t) is less than 1, so that dividing by this quantity increases the estimated survival probabilities. This is somewhat similar in Bayesian statististics to the use of the prior, using the reference survival times as a prior for what the psychiatric patients are likely to experience. Consequently, the adjusted survival curve is in between the kaplan-meier (data only) estimate, and the reference survival times.

STAT474/STAT574 February 24, 2017 38 / 94 Survival curves

STAT474/STAT574 February 24, 2017 39 / 94 Survival curves

STAT474/STAT574 February 24, 2017 40 / 94 Bayesian nonparametric survival analysis

The previous example leads naturally to Bayesian nonparametric survival analysis. Here we have prior information (or prior beliefs) about the shape of the survival curve (such as based on a reference survival function). The survival curve based on this previous information is combined with the likelihood of the survival data to produce a posterior estimate of the survival function.

Reasons for using a prior are: (1) to take advantage of prior information or expertise of someone familiar with the type of data, (2) to get a reasonable estimate when the sample size is small.

STAT474/STAT574 February 24, 2017 41 / 94 Bayesian survival analysis

In frequentist statistical methods, parameters are treated as fixed, but unknown, and an estimator is chosen to estimate the parameters based on the data and a model (including model assumptions). Parameters are unknown, but are treated as not being random.

Philosophically, the Bayesian approach is to try to model all uncertainty using random variables. Uncertainty exists both in the form of the data that would arise from a probability model as well as the parameters of the model itself, so both observations and parameters are treated as random. Typically, the observations have a distribution that depends on the parameters, and the parameters themselves come from some other distribution. Bayesian models are therefore often hierarchical, often with multiple levels in the hierarchy.

STAT474/STAT574 February 24, 2017 42 / 94 Bayesian survival analysis

For survival analysis, we think of the (unknown) survival curve as the parameter. From a frequentist point of view, survival probabilities determine the probabilities of observing different death times, but there are no probabilities of the survival function itself.

From a Bayesian point of view, you can imagine that there was some stochastic process generating survival curves according to some distribution on the space of survival curves. One of the survival curves happened to occur for the population we are studying. Once that survival function was chosen, event times could occur according to that survival curve.

STAT474/STAT574 February 24, 2017 43 / 94 Bayesian survival analysis

STAT474/STAT574 February 24, 2017 44 / 94 Bayesian survival analysis

We imagine that there is a true survival curve S(t), and an estimated survival curve, Sb(t). We define a as Z ∞ L(S, Sb) = [Sb(t) − S(t)]2 dt 0

The function Sb that minimizes the expected value of the loss function is called the posterior mean, which is used to estimate the survival function.

STAT474/STAT574 February 24, 2017 45 / 94 A prior for survival curves

A typical way to assign a prior on the survival function is to use a Dirichlet process prior. For a Dirichlet process, we partition the real line into intervals A1,..., Ak , so that P(X ∈ Ai ) = Wi . The numbers (W1,..., Wk ) have a k-dimension Dirichlet distribution with parameters α1, . . . , αk . For this to be a Dirichlet distribution, we must have Zi , i = 1,..., k are independent gamma random variables with shape Zi parameter αi and Wi = Pk . By construction, the random numbers i=1 Zi 0 Wi s are between 0 and 1 and sum to 1, so when interpreted as probabilities, they form a discrete .

Essentially, we can think of a Dirichlet distribution as a distribution on unfair dice with k sides. We want to make a die that has k sides, and we want the probabilities of each side to be randomly determined. How fair or unfair the die is partly depends on the α parameters and partly depends on chance itself.

STAT474/STAT574 February 24, 2017 46 / 94 A prior for survival curves

We can also think of the Dirichlet distrbibution as generalizing the beta distribution. A beta is a number between 0 and 1. This number partitions the interval [0,1] into two pieces, [0, x) and [x, 1]. A Dirichlet random variable partitions the interval into k regions, using k − 1 values between 0 and 1. The joint density for these k − 1 values is

"k−1 #" k−1 #αk −1 Γ[α1 + ··· + αk ) Y αi −1 X f (w1,..., wk−1) = wi 1 − wi Γ(α1) ··· Γ(αk ) i=1 i=1

which reduces to a beta density with parameters (α1, α2) when k = 2.

STAT474/STAT574 February 24, 2017 47 / 94 Assigning a prior

To assign a prior on the space of survival curves, first assume an average survival function, S0(t). The Dirichlet prior determines when the jumps occur, and the exponential curve gives the decay of the curve between −0.1t jumps. Simulated survival curves when S0(t) = e and α = 5S0(t) are given below.

STAT474/STAT574 February 24, 2017 48 / 94 Bayesian survival analysis

STAT474/STAT574 February 24, 2017 49 / 94 Bayesian survival analysis

Other approaches are to have a prior for the cumulative hazard function and to use Gibb’s or Markov chain Monte Carlo. These topics would be more appropriate to cover after a class in Bayes methods.

STAT474/STAT574 February 24, 2017 50 / 94 Chapter 7: Hypothesis testing

Hypothesis testing is typically done based on the cumulative hazard function. Here we’ll use the Nelson-Aalen estimate of the cumulative hazard. The survival function is used to weight differences between the observed and expected cumulative hazard.

Recall that the Nelson-Aalen estimate of the cumulative hazard is

X di He(t) = Yi t≤ti

In a one-sample problem, you test whether the hazard rate h(t) is equal to some reference hazard, h0(t). The null hypothesis is H0 : h(t) = h0(t). Under the null hypothesis, the expected hazard rate at time ti is h0(ti ).

STAT474/STAT574 February 24, 2017 51 / 94 Hypothesis testing: one sample

The idea is then to compare observed - expected cumulative hazard rates at the time τ, the largest time in the study (τ = tD ) if the largest time is a death time). The test is then

D Z τ X di Z(τ) = O(τ) − E(τ) = W (ti ) − W (s)h0(s) ds Yi i=1 0 where W (·) is a weight function.

The variance is Z τ h (s) V [Z(τ)] = W 2(s) 0 ds 0 Y (s)

STAT474/STAT574 February 24, 2017 52 / 94 Hypothesis testing

The expected value of Z(τ) = 0, so if we take a z-score of Z(τ) (subtracting the mean and dividing by the ), we get

Z(τ)/pV [Z(τ)]

which has an approximate standard normal distribution. This can be used for either a two-sided or one-sided test. For example, a one-sided test would be H1 : h(t) > h0(t), and you would reject only for large values of Z(τ)/pV [Z(τ)]

STAT474/STAT574 February 24, 2017 53 / 94 Hypothesis testing

The most popular choice for a weighting function is W (t) = Y (t), which leads to D D X di X O(τ) = Y (ti ) = di Yi i=1 i=1 This is also called the log-rank test (not sure why).

Other weight functions are possible. For example p q W (t) = Y (t)S0(t) [1 − S0(t)] with 0 ≤ p, q ≤ 1 (you don’t necessarily need q = 1 − p here). The choice of p affects whether you care more about the hazard not the hypothesized hazard for small t or large t. For example, if p is large, then more emphasis is placed on the estimated hazard matching the null hazard for small values of t.

S0(t) can be obtained from S0(t) = − exp[−H0(t)]. STAT474/STAT574 February 24, 2017 54 / 94 Hypothesis testing

An example where you would use the one-sided hypothesis test is in testing whether some population has a higher hazard than a reference population, such as the psychiatric patients from Iowa. Recall that for this example, we looked at excess mortality previously.

STAT474/STAT574 February 24, 2017 55 / 94 Hypothesis testing: two or more samples

If you have two or more samples (i.e., mortality for three different treatments or three different risk groups), then the null and are similar to that for ANOVA:

H0 : h1(t) = h2(t) = ··· hK (t), for all t ≤ τ

HA : hi (t) 6= hj (t) for some i 6= j and some t ≤ τ where τ is the largest time at which all of the groups have at least one subject at risk.

STAT474/STAT574 February 24, 2017 56 / 94 Hypothesis testing: two or more samples

We now define ti as the unique death times for the pooled data (i.e., ignoring the group that each observation comes from), and again tD is the largest death time.

We observe dij deaths at time ti in sample j, and there are Yij individuals PK at risk at time ti in sample j. We let di = j=1 dij be the total number of PK deaths at time ti and Yi = j=1 Yij be the total number of indivdiuals at risk (available for death?) at time ti .

STAT474/STAT574 February 24, 2017 57 / 94 Hypothesis testing: two or more samples

The idea for testing the hypothesis is that under the null hypothesis, the estimate of the hazard (and cumulative hazard) should be the same (in expectation) using the pooled data (ignoring the group the samples are from) and for the individual samples. We can think of the pooled data as providing a more precise estimate of the hazard for the jth sample than the jth sample itself, so using the idea of observed minus expected, we can write D   X dij di Zj (τ) = Wj (t) − , j = 1,..., K Yij Yi i=1

If all of the Zj (τ) terms are close to 0, then all of the sample estimated cumulative hazards are close to the pooled cumulative hazard, so they all must be close to each other, and this supports the null hypothesis.

STAT474/STAT574 February 24, 2017 58 / 94 Hypothesis testing: two or more samples

The typical weight function used is Wj (t) = Yij (t)W (ti ), where W (ti ) is a common weight shared by each group. For this weighting scheme,

D    X di Zj (τ) = dij − Yij Yi i=1

D     X 2 Yij Yij Yi − di Vb [Zj (τ)] = σbjj = W (ti ) 1 − di , j = 1,..., K Yi Yi Yi − 1 i=1 D   X 2 Yij Yik Yi − di cov(Zj (τ), Zk (τ)) = σbjk = W (ti ) di , j 6= k Yi Yi Yi − 1 i=1

STAT474/STAT574 February 24, 2017 59 / 94 Hypothesis testing: two or more samples

PK Based on the second formula for Zj (τ), the sum j=1 Zj (τ) is equal to 0, meaning that the Zj (τ) are not independent of one another. In particular ZK (τ) is a linear combination of Z1(τ),..., ZK−1(τ). Consequently, we construct a test statistic just based on the first K − 1 Zj (τ) terms:

2 −1 0 χ = (Z1(τ),..., ZK−1(τ))Σ (Z1(τ),..., ZK−1(τ))

where (Z1(τ),..., ZK−1(τ)) is interpreted as a K − 1 row-vector, Σ is a (K − 1) × (K − 1) matrix (if you had made a K × K matrix using all the variables, it wouldn’t be full rank, and therefore not invertible). The χ2 statistic has K − 1 degrees of freedom, and you can base the test on this distribution.

STAT474/STAT574 February 24, 2017 60 / 94 Hypothesis testing: two samples

Several weight functions are possible. W (t) = 1 for√ all t leads to the two-sample log-rank test. W (ti ) = Yi and W (ti ) = Yi have also been used.

In the case of K = 2 samples, the test statistic can be written as h  i PD di i=1 W (ti ) di1 − Yi1 Y Z = i r     PD 2 Yi1 Yi1 Yi −di W (ti ) 1 − i=1 Yi Yi Yi −1

SInce we don’t have to square in this case, we can do one-sided as well as two-sided hypothesis tests based on a standard normal distribution instead 2 2 of a χ , or you can square the statistic and use a χ1 distribution.

STAT474/STAT574 February 24, 2017 61 / 94 Hypothesis testing: two samples

STAT474/STAT574 February 24, 2017 62 / 94 Hypothesis testing: two samples

This example was kidney dialysis patients with surgically implanted catheters versus percutaneous (needle-puncture) placement of catheter. Even though the survival curves look fairly different after 1 year or so, the differences are not statistically signficant. Note that there are also very few observations for the percutaneous sample.

Actually the number of observations is fairly small for both samples, so the confidence intervals would be fairly wide.

STAT474/STAT574 February 24, 2017 63 / 94 Hypothesis testing: two samples

STAT474/STAT574 February 24, 2017 64 / 94 Hypothesis testing: two samples

STAT474/STAT574 February 24, 2017 65 / 94 Hypothesis testing: two samples

Different choices for the weight function affect the p-value. It is reassuring if a lot of weighting schemes give the same conclusion. The cases where the p-value were low were where the weighting scheme gave a lot of weight to differences in the hazard for large values of ti , which of course is where they appear different. This can also be sensitive to differences in patterns in the two samples, so should be used cautiously.

A problem with using lots of weighting schemes is if you only report weighting schemes that give the results you want and different weights conflict. This would be dishonest, so you should either pick a weighting scheme and stick to it, or report results of the different weighting schemes that you used.

STAT474/STAT574 February 24, 2017 66 / 94 Hypothesis testing: weight functions

STAT474/STAT574 February 24, 2017 67 / 94 Hypothesis testing: weight functions

The most common weight functions are either flat, W (ti ) = 1 or decreasing, with W (ti ) = Yi . A weight function that is increasing might be used if to compare longer term survival when early survival might be due to complications rather than long term effectiveness of a treatment.

An example is in comparing autologous transplants versus allogenic transplants for bone marrow for leukemia. Allogenic transplant patients (receiving bone marrow from sibling) tend to have more complications early on, reducing early survival rates (and increasing early hazard rates), but if interest is in long term survival, then a weight function could be used that emphasized later times.

STAT474/STAT574 February 24, 2017 68 / 94 Hypothesis testing in R

To test the difference in survival curves in R, you can use survdiff() from the survival library. An example is with the allo- versus auto- patients in the leukemia data. > x <- read.table("leukemia2.txt") > a <- survdiff(Surv(x$V1,x$V2)~factor(x$V3)) Call: survdiff(formula = Surv(x$V1, x$V2) ~ factor(x$V3))

N Observed Expected (O-E)^2/E (O-E)^2/V factor(x$V3)=1 51 28 25.8 0.182 0.382 factor(x$V3)=2 50 22 24.2 0.195 0.382

Chisq= 0.4 on 1 degrees of freedom, p= 0.537 The results suggest that the two groups had survival experiences that were not statistically significantly different from each other. STAT474/STAT574 February 24, 2017 69 / 94 Hypothesis testing in R

To plot the two survival curves together you can use

> x <- read.table("leukemia2.txt") > a <- survfit(Surv(x$V1[x$V3==1],x$V2[x$V3==1])~1) > b <- survfit(Surv(x$V1[x$V3==2],x$V2[x$V3==2])~1) > plot(a,conf=F) > points(b$time,b$surv,type="s",col="red",lwd=3) > legend(20,1,legend=c("auto","allo"),col=c("black","red"), lty=c(1,1),lwd=c(1,3),cex=1.3)

STAT474/STAT574 February 24, 2017 70 / 94 Hypothsis testing in R

STAT474/STAT574 February 24, 2017 71 / 94 Hypothesis testing in R

The survdiff() function in R has an optional paramter rho whose default is 0, which results in the log rank test. Larger values of rho put larger weight on later times and can have a big impact on the p-value.

STAT474/STAT574 February 24, 2017 72 / 94 Tests of trend

For multiple samples (K > 2), a different alternative hypothesis is the following: HA : h1(t) ≤ h2(t) ≤ · · · ≤ hK (t) , for t ≤ τ, where at least one inequality is strict. This is equivalent to

HA : S1(t) ≥ · · · ≥ SK (t)

STAT474/STAT574 February 24, 2017 73 / 94 Tests of trend

We construct the Zj (τ)s as before and use any weight functions Wj (ti ). We also pick a new set of weights aj , j = 1,..., K, where aj = j is often used.

The test statistic is now

PK aj Zj (τ) Z = j=1 qPK PK j=1 k=1 aj ak σbjk

where Σb = (σbjk ) is the K × K . (It isn’t full rank, but we don’t need the inverse.) The test statistic can be compared to a standard normal.

STAT474/STAT574 February 24, 2017 74 / 94 Tests of trend

STAT474/STAT574 February 24, 2017 75 / 94 Stratified tests

If different populations have different covariates (age, sex, etc.), then ideally, you could use a regression approach to survival analysis to adjust for covariates before comparing survival curves or hazard rates. This is done in Chapter 8.

If there are a small number of levels for a predictor, then you can use a stratified test instead.

Let

H0 : h1s (t) = h2s (t) = ··· = hKs (t), s = 1,..., M, t ≤ τ The idea is that for each level of the covariate (indexed by s), the hazard rate should be the same. Typically, M is small.

STAT474/STAT574 February 24, 2017 76 / 94 Stratified tests

For the stratified test, let

M X Zj.(τ) = Zjs (τ) s=1

M X σbjk = σbjks s=1 Then the test statistic is as before with multiple samples:

−1 0 (Z1.(τ),..., ZK−1,.(τ))Σ (Z1.(τ),..., ZK−1,.(τ))

which is approximately χ2 with K − 1 degrees of freedom. Here we have K samples and M strata within each sample.

STAT474/STAT574 February 24, 2017 77 / 94 Renyi type tests

For a two sample problem, if hazard functions cross, then the previous tests might not detect much overall difference in the hazard rates. Thus, the overall survival experience might be similar, but it could be different in the short term and different in the long term. If one group is at more at risk in the short term, and another in the long term, these changes of direction could cancel out leading one to not reject the hypothesis that the hazards are different.

Renyi-type tests are based on the maximum absolute value of the differences between cumulative hazard rates rather than the summed differences.

The idea is similar to the Kolmogorov-Smirnov test for comparing two distributions, which uses the largest absolute value of the difference betweent the two empirical CDF functions, but Renyi tests allow for censoring.

STAT474/STAT574 February 24, 2017 78 / 94 Renyi type tests

To construct this test, let    X dk Z(ti ) = W (tk ) dk1 − Yk1 , i = 1,..., D Yk tk ≤ti

where as usual dk = dk1 + dk2 and Yk = Yk1 + Yk2 (i.e., dk and Yk are the pulled number of deaths and number at risk at time tk over both samples). The of Z(τ) is       2 X 2 Yk1 Yk2 Yk − dk σ (τ) = W (tk ) dk Yk Yk Yk − 1 τk ≤τ

where τ is the largest death time tk with Yk1, Yk2 > 0

STAT474/STAT574 February 24, 2017 79 / 94 Renyi type tests

The test statistic is

Q = sup{|Z(t)|, t ≤ τ}/σ(τ)

you can think of the supremum here as just the maximum of the absolute values of the Z(tj ) values. Critical values are given in the Appendix, table C.5, and are based on the theory of Brownian motion.

STAT474/STAT574 February 24, 2017 80 / 94 Renyi type tests

STAT474/STAT574 February 24, 2017 81 / 94 Renyi type tests: finding the maximum |Z(tj )|

STAT474/STAT574 February 24, 2017 82 / 94 STAT474/STAT574 February 24, 2017 83 / 94 Renyi type tests

The maximum occurs at 315 days, with the maximum value being 9.8. The p-value (based on Table C.5) is 0.053, which is not significant at α = 0.05 but still gives more signal for the curves being different than the log-rank test, which gives p = 0.63.

STAT474/STAT574 February 24, 2017 84 / 94 Testing based on a fixed point in time

Instead of testing survival and hazard rates over all time points, you might be interested in the 1-yr survival rate. Note that the time being tested should be chosen before doing the test. If you look at two survival curves and say, “Wow, they look really different at year 3, is that significant?” then the p-value will biased too low.

It is similar to testing at many time points but then not adjusting for multiple comparisons. In practice, this is what happens all the time though. People look at a graph of the data, which is maybe meant to be descriptive, something jumps out at them as being unusual, and they say, “Wow, is that significant?” It’s extremely difficult to answer this type of question. A better approach in this type of case might be the Renyi type of test, because it is accounting for the fact that you are looking at maximum differences over the entire time frame.

STAT474/STAT574 February 24, 2017 85 / 94 Testing based on a fixed point in time

Here we want to test H0 : S1(t0) = S2(t0) against HA : S1(t0) 6= S2(t0) for two survival curves. (The method can be generalized to more survival curves.) The test statistic is

Sb1(t0) − Sb2(t0) Z = q Vb [Sb1(t0)] + Vb [Sb2(t0)]

which has an approximate standard normal distribution for large samples.

STAT474/STAT574 February 24, 2017 86 / 94 Testing based on a fixed point in time

If you want to test multiple fixed time points, such as the 1-yr and 5-yr survival rates, then you should adjust for multiple comparisons. For testing two time points, a Bonferroni adjustment could be made, meaning that you reject each hypothesis only if the p-value is less than α/2. The more time points you check, the less power you will have to find signficant differences.

STAT474/STAT574 February 24, 2017 87 / 94 Bonferroni adjustments

Probably the most popular, and simplest adjustment to make for multiple testing is Bonferroni adjustments. The idea is that to have k tests at level α (meaning that if the null hypotheses are true for all k tests, there is only a 5% chance of making an error on any one of them), you use an α level of α/k for each test.

What is the rationale for doing this?

STAT474/STAT574 February 24, 2017 88 / 94 Bonferroni adjustments

There are several ways to justify Bonferroni adjustments. One is to look at the expected number of false positives under the null. Let Xi = 1 if you make a correct decision on test i, and otherwise Xi = 0. What type of variable is Xi ? What is the probability that Xi = 1 if the null hypothesis (for i) is true? What is the expected value of Xi ?

STAT474/STAT574 February 24, 2017 89 / 94 Bonferroni adjustments

Xi as defined previously is Bernoulli with p = α if testing using level α. The expected value of a Bernoulli(p) random variable is p. (Why?), so the expected value of Xi is α.

If you do k , the expected number of false positives is

" k # X E Xi = kα i=1 However, if you test at the α/k level, then the expected number of false positives is α. Thus, the Bonferroni adjustment controls the expected number of false positives.

STAT474/STAT574 February 24, 2017 90 / 94 Bonferroni adjustments

Another approach is to use something called Bonferroni’s inequality. Let Ai be the event that you don’t reject the null hypothesis. Suppose we set P(Ai ) = 1 − α/k when the null is true. From the Inclusion-Exclusion formula

P(A1A2) = P(A1) + P(A2) − P(A1 ∪ A2) ≥ P(A1) + P(A2) − 1

If we apply the formula again, setting B = A1A2, we get

P(A1A2A3) = [P(A1)+P(A2)−1]+P(A3)−1 ≥ P(A1)+P(A2)+P(A3)−2

In general for k events

k X P(A1 ··· Ak ) ≥ P(Ai ) − (k − 1) i=1

STAT474/STAT574 February 24, 2017 91 / 94 Bonferroni adjustments

If P(Ai ) = 1 − α/k, then we get  α P(A ··· A ) ≥ k 1 − − k + 1 = 1 − α 1 k k Thus, the probability of all decisions being correct is at least 1 − α, and the probability of making any wrong decision is at most α.

STAT474/STAT574 February 24, 2017 92 / 94 Bonferroni adjustments

Bonferroni’s inequality can be useful in other probabilistic arguments as well.

STAT474/STAT574 February 24, 2017 93 / 94