Logic Argument of Research Article s2

Total Page:16

File Type:pdf, Size:1020Kb

Logic Argument of Research Article s2

Chapter 5-23. Cox Regression Proportional Hazards (PH) Assumption

In this chapter, we will discuss the c-statistic for Cox regression, as a way to assess goodness-of- fit, or more correctly, the discrimatory ability of the model. We will discuss the proportional hazards assumption of Cox regression and how to test the assumption. We will discuss the “stratification” approach for dealing with violations of the proportional hazards assumption, in a detailed Cox regression example.

We will begin with the same dataset we used to introduce Cox regression in Chapter 5-7, the LeeLife dataset (see box).

LeeLife dataset

This dataset came from Lee (1980, Table 3.5, p.31), which originally came from Myers (1969). The data concern male patients with localized cancer of the rectum diagnosed in Connecticut from 1935 to 1954. The research question is whether survival improved for the 1945-1954 cohort of patients (cohort = 1) relative to the earlier 1935-1944 cohort (cohort = 0).

Data Codebook ______

id study ID number

cohort 1 = 1945-1955 patient cohort 0 = 1935-1944 patient cohort

interval 1 to 10, time interval (year) following cancer diagnosis 11 = still alive and being followed at end of year 10

died 1 = died 0 = withdrawn alive or lost to follow-up during year interval

______

Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2010.

Chapter 5-23 (revision 16 May 2010) p. 1 Reading the data in,

File Open Find the directory where you copied the course CD Change to the subdirectory datasets & do-files Single click on LeeLife.dta Open

use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With Stata\datasets & do-files\LeeLife.dta", clear

* which must be all on one line, or use:

cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd "Biostats & Epi With Stata\datasets & do-files" use LeeLife.dta, clear

In preparation for using survival time commands, including Cox regression, which begin with st, we use the stset command to inform stata which is the death, or event, variable, and which is the time variable.

Statistics Survival analysis Setup & utilities Declare data to be survival time data Main tab: Time variable: interval Failure variable: died OK

stset interval, failure(died)

Chapter 5-23 (revision 16 May 2010) p. 2 Assessing Goodness of Fit with the c Statistic

Fitting a Cox regression model using,

Statistics Survival analysis Regression models Cox proportional hazards model Model tab: Independent variables: cohort OK

stcox cohort

Cox regression -- Breslow method for ties

No. of subjects = 1137 Number of obs = 1137 No. of failures = 798 Time at risk = 4835 LR chi2(1) = 39.74 Log likelihood = -5245.5703 Prob > chi2 = 0.0000

------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------cohort | .6282795 .0454925 -6.42 0.000 .5451539 .7240802 ------

Using the hazard ratio as an estimate of relative risk, we see that the later cohort had a smaller risk of dying from cancer than the earlier cohort (HR = 0.63, p < 0.001).

In the logistic regression part of the course, we assessed goodness of fit using the c-statistic. The same statistic is used to assess goodness of fit, or discrimatory value, in Cox regression.

The original paper for the c-statistic, as it has become known, was published for use with Cox regression. (Harrell, et al., 1982) Harrell presented it as,

“Draw a pair of patients and determine which patient lived longer from his baseline evaluation. Survival times can be validly compared either when both patients have died, or when one has died and the other’s follow-up time has exceeded the survival time of the first. If both patients are still alive, which will live longer is not known, and that pair of patients is not used in the analysis. Otherwise, it can be determined whether the patient with the higher prognostic score (i.e., the weighted combination of baseline and test variables used to predict survival) also had the longer survival time. The process is repeated until all possible pairs of patients have been examined. Of the pairs of patients for which the ordering of survival times could be inferred, the fraction of pairs such that the patient with the higher score had the longer survival time will be denoted by c.

The index c estimates the probability that, of two randomly chosen patients, the patient with the higher prognostic score will outlive the patient with the lower prognostic score. Values of c near .5 indicate that the prognostic score is no better than a coin-flip in determining which patient will live longer. Values of c near 0 or 1 indicate the baseline data virtually always determine which patient has a better prognosis.” Chapter 5-23 (revision 16 May 2010) p. 3 Computing the c-statistic,

Statistics Survival analysis Regression models Test proportional hazards assumption Main tab: Reports and statistics: Harrell’s C index (concordance) OK

estat concordance

Harrell's C concordance statistic

failure _d: died analysis time _t: interval2

Number of subjects (N) = 1137 Number of comparison pairs (P) = 488830 Number of orderings as expected (E) = 150665 Number of tied predictions (T) = 262446

Harrell's C = (E + T/2) / P = .5767 Somers' D = .1533 We see that the c = .58 does not achieve the 0.70 mark for acceptable discrimination (see box) for a prognostic model, but that is fine since we are not attempting to derive one.

Rule of Thumb for Interpreting ROC and c-statistic

Hosmer and Lemeshow (2000, p. 162) suggest the following general rule for interpreting the area under the ROC curve:

ROC =0.5 suggests no discrimination (i.e., no better than flipping a coin) 0.7  ROC < 0.8 is considered acceptable discrimination 0.8  ROC < 0.9 is considered excellent discrimination ROC  0.9 is considered outstanding discrimination (extremely unusual to observe this in practice)

The same rule of thumb holds for the c-statistic, since the c-statistic is identically the area under the ROC curve (Hosmer and Lemeshow, 2000, p.163; Harrell, 2001, p.248).

Chapter 5-23 (revision 16 May 2010) p. 4 Testing the Proportional Hazards Assumption of Cox Regression

In epidemiology courses, when the topic of a stratified analysis is presented, it is pointed out that Mantel-Haenszel pooled estimate requires an assumption of homogeneity of stratum-specific effect estimates (Chapter 3-11, p.7). This is required for the pooled estimate, or summary estimate, to be a good representation of what happens in the individual strata.

Since we verified in Chapter 5-7 that the Cox regression HR closely resembles the individual time strata RR estimates, it makes since that Cox regression has a similiar assumption. In Cox regression, this is called the proportional hazards assumption.

The assumption actually arises in how the log likelihood is specified, so that the assumption is an inherent part of the way the regression coefficients are estimated (Harrell, 2001, pp.466-468).

To test the proportional hazards assumption using a significance test approach, we use

estat phtest , detail

Test of proportional-hazards assumption

Time: Time ------| rho chi2 df Prob>chi2 ------+------cohort | 0.04554 1.64 1 0.1999 ------+------global test | 1.64 1 0.1999 ------

We see that the proportional hazards assumption is justified, since the test is not significant (p = 0.1999).

The first section of the table lists each predictor separately, testing the proportional hazards assumption for that predictor specifically. The second section provides an overall test, the global test, which is a test that the model, over all, meets the proportional hazards assumption. If you leave off the “detail” option, you just get the global test.

The test for the individual predictors uses the unscaled Schoenfeld residuals, while the global test uses the scaled Schoenfeld residuals (Grambsch and Therneau, 1994). If significant, then the PH assumption is rejected. See the Kleinbaum quote on the next page, which justified using a small alpha here (p< 0.01 or maybe even p<.001).

______Note: In Stata version 9, this test was done using:

capture drop sc* stcox cohort , schoenfeld(sch*) scal(sc*) stphtest, detail

Chapter 5-23 (revision 16 May 2010) p. 5 Protocol Suggestion

Suggested wording for describing using the PH test based on Schoenfeld residuals is:

A test of the proportional hazards, a required assumption of Cox regression, will be performed for each covariate and globally using a formal significance test based on the unscaled and scaled Schoenfeld residuals. (Grambsch and Therneau, 1994)

Example Sjöström et al (N Engl J Med, 2007) reported in their Statistical Analysis section using Schoenfeld residuals to test the proportional hazards assumption,

“Schoenfeld residuals from the models were examined to assess possible departures from model assumptions.32” ______32Grambsch PM, Therneau TM. Proportional hazards tests and diagnostics based on weighted residuals. Biometrika 1994;81:515-526.

Chapter 5-23 (revision 16 May 2010) p. 6 To test the proportional hazards assumption with a graphical approach, we use

Statistics Survival analysis Regression models Graphically assess proportional hazards assumption Main tab: Independent variable: cohort OK

stphplot, by(cohort) 5 . 1 ] ) 1 y t i l i b a b o r P 5

. l a v i v r u S ( n l 0 - [ n l - 5 . -

0 .5 1 1.5 2 2.5 ln(analysis time)

cohort = 0 cohort = 1

The advantage of the graphical approach over the signficance test is that it permits us to be more liberal in our assessment. If these lines look approximately parallel, then the PH assumption is met. If the lines cross, then clearly there is a problem. Other than lines crossing, it might be okay to assume the assumption is met. Kleinbaum (1996, p.141) recommends,

“We recommend that one should use a conservative strategy for this decision of assuming the PH assumption is satisfied unless there is strong evidence of nonparallelism of the log-log curves.”

Revised Protocol Suggestion

There is no reason not to use both methods, graphical and significance test, and consider the evidence from both approaches. Here is another way to state how the PH assumption will be tested: A test of the proportional hazards, a required assumption of Cox regression, will be performed for each covariate and globally using a formal significance test based on the unscaled and scaled Schoenfeld residuals (Grambsch and Therneau, 1994). In addition, a graphical assessment of proportional hazards will be made using log-log survival curves.

Chapter 5-23 (revision 16 May 2010) p. 7 Testing Proportional Hazards With a Time-Dependent Covariate

The proportional hazards (PH) assumption is just another way of saying that the hazard ratio (HR) does not change over time for some predictor X. If the HR does change over time for X, then X is said to interact with time, so adding an X × time interaction term provides a better fitting model and fixes the problem of non-PH. To test the PH assumption, then, you can add a X × time interaction term and test it for significance. If not significant, then the PH assumption is satisfied and the interaction term can be dropped from the model. If significant, then the interaction term can be kept in the model as a fix for the PH assumption violation.

Adding a X × time interaction term is done using the time-dependent covariate opion, tvc( ), and the time expression, or function analysis time, option, texp( ), which uses as an argument _t, the time variable created by the stset command.

Although it is mostly a matter of taste, some researchers like to use the log of time, ln(t), instead of the original time, t, which would be specified as texp(ln(_t)) instead of texp(_t). There is no clear guideline, so a more complete approach would be to try both. (Cleves, et al, 2004, p.177).

Testing for a cohort × time interaction,

stcox cohort, tvc(cohort) texp(_t)

Cox regression -- Breslow method for ties No. of subjects = 1137 Number of obs = 1137 No. of failures = 798 Time at risk = 4835 LR chi2(2) = 41.39 Log likelihood = -5244.7466 Prob > chi2 = 0.0000 ------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------rh | cohort | .5613034 .063886 -5.07 0.000 .4490724 .701583 ------+------t | cohort | 1.043777 .0350215 1.28 0.202 .9773441 1.114725 ------Note: Second equation contains variables that continuously vary with respect to time; variables are interacted with current values of _t.

The output is now split into two panels. The first contains the predictors that are constant with time, or fixed covariates (rh). The second contains the time-varying covariates (t), which a footnote to remind you which is which, and also to remind you what function of time was used.

Chapter 5-23 (revision 16 May 2010) p. 8 Testing for a cohort × time interaction, but this time using the log of time,

stcox cohort, tvc(cohort) texp(ln(_t))

Cox regression -- Breslow method for ties

No. of subjects = 1137 Number of obs = 1137 No. of failures = 798 Time at risk = 4835 LR chi2(2) = 41.28 Log likelihood = -5244.7997 Prob > chi2 = 0.0000 ------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------rh | cohort | .5773083 .0573695 -5.53 0.000 .4751388 .7014475 ------+------t | cohort | 1.131921 .1133381 1.24 0.216 .9302206 1.377355 ------Note: Second equation contains variables that continuously vary with respect to time; variables are interacted with current values of ln(_t).

Usually you would not include the variable twice like this in Cox regression. If you really wanted to model cohort as a time-dependent covariate, you would just incude it in the tvc( ) option. Separating like we did, however, allows us to separate the predictor into a “main effect” and the “interaction” effect.

We see the p=0.202 with non-log transformed time and p=0.216 with log-transformed time are very similar to the Schoenfeld residuals test for proportional hazards computed above, p=0.1999. We should conclude that the proportion hazards assumption is met and drop the interaction term. Our final model is

stcox cohort

Cox regression -- Breslow method for ties

No. of subjects = 1137 Number of obs = 1137 No. of failures = 803 Time at risk = 4841 LR chi2(1) = 42.31 Log likelihood = -5271.7697 Prob > chi2 = 0.0000 ------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------cohort | .6204562 .0447145 -6.62 0.000 .5387255 .7145865 ------For completeness, if we had chosen to model cohort as a time-dependent covariate, we would just include in the in tvc( ) option. Trying this,

stcox , tvc(cohort) texp(_t)

option tvc() not allowed r(198); we get an error message. It turns out that Stata needs at least one variable in front of the comma.

Chapter 5-23 (revision 16 May 2010) p. 9 Since we don’t have any other predictor in this dataset, we can include a variable that contains all ones, which represents the baseline hazard. Ordinarily you would not do this, because it is done anyway behind the scenes.

gen ones = 1 stcox ones , tvc(cohort) texp(_t)

Cox regression -- Breslow method for ties

No. of subjects = 1137 Number of obs = 1137 No. of failures = 798 Time at risk = 4835 LR chi2(1) = 15.96 Log likelihood = -5257.4616 Prob > chi2 = 0.0001 ------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------t | cohort | .9173471 .0195219 -4.05 0.000 .8798719 .9564184 ------Note: Second equation contains variables that continuously vary with respect to time; variables are interacted with current values of _t. This is not as impressive as our fixed predictor approach (HR=0.62) on the previous page. Trying log time,

stcox ones , tvc(cohort) texp(ln(_t))

Cox regression -- Breslow method for ties

No. of subjects = 1137 Number of obs = 1137 No. of failures = 798 Time at risk = 4835 LR chi2(1) = 11.36 Log likelihood = -5259.7624 Prob > chi2 = 0.0008 ------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------t | cohort | .7780817 .0568277 -3.44 0.001 .6743061 .8978282 ------Note: Second equation contains variables that continuously vary with respect to time; variables are interacted with current values of ln(_t). This time we get a slightly more impressive result, but still not as good as the fixed approach.

The time-dependent predictor variables are a type of combined main effect and interaction term with time, so you don’t need to have the same variable listed before the comma to model the variable.

Chapter 5-23 (revision 16 May 2010) p. 10 A Detailed Worked Example

In this detailed example, we will go into much more analysis detail than you would probably care to do in practice. It is a Rolls Royce approach.

We will analyze the Reyes dataset, reyeswithdates.dta (see box).

Reyeswithdates dataset

This dataset came from Cleves et al. (2004, p.185). Cleves’ file reyes.dta was modified slightly, by replacing the days variable with beginning and ending dates, representing the same number of days of follow-up.

This is a randomized clinical trial involving N=150 children diagnosed with Reye’s syndrome. Study subjects were randomized to a new medication or to a standard medication. The study hypothesis is that the new treatment will be effective in preventing death from Reye’s syndrome.

Cleves’ describes the dataset,

“Reye’s syndrome is a rare disease, usually affecting children under the age of fifteen who are recovering from an upper respiratory illness, chicken pox, or flu. The condition causes severe brain swelling and inflammation of the liver. This acute illness requires immediate and aggressive medical attention. The earlier the disease is diagnosed, the better the chances of a successful recovery. Treatment protocols include drugs to control the brain swelling and intravenuous fluids to restore normal blood chemistry. For this study of a new medication to control the brain swelling, and thus to prevent death, 150 Reye’s syndrome patients were randomly allocated at time of hospital presentation to either the standard high-dose barbiturate treatment protocol or to a treatment protocol that included the new experimental drug. The time from treatment allocation to death or end of follow-up was recorded in days.”

Data Codebook

id study ID number (one observation per subject) begindate date of treatment allocation enddate date of death or end of follow-up dead 1 = death , 0 = censored (survived) treat treatment, 1 = patient on experimental protocol 0 = patient on standard protocol age patient age (years) sex gender, 0=?, 1=? (Cleve’s did not say) ftliver fatty liver disease, 1 = present, 0 = absent (from liver biopsy within 24 hours of treatment allocation ammonia baseline blood ammonia level (mg/dl) sgot baseline serum level of aspartate transaminase (SGOT) (I.U.)

Chapter 5-23 (revision 16 May 2010) p. 11 Reading in the data,

use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With Stata\datasets & do-files\reyeswithdates.dta", clear

* which must be all on one line, or use:

cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd "Biostats & Epi With Stata\datasets & do-files" use reyeswithdates, clear

Checking the variable types,

describe

Contains data from reyeswithdates.dta obs: 150 vars: 10 27 Apr 2006 11:50 size: 4,950 (99.9% of memory free) ------storage display value variable name type format label variable label ------id int %8.0g begindate str8 %9s enddate str8 %9s dead byte %8.0g treat byte %8.0g age byte %8.0g sex byte %8.0g ftliver byte %8.0g ammonia float %9.0g sgot int %8.0g ------Sorted by:

We discover that the two date variables have storage type “str8”, meaning they are strings of up to 8 text characters, or string variables. The other variables have storage types of int, byte, and float, all of which are numeric variables.

Before dates can be used in Stata, such as computing the number of days between them, they have to be converted to “date” variables.

If we look at the date using Stata’s browser, we will see dates of the form “7/2/01”. To create two “date” variables, we use,

capture drop begindate2 enddate2 gen begindate2 = date(begindate,"md20y") // Stata 10 gen enddate2 = date(enddate,"md20y") // Stata 10 * capture drop begindate2 enddate2 gen begindate2 = date(begindate,"MD20Y") // Stata 11 gen enddate2 = date(enddate,"MD20Y") // Stata 11

Chapter 5-23 (revision 16 May 2010) p. 12 The second argument of the date function, the “md20y” part, informs Stata that the dates are month/day/year. If the dates where of the form “7/2/2001”, or “7/2/1991”, it would be sufficient to use “mdy” as the second argument. Since the years are only two digits, Stata requires us to inform it whether the dates are for the 1900s or 2000s. The “20y” informs Stata to put “20” in front of the two digits of the date. (Note that in Stata version 10, the “mdy” are lowercase, but in Stata version 11, these must be uppercase “MDY”.)

If we look at the data now, using the data browser, we will see the new date variables are of the form, “15158”. Statistical software uses what are called “elapsed dates”, which in Stata in the number of days since January 1, 1960.

To make the dates appear in the format “02jul2001”, use

format begindate2 enddate2 %d

The “%” indices that a format specification follows, which in this case is “d” for date”. Arithmetic can be done on date variables, so we can finally compute the days of follow-up. Do this using,

capture drop days gen days = enddate2 - begindate2

To check our work, we look at a few dates, using

list begindate enddate begindate2 enddate2 days in 1/5 , abbrev(15)

+------+ | begindate enddate begindate2 enddate2 days | |------| 1. | 7/2/01 7/10/01 02jul2001 10jul2001 8 | 2. | 7/2/01 9/22/01 02jul2001 22sep2001 82 | 3. | 7/3/01 7/22/01 03jul2001 22jul2001 19 | 4. | 7/4/01 8/19/01 04jul2001 19aug2001 46 | 5. | 7/5/01 8/7/01 05jul2001 07aug2001 33 | +------+

To get the means, standard deviations, percents, etc., for the potential covariates for the Cox regression model, which we might use for a “Table 1. Patient Characteristics” table in our manuscript, use,

ttest age , by(treat) ttest ammonia , by(treat) unequal ttest sgot , by(treat) tab sex treat, expect tab sex treat, col exact tab ftliver treat, expect tab ftliver treat, col chi2

Notice we asked for an “unequal” variance t test for amonia, because we noticed the standard deviation was twice as large in one group (pretending we ran the command the first time), so the equal variances assumption was suspect. Actually, our sample size is large enough that this assumption is not critical, so the ordinary equal variance t test would probably be just fine. We Chapter 5-23 (revision 16 May 2010) p. 13 also used the “expect” option to get the expected frequencies for the crosstabulations, and then commented it out so that we don’t get confused later thinking these are column frequencies. When at least one expected frequency, for a 2 × 2 table, was < 5, we used Fisher’s exact test; otherwise we used the chi-square test (this minimum expected frequency rule is in Chapter 2-3, p.18).

. ttest age , by(treat)

Two-sample t test with equal variances ------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ------+------0 | 71 12.49296 .2268437 1.911419 12.04053 12.94538 1 | 79 12.29114 .2125195 1.888914 11.86805 12.71423 ------+------combined | 150 12.38667 .1547999 1.895904 12.08078 12.69255 ------+------diff | .2018185 .3106441 -.4120523 .8156893 ------diff = mean(0) - mean(1) t = 0.6497 Ho: diff = 0 degrees of freedom = 148

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.7415 Pr(|T| > |t|) = 0.5169 Pr(T > t) = 0.2585

. ttest ammonia , by(treat) unequal

Two-sample t test with unequal variances ------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ------+------0 | 71 2.683099 .3921769 3.304542 1.900926 3.465271 1 | 79 5.143038 .7920279 7.039698 3.566231 6.719844 ------+------combined | 150 3.978667 .4661303 5.708907 3.057587 4.899746 ------+------diff | -2.459939 .8838049 -4.210859 -.7090201 ------diff = mean(0) - mean(1) t = -2.7834 Ho: diff = 0 Satterthwaite's degrees of freedom = 113.345

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.0032 Pr(|T| > |t|) = 0.0063 Pr(T > t) = 0.9968

. ttest sgot , by(treat)

Two-sample t test with equal variances ------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ------+------0 | 71 279.9155 .962422 8.109512 277.996 281.835 1 | 79 283.2911 1.04432 9.282119 281.2121 285.3702 ------+------combined | 150 281.6933 .7250671 8.880222 280.2606 283.1261 ------+------diff | -3.375646 1.430435 -6.202361 -.5489318 ------diff = mean(0) - mean(1) t = -2.3599 Ho: diff = 0 degrees of freedom = 148

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.0098 Pr(|T| > |t|) = 0.0196 Pr(T > t) = 0.9902 Chapter 5-23 (revision 16 May 2010) p. 14 Chapter 5-23 (revision 16 May 2010) p. 15 . tab sex treat, col exact

| treat sex | 0 1 | Total ------+------+------0 | 3 0 | 3 | 4.23 0.00 | 2.00 ------+------+------1 | 68 79 | 147 | 95.77 100.00 | 98.00 ------+------+------Total | 71 79 | 150 | 100.00 100.00 | 100.00

Fisher's exact = 0.104 1-sided Fisher's exact = 0.104

. tab ftliver treat, col chi2

| treat ftliver | 0 1 | Total ------+------+------0 | 62 59 | 121 | 87.32 74.68 | 80.67 ------+------+------1 | 9 20 | 29 | 12.68 25.32 | 19.33 ------+------+------Total | 71 79 | 150 | 100.00 100.00 | 100.00

Pearson chi2(1) = 3.8310 Pr = 0.050

These data are from a randomized trial. Notice randomization achieved balance on age and sex, but it did not achieve balance on ammonia, sgot, and ftliver. With these variables reported in a Table 1, the reader will question the randomization procedure so that should be discussed. The reader will also expect to see ammonia, sgot, and ftliver included in the final Cox model, or at least a statement that it was included and was dropped due to lack of significance or was dropped after determining it was not a confounder.

To analyze these data using the survival analysis procedures, including Cox regression, we first “stset” the data, informing Stata which variable is the follow-up time and which variable is the event outcome. Use,

stset days , failure(dead==1)

Next, try looking at the hazard function using Kaplan-Meier estimates,

ltable days dead, by(treat) hazard

This will give a table that is too long to be useful. Next, collapse days into two-week intervals,

ltable days dead, by(treat) hazard intervals(14)

Beg. Cum. Std. Std. Interval Total Failure Error Hazard Error [95% Conf. Int.] Chapter 5-23 (revision 16 May 2010) p. 16 ------treat 0 0 14 71 0.0851 0.0332 0.0063 0.0026 0.0013 0.0114 14 28 64 0.2833 0.0552 0.0174 0.0048 0.0080 0.0267 28 42 43 0.4303 0.0638 0.0163 0.0057 0.0051 0.0276 42 56 27 0.5363 0.0706 0.0147 0.0073 0.0004 0.0289 56 70 12 0.6754 0.0834 0.0252 0.0143 0.0000 0.0533 70 84 5 0.7836 0.1044 0.0286 0.0280 0.0000 0.0834 treat 1 0 14 79 0.1282 0.0379 0.0098 0.0031 0.0037 0.0158 14 28 67 0.2683 0.0533 0.0125 0.0041 0.0044 0.0206 28 42 36 0.2931 0.0570 0.0025 0.0025 0.0000 0.0073 42 56 22 0.3360 0.0677 0.0045 0.0045 0.0000 0.0132 56 70 10 0.4245 0.1012 0.0102 0.0102 0.0000 0.0302 70 84 4 0.6547 0.1884 0.0357 0.0346 0.0000 0.1035 ------

Looking at either the “Cumulative Failure” or the “Hazard” columns, the study treatment does not appear to have any effect until after the first month.

Looking at a Kaplan-Meier cumulative survival graph,

sts graph , by(treat)

Kaplan-Meier survival estimates, by treat 0 0 . 1 5 7 . 0 0 5 . 0 5 2 . 0 0 0 . 0 0 20 40 60 80 analysis time

treat = 0 treat = 1

Chapter 5-23 (revision 16 May 2010) p. 17 Looking at a Kaplan-Meier cumulative hazard graph,

sts graph , by(treat) failure

Kaplan-Meier failure estimates, by treat 0 0 . 1 5 7 . 0 0 5 . 0 5 2 . 0 0 0 . 0 0 20 40 60 80 analysis time

treat = 0 treat = 1

Which of these two graphs we would choose to publish depends on what is a more natural presentation. Do we want to make statements about the treatment improving survival (survival graph) or reducing mortality (failure graph)?

The failure graph is more aligned with the hazard ratio from Cox regression, which gives it a particular intuitive appeal. If the study treatment is effective, the HR < 1, which corresponds to the cumulative hazard line for study treatment being drawn below the cumulative hazard line for the standard treatment.

This particular graph will make some readers uncomfortable, since the graphs are not proportionally separated along the range of the follow-up time. It appears the proportional hazards assumption is not met. We see no treatment effect for the first month, after which the drug is providing a protective effect against death. Perhaps this is due, at least in part, to the sicker patients ending up in the treatment group, which maybe offsets any early protective effect, or perhaps it just takes a few weeks for the treatment effect to be discernible.

Although not sufficient to test the study hypothesis, given that we have suspected confounding due to the imbalance of baseline covariates, we can compare the treatments using the log-rank test survival test by,

Chapter 5-23 (revision 16 May 2010) p. 18 sts test treat

failure _d: dead == 1 analysis time _t: days

Log-rank test for equality of survivor functions

| Events Events treat | observed expected ------+------0 | 35 29.56 1 | 23 28.44 ------+------Total | 58 58.00

chi2(1) = 2.07 Pr>chi2 = 0.1504

The univariable Cox regression gives a p value very similar to the log-rank test. A Cox regression without covariates is also called the Cox-Mantel test (another of the many survival analysis tests like the log-rank test). Computing the univariable Cox regression,

stcox treat

Cox regression -- Breslow method for ties

No. of subjects = 150 Number of obs = 150 No. of failures = 58 Time at risk = 4971 LR chi2(1) = 2.06 Log likelihood = -253.1617 Prob > chi2 = 0.1508

------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------treat | .682312 .1833923 -1.42 0.155 .4028993 1.155499 ------

At this stage of our analysis, it appears that the treatment effect is not going to be significant.

Chapter 5-23 (revision 16 May 2010) p. 19 Let’s throw all of the covariates into a multivariable model and see what happens.

stcox treat age sex ftliver ammonia sgot

Cox regression -- Breslow method for ties

No. of subjects = 150 Number of obs = 150 No. of failures = 58 Time at risk = 4971 LR chi2(6) = 66.00 Log likelihood = -221.1916 Prob > chi2 = 0.0000

------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------treat | .3380707 .1104805 -3.32 0.001 .1781712 .6414721 age | 1.101588 .0834979 1.28 0.202 .9495113 1.278022 sex | 1.002736 .7487444 0.00 0.997 .2320559 4.332915 ftliver | 1.778003 .6144496 1.67 0.096 .9031728 3.500211 ammonia | 1.145347 .0263403 5.90 0.000 1.094867 1.198154 sgot | 1.056477 .0183852 3.16 0.002 1.021051 1.093133 ------

Apparently, there was some confounding, since the HR for treatment change by > 10% and became significant once covariates were included. Confounding can either detract from or enhance significance.

“drop only if p > 0.20” variable selection rule

A variable might confound a result even if statistical significance is not obtained. It is easy to imagine this could be the case for a potential confounder with a p = 0.06, for example, so where should we draw the line? Vittinghoff et al (2005, p.146) support this idea,

“...we do not recommend ‘parsimoniuous’ models that only include predictors that are statistically significant at P < 0.05 or even stricter criteria, because the potential for residual confounding in such models is substantial.”

To protect against residual confounding, it has been suggested that potential confounders be eliminated only if p > 0.20. (Maldonado and Greenland, 1993).

Chapter 5-23 (revision 16 May 2010) p. 20 For now, let’s retain all predictors but sex (p = 0.997 using the “drop only if p > 0.20” variable selection rule.

stcox treat age ftliver ammonia sgot

Cox regression -- Breslow method for ties

No. of subjects = 150 Number of obs = 150 No. of failures = 58 Time at risk = 4971 LR chi2(5) = 66.00 Log likelihood = -221.1916 Prob > chi2 = 0.0000

------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------treat | .3381223 .1095964 -3.35 0.001 .1791315 .6382281 age | 1.101635 .0825134 1.29 0.196 .9512224 1.275832 ftliver | 1.778163 .6129628 1.67 0.095 .9047903 3.494581 ammonia | 1.145348 .0263389 5.90 0.000 1.094871 1.198152 sgot | 1.056471 .0183002 3.17 0.002 1.021205 1.092955 ------

In linear regression, the completeness, or goodness of fit, of the model can be assessed with the multiple R, or multiple R-squared, statistic. In logistic regression, this is popularly done with the c-statistic. In Cox regression, we use the c-statistic as well. Let’s compute it now, using the estat command, so we can see can compare the fit of this model with other models we will derive.

stcox treat age ftliver ammonia sgot estat concordance

Harrell's C concordance statistic

failure _d: dead == 1 analysis time _t: days

Number of subjects (N) = 150 Number of comparison pairs (P) = 5545 Number of orderings as expected (E) = 4358 Number of tied predictions (T) = 0

Harrell's C = (E + T/2) / P = .7859 Somers' D = .5719

We see that the c-statistic = 0.79 is close to the upper end of the 0.7  c-statistic < 0.8 range for “acceptable discrimination”, using the Hosmer and Lemeshow (2000, p. 162) rule-of-thumb.

Let’s assess if keeping age provides any benefit, by dropping it and comparing the results to the previous model.

stcox treat ftliver ammonia sgot estat concordance

Chapter 5-23 (revision 16 May 2010) p. 21 Cox regression -- Breslow method for ties

No. of subjects = 150 Number of obs = 150 No. of failures = 58 Time at risk = 4971 LR chi2(4) = 64.34 Log likelihood = -222.02514 Prob > chi2 = 0.0000

------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------treat | .3522512 .1123675 -3.27 0.001 .188504 .6582401 ftliver | 1.747151 .6099943 1.60 0.110 .8813431 3.463507 ammonia | 1.139031 .0258365 5.74 0.000 1.089502 1.190812 sgot | 1.056876 .0183906 3.18 0.001 1.021438 1.093542 ------

. estat concordance

Harrell's C concordance statistic

failure _d: dead == 1 analysis time _t: days

Number of subjects (N) = 150 Number of comparison pairs (P) = 5545 Number of orderings as expected (E) = 4351 Number of tied predictions (T) = 1

Harrell's C = (E + T/2) / P = .7848 Somers' D = .5695

The c-statistic changed from 0.7859 to 0.7848, so age did not help much with the overall discrimatory ability of the model.

“10% change in estimate” variable selection rule

Confounding is said to be present if the unadjusted effect differs from the effect adjusted for putative confounders. [Rothman, 1998].

A variable selection rule consistent with this definition of confounding is the change-in-estimate method of variable selection. In this method, a potential confounder is included in the model if it changes the coefficient, or effect estimate, of the primary exposure variable (treat in our example) by 10%. This method has been shown to produce more reliable models than variable selection methods based on statistical significance [Greenland, 1989].

By dropping age, the HR=0.338 for treat changed to HR=0.352 in the reduced model (a 0.352- 0.338)/0.338 = 0.04, or 4% relative change). The c-statistic changed from 0.7859 to 0.7849, hardly at all, suggesting age contributed nothing to the overall goodness of fit. (The c-statistic is actually only important for models developed for prediction. For the purpose of this example, which is to test the treatment effect while controlling for confounders, the c-statistic does not apply. We are considering it only for our own illustration.)

Chapter 5-23 (revision 16 May 2010) p. 22 Both the statistical significance and the 10% change-in-estimate variable selection methods suggest dropping age from the model. We do not have to be concerned here with the “drop only if p > 0.20” variable selection rule, since we have already identified that confounding is not a problem using the “10% change in estimate” rule.

Let’s assess if keeping ftliver provides any benefit, by dropping it and comparing the results to the previous model.

stcox treat ammonia sgot estat concordance

Cox regression -- Breslow method for ties

No. of subjects = 150 Number of obs = 150 No. of failures = 58 Time at risk = 4971 LR chi2(3) = 61.92 Log likelihood = -223.23418 Prob > chi2 = 0.0000

------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------treat | .3555073 .1127858 -3.26 0.001 .1908983 .6620564 ammonia | 1.155364 .02421 6.89 0.000 1.108874 1.203803 sgot | 1.058267 .0185142 3.24 0.001 1.022595 1.095183 ------

. estat concordance

Harrell's C concordance statistic

failure _d: dead == 1 analysis time _t: days

Number of subjects (N) = 150 Number of comparison pairs (P) = 5545 Number of orderings as expected (E) = 4340 Number of tied predictions (T) = 1

Harrell's C = (E + T/2) / P = .7828 Somers' D = .5656

We see the HR=0.352 in the previous model with ftliver increased to HR=0.356, a change of (0.352-0.356)/0.352 = 0.01, or 1%. The c-statistic, c=0.7848 changed to c=0.7828, essentially no change at all.

Notice we have used a combination of the statistical significance and change-in-effect rules for variable selection. Using one or the other, or the combination, are three possible variable selection strategies.

Chapter 5-23 (revision 16 May 2010) p. 23 Backwards Variable Selection

To arrive at this “final” model, we have used a variable selection approach known as backward selection, where one begins with all of the predictor variables of interest, and removes them in the order of least significant (or a combination of least significant and least clinically relevant).

Backwards selection is considered superior to forwards selection (forward selection addes one variable at a time), because negatively confounded sets of variables are less likely to be omitted from the model (Sun et al, 1999), since the complete set is included in the initial model. In contrast, forward and stepwise (stepwise is where variables can be added and subsequently removed) selection procedures will only include such sets if at least one memeber meets the inclusion criterion in the absence of the others. (Vittinghoff et al, 2005, p.151).

By “negatively confounded sets” , we are referring to the situation where two or more variables must be included in the model as a set to control for confounding. When one of the variables is dropped, confounding increases.

Should we check the model assumptions?

It is tempting to publish without bothering to check the model assumptions, because most papers reporting Cox regression models do not mention having checked the model assumptions in their Statistical Methods section. The same can be said for all types of regression models.

Devereaux et al (2006), for example, published a Cox regression analysis in JAMA without any mention of checking the model assumptions. In fact, most authors do not mention checking the assumptions, including the proportional hazards assumption.

In this tutorial, we will compare our “final model” to models where violated assumptions are recognized and adjusted for, so that we can see what affect violated assumptions have on the model results.

Chapter 5-23 (revision 16 May 2010) p. 24 Although this is something we should have done at the very beginning, let’s now see how many deaths occurred in this sample, so we can determine how many variables we can test for model inclusion, without introducing “overfitting”.

tab dead

dead | Freq. Percent Cum. ------+------0 | 92 61.33 61.33 1 | 58 38.67 100.00 ------+------Total | 150 100.00

We see that there are 58 dead events. Therefore, we can model 58/10 = 5.8 variables, let’s say 6, without introducing overfitting. Strictly applying this rule, we should only consider a total of 6 variables in our variable selection exercise, not just limiting the final model to 6 variables.

We have 6 predictor variables in our dataset. If we take any one of these variables and convert it to quartiles, which would requires 3 dummy variables to model, this would count as 3 variables rather than one.

Actually, a 58/5 = 11.6 variables is okay, since it has lately been shown that 5 events per predictor are as good as 10 events per predictor when the goal is to test the effect of a primary predictor while controlling for confounding (see “overfitting” in the sample size chapter, Chapter 2-5, p.30).

In logistic regression, it is assumed that the odds increase exponentially across the range of the predictor. In Cox regression, it is assumed that the hazard increases exponentially across the range of the predictor. One way to check this is to first convert a continuous predictor to quartiles, or some other quantile. Let’s do this for our three continuous predictors.

xtile age4 = age , nq(4) tabstat age, stat ( count min max ) by(age4) nototal col(stat) xtile ammonia4 = ammonia , nq(4) tabstat ammonia, stat ( count min max ) by(ammonia4) nototal col(stat) xtile sgot4 = sgot , nq(4) tabstat sgot, stat ( count min max ) by(sgot4) nototal col(stat)

. tabstat age, stat ( count min max ) by(age4) nototal col(stat)

Summary for variables: age by categories of: age4 (4 quantiles of age )

age4 | N min max ------+------1 | 48 9 11 2 | 63 12 13 3 | 14 14 14 4 | 25 15 17 ------

. xtile ammonia4 = ammonia , nq(4)

. tabstat ammonia, stat ( count min max ) by(ammonia4) nototal col(stat) Chapter 5-23 (revision 16 May 2010) p. 25 Summary for variables: ammonia by categories of: ammonia4 (4 quantiles of ammonia )

ammonia4 | N min max ------+------1 | 42 .2 .6 2 | 34 .7 1.2 3 | 37 1.3 5.3 4 | 37 5.4 27.8 ------

. xtile sgot4 = sgot , nq(4)

. tabstat sgot, stat ( count min max ) by(sgot4) nototal col(stat)

Summary for variables: sgot by categories of: sgot4 (4 quantiles of sgot )

sgot4 | N min max ------+------1 | 41 263 276 2 | 37 277 281 3 | 36 282 288 4 | 36 289 306 ------

The xtile command created a new variable with 4 categories, each category representing 25% of the variable, as best it could. The tabstat command verified the variable was set up correctly, and also tells us what the range of the variable is composing each quartile.

Next we model the quartiles very easily using the “xi” (generate indictor variables) facility. We will include treat in each model, as well, since we know we want treat in our final model.

xi: stcox treat i.age4 // Stata version 10 xi: stcox treat i.ammonia4 // Stata version 10 xi: stcox treat i.sgot4 // Stata version 10

In Stata version 11, we can use “ib1”, which means generate indicator variables behind the scenes, using the first category as the baseline, or referent category,

stcox treat ib1.age4 // Stata version 11 stcox treat ib1.ammonia4 // Stata version 11 stcox treat ib1.sgot4 // Stata version 11

Chapter 5-23 (revision 16 May 2010) p. 26 Cox regression -- Breslow method for ties

No. of subjects = 150 Number of obs = 150 No. of failures = 58 Time at risk = 4971 LR chi2(4) = 2.93 Log likelihood = -252.7307 Prob > chi2 = 0.5702

------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------treat | .6782666 .1823941 -1.44 0.149 .4004074 1.148944 | age4 | 2 | 1.224904 .3872601 0.64 0.521 .6591589 2.276219 3 | 1.01796 .5181894 0.03 0.972 .375344 2.760783 4 | 1.384599 .5396833 0.83 0.404 .6449801 2.972363 ------

Cox regression -- Breslow method for ties

No. of subjects = 150 Number of obs = 150 No. of failures = 58 Time at risk = 4971 LR chi2(4) = 56.49 Log likelihood = -225.9478 Prob > chi2 = 0.0000

------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------treat | .7238032 .2030522 -1.15 0.249 .4176655 1.254332 | ammonia4 | 2 | 2.065668 1.097025 1.37 0.172 .7294713 5.849419 3 | 5.955138 2.902689 3.66 0.000 2.290836 15.48067 4 | 18.29635 8.854006 6.01 0.000 7.086789 47.23671 ------

. stcox treat ib1.sgot4 // Stata version 11

Cox regression -- Breslow method for ties

No. of subjects = 150 Number of obs = 150 No. of failures = 58 Time at risk = 4971 LR chi2(4) = 17.33 Log likelihood = -245.52822 Prob > chi2 = 0.0017

------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------treat | .5737291 .1582266 -2.01 0.044 .3341621 .9850462 | sgot4 | 2 | 2.367308 1.13604 1.80 0.073 .9242179 6.063664 3 | 2.887154 1.423618 2.15 0.032 1.098382 7.58903 4 | 5.128798 2.404224 3.49 0.000 2.046436 12.85384 ------

Looking at age, we see that no quartile is significant, just as the continuous variable was not significant. The hazard ratios for the quartiles are 1.0 (the 1st quartile, which is the referent),

Chapter 5-23 (revision 16 May 2010) p. 27 1.2, 1.0, and 1.4. This is not an exponential increase, but we don’t care because we will drop age out of the model as being not significant.

Looking at ammonia, we see hazard ratios of 1.0, 2.1, 6.0, and 18.3. This is probably close enough to an exponential increase that modeling the variable as a continuous variable would not be a problem.

Looking at SGOT, we see hazard ratios of 1.0, 2.4, 2.9, and 5.1. This is probably close enough to an exponential increase that modeling the variable as a continuous variable would not be a problem.

Another way to make the determination made above is to compare the effect estimate of treat, when moding the predictor either as a continuous variable or as quantiles. Obtaining the models with these predictors included as continuous variables,

stcox treat age stcox treat ammonia stcox treat sgot

Cox regression -- Breslow method for ties

No. of subjects = 150 Number of obs = 150 No. of failures = 58 Time at risk = 4971 LR chi2(2) = 2.26 Log likelihood = -253.06596 Prob > chi2 = 0.3237

------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------treat | .6828319 .1835354 -1.42 0.156 .4032024 1.156391 age | 1.031866 .0739947 0.44 0.662 .8965695 1.187579 ------

Cox regression -- Breslow method for ties

No. of subjects = 150 Number of obs = 150 No. of failures = 58 Time at risk = 4971 LR chi2(2) = 51.28 Log likelihood = -228.55595 Prob > chi2 = 0.0000

------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------treat | .433673 .1309922 -2.77 0.006 .2399134 .7839172 ammonia | 1.168163 .0230618 7.87 0.000 1.123826 1.214249 ------

Cox regression -- Breslow method for ties

No. of subjects = 150 Number of obs = 150 No. of failures = 58 Time at risk = 4971 LR chi2(2) = 22.56 Log likelihood = -242.91484 Prob > chi2 = 0.0000

Chapter 5-23 (revision 16 May 2010) p. 28 ------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------treat | .5262863 .147063 -2.30 0.022 .304345 .9100765 sgot | 1.083774 .0199286 4.38 0.000 1.04541 1.123545 ------

We get treat HR = .68 for age included as quartiles, and treat HR = 0.68 for age included as continuous.

We get treat HR = .72 for ammonia included as quartiles, and treat HR = 0.43 for ammonia included as continuous.

We get treat HR = .57 for sgot included as quartiles, and treat HR = 0.53 for sgot included as continuous.

It appears that the HR for treat depends on how to choose to model ammonia, since 0.72 is very different from 0.43.

We could consider smaller quantiles, such as quintiles (10 categories) but we would run into an overfitting problem.

The big difference is probably due to the range of the quartile. Above, we found the ranges of ammonia for the four quartiles to be 0.2-0.6, 0.7-1.2, 1.3-5.3, and 5.4-27.8. The fourth quartile is just too wide.

There is another way to determine the best function form of a predictor variable. We can use martingale residuals, obtained by specifying the mgale( ) option when fitting the Cox model. These residuals can be interpreted simply as the difference between the observed number of failures in the data and the number of failures predicted by the model. (Cleves et al, 2004, p.186-187). We begin by fitting a Cox model without predictor variables, which models the baseline hazard, to create a variable containing the martingale residuals. The “estimate” option is required when fitting a model without covariates (called the null model).

capture drop mgresid stcox , mgale(mgresid) estimate

Next, we separately plot each predictor, whose functional form we are interested in determining, against mgresid. We use the lowess (locally weighted regression) smoother to get a more easily interpreted graph. Run each of the lowess graphs separately

lowess mgresid age lowess mgresid ammonia lowess mgresid sgot

Chapter 5-23 (revision 16 May 2010) p. 29 Chapter 5-23(revisionChapter 2010)16May linear. it should make a log function, transform shapeofa logarithm this ofammonia.graph the has Since values lower at the relationship a linear the from that we graph deviates see ammonia variable, the For using its originalscale. a as continuous variable can be included variable sgot, the with it as age and did linear, variableappears approximately graph fora predictor If the

martingale martingale -1.5 -1 -.5 0 .5 1 -1.5 -1 -.5 0 .5 1 260 bandwidth = .8 bandwidth = .8 0 270 10 Lowess smoother Lowess smoother 280 ammonia sgot 290 20 300 310 30 p. 30 Creating a logged transformed variable of ammonia, and regraphing

gen lnammonia = ln(ammonia) lowess mgresid lnammonia

Lowess smoother 1 5 . e 0 l a g n i t r a 5 . m - 1 - 5 . 1 - -2 -1 0 1 2 3 lnammonia bandwidth = .8

This is a more linear relationship.

Fitting a model with the transformed variable

stcox treat lnammonia

Cox regression -- Breslow method for ties

No. of subjects = 150 Number of obs = 150 No. of failures = 58 Time at risk = 4971 LR chi2(2) = 63.22 Log likelihood = -222.58415 Prob > chi2 = 0.0000

------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------treat | .5957854 .1659649 -1.86 0.063 .3451238 1.028501 lnammonia | 2.518588 .3149692 7.39 0.000 1.971096 3.218152 ------

Chapter 5-23 (revision 16 May 2010) p. 31 Our three different models gave us

Ammonia treat effect continuous HR = 0.43 , p = 0.006 quartiles HR = 0.72 , p = 0.249 log transformed HR = 0.60 , p = 0.063 which are very different results. With ammonia modeled as continuous in its original scale, we get the biggest and most significant effect or treat, but we might be getting fooled. Quartiles were no fun at all, since the effect went away, but this might be because the fourth quartile had too wide of a range. The log-transformed variable seems like a good compromise.

Let’s see if using the c-statistic helps determine which is the best fit.

stcox treat ammonia estat concordance xi: stcox treat i.ammonia4 estat concordance stcox treat lnammonia estat concordance

Cox regression -- Breslow method for ties ------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------treat | .433673 .1309922 -2.77 0.006 .2399134 .7839172 ammonia | 1.168163 .0230618 7.87 0.000 1.123826 1.214249 ------

Harrell's C concordance statistic

Harrell's C = (E + T/2) / P = .7737

Cox regression -- Breslow method for ties ------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------treat | .7238032 .2030522 -1.15 0.249 .4176655 1.254332 _Iammonia4_2 | 2.065668 1.097025 1.37 0.172 .7294713 5.849419 _Iammonia4_3 | 5.955138 2.902689 3.66 0.000 2.290836 15.48067 _Iammonia4_4 | 18.29635 8.854006 6.01 0.000 7.086789 47.23671 ------

Harrell's C concordance statistic

Harrell's C = (E + T/2) / P = .7788 Somers' D = .5576

Cox regression -- Breslow method for ties ------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------treat | .5957854 .1659649 -1.86 0.063 .3451238 1.028501 lnammonia | 2.518588 .3149692 7.39 0.000 1.971096 3.218152 ------

Harrell's C concordance statistic

Chapter 5-23 (revision 16 May 2010) p. 32 Harrell's C = (E + T/2) / P = .7959 The three different models give us

Ammonia treat effect c statistic continuous HR = 0.43 , p = 0.006 0.77 quartiles HR = 0.72 , p = 0.249 0.78 log transformed HR = 0.60 , p = 0.063 0.80 so the log-transformed model provides the best goodness of fit.

Let’s do our backwards elimination variable selection again, this time with the log transformed ammonia variable, along with a c-statistic for the final model

stcox treat age ftliver lnammonia sgot stcox treat ftliver lnammonia sgot estat concordance

Cox regression -- Breslow method for ties ------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------treat | .4875879 .142898 -2.45 0.014 .2745303 .8659954 age | 1.047432 .075271 0.64 0.519 .9098217 1.205855 ftliver | 2.106065 .677367 2.32 0.021 1.121251 3.955861 lnammonia | 2.187477 .2828102 6.05 0.000 1.697833 2.818331 sgot | 1.047659 .0182932 2.67 0.008 1.012412 1.084134 ------

Cox regression -- Breslow method for ties ------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------treat | .4924155 .1444691 -2.41 0.016 .2770759 .8751139 ftliver | 2.078928 .6711683 2.27 0.023 1.104166 3.914212 lnammonia | 2.174083 .2798651 6.03 0.000 1.689284 2.798012 sgot | 1.048324 .018281 2.71 0.007 1.013099 1.084774 ------

Harrell's C concordance statistic

Harrell's C = (E + T/2) / P = .8183

Now that we have correctly specified ammonia, by log transformating, the ftliver predictor remains significant, whereas it was dropped from the model when ammonia was not log transformed.

Chapter 5-23 (revision 16 May 2010) p. 33 Let’s see if ftliver is a confounder, using the 10% rule.

stcox treat ftliver lnammonia sgot stcox treat lnammonia sgot

Cox regression -- Breslow method for ties ------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------treat | .4924155 .1444691 -2.41 0.016 .2770759 .8751139 ftliver | 2.078928 .6711683 2.27 0.023 1.104166 3.914212 lnammonia | 2.174083 .2798651 6.03 0.000 1.689284 2.798012 sgot | 1.048324 .018281 2.71 0.007 1.013099 1.084774 ------

Cox regression -- Breslow method for ties ------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------treat | .5152024 .1481634 -2.31 0.021 .2932154 .9052509 lnammonia | 2.321445 .2980431 6.56 0.000 1.804992 2.985668 sgot | 1.050645 .0187338 2.77 0.006 1.014561 1.088012 ------

Dropping ftliver, the effect for treat changes by (0.492-0.515)/0.492=0.047, or 4.7%. We could conclude that ftliver is not a confounder, by the 10% rule, and drop it from the model anyway. However, the three putative confounders, ftliver, ammonia, and sgot, have been shown to be important predictors of the final outcome of children with Reye’s syndrome (Cleves et al, 2004, p.186). Along with the treated and untreated groups being imbalanced on each of these three variables, the reader will feel much more comfortable with our final model if all three variables are included. That is, by the “association definition” of a confounder, where each of these putative confounders is associated with both the treatment exposure and the death outcome, these three variables can be considered confounders, and should be in the final model.

In other words, it is useful to retain the variables to provide “face validity”.

Chapter 5-23 (revision 16 May 2010) p. 34 Proportional Hazards Assumption

Next we will check the proportional hazards assumption.

The Cox proportional hazards (Cox PH) model has the form:

k

h( t , X1 ,..., Xk )= h 0 ( t )exp(b i X i ) i=1

where h0 ( t ) is the baseline hazard, and X1,...., X k are the predictor variables.

The model predicts an individual’s hazard for the event, based on multiplying the baseline hazard at time t (t being the individual’s follow-up time) by the individuals linear predictor. The linear predictor is the sum of regression weights, or betas, multiplied by the individual’s values for the predictor variables.

It was demonstrated in Chapter 5-7 of this course manual that the hazard ratio represents the time-specific risk ratios, being a type of pooled estimate (or weighted average) across the individual time strata. As taught in epidemiology courses, the Mantel-Haenszel pooled estimate of risk ratios, odds ratios, or rate ratios across strata assumes homogeneity of these stratum- specific estimates in order for a single pooled estimate to represent what is happening in the individual strata. In an anologous fashion, in Cox regression, it is assumed that the hazard ratio at each time point is homogeneous in order for the single HR provided by the Cox model to be a good estimate of what is happening at each follow-up time. This assumption is called the proportional hazards assumption.

Since the hazard ratio (HR) is a single number that summarizes the hazard for all follow-up times, it can only be a good estimate if the hazard remains constant across the range of follow-up times. When looking at a cumulative hazard graph, such as the one computed above, at any point of follow-up time (the X axis), the ratio of the value of the cumulative hazard (the Y axis) for the two curves should have the same value (said to be “proportional hazards”). This ratio is the HR.

Chapter 5-23 (revision 16 May 2010) p. 35 Recall that the graph of the univarible analysis, duplicated here, looks like it might not meet the proportional hazards assumption, because it does not separate during the first 30 days of followup, whereas there is wide separation at 50 days.

Kaplan-Meier failure estimates, by treat 0 0 . 1 5 7 . 0 0 5 . 0 5 2 . 0 0 0 . 0 0 20 40 60 80 analysis time

treat = 0 treat = 1

Let’s test the proportional hazards assumption for this univariable model, using

stcox treat estat phtest, detail

Cox regression -- Breslow method for ties

------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------treat | .682312 .1833923 -1.42 0.155 .4028993 1.155499 ------

. estat phtest, detail

Test of proportional-hazards assumption

Time: Time ------| rho chi2 df Prob>chi2 ------+------treat | -0.22228 2.78 1 0.0956 ------+------global test | 2.78 1 0.0956 ------

The significance test based on the Schoenfeld residuals (Grambsch and Therneau, 1994) is not significant, suggesting the assumption is sufficiently met.

The significance test for the PH hazard assumption was marginally significant (p = 0.096), so we should verify the assumption graphically as well.

Chapter 5-23 (revision 16 May 2010) p. 36 We do this using the log-log graph, by

stphplot, by(treat) 4 ] ) y t i l i 3 b a b o r P

l a v 2 i v r u S ( n l - [ 1 n l - 0

0 1 2 3 4 ln(analysis time)

treat = 0 treat = 1

In this graph, the proportional hazards assumption is met if the graphs are approximately linear. Since the graphs clearly cross at about 18 days [ln(time)=3, so exp(ln(time)=exp(3)=18 days], the proportional hazards assumption is not met. This is the same place the graphs cross, although slightly, on the cumulative hazard graph shown on the previous page.

In the univariable model, then, the HR=0.68 is not a good estimate of the effect, since it does not hold for the whole range of follow-up time.

Since the univariable is not the model we will be using to test the study hypothesis, we can ignore that the proportional hazards (PH) assumption does not hold for that model. Testing the PH assumption for our final multivariable model,

stcox treat ftliver lnammonia sgot estat phtest, detail

Chapter 5-23 (revision 16 May 2010) p. 37 Cox regression -- Breslow method for ties

No. of subjects = 150 Number of obs = 150 No. of failures = 58 Time at risk = 4971 LR chi2(4) = 75.88 Log likelihood = -216.25577 Prob > chi2 = 0.0000

------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------treat | .4924155 .1444691 -2.41 0.016 .2770759 .8751139 ftliver | 2.078928 .6711683 2.27 0.023 1.104166 3.914212 lnammonia | 2.174083 .2798651 6.03 0.000 1.689284 2.798012 sgot | 1.048324 .018281 2.71 0.007 1.013099 1.084774 ------

. estat phtest, detail

Test of proportional-hazards assumption

Time: Time ------| rho chi2 df Prob>chi2 ------+------treat | 0.00580 0.00 1 0.9657 ftliver | -0.34388 5.13 1 0.0236 lnammonia | 0.08187 0.38 1 0.5394 sgot | -0.02179 0.03 1 0.8667 ------+------global test | 5.25 4 0.2628 ------

We see that the PH assumption was met for the model overall (p=0.263), but not for predictor ftliver. Interestingly, it is now met very nicely for treat.

The simplest approach to dealing with a violation of the PH assumption, as long as the variable in violation is not our primary exposure variable, is to stratify the Cox model on that variable. If the variable is continuous, we first must convert it to an ordered categorical variable, with say 5 categories (quintiles). Since ftliver is already categorical, we do not need to categorize it. Stratifying the Cox model by ftliver, and again checking the PH assumption,

capture drop sch* sca* stcox treat lnammonia sgot, strata(ftliver) schoenfeld(sch*) scal(sca*) stphtest, detail

Chapter 5-23 (revision 16 May 2010) p. 38 Stratified Cox regr. -- Breslow method for ties

No. of subjects = 150 Number of obs = 150 No. of failures = 58 Time at risk = 4971 LR chi2(3) = 51.91 Log likelihood = -191.27316 Prob > chi2 = 0.0000

------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ------+------treat | .4668395 .1374445 -2.59 0.010 .2621568 .8313312 lnammonia | 2.063892 .2678633 5.58 0.000 1.600344 2.661709 sgot | 1.041826 .0181285 2.35 0.019 1.006894 1.07797 ------Stratified by ftliver

. estat phtest, detail

Test of proportional-hazards assumption

Time: Time ------| rho chi2 df Prob>chi2 ------+------treat | -0.02878 0.05 1 0.8308 lnammonia | 0.05933 0.19 1 0.6594 sgot | -0.04538 0.12 1 0.7277 ------+------global test | 0.34 3 0.9522 ------

We see that the PH assumption is now met both globally (the overall model) and individually for each predictor variable.

This model, with three predictor variables and one stratification variable, is the final model we would report in our article.

A popular way to report a final model is to just show the HR, Confidence Interval, and p value for the primary exposure variable (treat in our case) and then report what was controlled for in a footnote. Look at Table 2 of the Devereux et al (2004) paper for an example of this. In our footnote, we would state, “Adjusted for log-transformed ammonia and SGOT as predictor variables, and stratified by fatty liver disease since this variable did not meet the proportional hazards assumption.”

Finally, we want to be able to report a Kaplan-Meier graph, but if we report the graph produced above, our paper will lose credibility. That is, many readers will be nervous about the lack of proportional hazards in the univariable graph, even if we make efforts to correct for it in the multivariable model. This occurs because most readers with familiarity with Cox regression are not trained on how to deal with violations of proportional hazards, although they are usually trained to look for violations of proportional hazards in the Kaplan-Meier graph.

To get a Kaplan-Meier graph that is adjusted for covariates, we might try

sts graph , by(treat) failure adjust(lnammonia sgot ftliver)

Chapter 5-23 (revision 16 May 2010) p. 39 extreme extrapolation from the actual values of actualvalues from the extrapolation extreme ofzero, an value at covariates a graph,holdsthe bydefault, of263to 306.The a range having variable from the the comes problem would discover a time,we one at covariates three ofthe tried Ifwe foreach graph isaadjusting disaster. This Chapter 5-23(revisionChapter 2010)16May

0.00 0.25 0.50 0.75 1.00 0 adjusted for lnammonia sgot ftliver 20 Failure functions, by treat treat = 0 analysis time 40 sgot . treat = 1 sgot 60 . This could possibly be due to could possiblybe due . This 80 p. 40 sgot

To resolve this problem, we can use mean centering. That is, we first subtract the mean from each of the continuous covariates, and then adjust for these mean-centered variables. The zero value then represents the mean in such a variable. For dichotomous variables, such as ftliver, the variable is held at 0, or “no exposure”. [It turns out if we used the mean-centered values in the Cox regression model, the effect estimates would be unchanged.]

Computing mean centered variables, by using “r(mean)” after the “sum” command, which is where Stata returns the mean, and then including these new variables in the graph,

sum sgot gen sgotcen = sgot - r(mean) sum lnammonia gen lnammcen = lnammonia - r(mean) sts graph , by(treat) failure adjustfor(sgotcen lnammcen ftliver)

Failure functions, by treat adjusted for sgotcen lnammcen ftliver 0 0 . 1 5 7 . 0 0 5 . 0 5 2 . 0 0 0 . 0 0 20 40 60 80 analysis time

treat = 0 treat = 1

This is the Kaplan-Meier graph we would report in our article. In the figure legend, we could state, “This graph displays the Kaplan-Meier cumulative hazard, after adjusting for the same covariates used in the final multivariable model, with ammonia and SGOT held constant at their mean value, and fatty liver disease held constant at absence of disease.”

Although we stratified on ftliver in our Cox model, we can just still include it in the “adjustfor()” option, since only one strata is used in the adjustment anyway (which is the same thing as stratifying on it.)

Chapter 5-23 (revision 16 May 2010) p. 41 References

Blosseld H-P, Hamerle A, Mayer KU. (1989). Event History Analysis: Statistical Theory and Application in the Social Sciences. Hillsdale NJ, Lawrence Erlbaum Associates.

Choudhury JB. (2002). Non-parametric confidence interval estimation for competing risks analysis: application to contraceptive data. Statist. Med. 21:1129-1144.

Cleves MA, Gould WW, Gutierrez RG. An Introduction to Survival Analysis Using Stata. Revised edition. College Station, TX, Strata Press, 2004.

Coviello V, Boggess M. (2004). Cumulative incidence estimation in the presence of competing risks. The Stata Journal 4(2):103-112.

Devereux RB, Wachtell K, Gerdts E, el al. Prognostic significance of left ventricular mass change during treatment of hypertension. JAMA 292(19):2350-2356.

Freireich EO et al. (1963). The effect of 6-mercaptopmine on the duration of steroid induced remission in acute leukemia. Blood 21:699-716.

Grambsch PM, Therneau TM. (1994). Proportional hazards tests and diagnostics based on weighted residuals. Biometrika 81:515-526.

Greenland S. (editor) (1987). Evolution of Epidemiologic Ideas: Annotated Readings on Concepts and Methods. Chestnut Hill, Massachusetts.

Harrell FE, Califf RM, Pryor DB, et al. (1982). Evaluating the yield of medical tests. JAMA 247(18):2543-2546.

Harrell Jr FE. (2001). Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis. New York, Springer-Verlag.

Hosmer DW, Lemeshow S. (2000) Applied Logistic Regression. 2nd ed. New York, John Wiley & Sons.

Kalbfleisch JD, Prentice RL. (1980). The Statistical Analysis of Failure Time Data. New York, John Wiley & Sons.

Kleinbaum DG (1996). Survival Analysis: A Self-Learning Text. New York, Springer-Verlag.

Lee ET. (1980). Statistical Methods for Survival Data Analysis. Belmont CA, Lifetime Learning Publications.

Maldonado G, Greenland S. (1993). Simulation study of confounder-selection strategies. Am J Epidemiol 183:923-936.

Chapter 5-23 (revision 16 May 2010) p. 42 Mantel N. (1966). Evaluation of survival data and two new rank order statistics arising in its consideration. Cancer Chem Rep 50:163-170.

Myers MH. (1969). A Computing Procedure for a Significance Test of the Difference Between Two Survival Curves, Methodological Note No. 18 in Methodoligcal Notes compiled by the End Results Sections, National Cancer Institute, National Institute of Health, Bethesda, Maryland.

Sun GW, Shock TL, Kay GL. (1999). Inappropriate use of bivariable analysis to screen risk factors for use in multivariable analysis. Journal of Clinical Epidemiology 49:907-916.

Vittinghoff E, Glidden DV, Shiboski SC, McCulloch CE. (2005). Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models. New York, Springer.

Chapter 5-23 (revision 16 May 2010) p. 43

Recommended publications