Datum Vol. 19, No. 2 April/May 2013 1 Institute for Health and Society, Medical College of Wisconsin
Datum
Biostatistics NEWSLETTER Key Function of the CTSI & MCW Cancer Center Biostatistics Unit
Regression Splines Aniko Szabo, PhD, Associate Professor, Division of Biostatistics
Regression analysis is a powerful tool for evalua ng the effect of mul ple predictors Volume 19, Number 2 simultaneously. Con nuous predictors, such as age or blood glucose levels, are o en included in the regression model with a linear term, say b*Age, which implies that at April/May 2013 any age, ge ng one year older changes the outcome by the same amount b. In prac ce, however, the rela onship is o en nonlinear. In this issue: Consider, for example, a study of acute meningi s done at Duke University Medical Cen- ter (Spanos A, Harrell FE, Durack DT. Differen al diagnosis of acute meningi s: An analy- sis of the predic ve value of ini al observa ons. JAMA 262: 2700-2707, 1989). The goal Regression Splines...... 1 of the study was to develop a predic ve rule for determining the underlying cause of meningi s: bacterial versus viral. For this example, we will concentrate on evalua ng Biostatistics Drop-In the effect of pa ent age. Since the outcome is binary, bacterial versus viral meningi s, Consulting Schedule...... 3 logis c regression will be used for the analysis. Data from 420 pa ents is available. Poisson Regression...... …….4
As a first step, we break age into groups, and look at the probability of both types of Subscribe to Datum...... 5 meningi s by age group. As Table 1.1 shows, the pa ern is not monotonic: the proba- bility of bacterial meningi s starts very high in infants, then decreases and reaches its BiostatisticsMCW YouTube lowest levels in young adults, and then increases again in elderly pa ents. Channel...... 6
Two commonly used approaches for modeling nonlinear effects are categorizing the con nuous variable into groups, as done in Table 1.1, or using polynomial regression, that is including quadra c, cubic, and even higher order terms. Figure 1.1 shows the Table 1.1: Age group N Meningi s type (right) Bacterial Viral Distribu on ≤ 1 years 169 74% 26% of meningi s type by age 2 – 10 years 105 41% 59% group. 11 – 20 years 43 19% 81% 21 – 40 years 56 14% 86% 41 – 60 years 20 55% 45% > 60 years 27 81% 19% Overall 420 52% 26% observed data and the probability of bacterial meningi s es mated using logis c regres- sion with age groups, or a fi h order polynomial. The order of the polynomial was cho- sen to match the number of parameters for the spline model that will be fi ed later, but changing it does not substan ally affect the model fit.
Regression splines provide an alterna ve approach that combines the ideas of categori- za on and polynomial regression. Splines are piecewise polynomials: the range of the
(ConƟnued on page 2) 2 Datum Vol. 19, No. 2 April/May 2013
Figure 1.1: (right) Probability of bacterial meningi‐ s as a func on of age. Each point represents a pa ent, with the x‐coordinate represen ng the age, and the y‐coordinate the type of meningi s (1 – bacterial, 0 – viral). The lines show logis c regression es mates of the probability of bacterial meningi s: the solid blue line is based on a fi h‐ order polynomial in age, the dashed black line is based on the six age groups shown in Table 1.1.
Equa on 1.1: (right) equa on for dashed black line in Figure 1.1
Equa on 1.2: (below) equa on for solid blue line in Figure 1.1
predictor is divided into several subgroups with cut-points called knots, and a separate polynomial func on is used within each subgroup. Usually addi onal restric ons, such as con nuity, that is no jumps at the connec ons, and smoothness, that is no sharp turns, are imposed.
Figure 1.2 shows a typical example: a cubic spline with two internal knots constructed from three cubic polynomials that are tangent to each other at the connec on points. A polynomial corresponds to a spline with no knots, while categoriza on cor- responds to using 0-order (constant) splines with knots at each category boundary.
Figure 1.2: (right) Construc on of a cubic spline with internal knots at x=3 and x=6. The le panel shows the three component cubic func ons; the wide gray curve in the right panel shows the resul ng spline that
follows the path of a different component
func on on each interval defined by the knots. The ver cal bars mark the loca on of the knots.
Figure 1.3 shows the fi ed cubic spline with knots at the ter les of the age distribu on (ages 1 and 13) for the acute meningi- s data. While the general pa ern is similar, there is a striking difference in the fi ed curves at the early ages: the spline model suggests that there is a spike in the probability of bacterial meningi s among very young children. Is it an ar fact of the fi ng process? We can explore this further by “zooming in” to the under-6 age group.
Figure 1.3: (right) Probability of bacterial meningi s
es mated using a cu‐ bic spline
(conƟnued on page 3) Datum Vol. 19, No. 2 April/May 2013 3
Equa on 1.3: (below) equa on for solid blue line in Figure 1.3.
Figure 1.4 shows the polynomial fit (dashed line), the spline fit (solid line), and the observed propor ons at each individual age (plus signs). It is apparent that the spline picked up a phenomenon that the age categoriza on and polynomial missed: infants under 3 months of age have a much lower chance of having bacterial meningi s than slightly older infants.
Figure 1.4: (right) Acute meningi s data restricted to ages 6 and under with fi ed probability of bacterial
meningi s. The solid green line is the spline fit, the dashed blue line is the polynomial fit, and the red plus signs show the observed propor on with bacterial meningi s for each given age.
This example shows one of the main benefits of using splines: they allow one to es mate nonlinear rela onships in a way that is sensi ve to localized effects, while fi ng into a regression framework that allows adjustment for other covariates. Another benefit compared to polynomials is that it is possible to ensure that the splines are monotone increasing or de- creasing. This feature is not useful for this par cular example, as the effect of age is not monotonic, but in many contexts increasing values of a predictor are known to increase the value of the outcome.
Biostatistics Drop-In Consulting Schedule
What is Drop‐In Consul ng? Loca ons & Hours: Free sta s cal consulta ons are 1. Medical College of Wisconsin 4. Zablocki VA Medical Center provided by staff biosta s cians Tuesdays & Thursdays 1st & 3rd Monday of the month throughout the month at various 1:00-3:00 PM 9:00-11:00 AM Medical College of Wisconsin and Building: Health Research Center Building: Building 111, 5th Floor affiliate loca ons. The Drop-In Room: H2400 B-wing Service is a great way to get short Room: 5423 sta s cal ques ons answered. 2. Froedtert Hospital Drop-In is free, try it today. Mondays & Wednesdays 5. Marque e University 1:00-3:00 PM Every Tuesday Building: Froedtert Pavilion 8:30-10:30 AM Ques ons: Room: #L756- TRU Offices Building: School of Nursing-Clark Haley Montsma Hall 414.955.7439 3. MCW Cancer Center Room: Office of Research & [email protected] Wednesdays 10:00 AM -12:00 PM Scholarship: 112D Fridays 1:00-3:00 PM *Please note: Priority given to MU Building: MCW Clinical Cancer Nursing and Dental School Center Room: Clinical Trials Support Room CLCC:3236 (Enter through Sponsored by: C3233) 4 Datum Vol. 19, No. 2 April/May 2013
Poisson Regression
Kwang Woo Ahn, PhD, Assistant Professor, Division of Biostatistics
Count