Datum Vol. 19, No. 2  April/May 2013 1 Institute for Health and Society, Medical College of Wisconsin

Datum

Biostatistics NEWSLETTER Key Function of the CTSI & MCW Cancer Center Unit

Regression Splines Aniko Szabo, PhD, Associate Professor, Division of Biostatistics

Regression analysis is a powerful tool for evaluang the effect of mulple predictors Volume 19, Number 2 simultaneously. Connuous predictors, such as age or blood glucose levels, are oen included in the regression model with a linear term, say b*Age, which implies that at April/May 2013 any age, geng one year older changes the outcome by the same amount b. In pracce, however, the relaonship is oen nonlinear. In this issue: Consider, for example, a study of acute meningis done at Duke University Medical Cen- ter (Spanos A, Harrell FE, Durack DT. Differenal diagnosis of acute meningis: An analy- sis of the predicve value of inial observaons. JAMA 262: 2700-2707, 1989). The goal Regression Splines...... 1 of the study was to develop a predicve rule for determining the underlying cause of meningis: bacterial versus viral. For this example, we will concentrate on evaluang Biostatistics Drop-In the effect of paent age. Since the outcome is binary, bacterial versus viral meningis, Consulting Schedule...... 3 logisc regression will be used for the analysis. from 420 paents is available. ...... …….4

As a first step, we break age into groups, and look at the probability of both types of Subscribe to Datum...... 5 meningis by age group. As Table 1.1 shows, the paern is not monotonic: the proba- bility of bacterial meningis starts very high in infants, then decreases and reaches its BiostatisticsMCW YouTube lowest levels in young adults, and then increases again in elderly paents. Channel...... 6

Two commonly used approaches for modeling nonlinear effects are categorizing the connuous variable into groups, as done in Table 1.1, or using , that is including quadrac, cubic, and even higher order terms. Figure 1.1 shows the Table 1.1: Age group N Meningis type (right) Bacterial Viral Distribuon ≤ 1 years 169 74% 26% of meningis type by age 2 – 10 years 105 41% 59% group. 11 – 20 years 43 19% 81% 21 – 40 years 56 14% 86% 41 – 60 years 20 55% 45% > 60 years 27 81% 19% Overall 420 52% 26% observed data and the probability of bacterial meningis esmated using logisc regres- sion with age groups, or a fih order polynomial. The order of the polynomial was cho- sen to match the number of for the spline model that will be fied later, but changing it does not substanally affect the model fit.

Regression splines provide an alternave approach that combines the ideas of categori- zaon and polynomial regression. Splines are piecewise polynomials: the of the

(ConƟnued on page 2) 2 Datum Vol. 19, No. 2  April/May 2013

Figure 1.1: (right) Probability of bacterial meningi‐ s as a funcon of age. Each point represents a paent, with the x‐coordinate represenng the age, and the y‐coordinate the type of meningis (1 – bacterial, 0 – viral). The lines show logisc regression esmates of the probability of bacterial meningis: the solid blue line is based on a fih‐ order polynomial in age, the dashed black line is based on the six age groups shown in Table 1.1.

Equaon 1.1: (right) equaon for dashed black line in Figure 1.1

Equaon 1.2: (below) equaon for solid blue line in Figure 1.1

predictor is divided into several subgroups with cut-points called knots, and a separate polynomial funcon is used within each subgroup. Usually addional restricons, such as connuity, that is no jumps at the connecons, and smoothness, that is no sharp turns, are imposed.

Figure 1.2 shows a typical example: a cubic spline with two internal knots constructed from three cubic polynomials that are tangent to each other at the connecon points. A polynomial corresponds to a spline with no knots, while categorizaon cor- responds to using 0-order (constant) splines with knots at each category boundary.

Figure 1.2: (right) Construcon of a cubic spline with internal knots at x=3 and x=6. The le panel shows the three component cubic funcons; the wide gray curve in the right panel shows the resulng spline that

follows the path of a different component

funcon on each interval defined by the knots. The vercal bars mark the locaon of the knots.

Figure 1.3 shows the fied cubic spline with knots at the terles of the age distribuon (ages 1 and 13) for the acute meningi- s data. While the general paern is similar, there is a striking difference in the fied curves at the early ages: the spline model suggests that there is a spike in the probability of bacterial meningis among very young children. Is it an arfact of the fing process? We can explore this further by “zooming in” to the under-6 age group.

Figure 1.3: (right) Probability of bacterial meningis

esmated using a cu‐ bic spline

(conƟnued on page 3) Datum Vol. 19, No. 2  April/May 2013 3

Equaon 1.3: (below) equaon for solid blue line in Figure 1.3.

Figure 1.4 shows the polynomial fit (dashed line), the spline fit (solid line), and the observed proporons at each individual age (plus signs). It is apparent that the spline picked up a phenomenon that the age categorizaon and polynomial missed: infants under 3 months of age have a much lower chance of having bacterial meningis than slightly older infants.

Figure 1.4: (right) Acute meningis data restricted to ages 6 and under with fied probability of bacterial

meningis. The solid green line is the spline fit, the dashed blue line is the polynomial fit, and the red plus signs show the observed proporon with bacterial meningis for each given age.

This example shows one of the main benefits of using splines: they allow one to esmate nonlinear relaonships in a way that is sensive to localized effects, while fing into a regression framework that allows adjustment for other covariates. Another benefit compared to polynomials is that it is possible to ensure that the splines are monotone increasing or de- creasing. This feature is not useful for this parcular example, as the effect of age is not monotonic, but in many contexts increasing values of a predictor are known to increase the value of the outcome.

Biostatistics Drop-In Consulting Schedule

What is Drop‐In Consulng? Locaons & Hours: Free stascal consultaons are 1. Medical College of Wisconsin 4. Zablocki VA Medical Center provided by staff biostascians Tuesdays & Thursdays 1st & 3rd Monday of the month throughout the month at various 1:00-3:00 PM 9:00-11:00 AM Medical College of Wisconsin and Building: Health Research Center Building: Building 111, 5th Floor affiliate locaons. The Drop-In Room: H2400 B-wing Service is a great way to get short Room: 5423 stascal quesons answered. 2. Froedtert Hospital Drop-In is free, try it today. Mondays & Wednesdays 5. Marquee University 1:00-3:00 PM Every Tuesday Building: Froedtert Pavilion 8:30-10:30 AM Quesons: Room: #L756- TRU Offices Building: School of Nursing-Clark Haley Montsma Hall 414.955.7439 3. MCW Cancer Center Room: Office of Research & [email protected] Wednesdays 10:00 AM -12:00 PM Scholarship: 112D Fridays 1:00-3:00 PM *Please note: Priority given to MU Building: MCW Clinical Cancer Nursing and Dental School Center Room: Clinical Trials Support Room CLCC:3236 (Enter through Sponsored by: C3233) 4 Datum Vol. 19, No. 2  April/May 2013

Poisson Regression

Kwang Woo Ahn, PhD, Assistant Professor, Division of Biostatistics

Count